Evaluating POS Taggers: More Info on the Training Data

With the help of Bob Carpenter at Alias-i, I now have cross-validation data for LingPipe using the MorphAdorner training set. Before I get to the numbers (which are good) in another post, a little background on the training data itself. (Incidentally, one of the real advantages of all this evaluative work has been to give me a much better sense of the data and tagsets I’m working with than I would have had otherwise.)

NUPOS, the tagset developed by Martin Mueller and used by MorphAdorner, is large: it’s up to 238 tags at the moment, compared to 60 or 80 in most other mainstream tagsets. This has a lot of potential benefits for literary work, especially if one wants to examine texts from earlier periods. (For the rationale behind NUPOS, see Martin’s post on the MONK wiki, to which I’ve linked in the past. Note that the list of tags there isn’t fully up to date as of mid-January 2009; if you have access to MONK development builds, there’s a better list at http://scribe.at.northwestern.edu:8090/monk/servlet?op=pos.) The short version: It might be nice to be able to deal with “thou shouldst” and to easily find cases of nouns used as adjectives, to name only two instances. But it has some costs, too, particularly in the system requirements for training a tagger. This is especially true because of the way NUPOS handles contractions and certain other compound, which it treats as single tokens with compound POS tags. I think I mentioned something about this earlier, but it wasn’t clear to me then exactly what was going on. So … a couple of examples:

I'll     pns11|vmb   i|will    i'll
there's  pc-acp|vbz  there|be  there's
You've   pn22|vhb    you|have  you've

There are about 22,000 such cases (out of 3.8+ M total tokens) in the training data. As I said, NUPOS and MorphAdorner treat these as single tokens with two parts of speech, but most other taggers split them into two tokens. Martin has a nice argument for why the NUPOS/MorphAdorner method is the right one, but it makes things a bit tricky for me at the moment. First, it means there are 428 unique tags in the MorphAdorner training data (almost doubling the size of the already large NUPOS tagset), which makes for potentially enormous matrices in the training and cross-validation process of other taggers. I’ve already seen that it makes training the Stanford tagger on this data impossible on my machine, and it made it slightly harder for me to repeat Bob’s cross-validation runs with LingPipe than it might have been with a smaller tagset.

Now, I don’t suspect that such things are really impossible (in fact I’m rerunning the LingPipe cross-validation process with some tweaked settings right now), but it does make things more resource intensive.

And then there are tokenization issues, which make direct comparison of tokens and parts of speech between taggers pretty tough. Every tagger wants to tokenize incoming data differently, and they don’t necessarily preserve MorphAdorner’s existing tokenization (heck, even MorphAdorner doesn’t respect MorphAdorner’s existing tokenization; see all the instances of ‘…’ turned into ‘..’ and ‘.’). I suppose it would be possible to write a pretty trivial tokenizer for each of the packages that says “hey, every token in the input data is on a separate line, just go with that.” But mucking about with tokenizers is beyond my current investment level.

Finally, as Bob pointed out elsewhere (I’ve forgotten now whether it was in an email or a blog comment), tokenization has an obvious impact on tagging accuracy; it’s hard to get parts of speech right (where “right” = “matches the training data”) if the tagger thinks the data consists of different tokens than does the trainer. In the end, the obvious effect is to skew accuracy results toward taggers/tokenizers more closely related to the ones used to help generate the training data in the first place. Martin tells me that the MorphAdorner training data is distantly descended from material tagged by CLAWS and so is, as far as I can tell, probably not closely related to the other taggers I’m evaluating. Takeaway point: It’s no surprise that using MorphAdorner’s training data will tend to produce accuracy evaluations that favor MorphAdorner.

What’s next? A couple of posts later today, one on LingPipe cross-validation results and another on raw accuracy using a sharply reduced tagset and treating the whole dataset as a big bag of tags (i.e., what is the overall frequency of adjectives, etc.). Good times. More soon.

One thought on “Evaluating POS Taggers: More Info on the Training Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s