Evaluating POS Taggers: TreeTagger Cross-Validation

More on TreeTagger, which I’ve now (finally) gotten to train and cross-validate.

First, a note on an additional difficulty not included in the last post. I use MorphAdorner to create the lexicon from each chunk of training data (once that data has been cleaned up to TreeTagger’s liking). This works well, except that MorphAdorner has a quirk in its handling of punctuation for the lexicon. Specifically, MorphAdorner treats left and right quotes, which are tagged in the training data with ‘lq’ and ‘rq,’ as punctuation to be passed through in the lexicon. So what appears in the training data like this (sic, incidentally, on the “ here):

“	lq	“

appears in the lexicon as:

“	“	“

I think there’s a way to fiddle with smart quotes in MorphAdorner (it would be nice to treat them all as dumb double or single quotes, though maybe MorphAdporner’s clever enough to use left and right quotes for context info), but I can’t find it at the moment. Anyway, this freaks TreeTagger out. So I’ve modified my fix-everything-up-for-TreeTagger script to change the second form into the first whenever left or right quotes appear in the lexicon. Meh.

Results and Discussion

All that said, and two days of my life that I’ll never get back squandered, was it at least worth it? In a word: No. TreeTagger produces awful results in the cross-validation tests. See for yourself:

#       Words   Errors   Err Rate
0       382889  58934   .153919
1       382888  57586   .150399
2       382888  52527   .137186
3       382889  51714   .135062
4       382888  45794   .119601
5       382888  44436   .116054
6       382889  54037   .141129
7       382888  55332   .144512
8       382888  52674   .137570
9       382888  52441   .136961
Tot    3828883 525475   .137239

Recall that for MorphAdorner the average error rate in cross-validation was 2.3-2.9%, depending on the lexical data used. Here it’s 13.7%. This is obviously totally unusable. And a quick survey of the data suggests that the errors are neither easily systematic nor trivial—there are nouns tagged as verbs and such all over the place.

[A side note: TreeTagger doesn’t show the disparity on the Shakespeare-heavy tranches, nos. 6-8, that MorphAdorner did. But the error rates are so high that it’s hard to say anything more about it. I hope to look at those chunks again in the Stanford case.]

It’s possible—likely, even—that I’ve done something wrong, given how bad the results are. TreeTagger’s authors claim to have achieved 96+% accuracy on general data using a smaller tagset. It’s also surely true that the tagset and cardinal/ordinal restrictions on TreeTagger’s training input limit its accuracy in the present case. Whatever. I’m sick of dealing with it.

[Update: Some of the errors are surely due to my not having reworked the training data to set a sentence-end tag the way TreeTagger expects. (I used ‘.’ instead—close, but of course not perfect.) But that would account for only a small fraction of the total errors, and to fix it at this point would require more work than I’m will to spend on a solution that obviously won’t beat MorphAdorner.]

Out of curiosity, I’ll still run TreeTagger through the stock-training-data-and-reduced-tagset comparison of all the taggers that will be the last test in this roundup, but for now TreeTagger’s very likely out of the running, especially since I’d like to have NUPOS compatibility.

One thought on “Evaluating POS Taggers: TreeTagger Cross-Validation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s