More on TreeTagger, which I’ve now (finally) gotten to train and cross-validate.
First, a note on an additional difficulty not included in the last post. I use MorphAdorner to create the lexicon from each chunk of training data (once that data has been cleaned up to TreeTagger’s liking). This works well, except that MorphAdorner has a quirk in its handling of punctuation for the lexicon. Specifically, MorphAdorner treats left and right quotes, which are tagged in the training data with ‘lq’ and ‘rq,’ as punctuation to be passed through in the lexicon. So what appears in the training data like this (sic, incidentally, on the “ here):
“ lq “
appears in the lexicon as:
“ “ “
I think there’s a way to fiddle with smart quotes in MorphAdorner (it would be nice to treat them all as dumb double or single quotes, though maybe MorphAdporner’s clever enough to use left and right quotes for context info), but I can’t find it at the moment. Anyway, this freaks TreeTagger out. So I’ve modified my fix-everything-up-for-TreeTagger script to change the second form into the first whenever left or right quotes appear in the lexicon. Meh.
Results and Discussion
All that said, and two days of my life that I’ll never get back squandered, was it at least worth it? In a word: No. TreeTagger produces awful results in the cross-validation tests. See for yourself:
# Words Errors Err Rate 0 382889 58934 .153919 1 382888 57586 .150399 2 382888 52527 .137186 3 382889 51714 .135062 4 382888 45794 .119601 5 382888 44436 .116054 6 382889 54037 .141129 7 382888 55332 .144512 8 382888 52674 .137570 9 382888 52441 .136961 Tot 3828883 525475 .137239
Recall that for MorphAdorner the average error rate in cross-validation was 2.3-2.9%, depending on the lexical data used. Here it’s 13.7%. This is obviously totally unusable. And a quick survey of the data suggests that the errors are neither easily systematic nor trivial—there are nouns tagged as verbs and such all over the place.
[A side note: TreeTagger doesn’t show the disparity on the Shakespeare-heavy tranches, nos. 6-8, that MorphAdorner did. But the error rates are so high that it’s hard to say anything more about it. I hope to look at those chunks again in the Stanford case.]
It’s possible—likely, even—that I’ve done something wrong, given how bad the results are. TreeTagger’s authors claim to have achieved 96+% accuracy on general data using a smaller tagset. It’s also surely true that the tagset and cardinal/ordinal restrictions on TreeTagger’s training input limit its accuracy in the present case. Whatever. I’m sick of dealing with it.
[Update: Some of the errors are surely due to my not having reworked the training data to set a sentence-end tag the way TreeTagger expects. (I used ‘.’ instead—close, but of course not perfect.) But that would account for only a small fraction of the total errors, and to fix it at this point would require more work than I’m will to spend on a solution that obviously won’t beat MorphAdorner.]
Out of curiosity, I’ll still run TreeTagger through the stock-training-data-and-reduced-tagset comparison of all the taggers that will be the last test in this roundup, but for now TreeTagger’s very likely out of the running, especially since I’d like to have NUPOS compatibility.
One thought on “Evaluating POS Taggers: TreeTagger Cross-Validation”