Here are the results of the cross-validation trials with MorphAdorner. Note to self: Never claim that anything will be trivial.
I ran a ten-fold cross-validation trial on the MorphAdorner training data. All fairly straightforward and naïve; I chopped the training data into ten equal-size chunks (with no attention to sentence or paragraph boundaries, which introduces edge cases, but they’re trivial compared to the sample size), trained the tagger on nine of them and ran it over the remaining one using the new training set, then repeated the process for each chunk. I did this in two forms: Once creating new lexical data as well as a new transformation matrix (columns labeled “1” below), and then a second time creating only a new transformation matrix, but using the supplied lexical information (labeled “2”). The true error rate on new texts is probably somewhere between the two, but (I think) likely to be closer to the second (better) case.
Results
So … the results:
# Wds Err1 Rate1 Err2 Rate2 Diff12 Rate12 0 382889 8607 .022479 7049 .018410 2143 .005596 1 382888 8267 .021591 5907 .015427 3120 .008148 2 382888 8749 .022850 6698 .017493 2739 .007153 3 382889 8780 .022930 7322 .019123 2168 .005662 4 382888 8039 .020995 5784 .015106 2851 .007446 5 382888 8147 .021277 5835 .015239 3020 .007887 6 382889 19660 .051346 16727 .043686 3721 .009718 7 382888 20494 .053524 17723 .046287 3679 .009608 8 382888 17078 .044603 14441 .037715 3603 .009410 9 382888 7151 .018676 4512 .011784 4436 .011585 Tot 3828883 114972 .030027 91998 .024027 31480 .008221
Key
# = Chunk being evaluated
Wds = Number of words in that chunk
Err1 = Number of tagging errors in the testing chunk using lexical and matrix data derived exclusively from the other cross-validation chunks
Rate1 = Tagging error rate for the testing chunk using this training data (1)
Err2 = Number of tagging errors using a matrix derived from the other chunks, but stock lexical data from the MorphAdorner distribution
Rate2 = Error rate in the second case
Diff12 = Number of differences between the output files generated by cases 1 and 2
Rate12 = Rate of differences between output 1 and 2
[Update: In my original test (the data reported above), I didn’t preserve empty lines in the training and testing data. Rerunning the cross-validation with empty lines in place produces very slightly better results, with overall average error rates dropping from 3.0% to 2.9% (using limited lexical data, case 1) and from 2.4% to 2.3% (with full lexical data, case 2). Rates corrected in the analysis sections below. Corrected data (where blank lines now count as “words”):
# Wds Err1 Rate1 Err2 Rate2 Diff12 Rate12 0 399644 8599 .021516 7045 .017628 2142 .005359 1 399643 8140 .020368 5886 .014728 3015 .007544 2 399643 8803 .022027 6737 .016857 2769 .006928 3 399644 8806 .022034 7347 .018383 2173 .005437 4 399643 8070 .020193 5817 .014555 2843 .007113 5 399643 8817 .022062 5918 .014808 3676 .009198 6 399644 19806 .049559 16901 .042290 3727 .009325 7 399643 20246 .050660 17486 .043754 3645 .009120 8 399643 16997 .042530 14362 .035937 3597 .009000 9 399643 7150 .017890 4512 .011290 4438 .011104 Tot 3996433 115434 .028884 92011 .023023 32025 .008013
]
Notes
A couple of observations. In general, the tagging quality is quite good; it averages 97.1% accuracy over the full testing set even when it’s working without the enhanced lexicon supplied with the distribution, and rises to 97.7% with it.
There’s some serious variability, especially in chunks 6, 7, and 8. A quick inspection suggests that they’re mostly Shakespeare, which I guess is both reassuring and unsurprising. Reassuring because I’ll mostly be working with later stuff and with fiction rather than drama; and unsurprising particularly in the case of cross-validation, since much of the Shakespeare is necessarily excluded from the training data when it happens to occur in the chunk under consideration. Without the Shakespeare-heavy chunks, accuracy is around 97.8% and 98.4% for cases 1 and 2, respectively. That’s darn good.
I don’t know what, if anything, to make of the difference rate between the two testing samples, which is quite small (less than 1%) but not zero. I guess it’s good to know that different training inputs do in fact produce different outputs. Note also the difference between them is not identical to the sum of the differences between each output and the reference data, i.e., it’s not the case that they agree on everything but the errors. Each case gets some words right that the other one misses. No surprise there.
This is all very encouraging; accuracy at 97% and above is excellent. And this is comparing exact matches using a very large tagset; accuracy would be even better with less rigorous matching. As a practical matter, I probably won’t be able to do equivalent cross-validation for the other taggers (too time consuming, even if it’s technically possible), but I should at least be able to determine overall error rates using their default general-linguistic training data and the MorphAdorner reference corpus. I suspect they’d have to be pretty impressive to outweigh the other benefits of MorphAdorner. Time will tell. More after Thanksgiving.
[Oh, and for reference, the MONK Wiki has more information on MorphAdorner, including some good discussion of POS tagging in general, all by Martin AFAICT. Note that the last edit as May, 2007, so some things may have changed a bit.]
[Also, a check on speed: The above cross-validation runs, working on about 3.8 million words and processing all of them ten times, either as training or testing data, plus some associated calculating/processing, took about 45 minutes in sum on the same Athlon X2 3600+ as before (now with four gigs of RAM rather than two). Plenty speedy.]
3 thoughts on “Evaluating POS Taggers: MorphAdorner Cross-Validation”