Evaluating POS Taggers: MorphAdorner Cross-Validation

Here are the results of the cross-validation trials with MorphAdorner. Note to self: Never claim that anything will be trivial.

I ran a ten-fold cross-validation trial on the MorphAdorner training data. All fairly straightforward and naïve; I chopped the training data into ten equal-size chunks (with no attention to sentence or paragraph boundaries, which introduces edge cases, but they’re trivial compared to the sample size), trained the tagger on nine of them and ran it over the remaining one using the new training set, then repeated the process for each chunk. I did this in two forms: Once creating new lexical data as well as a new transformation matrix (columns labeled “1” below), and then a second time creating only a new transformation matrix, but using the supplied lexical information (labeled “2”). The true error rate on new texts is probably somewhere between the two, but (I think) likely to be closer to the second (better) case.

Results

So … the results:

#     Wds     Err1   Rate1    Err2    Rate2   Diff12   Rate12
0    382889   8607  .022479   7049   .018410   2143   .005596
1    382888   8267  .021591   5907   .015427   3120   .008148
2    382888   8749  .022850   6698   .017493   2739   .007153
3    382889   8780  .022930   7322   .019123   2168   .005662
4    382888   8039  .020995   5784   .015106   2851   .007446
5    382888   8147  .021277   5835   .015239   3020   .007887
6    382889  19660  .051346  16727   .043686   3721   .009718
7    382888  20494  .053524  17723   .046287   3679   .009608
8    382888  17078  .044603  14441   .037715   3603   .009410
9    382888   7151  .018676   4512   .011784   4436   .011585
Tot 3828883 114972  .030027  91998   .024027  31480   .008221

Key

# = Chunk being evaluated
Wds = Number of words in that chunk
Err1 = Number of tagging errors in the testing chunk using lexical and matrix data derived exclusively from the other cross-validation chunks
Rate1 = Tagging error rate for the testing chunk using this training data (1)
Err2 = Number of tagging errors using a matrix derived from the other chunks, but stock lexical data from the MorphAdorner distribution
Rate2 = Error rate in the second case
Diff12 = Number of differences between the output files generated by cases 1 and 2
Rate12 = Rate of differences between output 1 and 2

[Update: In my original test (the data reported above), I didn’t preserve empty lines in the training and testing data. Rerunning the cross-validation with empty lines in place produces very slightly better results, with overall average error rates dropping from 3.0% to 2.9% (using limited lexical data, case 1) and from 2.4% to 2.3% (with full lexical data, case 2). Rates corrected in the analysis sections below. Corrected data (where blank lines now count as “words”):

# Wds Err1 Rate1 Err2 Rate2 Diff12 Rate12 0 399644 8599 .021516 7045 .017628 2142 .005359 1 399643 8140 .020368 5886 .014728 3015 .007544 2 399643 8803 .022027 6737 .016857 2769 .006928 3 399644 8806 .022034 7347 .018383 2173 .005437 4 399643 8070 .020193 5817 .014555 2843 .007113 5 399643 8817 .022062 5918 .014808 3676 .009198 6 399644 19806 .049559 16901 .042290 3727 .009325 7 399643 20246 .050660 17486 .043754 3645 .009120 8 399643 16997 .042530 14362 .035937 3597 .009000 9 399643 7150 .017890 4512 .011290 4438 .011104 Tot 3996433 115434 .028884 92011 .023023 32025 .008013

]

Notes

A couple of observations. In general, the tagging quality is quite good; it averages 97.1% accuracy over the full testing set even when it’s working without the enhanced lexicon supplied with the distribution, and rises to 97.7% with it.

There’s some serious variability, especially in chunks 6, 7, and 8. A quick inspection suggests that they’re mostly Shakespeare, which I guess is both reassuring and unsurprising. Reassuring because I’ll mostly be working with later stuff and with fiction rather than drama; and unsurprising particularly in the case of cross-validation, since much of the Shakespeare is necessarily excluded from the training data when it happens to occur in the chunk under consideration. Without the Shakespeare-heavy chunks, accuracy is around 97.8% and 98.4% for cases 1 and 2, respectively. That’s darn good.

I don’t know what, if anything, to make of the difference rate between the two testing samples, which is quite small (less than 1%) but not zero. I guess it’s good to know that different training inputs do in fact produce different outputs. Note also the difference between them is not identical to the sum of the differences between each output and the reference data, i.e., it’s not the case that they agree on everything but the errors. Each case gets some words right that the other one misses. No surprise there.

This is all very encouraging; accuracy at 97% and above is excellent. And this is comparing exact matches using a very large tagset; accuracy would be even better with less rigorous matching. As a practical matter, I probably won’t be able to do equivalent cross-validation for the other taggers (too time consuming, even if it’s technically possible), but I should at least be able to determine overall error rates using their default general-linguistic training data and the MorphAdorner reference corpus. I suspect they’d have to be pretty impressive to outweigh the other benefits of MorphAdorner. Time will tell. More after Thanksgiving.

[Oh, and for reference, the MONK Wiki has more information on MorphAdorner, including some good discussion of POS tagging in general, all by Martin AFAICT. Note that the last edit as May, 2007, so some things may have changed a bit.]

[Also, a check on speed: The above cross-validation runs, working on about 3.8 million words and processing all of them ten times, either as training or testing data, plus some associated calculating/processing, took about 45 minutes in sum on the same Athlon X2 3600+ as before (now with four gigs of RAM rather than two). Plenty speedy.]

Work Product

Research notes in quantitative humanities

Menu

Evaluating POS Taggers: MorphAdorner Cross-Validation

Results

Key

Notes

3 thoughts on “Evaluating POS Taggers: MorphAdorner Cross-Validation”

Leave a comment Cancel reply

Menu

Results

Key

Notes

Share this:

Related

3 thoughts on “Evaluating POS Taggers: MorphAdorner Cross-Validation”

Leave a comment Cancel reply