Evaluating POS Taggers: MorphAdorner Cross-Validation

Here are the results of the cross-validation trials with MorphAdorner. Note to self: Never claim that anything will be trivial.

I ran a ten-fold cross-validation trial on the MorphAdorner training data. All fairly straightforward and naïve; I chopped the training data into ten equal-size chunks (with no attention to sentence or paragraph boundaries, which introduces edge cases, but they’re trivial compared to the sample size), trained the tagger on nine of them and ran it over the remaining one using the new training set, then repeated the process for each chunk. I did this in two forms: Once creating new lexical data as well as a new transformation matrix (columns labeled “1” below), and then a second time creating only a new transformation matrix, but using the supplied lexical information (labeled “2”). The true error rate on new texts is probably somewhere between the two, but (I think) likely to be closer to the second (better) case.

Results

So … the results:

#     Wds     Err1   Rate1    Err2    Rate2   Diff12   Rate12
0    382889   8607  .022479   7049   .018410   2143   .005596
1    382888   8267  .021591   5907   .015427   3120   .008148
2    382888   8749  .022850   6698   .017493   2739   .007153
3    382889   8780  .022930   7322   .019123   2168   .005662
4    382888   8039  .020995   5784   .015106   2851   .007446
5    382888   8147  .021277   5835   .015239   3020   .007887
6    382889  19660  .051346  16727   .043686   3721   .009718
7    382888  20494  .053524  17723   .046287   3679   .009608
8    382888  17078  .044603  14441   .037715   3603   .009410
9    382888   7151  .018676   4512   .011784   4436   .011585
Tot 3828883 114972  .030027  91998   .024027  31480   .008221

Key

# = Chunk being evaluated
Wds = Number of words in that chunk
Err1 = Number of tagging errors in the testing chunk using lexical and matrix data derived exclusively from the other cross-validation chunks
Rate1 = Tagging error rate for the testing chunk using this training data (1)
Err2 = Number of tagging errors using a matrix derived from the other chunks, but stock lexical data from the MorphAdorner distribution
Rate2 = Error rate in the second case
Diff12 = Number of differences between the output files generated by cases 1 and 2
Rate12 = Rate of differences between output 1 and 2

[Update: In my original test (the data reported above), I didn’t preserve empty lines in the training and testing data. Rerunning the cross-validation with empty lines in place produces very slightly better results, with overall average error rates dropping from 3.0% to 2.9% (using limited lexical data, case 1) and from 2.4% to 2.3% (with full lexical data, case 2). Rates corrected in the analysis sections below. Corrected data (where blank lines now count as “words”):

#     Wds     Err1   Rate1    Err2    Rate2   Diff12   Rate12
0    399644   8599  .021516   7045   .017628   2142   .005359
1    399643   8140  .020368   5886   .014728   3015   .007544
2    399643   8803  .022027   6737   .016857   2769   .006928
3    399644   8806  .022034   7347   .018383   2173   .005437
4    399643   8070  .020193   5817   .014555   2843   .007113
5    399643   8817  .022062   5918   .014808   3676   .009198
6    399644  19806  .049559  16901   .042290   3727   .009325
7    399643  20246  .050660  17486   .043754   3645   .009120
8    399643  16997  .042530  14362   .035937   3597   .009000
9    399643   7150  .017890   4512   .011290   4438   .011104
Tot 3996433 115434  .028884  92011   .023023  32025   .008013

]

Notes

A couple of observations. In general, the tagging quality is quite good; it averages 97.1% accuracy over the full testing set even when it’s working without the enhanced lexicon supplied with the distribution, and rises to 97.7% with it.

There’s some serious variability, especially in chunks 6, 7, and 8. A quick inspection suggests that they’re mostly Shakespeare, which I guess is both reassuring and unsurprising. Reassuring because I’ll mostly be working with later stuff and with fiction rather than drama; and unsurprising particularly in the case of cross-validation, since much of the Shakespeare is necessarily excluded from the training data when it happens to occur in the chunk under consideration. Without the Shakespeare-heavy chunks, accuracy is around 97.8% and 98.4% for cases 1 and 2, respectively. That’s darn good.

I don’t know what, if anything, to make of the difference rate between the two testing samples, which is quite small (less than 1%) but not zero. I guess it’s good to know that different training inputs do in fact produce different outputs. Note also the difference between them is not identical to the sum of the differences between each output and the reference data, i.e., it’s not the case that they agree on everything but the errors. Each case gets some words right that the other one misses. No surprise there.

This is all very encouraging; accuracy at 97% and above is excellent. And this is comparing exact matches using a very large tagset; accuracy would be even better with less rigorous matching. As a practical matter, I probably won’t be able to do equivalent cross-validation for the other taggers (too time consuming, even if it’s technically possible), but I should at least be able to determine overall error rates using their default general-linguistic training data and the MorphAdorner reference corpus. I suspect they’d have to be pretty impressive to outweigh the other benefits of MorphAdorner. Time will tell. More after Thanksgiving.

[Oh, and for reference, the MONK Wiki has more information on MorphAdorner, including some good discussion of POS tagging in general, all by Martin AFAICT. Note that the last edit as May, 2007, so some things may have changed a bit.]

[Also, a check on speed: The above cross-validation runs, working on about 3.8 million words and processing all of them ten times, either as training or testing data, plus some associated calculating/processing, took about 45 minutes in sum on the same Athlon X2 3600+ as before (now with four gigs of RAM rather than two). Plenty speedy.]

3 thoughts on “Evaluating POS Taggers: MorphAdorner Cross-Validation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s