In my last post, I outlined an approach to assessing tagger accuracy by measuring the relative frequencies of a reduced set of POS tags in a tagger’s output and comparing them to the frequencies in the training data I have on hand. Here’s the first set of results, from MorphAdorner.
A couple of notes. First, the procedure for MorphAdorner differs slightly from the one I’ll use for the other taggers. In order to mitigate some of the advantage it has thanks to working with its own training data, I’ll be using the output of its cross-validation runs, rather than that of a stock run over the full corpus. That means that it’s not making use of the extended lexicon built into the default distribution, nor of any other fixups that may have been applied by hand after the default training (to both of which the other taggers of course also lack access). I think MorphAdorner still has an advantage, but this should reduce it slightly (in cross-validation, lack of the extended lexicon reduced overall accuracy from 97.7% to 97.1%, both of which were down from 98.1% on a run of the default distribution back over the training data).
A listing of the reduced tagset is included in the last post. You can also download two versions of the translation table from MorphAdorner-ese (i.e., NUPOS) to this reduced set. The first covers only the 238 tags currently defined by NUPOS, and is derived directly from a dump of the tag statistics and help information in MONK. The second includes all 261 tags actually used in the MorphAdorner training data. Two points to note: (1.) These translations use the syntactic part of speech assigned to each tag, not the word class to which it belongs. So when a noun is used as an adjective, it’s translated to ‘jj’ not ‘nn’. (2.) Of the 23 tags that appear in the training data but not in the standard tagset, almost all are punctuation. In fact the only ones that I haven’t translated as punctuation are various iterations of asterisks, to which I’ve assigned ‘zz’ (unknown-other), though I could see the case for making them either punctuation or symbols. It doesn’t seem to have made much difference, as ‘zz’ remains underrepresented in the output, while ‘sy’ and ‘pu’ are overrepresented. Go figure.
First, the raw numbers:
Table 1: POS frequency in MorphAdorner training data and cross-validation output
POS Train Cross Tr % Cr % Cr Dif Cr Err % av 213765 212662 5.551 5.512 -0.03909 -0.516 cc 243720 240704 6.329 6.239 -0.09015 -1.237 dt 313671 313111 8.145 8.116 -0.02993 -0.179 fw 4117 4076 0.107 0.106 -0.00126 -0.996 jj 210224 210454 5.459 5.455 -0.00437 0.109 nn 565304 571418 14.680 14.811 0.13069 1.082 np 91933 84632 2.387 2.194 -0.19375 -7.942 nu 24440 24776 0.635 0.642 0.00751 1.375 pc 54518 54200 1.416 1.405 -0.01092 -0.583 pp 323212 325560 8.393 8.438 0.04498 0.726 pr 422442 422557 10.970 10.952 -0.01778 0.027 pu 632749 640074 16.431 16.590 0.15877 1.158 sy 318 1282 0.008 0.033 0.02497 303.145 uh 19492 20836 0.506 0.540 0.03388 6.895 vb 666095 666822 17.297 17.283 -0.01388 0.109 wh 40162 40316 1.043 1.045 0.00202 0.383 xx 24544 24596 0.637 0.638 0.00014 0.212 zz 167 97 0.004 0.003 -0.00182 -41.916 -- -- -- Tot 3850873 3858173
POS = Part of speech (see previous post for explanations)
Train = Number of occurrences in training data
Cross = Number of occurrences in cross-validation output
Tr % = Percentage of training data tagged with this POS
Cr % = Percentage of cross-validation data tagged with this POS
Cr Dif = Difference in percentage points between Tr % and Cr %
Cr Err % = Percent error in cross-validation frequency relative to training data
That’s a bit hard to read, and is better represented graphically (click for full-size images):
Figure 1: Percentage point errors in POS frequency relative to the training data
Figure 2: Percent error by POS type relative to the training data
Notes on Figure 2:
- The size of each bubble represents the relative frequency of that POS tag in the training data. Bigger bubbles mean more frequent tags.
- The x-axis is logarithmic, so the differences are both more (‘zz’, ‘sy’) and less (‘pr’, ‘nn’) dramatic than they appear.
Just a couple of points, really.
- Things look pretty good overall. With the exception of a few (relatively low-frequency) outliers, most tags are recognized with around 99% accuracy or better. So I could have pretty good confidence were I to compare, say, adjective frequency in different contexts.
- It looks, though, like there’s systematic undercounting of proper nouns, conjunctions and, to a lesser extent, adverbs and determiners. Proper nouns will probably always be hard, since they’ll have a disproportionate share of unknown (i.e., not present in the training data) tokens (on which accuracy generally drops to 70-90%, perhaps a bit better with proper nouns since you can maybe crib from orthography?).
- Nouns are overcounted, reducing the proper-noun problem if I were to lump the two together. Punctuation is overcounted, but I find it hard to imagine a situation in which I would care. Prepositions are slightly overcounted.
- Symbols (303% error) and unknown/other tokens (42%) are totally unreliable. They don’t contribute much to the overall error rate, because there aren’t very many of them (hence the nearly invisible bubbles in Figure 2), which is likely also why they’re so unreliable in the first place. But you wouldn’t want to make much of any variation you might see in those frequencies. Same goes for interjections, which hover around 7% error rate. [Note: I’ll look into the ‘sy’ and ‘zz’ cases, since it could be something systematically askew, but also might just be small-sample effects.]
Anyway, this is useful information on its own, and I hope it will be even more so once I have analogous data for the other taggers. That data should be ready in the next day or two, so more to follow shortly.