Following on from the MorphAdorner bag-o-tags post, here’s the same treatment for the Stanford tagger.
I’ve used out-of-the-box settings, which means the left3words tagger trained on the usual WSJ corpus and employing the Penn Treebank tagset. Translations from this set (as it exists in Stanford’s output data) to my (very) reduced tagset are also available.
There are a couple of reasons to expect that the results will be worse than those seen with MorphAdorner. One is the tokenizer. Another is the different (non-lit) training set. A third is incompatibility between the tagsets. This last point is unfortunate, but there’s not really any easy way to get around it. It crops up in a few of the tags, and works in both directions:
- Stanford/Penn has a ‘to’ tag for the word “to”; the reference data (MorphAdorner’s training corpus) has no such thing, using ‘pc’ and ‘pp’ as appropriate instead.
- Stanford uses ‘pdt’ for “pre-determiner,” i.e., words that come before (and qualify) a determiner. MorphAdorner lacks this tag, using ‘dt’ or ‘jj’ as appropriate.
- Easier: Stanford uses ‘pos’ for possessive suffixes, while MorphAdorner doesn’t break them off from the base token, and contains modified versions of the base tags that indicate possessiveness. But since I’m not looking as possessives as an individual class, I can just ignore these, since the base tokens will be tagged on their own anyway.
- Also easy-ish: Stanford doesn’t use MorphAdorner’s ‘xx’ (negation) tag. It turns out that almost everything MorphAdorner tags ‘xx’, Stanford considers an adverb, so one could lump ‘xx’ and ‘av’ together, were one so inclined.
Data
Table 1: POS frequency in reference data and Stanford output
POS Ref. Test Ref % Test % Test Dif Test Err % av 213765 222465 5.551 5.754 0.20264 3.650 cc 243720 165546 6.329 4.282 -2.04736 -32.349 dt* 313671 299797 8.145 7.754 -0.39166 -4.808 fw 4117 4094 0.107 0.106 -0.00103 -0.959 jj* 210224 235856 5.459 6.100 0.64093 11.741 nn 565304 583214 14.680 15.084 0.40405 2.752 np 91933 156650 2.387 4.052 1.66419 69.709 nu 24440 18006 0.635 0.466 -0.16896 -26.623 pc* 54518 92888 1.416 2.402 0.98668 69.694 pp* 323212 375382 8.393 9.709 1.31547 15.673 pr 422442 385059 10.970 9.959 -1.01107 -9.217 pu 632749 641420 16.431 16.589 0.15804 0.962 sy 318 118 0.008 0.003 -0.00521 -63.043 uh 19492 2239 0.506 0.058 -0.44826 -88.560 vb 666095 641714 17.297 16.597 -0.70029 -4.049 wh 40162 41807 1.043 1.081 0.03834 3.676 xx 24544 0 0.637 0.000 -0.63736 -100.00 zz 167 201 0.004 0.005 0.00086 19.874 -- Tot 3850873 3866456
* Tag counts for which there is reason to expect systematic errors
Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref. = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Test Dif = Difference in percentage points between Ref % and Test %
Test Err % = Percent error in output frequency relative to reference data
Pictures
And then the graphs (click for large versions). Note: These graphs are corrected for the xx/av problem discussed above; ‘xx’ tags in the reference data have been rolled into ‘av’ here.
Figure 1: Percentage point errors in POS frequency relative to reference data
Figure 2: Percent error by POS type relative to reference data
Notes on Figure 2:
- As before, the size of each bubble represents the relative frequency of that POS tag in the reference data. Bigger bubbles mean more frequent tags.
- The x-axis is logarithmic, so the differences are both more (‘uh’, ‘sy’) and less (‘pu’, ‘nn’) dramatic than they appear.
Discussion
First, there are a few POS types that we know will be off, since there’s not a straightforward conversion for them between the NUPOS and Penn tagsets. These are: dt, jj, pc, and pp. Adjectives (jj) are the only ones that are a bit of a disappointment, since that’s a class in which I have some interest. On the plus side, though, this is a problem connected to the ‘pdt’ tag in Penn, and there are only about 4,000 occurrences of it in Stanford’s output, compared to 235,000+ occurrences of ‘jj.’ Even if half the ‘pdt’ tags should be ‘jj’ rather than ‘dt’, that’s still less than a 1% contribution to the overall error rate for ‘jj’.
That said, what sticks out here? Well, the numbers are a lot worse than those for MorphAdorner, for which the worst cases (proper nouns, nouns, and punctuation) were 0.1 – 0.2 percentage points off, compared to 1 – 2 here. So we have results that are about an order of magnitude worse. And in MorphAdorner’s case, the nouns overall may not have been as bad as they look, since proper nouns were undercounted, while common nouns were overcounted by a roughly offsetting amount. For Stanford, though, both common nouns and proper nouns are overcounted, so you can’t get rid of the error by lumping them together.
Similarly, relative error percentages for most tag types are much higher. In Figure 2, the main cluster of values is between 1% and about 50%; for MorphAdorner, it was 0.1 and 1.0. Nouns, verbs, adjectives, adverbs, and pronouns—all major components of the data set—are off by 3% to 10% or more.
The real question, though, is what to make of all this. How many of the errors are merely “errors,” i.e., differences between the way Stanford does things and the way MorphAdorner does them. What I’m interested in, ultimately, is a tagger that’s reliable across a historically diverse corpus; I don’t especially care if it undercounts nouns, for instance, so long as it undercounts them all the time, everywhere, to the same extent. But in the absence of literary reference data not linked to MorphAdorner, and without the ability to train the Stanford tagger on MorphAdorner’s reference corpus, it’s hard for me to assess accuracy other than by standard cross-validation and this bag of tags method.
Takeaway point: I don’t see any compelling reason to keep Stanford in the running at this point.