Evaluating POS Taggers: Stanford Bag of Tags Accuracy

Following on from the MorphAdorner bag-o-tags post, here’s the same treatment for the Stanford tagger.

I’ve used out-of-the-box settings, which means the left3words tagger trained on the usual WSJ corpus and employing the Penn Treebank tagset. Translations from this set (as it exists in Stanford’s output data) to my (very) reduced tagset are also available.

There are a couple of reasons to expect that the results will be worse than those seen with MorphAdorner. One is the tokenizer. Another is the different (non-lit) training set. A third is incompatibility between the tagsets. This last point is unfortunate, but there’s not really any easy way to get around it. It crops up in a few of the tags, and works in both directions:

Stanford/Penn has a ‘to’ tag for the word “to”; the reference data (MorphAdorner’s training corpus) has no such thing, using ‘pc’ and ‘pp’ as appropriate instead.
Stanford uses ‘pdt’ for “pre-determiner,” i.e., words that come before (and qualify) a determiner. MorphAdorner lacks this tag, using ‘dt’ or ‘jj’ as appropriate.
Easier: Stanford uses ‘pos’ for possessive suffixes, while MorphAdorner doesn’t break them off from the base token, and contains modified versions of the base tags that indicate possessiveness. But since I’m not looking as possessives as an individual class, I can just ignore these, since the base tokens will be tagged on their own anyway.
Also easy-ish: Stanford doesn’t use MorphAdorner’s ‘xx’ (negation) tag. It turns out that almost everything MorphAdorner tags ‘xx’, Stanford considers an adverb, so one could lump ‘xx’ and ‘av’ together, were one so inclined.

Data

Table 1: POS frequency in reference data and Stanford output

POS	Ref.	Test	Ref %	Test %	Test Dif	Test Err %
av	213765	222465	 5.551	 5.754	 0.20264	  3.650
cc	243720	165546	 6.329	 4.282	-2.04736	-32.349
dt*	313671	299797	 8.145	 7.754	-0.39166	 -4.808
fw	  4117	  4094	 0.107	 0.106	-0.00103	 -0.959
jj*	210224	235856	 5.459	 6.100	 0.64093	 11.741
nn	565304	583214	14.680	15.084	 0.40405	  2.752
np	 91933	156650	 2.387	 4.052	 1.66419	 69.709
nu	 24440	 18006	 0.635	 0.466	-0.16896	-26.623
pc*	 54518	 92888	 1.416	 2.402	 0.98668	 69.694
pp*	323212	375382	 8.393	 9.709	 1.31547	 15.673
pr	422442	385059	10.970	 9.959	-1.01107	 -9.217
pu	632749	641420	16.431	16.589	 0.15804	  0.962
sy	   318	   118	 0.008	 0.003	-0.00521	-63.043
uh	 19492	  2239	 0.506	 0.058	-0.44826	-88.560
vb	666095	641714	17.297	16.597	-0.70029	 -4.049
wh	 40162	 41807	 1.043	 1.081	 0.03834	  3.676
xx	 24544	     0	 0.637	 0.000	-0.63736	-100.00
zz	   167	   201	 0.004	 0.005	 0.00086	 19.874
--	
Tot	3850873	3866456

* Tag counts for which there is reason to expect systematic errors

Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref. = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Test Dif = Difference in percentage points between Ref % and Test %
Test Err % = Percent error in output frequency relative to reference data

Pictures

And then the graphs (click for large versions). Note: These graphs are corrected for the xx/av problem discussed above; ‘xx’ tags in the reference data have been rolled into ‘av’ here.

Figure 1: Percentage point errors in POS frequency relative to reference data

Figure 2: Percent error by POS type relative to reference data

Notes on Figure 2:

As before, the size of each bubble represents the relative frequency of that POS tag in the reference data. Bigger bubbles mean more frequent tags.
The x-axis is logarithmic, so the differences are both more (‘uh’, ‘sy’) and less (‘pu’, ‘nn’) dramatic than they appear.

Discussion

First, there are a few POS types that we know will be off, since there’s not a straightforward conversion for them between the NUPOS and Penn tagsets. These are: dt, jj, pc, and pp. Adjectives (jj) are the only ones that are a bit of a disappointment, since that’s a class in which I have some interest. On the plus side, though, this is a problem connected to the ‘pdt’ tag in Penn, and there are only about 4,000 occurrences of it in Stanford’s output, compared to 235,000+ occurrences of ‘jj.’ Even if half the ‘pdt’ tags should be ‘jj’ rather than ‘dt’, that’s still less than a 1% contribution to the overall error rate for ‘jj’.

That said, what sticks out here? Well, the numbers are a lot worse than those for MorphAdorner, for which the worst cases (proper nouns, nouns, and punctuation) were 0.1 – 0.2 percentage points off, compared to 1 – 2 here. So we have results that are about an order of magnitude worse. And in MorphAdorner’s case, the nouns overall may not have been as bad as they look, since proper nouns were undercounted, while common nouns were overcounted by a roughly offsetting amount. For Stanford, though, both common nouns and proper nouns are overcounted, so you can’t get rid of the error by lumping them together.

Similarly, relative error percentages for most tag types are much higher. In Figure 2, the main cluster of values is between 1% and about 50%; for MorphAdorner, it was 0.1 and 1.0. Nouns, verbs, adjectives, adverbs, and pronouns—all major components of the data set—are off by 3% to 10% or more.

The real question, though, is what to make of all this. How many of the errors are merely “errors,” i.e., differences between the way Stanford does things and the way MorphAdorner does them. What I’m interested in, ultimately, is a tagger that’s reliable across a historically diverse corpus; I don’t especially care if it undercounts nouns, for instance, so long as it undercounts them all the time, everywhere, to the same extent. But in the absence of literary reference data not linked to MorphAdorner, and without the ability to train the Stanford tagger on MorphAdorner’s reference corpus, it’s hard for me to assess accuracy other than by standard cross-validation and this bag of tags method.

Takeaway point: I don’t see any compelling reason to keep Stanford in the running at this point.

Work Product

Research notes in quantitative humanities

Menu

Evaluating POS Taggers: Stanford Bag of Tags Accuracy

Data

Table 1: POS frequency in reference data and Stanford output

Pictures

Figure 1: Percentage point errors in POS frequency relative to reference data

Figure 2: Percent error by POS type relative to reference data

Discussion

Leave a comment Cancel reply

Menu

Data

Table 1: POS frequency in reference data and Stanford output

Pictures

Figure 1: Percentage point errors in POS frequency relative to reference data

Figure 2: Percent error by POS type relative to reference data

Discussion

Share this:

Related

Leave a comment Cancel reply