Evaluating POS Taggers: MorphAdorner Bag of Tags Accuracy

In my last post, I outlined an approach to assessing tagger accuracy by measuring the relative frequencies of a reduced set of POS tags in a tagger’s output and comparing them to the frequencies in the training data I have on hand. Here’s the first set of results, from MorphAdorner.

A couple of notes. First, the procedure for MorphAdorner differs slightly from the one I’ll use for the other taggers. In order to mitigate some of the advantage it has thanks to working with its own training data, I’ll be using the output of its cross-validation runs, rather than that of a stock run over the full corpus. That means that it’s not making use of the extended lexicon built into the default distribution, nor of any other fixups that may have been applied by hand after the default training (to both of which the other taggers of course also lack access). I think MorphAdorner still has an advantage, but this should reduce it slightly (in cross-validation, lack of the extended lexicon reduced overall accuracy from 97.7% to 97.1%, both of which were down from 98.1% on a run of the default distribution back over the training data).

A listing of the reduced tagset is included in the last post. You can also download two versions of the translation table from MorphAdorner-ese (i.e., NUPOS) to this reduced set. The first covers only the 238 tags currently defined by NUPOS, and is derived directly from a dump of the tag statistics and help information in MONK. The second includes all 261 tags actually used in the MorphAdorner training data. Two points to note: (1.) These translations use the syntactic part of speech assigned to each tag, not the word class to which it belongs. So when a noun is used as an adjective, it’s translated to ‘jj’ not ‘nn’. (2.) Of the 23 tags that appear in the training data but not in the standard tagset, almost all are punctuation. In fact the only ones that I haven’t translated as punctuation are various iterations of asterisks, to which I’ve assigned ‘zz’ (unknown-other), though I could see the case for making them either punctuation or symbols. It doesn’t seem to have made much difference, as ‘zz’ remains underrepresented in the output, while ‘sy’ and ‘pu’ are overrepresented. Go figure.

Results

First, the raw numbers:

Table 1: POS frequency in MorphAdorner training data and cross-validation output

POS	Train	Cross	 Tr %	 Cr %	 Cr Dif		Cr Err %
av	213765	212662	 5.551	 5.512	-0.03909	 -0.516
cc	243720	240704	 6.329	 6.239	-0.09015	 -1.237
dt	313671	313111	 8.145	 8.116	-0.02993	 -0.179
fw	  4117	  4076	 0.107	 0.106	-0.00126	 -0.996
jj	210224	210454	 5.459	 5.455	-0.00437	  0.109
nn	565304	571418	14.680	14.811	 0.13069	  1.082
np	 91933	 84632	 2.387	 2.194	-0.19375	 -7.942
nu	 24440	 24776	 0.635	 0.642	 0.00751	  1.375
pc	 54518	 54200	 1.416	 1.405	-0.01092	 -0.583
pp	323212	325560	 8.393	 8.438	 0.04498	  0.726
pr	422442	422557	10.970	10.952	-0.01778	  0.027
pu	632749	640074	16.431	16.590	 0.15877	  1.158
sy	   318	  1282	 0.008	 0.033	 0.02497	303.145
uh	 19492	 20836	 0.506	 0.540	 0.03388	  6.895
vb	666095	666822	17.297	17.283	-0.01388	  0.109
wh	 40162	 40316	 1.043	 1.045	 0.00202	  0.383
xx	 24544	 24596	 0.637	 0.638	 0.00014	  0.212
zz	   167	    97	 0.004	 0.003	-0.00182	-41.916
--	    --	    --
Tot	3850873	3858173

Legend
POS = Part of speech (see previous post for explanations)
Train = Number of occurrences in training data
Cross = Number of occurrences in cross-validation output
Tr % = Percentage of training data tagged with this POS
Cr % = Percentage of cross-validation data tagged with this POS
Cr Dif = Difference in percentage points between Tr % and Cr %
Cr Err % = Percent error in cross-validation frequency relative to training data

That’s a bit hard to read, and is better represented graphically (click for full-size images):

Figure 1: Percentage point errors in POS frequency relative to the training data

Figure 2: Percent error by POS type relative to the training data

Notes on Figure 2:

The size of each bubble represents the relative frequency of that POS tag in the training data. Bigger bubbles mean more frequent tags.
The x-axis is logarithmic, so the differences are both more (‘zz’, ‘sy’) and less (‘pr’, ‘nn’) dramatic than they appear.

Discussion

Just a couple of points, really.

Things look pretty good overall. With the exception of a few (relatively low-frequency) outliers, most tags are recognized with around 99% accuracy or better. So I could have pretty good confidence were I to compare, say, adjective frequency in different contexts.
It looks, though, like there’s systematic undercounting of proper nouns, conjunctions and, to a lesser extent, adverbs and determiners. Proper nouns will probably always be hard, since they’ll have a disproportionate share of unknown (i.e., not present in the training data) tokens (on which accuracy generally drops to 70-90%, perhaps a bit better with proper nouns since you can maybe crib from orthography?).
Nouns are overcounted, reducing the proper-noun problem if I were to lump the two together. Punctuation is overcounted, but I find it hard to imagine a situation in which I would care. Prepositions are slightly overcounted.
Symbols (303% error) and unknown/other tokens (42%) are totally unreliable. They don’t contribute much to the overall error rate, because there aren’t very many of them (hence the nearly invisible bubbles in Figure 2), which is likely also why they’re so unreliable in the first place. But you wouldn’t want to make much of any variation you might see in those frequencies. Same goes for interjections, which hover around 7% error rate. [Note: I’ll look into the ‘sy’ and ‘zz’ cases, since it could be something systematically askew, but also might just be small-sample effects.]

Anyway, this is useful information on its own, and I hope it will be even more so once I have analogous data for the other taggers. That data should be ready in the next day or two, so more to follow shortly.

5 thoughts on “Evaluating POS Taggers: MorphAdorner Bag of Tags Accuracy”

Pingback: Evaluating POS Taggers: Stanford Bag of Tags Accuracy « Work Product ~
Pingback: Evaluating POS Taggers: TreeTagger Bag of Tags Accuracy « Work Product ~
Pingback: Evaluating POS Taggers: LingPipe Bag of Tags Accuracy and General Thoughts on Tagset Translation « Work Product ~
I says:

March 4, 2009 at 8:27 am

Hi,
I looked on MorphAdoner website and it is still not available. Did I search in a wrong way or do you have an different kind (not free) of access to it.
Cheers,
I

Matthew Wilkens says:

March 4, 2009 at 8:47 am

No, as far as I know, MorphAdorner just hasn’t quite made it to GA yet. I don’t have any inside info, but they must be close, given the content of their download page.

Work Product

Research notes in quantitative humanities

Menu

Evaluating POS Taggers: MorphAdorner Bag of Tags Accuracy

Results

Table 1: POS frequency in MorphAdorner training data and cross-validation output

Figure 1: Percentage point errors in POS frequency relative to the training data

Figure 2: Percent error by POS type relative to the training data

Discussion

5 thoughts on “Evaluating POS Taggers: MorphAdorner Bag of Tags Accuracy”

Leave a comment Cancel reply

Menu

Results

Table 1: POS frequency in MorphAdorner training data and cross-validation output

Figure 1: Percentage point errors in POS frequency relative to the training data

Figure 2: Percent error by POS type relative to the training data

Discussion

Share this:

Related

5 thoughts on “Evaluating POS Taggers: MorphAdorner Bag of Tags Accuracy”

Leave a comment Cancel reply