Evaluating POS Taggers: Accuracy and Bags of Tags

As I mentioned in my last post, one of the things that’s tricky about comparing the results of different taggers—or even, for instance, of comparing MorphAdorner to its own training data—is that the tokens involved can change, since each tagger tokenizes input data differently. Even if you’re not going hold that against a given tagger directly (i.e., you’re not going to count token mismatches as errors in themselves), you still have the problem that the number of POS tags in the output will then sometimes differ from the number in the input, even for single tokens, which in turn will produce mismatches/errors (except where, as with MorphAdorner, single tokens can take multiple tags, so that it’s possible in principle for different tokenizations to produce equivalent series of tags).

(NB. Just to be clear, yes, you want tokenization to be as accurate as possible in general, but this is only a direct problem when comparing the output of a tagger to known-good data, as I’m doing at the moment. In the usual case, you’ll just accept whatever tokenization the package produces. There are valid arguments to be made for different treatments of some kinds of tokens, so accuracy is, as always, a matter of conformance to expectations rather than one of abstract correctness.)

One way to get around this problem, at least partially, is to consider the results not token-by-token, but in aggregate over an entire corpus. So the question is then not “Is this word tagged correctly as an adjective?” (since two taggers may not agree on what constitutes a word), but “How does the percentage of all words tagged as adjectives by this tagger compare to that of the training set?” This isn’t the same thing as cross-validation, though one could ask the question during a cross-validation procedure. Instead, what I’ll be doing is comparing the relative frequencies of various parts of speech assigned by each tagger using that tagger’s stock training data to the relative frequencies observed in the training corpus. So there’s no need to train the taggers on MorphAdorner’s data, which is good, since I’ve already seen that I can’t manage such a thing for Standford’s tagger on my hardware, and TreeTagger produced abysmal results (probably my fault, but I’ve run out of patience with it in any case).

The only other complication is that of course the taggers use different tagsets by default, so I’ll need to create a translation table for each one. I’m interested in pretty course-grained POS categories, specifically the following 18:

Table 1: POS Tags Used for Bag-of-Tags Accuracy Comparisons

av      adverb
cc      conjunction
dt      determiner
fw      foreign
jj      adjective
nn      noun
np      noun-proper
nu      number
pr      pronoun
pc      participle
pp      preposition
pu      punctuation
wh      wh-word
sy      symbol
uh      interjection
vb      verb
xx      negation
zz      unknown-other

Note in particular that I’m not dividing up verbs into classes like modals or “being” verbs, nor distinguishing by tense or mood. The down side of this is loss of detail, and if I were doing actual research work, I’d keep more classes. But it guarantees that every tag from every tagger will slot easily into one or another of these categories, and it’ll give me a feel for how they stack up at the most basic level.

Also, this will help answer a question I have about MorphAdorner’s accuracy, specifically how its “big picture” accuracy (determining whether a word is used as an adjective or an adverb or a noun, say) compares to the vary fine-grained accuracy over its full tagset (which we’ve already seen is around 97% in cross-validation). One of the reasons to care about this is that more or less every tagger claims to be (and tests as) about 97% accurate, but typically over smaller tagsets and on non-literary data. Does this mean that MorphAdorner will do better when it’s not being asked to discriminate so finely? Or is it possible that having been trained on such fine distinctions, it suffers on coarser ones? The second possibility doesn’t seem absurd to me; more categories means fewer examples of each one in the training data, and it’s possible that MorphAdorner will therefore do worse than taggers trained on fewer, broader categories. Now, even if that’s true (and I don’t know yet what the outcome will be), it’s still possible that the difference would be small and/or that the advantages of the large tagset would outweigh whatever drop in coarse-grained accuracy might be observed. But it’s an interesting question, and one that I’d like to be able to answer. Plus, it bears directly on the research in service of which I doing this whole evaluation of taggers: I want to examine POS frequency distribution in large corpora across historical periods. So I need to have a sense of how accurate the taggers are not on individual tokens, but in sum. True, accuracy in sum largely depends on getting individual tokens right, and yes, I’ll probably care at some point about things like bi- and tri-grams that are much more sensitive to the accuracy of individual tags. But still, big-picture, bag-of-tags accuracy matters to me.

Coming next, then, the first results of this reduction procedure with MorphAdorner. Including graphs! It’ll be a hoot.

4 thoughts on “Evaluating POS Taggers: Accuracy and Bags of Tags

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s