With apologies for the delay (I was gone last week at the Modernist Studies meeting), here are the very earliest accuracy results for the part-of-speech taggers. Or, well, for one—and it’s far from complete (again, this blog = lab notes). More to follow in the next few days, though I surely won’t wrap things up until after Thanksgiving.
I have training data from MorphAdorner/MONK, a hand-tagged corpus of mostly nineteenth century, mostly British fiction that looks quite a bit like my testing corpus (but is not identical to it). It runs to just over 3.8 million words.
Running MorphAdorner back over this training data gives an error rate of about 1.9%. That number is exceedingly, but maybe not unexpectedly, good. It also requires a bit of specification.
First, MorphAdorner changes some of the training input data in ways that make a fully direct comparison difficult. Specifically—and this is the intended behavior, AFAIK—it separates punctuation from adjoining words. This is how the training data works, too, except in a few cases such as apostrophes in certain contractions and in (all?) plural possessives, or the period(s) in initials. So, for instance, days’ time (two tokens) in the training data becomes days ‘ time (three tokens) in the output. This isn’t a big deal, and it happens to only about 0.2% of the input data. I’ve chosen to exclude these differences from my error calculations. (For the record, if you have a great deal of free time and want to check my math, there are 7366 such instances out of 3828883 tokens in the training set. Full data for all the accuracy testing will follow in a later post.)
Second, the 1.9% net error rate quoted above is the worst case scenario, since it counts any difference between training tags and computed tags as an error. MorphAdorner uses the NUPOS tagset (see both MorphAdorner’s info on their POS tagger and Martin Mueller’s very helpful paper on NUPOS [PDF].) NUPOS has
163 185 tags (updated; see the MONK wiki page on NUPOS). That’s a lot of tags. The granularity it affords is nice, but I suspect one might often be satisfied if one’s tagger gets the answer right at a significantly higher level of granularity (say, is this word a noun or a verb?). It’s great that the extra detail is there, and Martin points out some of the benefits of both this approach and the specific features of NUPOS in his paper, but I’ll also be rerunning the accuracy tests with looser matching to see how MorphAdorner and the other taggers do under less demanding conditions. Also, some sort of grouping along these lines will be necessary in order to compare the taggers to one another, since they each use a different tagset (most of them meaningfully smaller than MorphAdorner’s).
Third, though, this test is really too easy on MorphAdorner, since it’s only being asked to retag exactly the data on which it was trained. I’ll run some cross-validation tests shortly. MorphAdorner is the only tagger for which this kind of literary cross-validation is trivially easy to do; the others could do it on their own (non-literary) training data, or I could look into using the MorphAdorner data to train them. Not sure how much work that would involve; it seems easy, though at the very least it would involve converting between tagsets. Depending on how MorphAdorner does with cross-validation and how the others do with their more general training sets, this may or may not be worth the effort. I’d like to do all of this, but I’m also anxious to get started on real experiments.
So … more results coming soon. Also: My Saturday night = HOTT.
5 thoughts on “Evaluating POS Taggers: Basic MorphAdorner Accuracy”
Hello. I know it’s been a few years since you posted your work on evaluating POS detection tools. I’m currently working on a project doing sentiment analysis on twitter, and I was curious if it would be possible to request your training lexicon or trained file for MorphAdorner for Twitter? If you would rather not share, I can understand, particularly since Twitter has a rather backwards policy about redistributing tweets but it would be nice. I would, after all, rather not spend the next week doing manual POS tagging on raw tweets.
Hi, Andrew. Sorry, I don’t have any training data for tweets; MorphAdorner is trained on literary texts from decades and centuries past. Sounds like that’s not of interest to you, but if I’m wrong about that, the people to contact are the MorphAdorner folks themselves at Northwestern.
So what you’re saying, particularly in your work on comparing the POS tagging done by MorphAdorner versus treetagger is that despite the rather messy nature of tweets, MorphAdorner still outperformed treetagger? I recall a few other blog entries you posted on the subject, and it’s of particular interest to me, as I’m doing a long term sentiment analysis study on twitter data, and the first step I’m looking at is tentative subjectivity detection via naive Bayesian POS analysis.
True about MorphAdorner outperforming TreeTagger in accuracy, but no tweets were involved. My data set is 18th- and 19th-century fiction. Nothing about Twitter.
FWIW, a Twitter-specific NLP/POS tagger from Carnegie Mellon: http://www.ark.cs.cmu.edu/TweetNLP/