With apologies for the delay (I was gone last week at the Modernist Studies meeting), here are the very earliest accuracy results for the part-of-speech taggers. Or, well, for one—and it’s far from complete (again, this blog = lab notes). More to follow in the next few days, though I surely won’t wrap things up until after Thanksgiving.
I have training data from MorphAdorner/MONK, a hand-tagged corpus of mostly nineteenth century, mostly British fiction that looks quite a bit like my testing corpus (but is not identical to it). It runs to just over 3.8 million words.
Running MorphAdorner back over this training data gives an error rate of about 1.9%. That number is exceedingly, but maybe not unexpectedly, good. It also requires a bit of specification.
First, MorphAdorner changes some of the training input data in ways that make a fully direct comparison difficult. Specifically—and this is the intended behavior, AFAIK—it separates punctuation from adjoining words. This is how the training data works, too, except in a few cases such as apostrophes in certain contractions and in (all?) plural possessives, or the period(s) in initials. So, for instance, days’ time (two tokens) in the training data becomes days ‘ time (three tokens) in the output. This isn’t a big deal, and it happens to only about 0.2% of the input data. I’ve chosen to exclude these differences from my error calculations. (For the record, if you have a great deal of free time and want to check my math, there are 7366 such instances out of 3828883 tokens in the training set. Full data for all the accuracy testing will follow in a later post.)
Second, the 1.9% net error rate quoted above is the worst case scenario, since it counts any difference between training tags and computed tags as an error. MorphAdorner uses the NUPOS tagset (see both MorphAdorner’s info on their POS tagger and Martin Mueller’s very helpful paper on NUPOS [PDF].) NUPOS has
163 185 tags (updated; see the MONK wiki page on NUPOS). That’s a lot of tags. The granularity it affords is nice, but I suspect one might often be satisfied if one’s tagger gets the answer right at a significantly higher level of granularity (say, is this word a noun or a verb?). It’s great that the extra detail is there, and Martin points out some of the benefits of both this approach and the specific features of NUPOS in his paper, but I’ll also be rerunning the accuracy tests with looser matching to see how MorphAdorner and the other taggers do under less demanding conditions. Also, some sort of grouping along these lines will be necessary in order to compare the taggers to one another, since they each use a different tagset (most of them meaningfully smaller than MorphAdorner’s).
Third, though, this test is really too easy on MorphAdorner, since it’s only being asked to retag exactly the data on which it was trained. I’ll run some cross-validation tests shortly. MorphAdorner is the only tagger for which this kind of literary cross-validation is trivially easy to do; the others could do it on their own (non-literary) training data, or I could look into using the MorphAdorner data to train them. Not sure how much work that would involve; it seems easy, though at the very least it would involve converting between tagsets. Depending on how MorphAdorner does with cross-validation and how the others do with their more general training sets, this may or may not be worth the effort. I’d like to do all of this, but I’m also anxious to get started on real experiments.
So … more results coming soon. Also: My Saturday night = HOTT.