“Damn,” you may be saying, “is he still on about this stuff?” Yes, yes I am.
So … MorphAdorner is easy to retrain/validate, because I have access to the tagged, formatted training data. This is what I reported in my previous post on cross-validation.
Now I want to see how much work would be involved in trying to train the other taggers on MorphAdorner’s data. I’m interested in doing this because MorphAdorner is the only one of the bunch that’s trained on literary texts rather than general contemporary sources (often a standard corpus from the 1989 Wall Street Journal). If it’s easy enough, I’ll retrain any/all of the other taggers to see if they produce better results (and if so, how much better) before I compare them to MorphAdorner.
A related note: It will be especially nifty if the other taggers can use an arbitrary tagset. Or, well, not arbitrary. NUPOS. NUPOS is good, for reasons mentioned earlier. Plus, that would make direct use of the MorphAdorner training data much easier, and it would make a direct comparison of outputs easier as well. If this tagset agnosticism isn’t built into the other taggers, I’ll have to convert between tagsets.
The rest of this post, then, is my notes on what’s involved in training the other taggers, with an eye specifically toward using MorphAdorner’s data/tagset.
The MorphAdorner Training Data
The training data I have from MorphAdorner has a pretty simple four-column, tab-delimited format. It’s just
word pos-tag lemma std-spelling. And sentences are followed by a blank line, which makes it easy to pick them out if necessary. A snippet:
She pns31 she she had vhd have had always av always always wanted vvn want wanted to pc-acp to to do vdi do do [...]
So it should be pretty easy to transform it into whatever (text-based) form might be required for the other taggers.
LingPipe is a bit of a problem. To train it on a new corpus, I’d need to write a moderate amount of Java code, plus munge the training data a bit. There are examples provided for analogous cases (assuming proper input), and the documentation is good, but this would probably be a week-long project. I’d be willing to invest that amount of time if I had reason to believe the results would be superior, but here’s the thing: LingPipe uses a Hidden Markov Model, as does MorphAdorner. It’s not that much faster than MorphAdorner (about 2x, but that’s likely to drop over a larger corpus, since MorphAdorner has higher one-time startup overhead than LingPipe). And it’s licensed in a way that probably makes it unusable with nonredistributable corpora. So do I want to send a week, however edifying it might be, evaluating whether or not LingPipe might be meaningfully superior to MorphAdorner, when there’s good reason to expect that it will not be? I’m going with no.
What I will do, though, is compare its default output (using the Brown training data) to MorphAdorner’s over the training corpus to get a sense of its accuracy using the built-in model(s). That’s just a matter of working up a set of translations between NUPOS and LingPipe’s tagsets, plus some XML-munging Perl-fu.
Stanford looks easier, and it’s a more interesting case, since it uses a different algorithm (maximum entropy rather than Hidden Markov). Just need to transform the training data into the expected form and have at it. (Incidentally, maybe I can just set the tag delimiter to [tab]? Trivial, but investigate.) If this turns out to be as easy as expected (ha!), will run full cross-validation. All this probably using left3words only. I want no further part of the slow-as-molases-in-January bidirectional version, though I suppose I should look into it for the sake of completeness.
As above, I’ll also compare the retrained results to those already obtained using default training data.
This also looks very easy to train using MorphAdorner’s data, so expect the same cross-validation and default comparisons.
OK, so Stanford and TreeTagger will get full retraining and cross-validation (as MorphAdorner did before them), plus comparison of their default output to retrained cases. LingPipe will get only the latter.
Will write up (briefly) what’s involved in doing the retraining as I get into it. More in the next couple of days as results are available.
One thought on “Evaluating POS Taggers: Training the Other Taggers”
I’m very glad you’re finding our doc clear! The other thing you can do to train LingPipe is just convert your data into some format LingPipe already understands, like the Brown or MEDPOST formats. It’s not whitespace sensitive for POS tagging. In fact, if you send me the MorphAdorner training data, I can write the data parser directly and send it back to you so you can do the evals (the joys of being the API author with a stack of tools up his sleeve rather than a new user!).
All of the POS taggers out there are going to be in the same range of accuracy on previously seen tokens, which will dominate counts. There’ll be substantial differences on unseen tokens, which is where our HMM is different than a standard one with character-language-model-based emissions. TnT uses suffixes. I don’t know anything (yet) about MorphAdorner. The systems using even more elaborate subword features like CRFs will do even better on unseen tokens, but be even slower to evaluate (though see implementations of CRFs like Carafe from MITRE, which are very fast, because they use lots of caching).
The bidirectional tagging stuff from Stanford is very cool theoretically and practically useful for extracting that last bit of accuracy from tagging tasks like named-entity extraction, but I can see where it might not be so practical as currently written.
I’m about to roll out some EM-based semi-supervised training for LingPipe, which I hope will help with unseen word issues by seeing way more words. But that’ll be slow and an art form to train properly.