“Damn,” you may be saying, “is he still on about this stuff?” Yes, yes I am.
So … MorphAdorner is easy to retrain/validate, because I have access to the tagged, formatted training data. This is what I reported in my previous post on cross-validation.
Now I want to see how much work would be involved in trying to train the other taggers on MorphAdorner’s data. I’m interested in doing this because MorphAdorner is the only one of the bunch that’s trained on literary texts rather than general contemporary sources (often a standard corpus from the 1989 Wall Street Journal). If it’s easy enough, I’ll retrain any/all of the other taggers to see if they produce better results (and if so, how much better) before I compare them to MorphAdorner.
A related note: It will be especially nifty if the other taggers can use an arbitrary tagset. Or, well, not arbitrary. NUPOS. NUPOS is good, for reasons mentioned earlier. Plus, that would make direct use of the MorphAdorner training data much easier, and it would make a direct comparison of outputs easier as well. If this tagset agnosticism isn’t built into the other taggers, I’ll have to convert between tagsets.
The rest of this post, then, is my notes on what’s involved in training the other taggers, with an eye specifically toward using MorphAdorner’s data/tagset.
The MorphAdorner Training Data
The training data I have from MorphAdorner has a pretty simple four-column, tab-delimited format. It’s just
word pos-tag lemma std-spelling. And sentences are followed by a blank line, which makes it easy to pick them out if necessary. A snippet:
She pns31 she she had vhd have had always av always always wanted vvn want wanted to pc-acp to to do vdi do do [...]
So it should be pretty easy to transform it into whatever (text-based) form might be required for the other taggers.
LingPipe is a bit of a problem. To train it on a new corpus, I’d need to write a moderate amount of Java code, plus munge the training data a bit. There are examples provided for analogous cases (assuming proper input), and the documentation is good, but this would probably be a week-long project. I’d be willing to invest that amount of time if I had reason to believe the results would be superior, but here’s the thing: LingPipe uses a Hidden Markov Model, as does MorphAdorner. It’s not that much faster than MorphAdorner (about 2x, but that’s likely to drop over a larger corpus, since MorphAdorner has higher one-time startup overhead than LingPipe). And it’s licensed in a way that probably makes it unusable with nonredistributable corpora. So do I want to send a week, however edifying it might be, evaluating whether or not LingPipe might be meaningfully superior to MorphAdorner, when there’s good reason to expect that it will not be? I’m going with no.
What I will do, though, is compare its default output (using the Brown training data) to MorphAdorner’s over the training corpus to get a sense of its accuracy using the built-in model(s). That’s just a matter of working up a set of translations between NUPOS and LingPipe’s tagsets, plus some XML-munging Perl-fu.
Stanford looks easier, and it’s a more interesting case, since it uses a different algorithm (maximum entropy rather than Hidden Markov). Just need to transform the training data into the expected form and have at it. (Incidentally, maybe I can just set the tag delimiter to [tab]? Trivial, but investigate.) If this turns out to be as easy as expected (ha!), will run full cross-validation. All this probably using left3words only. I want no further part of the slow-as-molases-in-January bidirectional version, though I suppose I should look into it for the sake of completeness.
As above, I’ll also compare the retrained results to those already obtained using default training data.
This also looks very easy to train using MorphAdorner’s data, so expect the same cross-validation and default comparisons.
OK, so Stanford and TreeTagger will get full retraining and cross-validation (as MorphAdorner did before them), plus comparison of their default output to retrained cases. LingPipe will get only the latter.
Will write up (briefly) what’s involved in doing the retraining as I get into it. More in the next couple of days as results are available.