As always, things aren’t as easy as it seems they should be. Specifically, TreeTagger—which looked like it was going to be very easy to retrain, hence also easy-ish to cross-validate—turns out to be mildly tricky. Tricky enough that I’m sick of dealing with it.
Why It Looks Easy
To train TreeTagger, you supply it with a lexicon (word, POS tag, lemma, plus additional POS-lemma pairs for words that can function as multiple parts of speech), a list of valid tags, a tab-delimited training file (again, word-POS-lemma), and a “sentence-end” tag (which replaces the POS tag for punctuation that ends a sentence; “SENT” by default). I have all of that data, except for the sentence-end tags (MorphAdorner tags punctuation as the mark itself, e.g., ‘!’ is tagged ‘!’ usw., with no differentiation of sentence ends/breaks). But the SENT tag is configurable, and I could probably fudge by setting it to ‘.’.
Why It Isn’t
The main problem is that TreeTagger insists that the tags used in the supplied tagset match exactly those used in the training data and in the lexical data, which (training and lex tags) must in turn match each other. “Well, duh,” you say, “it can only know how to use tags you tell it about.” True, but here’s the thing: It chokes on extra tags in the tagset or the lexicon, not just in the training data. So you can’t use one tagset or lexicon for all the training; you have to custom-generate a lexicon and a list of tags for every training set you want to use. You’d want to custom-generate the lexicon anyway, though it would be nice to be able to test against the enhanced lexicon supplied with MorphAdorner.
And then there’s the fact that, per the documentation, TreeTagger shouldn’t be trained or cardinal and ordinal numbers that contain digits, so you need to strip them from the training and lexical data, but keep them in cases where they’re spelled out. Plus, as mentioned above, MorphAdorner doesn’t have a generic punctuation tag—it passes punctuation through, tagged as the punctuation itself. And it groups adjacent punctuation marks together, most noticeably blocks of periods ‘…’, ‘..’, ‘…..’, etc. But chunks of random punctuation are of course not valid NUPOS tags, so you don’t really want them in the training or tag sets.
So here’s what would be required to run cross-validation, as far as I can tell, above and beyond the data-chunking-and-stripping scripts that I already have: A script that iterates line-by-line over the training data (recall: formatted “word-POS-lemma[-pos-lemma[-pos-lemma…]],” where the extra POS-lemma pairs are occasional features of which there may be arbitrarily many), removing lines that have a digit in the word and a POS tag of “crd” or “ord”, unless they also have another POS tag, in which case only the offending POS-lemma pair should be removed; removes any line with a POS tag not in the full NUPOS set (in order to catch the multiple-punctuation business, plus any oddness that might have slipped into the training data by mistake; then adds the (now guaranteed to be valid) POS tags from the line to an array (or probably a hash) of valid-and-existant tags in this particular training set, checking to make sure the tag it’s adding isn’t a duplicate. And then regenerates the lexicon based on the resulting, reduced training set.
And at that point it should be possible to train TreeTagger on MorphAdorner’s data, hence to cross-validate it. Maybe. Unless I’ve missed something. Which is likely.
I’m going to spend a few (more) hours banging my head against this, but only to the end of the day today, tops.