After a couple days off, I’ve been trying to run cross-validation on the Stanford tagger. The good news is that it works, more or less; I can feed training data to it and get a trained model out, which I can then use to tag new text. The bad news is that the training process keeps crashing out of memory (given 3800 MB to work with) on data sets over about 1M words. I emailed the helpful folks at Stanford, who pointed out that my training data (from MorphAdorner), in addition to being significantly larger than what they normally use (about 4M words instead of 1M; 500+ word maximum sentence length instead of about half that), was using 420+ tags, which made training impossible under 4GB of physical memory.
420 tags was news to me; NUPOS contains about 180 valid tags. So I wrote a little script to list the tags used in the training data, and immediately discovered the problem. Recall that MorphAdorner training data usually looks like this:
Her po31 she her mother n1 mother mother had vhd have had died vvn die died too av too too long av-j long long ago av ago ago [...]
One word, one tag, one lemma, one standard spelling. As I’ve noted before, it also passes punctuation unmodified, like this:
, , , ,
***** ***** ***** *****
Not all such punctuation is part of the standard NUPOS tagset, of course (for the record, the only punctuation included in the “official” NUPOS tagset is “
. ; ( ) ,”). So some of the extra tags in the training data are simply funky punctuation. But that doesn’t explain all (or even most) of the discrepancy (there are maybe 30-40 unique instances of such punctuation in the training data).
So what gives? Turns out that MorphAdorner does something clever with contractions: it treats them as two words in one, and gives them two tags on a single input line. A couple of examples:
done's vvn|vbz done|be done's off's a-acp|po31 off|he off's
Note the “|” separator. I can’t remember if I’d noticed this before (it occurs about 22K times in 4M lines of training data, so it wouldn’t necessarily jump out in a visual inspection), but in any case I definitely didn’t do anything about it when I fed the training data to TreeTagger or Stanford. This surely accounts for some of the abysmal accuracy results I saw with TreeTagger, and it’s a major culprit in my memory woes with Stanford. It adds 197 items that look like tags to Stanford, roughly doubling the tagset size, and since the training process involves a two-dimensional array of tags, it therefore roughly quadruples the memory requirements. Coincidentally, that happens to be about the difference in size between the full training dataset (3.8M words) and the largest sample I’ve used successfully to train Stanford (1M words).
The good news is that it’s probably largely fixable with some data munging. If I were maximally lazy, I could just kill all the lines in question. Or I could split them in two at the apostrophe and the vertical bar in their respective fields. Either way, relevant information would be lost and output quality would presumably drop, though not so much as at the moment, when I’m feeding the Stanford tagger what is effectively bad data. I’ll try the simple deletion method first, just to see if it allows me to run the trainer on a data set this large, then worry about handling things properly later.
[Update: Martin Mueller fills me in via email concerning the use of compound tags in the MorphAdorner training data:
Most modern taggers treat contracted forms as two tokens but take their cue from modern orthography, which uses an apostrophe as a contraction marker. Morphadorner treats contracted forms as a single token for two reasons:
1. The orthographic practice reflects an underlying linguistic ‘reality’ that the tokenization should respect
2. In Early Modern English ( as in Shaw’s orthographic reforms) contracted forms appear without apostrophes, as in ‘not’ for ‘knows not’ or ‘niltow’ for ‘wilt thou not’. It’s not obvious how to split these forms.
Contracted forms get two POS tags separated by a vertical bar, but with regard to forms like “don’t’, “cannot”, “ain’t”, MorphAdorner analyzes the forms as the negative form of a verb and does not treat the form as a contraction. It uses the symbol ‘x’ to mark a negative POS tag.
So the compound tags are in a very real sense “true” single tags that just happen to be made up of two otherwise-valid individual tags. The 423 tags in the MorphAdorner training data are thus to be treated as 423 unique features/tag-types, or at least that’s how MorphAdorner does it. This doesn’t solve my memory-use problem, but it’s good to know.