After a couple days off, I’ve been trying to run cross-validation on the Stanford tagger. The good news is that it works, more or less; I can feed training data to it and get a trained model out, which I can then use to tag new text. The bad news is that the training process keeps crashing out of memory (given 3800 MB to work with) on data sets over about 1M words. I emailed the helpful folks at Stanford, who pointed out that my training data (from MorphAdorner), in addition to being significantly larger than what they normally use (about 4M words instead of 1M; 500+ word maximum sentence length instead of about half that), was using 420+ tags, which made training impossible under 4GB of physical memory.
420 tags was news to me; NUPOS contains about 180 valid tags. So I wrote a little script to list the tags used in the training data, and immediately discovered the problem. Recall that MorphAdorner training data usually looks like this:
Her po31 she her mother n1 mother mother had vhd have had died vvn die died too av too too long av-j long long ago av ago ago [...]
One word, one tag, one lemma, one standard spelling. As I’ve noted before, it also passes punctuation unmodified, like this:
, , , ,
***** ***** ***** *****
Not all such punctuation is part of the standard NUPOS tagset, of course (for the record, the only punctuation included in the “official” NUPOS tagset is “
. ; ( ) ,”). So some of the extra tags in the training data are simply funky punctuation. But that doesn’t explain all (or even most) of the discrepancy (there are maybe 30-40 unique instances of such punctuation in the training data).
So what gives? Turns out that MorphAdorner does something clever with contractions: it treats them as two words in one, and gives them two tags on a single input line. A couple of examples:
done's vvn|vbz done|be done's off's a-acp|po31 off|he off's
Note the “|” separator. I can’t remember if I’d noticed this before (it occurs about 22K times in 4M lines of training data, so it wouldn’t necessarily jump out in a visual inspection), but in any case I definitely didn’t do anything about it when I fed the training data to TreeTagger or Stanford. This surely accounts for some of the abysmal accuracy results I saw with TreeTagger, and it’s a major culprit in my memory woes with Stanford. It adds 197 items that look like tags to Stanford, roughly doubling the tagset size, and since the training process involves a two-dimensional array of tags, it therefore roughly quadruples the memory requirements. Coincidentally, that happens to be about the difference in size between the full training dataset (3.8M words) and the largest sample I’ve used successfully to train Stanford (1M words).
The good news is that it’s probably largely fixable with some data munging. If I were maximally lazy, I could just kill all the lines in question. Or I could split them in two at the apostrophe and the vertical bar in their respective fields. Either way, relevant information would be lost and output quality would presumably drop, though not so much as at the moment, when I’m feeding the Stanford tagger what is effectively bad data. I’ll try the simple deletion method first, just to see if it allows me to run the trainer on a data set this large, then worry about handling things properly later.
[Update: Martin Mueller fills me in via email concerning the use of compound tags in the MorphAdorner training data:
Most modern taggers treat contracted forms as two tokens but take their cue from modern orthography, which uses an apostrophe as a contraction marker. Morphadorner treats contracted forms as a single token for two reasons:
1. The orthographic practice reflects an underlying linguistic ‘reality’ that the tokenization should respect
2. In Early Modern English ( as in Shaw’s orthographic reforms) contracted forms appear without apostrophes, as in ‘not’ for ‘knows not’ or ‘niltow’ for ‘wilt thou not’. It’s not obvious how to split these forms.
Contracted forms get two POS tags separated by a vertical bar, but with regard to forms like “don’t’, “cannot”, “ain’t”, MorphAdorner analyzes the forms as the negative form of a verb and does not treat the form as a contraction. It uses the symbol ‘x’ to mark a negative POS tag.
So the compound tags are in a very real sense “true” single tags that just happen to be made up of two otherwise-valid individual tags. The 423 tags in the MorphAdorner training data are thus to be treated as 423 unique features/tag-types, or at least that’s how MorphAdorner does it. This doesn’t solve my memory-use problem, but it’s good to know.
One thought on “Evaluating POS Taggers: Stanford Memory Use, Training Data Issues”
This post highlights a couple of deep issues in system design and evaluation: tokenization and scalability.
Training data for taggers is by its nature pre-tokenized, because tags attach to tokens. But often, tokenization isn’t deterministic, but depends on deeper linguistic analysis of compounds, clitics, sentence-boundaries, etc.
Real text analysis will be affected by imperfect sentence-boundary detection. I wrote a blog post on this, the curse of intelligent tokenization. We had a rough time with the French Treebank on this front, as well as with the Penn Treebank, BioIE, Google’s language model data, etc.
Evaluations based on knowing the tokenization are in some sense cheating if the token discovery process is itself not 100% recreateable. In any case, you need to use a training-data-specific tokenizer at run time.
One of the thing that constrained the LingPipe tagger design was being able to scale and being able to be used out-of-the-box for different tasks (like named-entity extraction) and languages (like Chinese and Hindi).
HMMs are popular in applications because they scale very well. Standard HMM taggers can scale to huge datasets. LingPipe’s a little less scalable because of the character language model for emissions. But either way, you can prune label sequence counts and/or token emission counts and reduce run-time model size (or training-data memory requirements), often with minimal loss in accuracy.
CRFs are going to be much more dependent on the number of features in their training and run time. A standard workaround is to do feature selection — that is, select a subset of the most relevant features (choosing them is a black art) and just using those. Run time is also directly depends on the number of features and how hard they are to extract.
Another of the constraints we felt compelled to impose on our design was being able to easily recreate the underlying character sequence. Most tokenizers used for POS will munge input sequences and ignore whitespace, leaving a lot of work on the reconstruction side. For instance, if you want to find noun phrases and highlight them in the text, it can be a painful process that requires access to the tokenizer.
Thanks again for keeping up this design discussion.