Evaluating POS Taggers: LingPipe Cross-Validation

Below are the promised cross-validation results for LingPipe. They’re produced by LingPipe’s own test suite rather than by my own (cruder) methods, but there’s no reason to suspect that these numbers shouldn’t be directly comparable to my earlier results for MorphAdorner.

So, without delay, the out-of-the-box numbers:

Accuracy: 96.9%
Accuracy on unknown tokens: 71%

That’s using 5-grams; compiled models are about 6 MB. With a cache and 8-grams (producing 16 MB models), things are about the same:

Beams  Acc   Unk   Speed (tokens/s)*
10    .961  .683   29K/s
14    .963  .694   28K/s
20    .967  .699   27K/s
28    .970  .699   18K/s
40    .970  .699    5K/s

* See note below on speed; my own are a bit lower, because my machine is slower.

Note, as Bob did in an email to me, that overall accuracy in this case is very slightly higher, but that it actually does a little bit worse on unknown tokens.

For reference, recall that MorphAdorner (with the potential benefit of running over training data produced in conjunction with one of its predecessors, hence likely to do a bit better on tokenization; Martin or Phil, correct me if I’m wrong about this) was 97.1% accurate when restricted to a lexicon derived solely from the cross-validation data. Unfortunately I don’t have figures for MorphAdorner’s performance on unknown tokens.

Takeaway point: This looks to me, for practical purposes, like a dead heat as far as accuracy goes.

Next up, a comparison of overall bag of tags statistics.

[Note: The numbers above are from Bob’s report to me. I’ve tried to rerun them on my machine, but have had trouble getting the cross-validation run to finish. It keeps dying at apparently random spots after many successful fold iterations with a Java error (not out of memory, but something about a variable being out of range). So I don’t have full numbers from my own trial. But the process goes more than far enough to suggest that the numbers above are reasonable and repeatable; I see very similar accuracy figures for each fold over many folds and with different fold sizes. My speeds are lower (about half the numbers given above, consistent with what I’ve seen for LingPipe on this computer in the past) since my machine is slower, but show a similar pattern: consistent through 14 or 20 beams, slowing at 20 or 28, and down by 3x or 4x at 40, with accuracy leveling off at 20 or 28 beams (which looks to be the speed/accuracy sweet spot). In any case, I’m satisfied with the way things stand and don’t see much reason to look into this further.]

2 thoughts on “Evaluating POS Taggers: LingPipe Cross-Validation

  1. Does your machine have error-correcting memory? We used to have problems with some of the desktop machines we were using for that reason — they throw segmentation faults in the JVM. For some reason I’ve never had problem with laptop memory. Could you send me a copy of the error dump?

    The results above assume the tokenization’s given. I didn’t have the actual underlying text. Nor do I have code to test part-of-speech tagging with tokenization errors. It should be easy to recreate the corpus’s tokenization if it’s regex-based.

    Taging accuracy is just counting, so I’m guessing we’re all computing it correctly.

  2. Huh, yes, I’m using ECC RAM. And here I thought it would *correct* errors, not cause them. Will send the error dump from work tomorrow.

    (Interesting about the ECC issue – I’ve been looking at adding memory, but was holding off because the ECC stuff’s more expensive. Perhaps I shouldn’t have bothered with it in the first place.)

    Thanks for the info on tokenization – very useful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s