Below are the promised cross-validation results for LingPipe. They’re produced by LingPipe’s own test suite rather than by my own (cruder) methods, but there’s no reason to suspect that these numbers shouldn’t be directly comparable to my earlier results for MorphAdorner.
So, without delay, the out-of-the-box numbers:
Accuracy on unknown tokens: 71%
That’s using 5-grams; compiled models are about 6 MB. With a cache and 8-grams (producing 16 MB models), things are about the same:
Beams Acc Unk Speed (tokens/s)* 10 .961 .683 29K/s 14 .963 .694 28K/s 20 .967 .699 27K/s 28 .970 .699 18K/s 40 .970 .699 5K/s
* See note below on speed; my own are a bit lower, because my machine is slower.
Note, as Bob did in an email to me, that overall accuracy in this case is very slightly higher, but that it actually does a little bit worse on unknown tokens.
For reference, recall that MorphAdorner (with the potential benefit of running over training data produced in conjunction with one of its predecessors, hence likely to do a bit better on tokenization; Martin or Phil, correct me if I’m wrong about this) was 97.1% accurate when restricted to a lexicon derived solely from the cross-validation data. Unfortunately I don’t have figures for MorphAdorner’s performance on unknown tokens.
Takeaway point: This looks to me, for practical purposes, like a dead heat as far as accuracy goes.
Next up, a comparison of overall bag of tags statistics.
[Note: The numbers above are from Bob’s report to me. I’ve tried to rerun them on my machine, but have had trouble getting the cross-validation run to finish. It keeps dying at apparently random spots after many successful fold iterations with a Java error (not out of memory, but something about a variable being out of range). So I don’t have full numbers from my own trial. But the process goes more than far enough to suggest that the numbers above are reasonable and repeatable; I see very similar accuracy figures for each fold over many folds and with different fold sizes. My speeds are lower (about half the numbers given above, consistent with what I’ve seen for LingPipe on this computer in the past) since my machine is slower, but show a similar pattern: consistent through 14 or 20 beams, slowing at 20 or 28, and down by 3x or 4x at 40, with accuracy leveling off at 20 or 28 beams (which looks to be the speed/accuracy sweet spot). In any case, I’m satisfied with the way things stand and don’t see much reason to look into this further.]