Evaluating POS Taggers: MorphAdorner Bag of Tags Accuracy

In my last post, I outlined an approach to assessing tagger accuracy by measuring the relative frequencies of a reduced set of POS tags in a tagger’s output and comparing them to the frequencies in the training data I have on hand. Here’s the first set of results, from MorphAdorner.

A couple of notes. First, the procedure for MorphAdorner differs slightly from the one I’ll use for the other taggers. In order to mitigate some of the advantage it has thanks to working with its own training data, I’ll be using the output of its cross-validation runs, rather than that of a stock run over the full corpus. That means that it’s not making use of the extended lexicon built into the default distribution, nor of any other fixups that may have been applied by hand after the default training (to both of which the other taggers of course also lack access). I think MorphAdorner still has an advantage, but this should reduce it slightly (in cross-validation, lack of the extended lexicon reduced overall accuracy from 97.7% to 97.1%, both of which were down from 98.1% on a run of the default distribution back over the training data).

A listing of the reduced tagset is included in the last post. You can also download two versions of the translation table from MorphAdorner-ese (i.e., NUPOS) to this reduced set. The first covers only the 238 tags currently defined by NUPOS, and is derived directly from a dump of the tag statistics and help information in MONK. The second includes all 261 tags actually used in the MorphAdorner training data. Two points to note: (1.) These translations use the syntactic part of speech assigned to each tag, not the word class to which it belongs. So when a noun is used as an adjective, it’s translated to ‘jj’ not ‘nn’. (2.) Of the 23 tags that appear in the training data but not in the standard tagset, almost all are punctuation. In fact the only ones that I haven’t translated as punctuation are various iterations of asterisks, to which I’ve assigned ‘zz’ (unknown-other), though I could see the case for making them either punctuation or symbols. It doesn’t seem to have made much difference, as ‘zz’ remains underrepresented in the output, while ‘sy’ and ‘pu’ are overrepresented. Go figure.

Results

First, the raw numbers:

Table 1: POS frequency in MorphAdorner training data and cross-validation output

 

POS	Train	Cross	 Tr %	 Cr %	 Cr Dif		Cr Err %
av	213765	212662	 5.551	 5.512	-0.03909	 -0.516
cc	243720	240704	 6.329	 6.239	-0.09015	 -1.237
dt	313671	313111	 8.145	 8.116	-0.02993	 -0.179
fw	  4117	  4076	 0.107	 0.106	-0.00126	 -0.996
jj	210224	210454	 5.459	 5.455	-0.00437	  0.109
nn	565304	571418	14.680	14.811	 0.13069	  1.082
np	 91933	 84632	 2.387	 2.194	-0.19375	 -7.942
nu	 24440	 24776	 0.635	 0.642	 0.00751	  1.375
pc	 54518	 54200	 1.416	 1.405	-0.01092	 -0.583
pp	323212	325560	 8.393	 8.438	 0.04498	  0.726
pr	422442	422557	10.970	10.952	-0.01778	  0.027
pu	632749	640074	16.431	16.590	 0.15877	  1.158
sy	   318	  1282	 0.008	 0.033	 0.02497	303.145
uh	 19492	 20836	 0.506	 0.540	 0.03388	  6.895
vb	666095	666822	17.297	17.283	-0.01388	  0.109
wh	 40162	 40316	 1.043	 1.045	 0.00202	  0.383
xx	 24544	 24596	 0.637	 0.638	 0.00014	  0.212
zz	   167	    97	 0.004	 0.003	-0.00182	-41.916
--	    --	    --
Tot	3850873	3858173		

Legend
POS = Part of speech (see previous post for explanations)
Train = Number of occurrences in training data
Cross = Number of occurrences in cross-validation output
Tr % = Percentage of training data tagged with this POS
Cr % = Percentage of cross-validation data tagged with this POS
Cr Dif = Difference in percentage points between Tr % and Cr %
Cr Err % = Percent error in cross-validation frequency relative to training data

That’s a bit hard to read, and is better represented graphically (click for full-size images):

Figure 1: Percentage point errors in POS frequency relative to the training data

 

MA Errors.png

 

Figure 2: Percent error by POS type relative to the training data

 

MA Errors Bubble.png

Notes on Figure 2:

  1. The size of each bubble represents the relative frequency of that POS tag in the training data. Bigger bubbles mean more frequent tags.
  2. The x-axis is logarithmic, so the differences are both more (‘zz’, ‘sy’) and less (‘pr’, ‘nn’) dramatic than they appear.

Discussion

Just a couple of points, really.

  1. Things look pretty good overall. With the exception of a few (relatively low-frequency) outliers, most tags are recognized with around 99% accuracy or better. So I could have pretty good confidence were I to compare, say, adjective frequency in different contexts.
  2. It looks, though, like there’s systematic undercounting of proper nouns, conjunctions and, to a lesser extent, adverbs and determiners. Proper nouns will probably always be hard, since they’ll have a disproportionate share of unknown (i.e., not present in the training data) tokens (on which accuracy generally drops to 70-90%, perhaps a bit better with proper nouns since you can maybe crib from orthography?).
  3. Nouns are overcounted, reducing the proper-noun problem if I were to lump the two together. Punctuation is overcounted, but I find it hard to imagine a situation in which I would care. Prepositions are slightly overcounted.
  4. Symbols (303% error) and unknown/other tokens (42%) are totally unreliable. They don’t contribute much to the overall error rate, because there aren’t very many of them (hence the nearly invisible bubbles in Figure 2), which is likely also why they’re so unreliable in the first place. But you wouldn’t want to make much of any variation you might see in those frequencies. Same goes for interjections, which hover around 7% error rate. [Note: I’ll look into the ‘sy’ and ‘zz’ cases, since it could be something systematically askew, but also might just be small-sample effects.]

Anyway, this is useful information on its own, and I hope it will be even more so once I have analogous data for the other taggers. That data should be ready in the next day or two, so more to follow shortly.

Evaluating POS Taggers: Accuracy and Bags of Tags

As I mentioned in my last post, one of the things that’s tricky about comparing the results of different taggers—or even, for instance, of comparing MorphAdorner to its own training data—is that the tokens involved can change, since each tagger tokenizes input data differently. Even if you’re not going hold that against a given tagger directly (i.e., you’re not going to count token mismatches as errors in themselves), you still have the problem that the number of POS tags in the output will then sometimes differ from the number in the input, even for single tokens, which in turn will produce mismatches/errors (except where, as with MorphAdorner, single tokens can take multiple tags, so that it’s possible in principle for different tokenizations to produce equivalent series of tags).

(NB. Just to be clear, yes, you want tokenization to be as accurate as possible in general, but this is only a direct problem when comparing the output of a tagger to known-good data, as I’m doing at the moment. In the usual case, you’ll just accept whatever tokenization the package produces. There are valid arguments to be made for different treatments of some kinds of tokens, so accuracy is, as always, a matter of conformance to expectations rather than one of abstract correctness.)

One way to get around this problem, at least partially, is to consider the results not token-by-token, but in aggregate over an entire corpus. So the question is then not “Is this word tagged correctly as an adjective?” (since two taggers may not agree on what constitutes a word), but “How does the percentage of all words tagged as adjectives by this tagger compare to that of the training set?” This isn’t the same thing as cross-validation, though one could ask the question during a cross-validation procedure. Instead, what I’ll be doing is comparing the relative frequencies of various parts of speech assigned by each tagger using that tagger’s stock training data to the relative frequencies observed in the training corpus. So there’s no need to train the taggers on MorphAdorner’s data, which is good, since I’ve already seen that I can’t manage such a thing for Standford’s tagger on my hardware, and TreeTagger produced abysmal results (probably my fault, but I’ve run out of patience with it in any case).

The only other complication is that of course the taggers use different tagsets by default, so I’ll need to create a translation table for each one. I’m interested in pretty course-grained POS categories, specifically the following 18:

Table 1: POS Tags Used for Bag-of-Tags Accuracy Comparisons

av      adverb
cc      conjunction
dt      determiner
fw      foreign
jj      adjective
nn      noun
np      noun-proper
nu      number
pr      pronoun
pc      participle
pp      preposition
pu      punctuation
wh      wh-word
sy      symbol
uh      interjection
vb      verb
xx      negation
zz      unknown-other

Note in particular that I’m not dividing up verbs into classes like modals or “being” verbs, nor distinguishing by tense or mood. The down side of this is loss of detail, and if I were doing actual research work, I’d keep more classes. But it guarantees that every tag from every tagger will slot easily into one or another of these categories, and it’ll give me a feel for how they stack up at the most basic level.

Also, this will help answer a question I have about MorphAdorner’s accuracy, specifically how its “big picture” accuracy (determining whether a word is used as an adjective or an adverb or a noun, say) compares to the vary fine-grained accuracy over its full tagset (which we’ve already seen is around 97% in cross-validation). One of the reasons to care about this is that more or less every tagger claims to be (and tests as) about 97% accurate, but typically over smaller tagsets and on non-literary data. Does this mean that MorphAdorner will do better when it’s not being asked to discriminate so finely? Or is it possible that having been trained on such fine distinctions, it suffers on coarser ones? The second possibility doesn’t seem absurd to me; more categories means fewer examples of each one in the training data, and it’s possible that MorphAdorner will therefore do worse than taggers trained on fewer, broader categories. Now, even if that’s true (and I don’t know yet what the outcome will be), it’s still possible that the difference would be small and/or that the advantages of the large tagset would outweigh whatever drop in coarse-grained accuracy might be observed. But it’s an interesting question, and one that I’d like to be able to answer. Plus, it bears directly on the research in service of which I doing this whole evaluation of taggers: I want to examine POS frequency distribution in large corpora across historical periods. So I need to have a sense of how accurate the taggers are not on individual tokens, but in sum. True, accuracy in sum largely depends on getting individual tokens right, and yes, I’ll probably care at some point about things like bi- and tri-grams that are much more sensitive to the accuracy of individual tags. But still, big-picture, bag-of-tags accuracy matters to me.

Coming next, then, the first results of this reduction procedure with MorphAdorner. Including graphs! It’ll be a hoot.

Evaluating POS Taggers: More Info on the Training Data

With the help of Bob Carpenter at Alias-i, I now have cross-validation data for LingPipe using the MorphAdorner training set. Before I get to the numbers (which are good) in another post, a little background on the training data itself. (Incidentally, one of the real advantages of all this evaluative work has been to give me a much better sense of the data and tagsets I’m working with than I would have had otherwise.)

NUPOS, the tagset developed by Martin Mueller and used by MorphAdorner, is large: it’s up to 238 tags at the moment, compared to 60 or 80 in most other mainstream tagsets. This has a lot of potential benefits for literary work, especially if one wants to examine texts from earlier periods. (For the rationale behind NUPOS, see Martin’s post on the MONK wiki, to which I’ve linked in the past. Note that the list of tags there isn’t fully up to date as of mid-January 2009; if you have access to MONK development builds, there’s a better list at http://scribe.at.northwestern.edu:8090/monk/servlet?op=pos.) The short version: It might be nice to be able to deal with “thou shouldst” and to easily find cases of nouns used as adjectives, to name only two instances. But it has some costs, too, particularly in the system requirements for training a tagger. This is especially true because of the way NUPOS handles contractions and certain other compound, which it treats as single tokens with compound POS tags. I think I mentioned something about this earlier, but it wasn’t clear to me then exactly what was going on. So … a couple of examples:

I'll     pns11|vmb   i|will    i'll
...
there's  pc-acp|vbz  there|be  there's
...
You've   pn22|vhb    you|have  you've

There are about 22,000 such cases (out of 3.8+ M total tokens) in the training data. As I said, NUPOS and MorphAdorner treat these as single tokens with two parts of speech, but most other taggers split them into two tokens. Martin has a nice argument for why the NUPOS/MorphAdorner method is the right one, but it makes things a bit tricky for me at the moment. First, it means there are 428 unique tags in the MorphAdorner training data (almost doubling the size of the already large NUPOS tagset), which makes for potentially enormous matrices in the training and cross-validation process of other taggers. I’ve already seen that it makes training the Stanford tagger on this data impossible on my machine, and it made it slightly harder for me to repeat Bob’s cross-validation runs with LingPipe than it might have been with a smaller tagset.

Now, I don’t suspect that such things are really impossible (in fact I’m rerunning the LingPipe cross-validation process with some tweaked settings right now), but it does make things more resource intensive.

And then there are tokenization issues, which make direct comparison of tokens and parts of speech between taggers pretty tough. Every tagger wants to tokenize incoming data differently, and they don’t necessarily preserve MorphAdorner’s existing tokenization (heck, even MorphAdorner doesn’t respect MorphAdorner’s existing tokenization; see all the instances of ‘…’ turned into ‘..’ and ‘.’). I suppose it would be possible to write a pretty trivial tokenizer for each of the packages that says “hey, every token in the input data is on a separate line, just go with that.” But mucking about with tokenizers is beyond my current investment level.

Finally, as Bob pointed out elsewhere (I’ve forgotten now whether it was in an email or a blog comment), tokenization has an obvious impact on tagging accuracy; it’s hard to get parts of speech right (where “right” = “matches the training data”) if the tagger thinks the data consists of different tokens than does the trainer. In the end, the obvious effect is to skew accuracy results toward taggers/tokenizers more closely related to the ones used to help generate the training data in the first place. Martin tells me that the MorphAdorner training data is distantly descended from material tagged by CLAWS and so is, as far as I can tell, probably not closely related to the other taggers I’m evaluating. Takeaway point: It’s no surprise that using MorphAdorner’s training data will tend to produce accuracy evaluations that favor MorphAdorner.

What’s next? A couple of posts later today, one on LingPipe cross-validation results and another on raw accuracy using a sharply reduced tagset and treating the whole dataset as a big bag of tags (i.e., what is the overall frequency of adjectives, etc.). Good times. More soon.

Evaluating POS Taggers: Stanford Memory Use, Training Data Issues

After a couple days off, I’ve been trying to run cross-validation on the Stanford tagger. The good news is that it works, more or less; I can feed training data to it and get a trained model out, which I can then use to tag new text. The bad news is that the training process keeps crashing out of memory (given 3800 MB to work with) on data sets over about 1M words. I emailed the helpful folks at Stanford, who pointed out that my training data (from MorphAdorner), in addition to being significantly larger than what they normally use (about 4M words instead of 1M; 500+ word maximum sentence length instead of about half that), was using 420+ tags, which made training impossible under 4GB of physical memory.

420 tags was news to me; NUPOS contains about 180 valid tags. So I wrote a little script to list the tags used in the training data, and immediately discovered the problem. Recall that MorphAdorner training data usually looks like this:

Her     po31    she     her
mother  n1      mother  mother
had     vhd     have    had
died    vvn     die     died
too     av      too     too
long    av-j    long    long
ago     av      ago     ago
[...]

One word, one tag, one lemma, one standard spelling. As I’ve noted before, it also passes punctuation unmodified, like this:

,       ,       ,       ,

or

*****   *****   *****   *****

Not all such punctuation is part of the standard NUPOS tagset, of course (for the record, the only punctuation included in the “official” NUPOS tagset is “. ; ( ) ,”). So some of the extra tags in the training data are simply funky punctuation. But that doesn’t explain all (or even most) of the discrepancy (there are maybe 30-40 unique instances of such punctuation in the training data).

So what gives? Turns out that MorphAdorner does something clever with contractions: it treats them as two words in one, and gives them two tags on a single input line. A couple of examples:

done's  vvn|vbz    done|be done's
off's   a-acp|po31 off|he  off's

Note the “|” separator. I can’t remember if I’d noticed this before (it occurs about 22K times in 4M lines of training data, so it wouldn’t necessarily jump out in a visual inspection), but in any case I definitely didn’t do anything about it when I fed the training data to TreeTagger or Stanford. This surely accounts for some of the abysmal accuracy results I saw with TreeTagger, and it’s a major culprit in my memory woes with Stanford. It adds 197 items that look like tags to Stanford, roughly doubling the tagset size, and since the training process involves a two-dimensional array of tags, it therefore roughly quadruples the memory requirements. Coincidentally, that happens to be about the difference in size between the full training dataset (3.8M words) and the largest sample I’ve used successfully to train Stanford (1M words).

The good news is that it’s probably largely fixable with some data munging. If I were maximally lazy, I could just kill all the lines in question. Or I could split them in two at the apostrophe and the vertical bar in their respective fields. Either way, relevant information would be lost and output quality would presumably drop, though not so much as at the moment, when I’m feeding the Stanford tagger what is effectively bad data. I’ll try the simple deletion method first, just to see if it allows me to run the trainer on a data set this large, then worry about handling things properly later.

[Update: Martin Mueller fills me in via email concerning the use of compound tags in the MorphAdorner training data:

Most modern taggers treat contracted forms as two tokens but take their cue from modern orthography, which uses an apostrophe as a contraction marker. Morphadorner treats contracted forms as a single token for two reasons:

1. The orthographic practice reflects an underlying linguistic ‘reality’ that the tokenization should respect

2. In Early Modern English ( as in Shaw’s orthographic reforms) contracted forms appear without apostrophes, as in ‘not’ for ‘knows not’ or ‘niltow’ for ‘wilt thou not’. It’s not obvious how to split these forms.

Contracted forms get two POS tags separated by a vertical bar, but with regard to forms like “don’t’, “cannot”, “ain’t”, MorphAdorner analyzes the forms as the negative form of a verb and does not treat the form as a contraction. It uses the symbol ‘x’ to mark a negative POS tag.

So the compound tags are in a very real sense “true” single tags that just happen to be made up of two otherwise-valid individual tags. The 423 tags in the MorphAdorner training data are thus to be treated as 423 unique features/tag-types, or at least that’s how MorphAdorner does it. This doesn’t solve my memory-use problem, but it’s good to know.
]

Evaluating POS Taggers: TreeTagger Cross-Validation

More on TreeTagger, which I’ve now (finally) gotten to train and cross-validate.

First, a note on an additional difficulty not included in the last post. I use MorphAdorner to create the lexicon from each chunk of training data (once that data has been cleaned up to TreeTagger’s liking). This works well, except that MorphAdorner has a quirk in its handling of punctuation for the lexicon. Specifically, MorphAdorner treats left and right quotes, which are tagged in the training data with ‘lq’ and ‘rq,’ as punctuation to be passed through in the lexicon. So what appears in the training data like this (sic, incidentally, on the “ here):

“	lq	“

appears in the lexicon as:

“	“	“

I think there’s a way to fiddle with smart quotes in MorphAdorner (it would be nice to treat them all as dumb double or single quotes, though maybe MorphAdporner’s clever enough to use left and right quotes for context info), but I can’t find it at the moment. Anyway, this freaks TreeTagger out. So I’ve modified my fix-everything-up-for-TreeTagger script to change the second form into the first whenever left or right quotes appear in the lexicon. Meh.

Results and Discussion

All that said, and two days of my life that I’ll never get back squandered, was it at least worth it? In a word: No. TreeTagger produces awful results in the cross-validation tests. See for yourself:

#       Words   Errors   Err Rate
0       382889  58934   .153919
1       382888  57586   .150399
2       382888  52527   .137186
3       382889  51714   .135062
4       382888  45794   .119601
5       382888  44436   .116054
6       382889  54037   .141129
7       382888  55332   .144512
8       382888  52674   .137570
9       382888  52441   .136961
Tot    3828883 525475   .137239

Recall that for MorphAdorner the average error rate in cross-validation was 2.3-2.9%, depending on the lexical data used. Here it’s 13.7%. This is obviously totally unusable. And a quick survey of the data suggests that the errors are neither easily systematic nor trivial—there are nouns tagged as verbs and such all over the place.

[A side note: TreeTagger doesn’t show the disparity on the Shakespeare-heavy tranches, nos. 6-8, that MorphAdorner did. But the error rates are so high that it’s hard to say anything more about it. I hope to look at those chunks again in the Stanford case.]

It’s possible—likely, even—that I’ve done something wrong, given how bad the results are. TreeTagger’s authors claim to have achieved 96+% accuracy on general data using a smaller tagset. It’s also surely true that the tagset and cardinal/ordinal restrictions on TreeTagger’s training input limit its accuracy in the present case. Whatever. I’m sick of dealing with it.

[Update: Some of the errors are surely due to my not having reworked the training data to set a sentence-end tag the way TreeTagger expects. (I used ‘.’ instead—close, but of course not perfect.) But that would account for only a small fraction of the total errors, and to fix it at this point would require more work than I’m will to spend on a solution that obviously won’t beat MorphAdorner.]

Out of curiosity, I’ll still run TreeTagger through the stock-training-data-and-reduced-tagset comparison of all the taggers that will be the last test in this roundup, but for now TreeTagger’s very likely out of the running, especially since I’d like to have NUPOS compatibility.

Evaluating POS Taggers: TreeTagger Training Troubles

As always, things aren’t as easy as it seems they should be. Specifically, TreeTagger—which looked like it was going to be very easy to retrain, hence also easy-ish to cross-validate—turns out to be mildly tricky. Tricky enough that I’m sick of dealing with it.

Why It Looks Easy

To train TreeTagger, you supply it with a lexicon (word, POS tag, lemma, plus additional POS-lemma pairs for words that can function as multiple parts of speech), a list of valid tags, a tab-delimited training file (again, word-POS-lemma), and a “sentence-end” tag (which replaces the POS tag for punctuation that ends a sentence; “SENT” by default). I have all of that data, except for the sentence-end tags (MorphAdorner tags punctuation as the mark itself, e.g., ‘!’ is tagged ‘!’ usw., with no differentiation of sentence ends/breaks). But the SENT tag is configurable, and I could probably fudge by setting it to ‘.’.

Why It Isn’t

The main problem is that TreeTagger insists that the tags used in the supplied tagset match exactly those used in the training data and in the lexical data, which (training and lex tags) must in turn match each other. “Well, duh,” you say, “it can only know how to use tags you tell it about.” True, but here’s the thing: It chokes on extra tags in the tagset or the lexicon, not just in the training data. So you can’t use one tagset or lexicon for all the training; you have to custom-generate a lexicon and a list of tags for every training set you want to use. You’d want to custom-generate the lexicon anyway, though it would be nice to be able to test against the enhanced lexicon supplied with MorphAdorner.

And then there’s the fact that, per the documentation, TreeTagger shouldn’t be trained or cardinal and ordinal numbers that contain digits, so you need to strip them from the training and lexical data, but keep them in cases where they’re spelled out. Plus, as mentioned above, MorphAdorner doesn’t have a generic punctuation tag—it passes punctuation through, tagged as the punctuation itself. And it groups adjacent punctuation marks together, most noticeably blocks of periods ‘…’, ‘..’, ‘…..’, etc. But chunks of random punctuation are of course not valid NUPOS tags, so you don’t really want them in the training or tag sets.

So here’s what would be required to run cross-validation, as far as I can tell, above and beyond the data-chunking-and-stripping scripts that I already have: A script that iterates line-by-line over the training data (recall: formatted “word-POS-lemma[-pos-lemma[-pos-lemma…]],” where the extra POS-lemma pairs are occasional features of which there may be arbitrarily many), removing lines that have a digit in the word and a POS tag of “crd” or “ord”, unless they also have another POS tag, in which case only the offending POS-lemma pair should be removed; removes any line with a POS tag not in the full NUPOS set (in order to catch the multiple-punctuation business, plus any oddness that might have slipped into the training data by mistake; then adds the (now guaranteed to be valid) POS tags from the line to an array (or probably a hash) of valid-and-existant tags in this particular training set, checking to make sure the tag it’s adding isn’t a duplicate. And then regenerates the lexicon based on the resulting, reduced training set.

And at that point it should be possible to train TreeTagger on MorphAdorner’s data, hence to cross-validate it. Maybe. Unless I’ve missed something. Which is likely.

I’m going to spend a few (more) hours banging my head against this, but only to the end of the day today, tops.

Evaluating POS Taggers: Training the Other Taggers

“Damn,” you may be saying, “is he still on about this stuff?” Yes, yes I am.

So … MorphAdorner is easy to retrain/validate, because I have access to the tagged, formatted training data. This is what I reported in my previous post on cross-validation.

Now I want to see how much work would be involved in trying to train the other taggers on MorphAdorner’s data. I’m interested in doing this because MorphAdorner is the only one of the bunch that’s trained on literary texts rather than general contemporary sources (often a standard corpus from the 1989 Wall Street Journal). If it’s easy enough, I’ll retrain any/all of the other taggers to see if they produce better results (and if so, how much better) before I compare them to MorphAdorner.

A related note: It will be especially nifty if the other taggers can use an arbitrary tagset. Or, well, not arbitrary. NUPOS. NUPOS is good, for reasons mentioned earlier. Plus, that would make direct use of the MorphAdorner training data much easier, and it would make a direct comparison of outputs easier as well. If this tagset agnosticism isn’t built into the other taggers, I’ll have to convert between tagsets.

The rest of this post, then, is my notes on what’s involved in training the other taggers, with an eye specifically toward using MorphAdorner’s data/tagset.

The MorphAdorner Training Data

The training data I have from MorphAdorner has a pretty simple four-column, tab-delimited format. It’s just word pos-tag lemma std-spelling. And sentences are followed by a blank line, which makes it easy to pick them out if necessary. A snippet:

She     pns31   she     she
had     vhd     have    had
always  av      always  always
wanted  vvn     want    wanted
to      pc-acp  to      to
do      vdi     do      do
[...]

So it should be pretty easy to transform it into whatever (text-based) form might be required for the other taggers.

LingPipe

LingPipe is a bit of a problem. To train it on a new corpus, I’d need to write a moderate amount of Java code, plus munge the training data a bit. There are examples provided for analogous cases (assuming proper input), and the documentation is good, but this would probably be a week-long project. I’d be willing to invest that amount of time if I had reason to believe the results would be superior, but here’s the thing: LingPipe uses a Hidden Markov Model, as does MorphAdorner. It’s not that much faster than MorphAdorner (about 2x, but that’s likely to drop over a larger corpus, since MorphAdorner has higher one-time startup overhead than LingPipe). And it’s licensed in a way that probably makes it unusable with nonredistributable corpora. So do I want to send a week, however edifying it might be, evaluating whether or not LingPipe might be meaningfully superior to MorphAdorner, when there’s good reason to expect that it will not be? I’m going with no.

What I will do, though, is compare its default output (using the Brown training data) to MorphAdorner’s over the training corpus to get a sense of its accuracy using the built-in model(s). That’s just a matter of working up a set of translations between NUPOS and LingPipe’s tagsets, plus some XML-munging Perl-fu.

Stanford

Stanford looks easier, and it’s a more interesting case, since it uses a different algorithm (maximum entropy rather than Hidden Markov). Just need to transform the training data into the expected form and have at it. (Incidentally, maybe I can just set the tag delimiter to [tab]? Trivial, but investigate.) If this turns out to be as easy as expected (ha!), will run full cross-validation. All this probably using left3words only. I want no further part of the slow-as-molases-in-January bidirectional version, though I suppose I should look into it for the sake of completeness.

As above, I’ll also compare the retrained results to those already obtained using default training data.

TreeTagger

This also looks very easy to train using MorphAdorner’s data, so expect the same cross-validation and default comparisons.

Action Items

OK, so Stanford and TreeTagger will get full retraining and cross-validation (as MorphAdorner did before them), plus comparison of their default output to retrained cases. LingPipe will get only the latter.

Will write up (briefly) what’s involved in doing the retraining as I get into it. More in the next couple of days as results are available.

Evaluating POS Taggers: MorphAdorner Cross-Validation

Here are the results of the cross-validation trials with MorphAdorner. Note to self: Never claim that anything will be trivial.

I ran a ten-fold cross-validation trial on the MorphAdorner training data. All fairly straightforward and naïve; I chopped the training data into ten equal-size chunks (with no attention to sentence or paragraph boundaries, which introduces edge cases, but they’re trivial compared to the sample size), trained the tagger on nine of them and ran it over the remaining one using the new training set, then repeated the process for each chunk. I did this in two forms: Once creating new lexical data as well as a new transformation matrix (columns labeled “1” below), and then a second time creating only a new transformation matrix, but using the supplied lexical information (labeled “2”). The true error rate on new texts is probably somewhere between the two, but (I think) likely to be closer to the second (better) case.

Results

So … the results:

#     Wds     Err1   Rate1    Err2    Rate2   Diff12   Rate12
0    382889   8607  .022479   7049   .018410   2143   .005596
1    382888   8267  .021591   5907   .015427   3120   .008148
2    382888   8749  .022850   6698   .017493   2739   .007153
3    382889   8780  .022930   7322   .019123   2168   .005662
4    382888   8039  .020995   5784   .015106   2851   .007446
5    382888   8147  .021277   5835   .015239   3020   .007887
6    382889  19660  .051346  16727   .043686   3721   .009718
7    382888  20494  .053524  17723   .046287   3679   .009608
8    382888  17078  .044603  14441   .037715   3603   .009410
9    382888   7151  .018676   4512   .011784   4436   .011585
Tot 3828883 114972  .030027  91998   .024027  31480   .008221

Key

# = Chunk being evaluated
Wds = Number of words in that chunk
Err1 = Number of tagging errors in the testing chunk using lexical and matrix data derived exclusively from the other cross-validation chunks
Rate1 = Tagging error rate for the testing chunk using this training data (1)
Err2 = Number of tagging errors using a matrix derived from the other chunks, but stock lexical data from the MorphAdorner distribution
Rate2 = Error rate in the second case
Diff12 = Number of differences between the output files generated by cases 1 and 2
Rate12 = Rate of differences between output 1 and 2

[Update: In my original test (the data reported above), I didn’t preserve empty lines in the training and testing data. Rerunning the cross-validation with empty lines in place produces very slightly better results, with overall average error rates dropping from 3.0% to 2.9% (using limited lexical data, case 1) and from 2.4% to 2.3% (with full lexical data, case 2). Rates corrected in the analysis sections below. Corrected data (where blank lines now count as “words”):

#     Wds     Err1   Rate1    Err2    Rate2   Diff12   Rate12
0    399644   8599  .021516   7045   .017628   2142   .005359
1    399643   8140  .020368   5886   .014728   3015   .007544
2    399643   8803  .022027   6737   .016857   2769   .006928
3    399644   8806  .022034   7347   .018383   2173   .005437
4    399643   8070  .020193   5817   .014555   2843   .007113
5    399643   8817  .022062   5918   .014808   3676   .009198
6    399644  19806  .049559  16901   .042290   3727   .009325
7    399643  20246  .050660  17486   .043754   3645   .009120
8    399643  16997  .042530  14362   .035937   3597   .009000
9    399643   7150  .017890   4512   .011290   4438   .011104
Tot 3996433 115434  .028884  92011   .023023  32025   .008013

]

Notes

A couple of observations. In general, the tagging quality is quite good; it averages 97.1% accuracy over the full testing set even when it’s working without the enhanced lexicon supplied with the distribution, and rises to 97.7% with it.

There’s some serious variability, especially in chunks 6, 7, and 8. A quick inspection suggests that they’re mostly Shakespeare, which I guess is both reassuring and unsurprising. Reassuring because I’ll mostly be working with later stuff and with fiction rather than drama; and unsurprising particularly in the case of cross-validation, since much of the Shakespeare is necessarily excluded from the training data when it happens to occur in the chunk under consideration. Without the Shakespeare-heavy chunks, accuracy is around 97.8% and 98.4% for cases 1 and 2, respectively. That’s darn good.

I don’t know what, if anything, to make of the difference rate between the two testing samples, which is quite small (less than 1%) but not zero. I guess it’s good to know that different training inputs do in fact produce different outputs. Note also the difference between them is not identical to the sum of the differences between each output and the reference data, i.e., it’s not the case that they agree on everything but the errors. Each case gets some words right that the other one misses. No surprise there.

This is all very encouraging; accuracy at 97% and above is excellent. And this is comparing exact matches using a very large tagset; accuracy would be even better with less rigorous matching. As a practical matter, I probably won’t be able to do equivalent cross-validation for the other taggers (too time consuming, even if it’s technically possible), but I should at least be able to determine overall error rates using their default general-linguistic training data and the MorphAdorner reference corpus. I suspect they’d have to be pretty impressive to outweigh the other benefits of MorphAdorner. Time will tell. More after Thanksgiving.

[Oh, and for reference, the MONK Wiki has more information on MorphAdorner, including some good discussion of POS tagging in general, all by Martin AFAICT. Note that the last edit as May, 2007, so some things may have changed a bit.]

[Also, a check on speed: The above cross-validation runs, working on about 3.8 million words and processing all of them ten times, either as training or testing data, plus some associated calculating/processing, took about 45 minutes in sum on the same Athlon X2 3600+ as before (now with four gigs of RAM rather than two). Plenty speedy.]

Evaluating POS Taggers: Basic MorphAdorner Accuracy

With apologies for the delay (I was gone last week at the Modernist Studies meeting), here are the very earliest accuracy results for the part-of-speech taggers. Or, well, for one—and it’s far from complete (again, this blog = lab notes). More to follow in the next few days, though I surely won’t wrap things up until after Thanksgiving.

I have training data from MorphAdorner/MONK, a hand-tagged corpus of mostly nineteenth century, mostly British fiction that looks quite a bit like my testing corpus (but is not identical to it). It runs to just over 3.8 million words.

Running MorphAdorner back over this training data gives an error rate of about 1.9%. That number is exceedingly, but maybe not unexpectedly, good. It also requires a bit of specification.

First, MorphAdorner changes some of the training input data in ways that make a fully direct comparison difficult. Specifically—and this is the intended behavior, AFAIK—it separates punctuation from adjoining words. This is how the training data works, too, except in a few cases such as apostrophes in certain contractions and in (all?) plural possessives, or the period(s) in initials. So, for instance, days’ time (two tokens) in the training data becomes days time (three tokens) in the output. This isn’t a big deal, and it happens to only about 0.2% of the input data. I’ve chosen to exclude these differences from my error calculations. (For the record, if you have a great deal of free time and want to check my math, there are 7366 such instances out of 3828883 tokens in the training set. Full data for all the accuracy testing will follow in a later post.)

Second, the 1.9% net error rate quoted above is the worst case scenario, since it counts any difference between training tags and computed tags as an error. MorphAdorner uses the NUPOS tagset (see both MorphAdorner’s info on their POS tagger and Martin Mueller’s very helpful paper on NUPOS [PDF].) NUPOS has 163 185 tags (updated; see the MONK wiki page on NUPOS). That’s a lot of tags. The granularity it affords is nice, but I suspect one might often be satisfied if one’s tagger gets the answer right at a significantly higher level of granularity (say, is this word a noun or a verb?). It’s great that the extra detail is there, and Martin points out some of the benefits of both this approach and the specific features of NUPOS in his paper, but I’ll also be rerunning the accuracy tests with looser matching to see how MorphAdorner and the other taggers do under less demanding conditions. Also, some sort of grouping along these lines will be necessary in order to compare the taggers to one another, since they each use a different tagset (most of them meaningfully smaller than MorphAdorner’s).

Third, though, this test is really too easy on MorphAdorner, since it’s only being asked to retag exactly the data on which it was trained. I’ll run some cross-validation tests shortly. MorphAdorner is the only tagger for which this kind of literary cross-validation is trivially easy to do; the others could do it on their own (non-literary) training data, or I could look into using the MorphAdorner data to train them. Not sure how much work that would involve; it seems easy, though at the very least it would involve converting between tagsets. Depending on how MorphAdorner does with cross-validation and how the others do with their more general training sets, this may or may not be worth the effort. I’d like to do all of this, but I’m also anxious to get started on real experiments.

So … more results coming soon. Also: My Saturday night = HOTT.

Evaluating POS Taggers: Speed

[Update: Edited significantly on 8 November 2008 to include results from TreeTagger and MorphAdorner. Tables reworked and discussion and conclusions modified. For a full list and discussion of the packages under review, see “Evaluating POS Taggers: The Contenders“; the background info on Lingpipe and the Stanford tagger that previously was here has been moved to that post.]

[Update 2: Added results from Joyce rerun.]

[Update 3: See Bob Carpenter’s (of alias-i, the author and all-around guru of LingPipe) very helpful comment below for a list of things that have a potentially significant impact on speed, both for LingPipe and for the other taggers. These numbers are to be taken with a significant grain of salt in light thereof. Thanks, Bob!]

The part-of-speech taggers I’m evaluating are MorphAdorner, Lingpipe, TreeTagger, and the Stanford Log-Linear POS Tagger (one of these kids: not named like the others). See the link in the update note above for all the details.

Test setup

I’m using a dual-core Athlon 64 X2 3600+ system with two GB RAM running 64-bit Fedora 9. The machine is a desktop, and it’s a couple of years old; it’s not the fastest thing around, but neither is it especially slow. The JRE is Sun’s 64-bit package, version 1.6.0_10 (the latest release at the time of writing).

Tagging is an easily parrallelized task, and this is a dual-core system, but I didn’t take any steps to make use of this fact; in general, the taggers used 100% CPU time on one core and a smaller amount of time (1%-30%) on the other. It would be great if there were a simple way to tell Java to use both cores fully, but if there is, I haven’t come across it. For larger jobs, I would write a simple batch job to start as many Java instances as there are cores (but note that total system memory requirements could be an issue in that case; hello, Newegg!).

The taggers were invoked using the following commands, with all variables defined as appropriate. (This section is pretty skippable unless you want to repeat my tests. The short version: I gave Java about a gig of RAM and looped over the input files with a simple for loop unless there was an explicit option to process an entire input directory. There are also some notes on specific options used with each package.)

Lingpipe

For Lingpipe running over individual files:

for i in $( find $inputDir -printf "%P\n" ); do
   /usr/bin/time -a -o $logFile \
      ./cmd_pos_en_general_brown.sh \
      -resultType=firstBest -inFile=$inputDir/$i \
      -outFile=$outputDir/$i >> $logFile 2>&1
done

Note that in this and all other commands, I’m timing the runs with /usr/bin/time. This could—and would—obviously be left out if I weren’t benchmarking.

For Lingpipe running over entire directories the command was the same, but without the for loop and with -*Dir in place of the corresponding -*File options.

The main thing to note is that I’m using the generic tagger trained on the Brown corpus (supplied with the Lingpipe distribution) and am using “first best” results rather than n-best or confidence-weighted listings for each word/token (which are roughly 1.5 and 2.5 times slower, respectively). The differences between the result types are explained in Lingpipe’s POS tutorial; the short version is that first-best gives you the single best guess at each word’s part of speech, which is all I’m looking for at the moment. I do think it would be interesting in the long run, though, to see what kinds of words and situations result in high-confidence output and which are confusing or marginal for the tagger.

Stanford

For Stanford, the command line is:

for i in $( find $inputDir -printf "%P\n" ); do
   /usr/bin/time -a -o $logFile java -mx1g \
      -classpath stanford-postagger.jar \
      edu.stanford.nlp.tagger.maxent.MaxentTagger \
      -model models/$modelType -textFile $inputDir/$i \
      > $outputDir/$i
done

$modelType is either bidirectional-wsj-0-18.tagger (“bidirectional”) or left3words-wsj-0-18.tagger (“left3”). As invoked, the tagger produces plain-text output, though it can be set to use XML (I think; I know this is true with XML input, but my brief attempt to get XML output didn’t work. Will investigate further if necessary).

MorphAdorner

The MorphAdorner command line to loop over individual files in a directory is:

for i in $( find $inputDir -printf "%P\n" ); do
   /usr/bin/time -a -o $logFile ./adornplaintext \
      $outputDir $inputDir/$i >> $logFile 2>&1
done

This is fine (I do like those for loops), but it could be better. MorphAdorner doesn’t appear to have a direct way to specify an input directory rather than individual files, but it’s happy to take wildcards, which has the same effect and is much faster than the loop version above (because it doesn’t have to load lexical data over and over again). This is then surely the preferred form:

/usr/bin/time -a -o $logFile ./adornplaintext $outputDir \
   $inputDir/\*.txt >> $logFile 2>&1

Note that I have—following a certain amount of headbanging, because I’m an idiot—escaped the wildcard. adornplaintext is a simple script to invoke the Java processor with the appropriate settings; it’s supplied with the distribution and uses the NCF (nineteenth-century British fiction, mostly) training data. The only change I made was to raise the max Java heap size to 1440m, since the default 720m threw errors on a couple of the larger texts.

TreeTagger

TreeTagger, unlike the other three, isn’t a Java program, so there’s no mucking about with Java parameters. Otherwise, it looks similar. (None of this is rocket science, of course; it’s all just here for documentation.)

for i in $( find $inputDir -printf "%P\n" ); do
   /usr/bin/time -a -o $logFile ./tree-tagger-english \
      $inputDir/$i > $outputDir/$i
done

Same story as most of the others with the lack of an input directory option, but the overhead penalty is much smaller with TreeTagger, so I made no attempt to work around it. tree-tagger-english is, again like many of the others, a small script that sets options and invokes the tagging binaries. I left the defaults alone.

Further Notes

The small archive of texts I’m using to benchmark is described in my previous post. Just for recall: Fourteen novels, 2.4 million words.

I ran each of the taggers over the archive three times and took the average time required to process each text (or directory). Stanford bidirectional was too slow for multiple runs, so it was run only once. There were no significant outlying times for any of the texts in any of the runs, i.e., there was no immediate reason to be suspicious of this simple time averaging.

Results

First, a list of the elapsed times, in seconds, for the various taggers and models.

(Apologies for lack of monospaced fonts in the tables, which seems to be a wordpress.com issue. I should also probably reorder tables 1 and 2, but that would be a pain in the hand-coded HTML ass.)

Key
LP = Lingpipe
SL = Stanford, left3words model
SB = Stanford, bidirectional model
TT = TreeTagger
MA = MorphAdorner

Table 1: Run times

All times in CPU seconds, measured from the output of /usr/bin/time, summing user and system CPU time, not wall clock time. Word counts from wc -w for consistency; most taggers measure tokens (including punctuation) rather than words, so their internal word counts are higher than the figures listed here and are not mutually consistent. Lower numbers are better.

Author Words LP SL SB TT MA
Alcott 185890 34 468 13582 9 132
Austen 118572 21 214 4319 9 105
Brontë 185455 32 495 17452 9 127
Conrad 38109 10 104 2824 2 85
Defoe 121515 20 205 4552 5 112
Dickens 184425 34 401 9587 9 128
Eliot 316158 56 807 29717* 14 170
Hawthorne 82910 18 280 8060 4 96
Joyce 264965 42 1791 95906* 12 177
Melville 212467 38 933 33330 10 139
Richardson 221252 41 414 11506 10 136
Scott 183104 34 740 27176 9 129
Stowe 180554 35 528 16264 9 129
Twain 110294 22 288 8778 5 111
Total: 2405670 437 7668 283054 112 1775
By dir: 2405670 323 593

“By dir” = tagger run over entire input directory rather than over individual files.

*Total times extrapolated from incomplete runs; tagger crashed with
Java out of memory error at 93% complete for Eliot and 43% for Joyce. The Eliot is close enough to use, I think, but the Joyce is dubious, especially since it’s such an outlier. See notes below. [Update: I just reran the Joyce using a gig and a half of memory for Java. Result: 227,708 seconds of CPU time. That’s more than two and a half days. Seriously. Whatever’s going on with this, it’s strange.]

Table 2: Speeds

Words tagged per CPU second. Higher is better.

Author LP SL SB TT MA
Alcott 5429 397 13.7 19692 1412
Austen 5775 554 27.5 19828 1125
Brontë 5864 375 10.6 19570 1459
Conrad 3862 366 13.5 16862 449
Defoe 6146 592 26.7 20654 1090
Dickens 5479 459 19.2 19774 1436
Eliot 5673 392 10.6 20924 1860
Hawthorne 4595 296 10.3 19341 862
Joyce 6286 148 2.8* 19695 1501
Melville 5535 228 6.4 19685 1529
Richardson 5439 535 19.2 20274 1627
Scott 5341 247 6.7 19990 1424
Stowe 5103 342 11.1 18775 1396
Twain 4903 383 12.6 18736 994
Average: 5507 314 8.5* 19786 1335
By dir: 7443 4058

*If the dubious Joyce result is excluded, the average speed rises to 11.4 words per second. I’ve used this higher value in the speed comparisons below.

Table 3: Speed ratios

Ratio of words tagged per CPU second. Read as: Column entry over row entry, e.g., TreeTagger is 2.9 times faster than Lingpipe. Tagging speed in words per second added for reference.

Note that I’ve used the “by directory” speeds for Lingpipe and MorphAdorner, and excluded the possibly anomalous Joyce result from the Stanford bidirectional speed.

Type TT LP MA SL
Speed (21414) (7443) (4058) (314)
LP (7443) 2.9
MA (4058) 5.3 1.8
SL (314) 68 24 13
SB (11.4) 1872 651 355 27

All of this in graphical form:

Speed.png

 

Discussion

The TreeTagger, Lingpipe, and Morphadorner taggers are all very fast; they process entire novels in seconds. Their speed is pretty consistent, too, though this fact is hidden a bit in the case of MorphAdorner by its significant startup overhead (20-30 seconds). Looking at MorphAdorner’s speed over the entire testing directory, rather than file by file, gives not only a much higher average speed (4000 wps vs. 1300) but much more consistent speed results (not shown in the tables above).

[Update: As noted above, cf. Bob Carpenter’s comment below for some notes on speed.]

These taggers tend to be slowest on the shortest works, especially Conrad, probably because there’s a fixed amount of overhead to start up a Java instance and perform another initialization steps, and Conrad’s novel is short enough for that to be a significant amount of the total processing time. This isn’t an issue in runs over an entire directory, where there’s only one startup instance, rather than file by file, and the overhead penalty in that case drops to zero as the corpus size becomes arbitrarily large.

Both Lingpipe and TreeTagger have the option to produce probabilistic results, that is, to produce output that includes several candidate parts of speech and the probability that each is correct for a given token. In Lingpipe, there’s a moderate penalty for doing so, somewhere around doubling the run time; in TreeTagger, the penalty is almost vanishingly small, around 10%.

Stanford’s tagger is much slower, working through novels in minutes or hours, depending on the analytical model used. There’s more variation in speed, too; the left3 model shows about a 4x difference between Joyce (slow) and Defoe (fast). Even excluding Joyce, which is something of an outlier, there’s a 2.6x difference between Melville and Defoe. The bidirectional model is similarly variable, especially if the Joyce result is correct. There’s a 10x spread between fastest and slowest results, or about 4x excluding Joyce.

The difference in tagging speed on Joyce in particular is interesting to me. Ulysses is surely the strangest and least conventionally grammatical of the bunch, and it doesn’t entirely surprise me that it should take longer to tag than the others. In fact I’m a bit suspicious of the fact that the other taggers not only don’t take longer to process it, but in some cases actually chew through it faster than any other text. Are they just throwing up their metaphorical hands and labeling a bunch of stuff junk? But I may be anthropomorphizing the algorithms. Just because I would find it hard to sort out Joyce doesn’t necessarily mean it’s especially hard to compute. Will look at this in the next post, on accuracy.

Comparing the taggers and models by speed, the TreeTagger, Lingpipe and MorphAdorner packages are all within an order of magnitude of one another. They are in turn one to two orders of magnitude faster than Stanford left3 and hundreds to thousands of times faster than Stanford bidirectional. Stanford left3 is about 30 times faster than Stanford bidi.

Conclusions

Everything here is provisional until I look at accuracy, obviously.

First, note that speed is a threshold property, as far as I’m concerned. Faster is better, but the question isn’t really “which tagger is fastest?” but “is it fast enough to do x” or “how can I use this tagger, given its speed?” I think there are three potentially relevant thresholds:

  1. As a practical matter, can I run it once over the entire corpus?
  2. Could I retag the corpus more or less on whim, if it were necessary or useful?
  3. Could I retag the corpus in real time, on demand, if there turned out to be a reason to do so?

The first question is the most important one: I definitely have to be able to tag the entire corpus once. The others would be nice, especially #2, but I don’t yet have a compelling reason to require them.

With that in mind, there would have to be a pretty analytically significant difference in output quality to justify using the Stanford bidirectional tagger. Some quick calculations: Gutenberg has probably around 2500 usable English-language novels. Say they average 150,000 words each. Both of these are conservative estimates. That would mean 375 million words to tag. Assume, generously, that Stanford bidi averages 12 words per second. So a year of CPU time to tag the corpus. That could be cut down to something manageable with multiple, faster machines, but still, unless I get serious compute time, it would probably take a month or two. In principle, you only need to tag the dataset once, so this might be acceptable if it really did improve the analysis, but options 2 and 3 above would definitely be out with the bidi model, barring a major change in the amount of computing power at my disposal.

Also: The system requirements of the bidirectional model may be significant. As noted in passing above, giving Java a gig of RAM wasn’t enough for the Joyce and Eliot texts, which crashed out with insufficient memory at 43% and 93% complete, respectively. They’re too bloody slow to retest at the moment with 1.5 GB allocated. In any case, I couldn’t currently go above that, since I only have 2 gigs in the system. True, memory is cheap(ish), but it’s neither free nor infinite. A related point to investigate if necessary: Would more memory or other tweaks significantly improve speed with either of the the Stanford models?

The non-Stanford taggers could tag the entire corpus in a CPU day or less, and perhaps as little as an hour or two of real time with a small cluster. That would allow a lot of flexibility to rerun and to test things out. Stanford’s left3 model would take about two CPU weeks. That’s not ideal, but it’s certainly a possibility.

None of the taggers looks fast enough to do real-time reprocessing of the entire corpus given current computational power, but they could easily retag subsections in seconds to minutes, depending on size.

Finally, a brief foretaste on accuracy: Stanford’s own cross-validation indicates that the bidirectional model is 97.18% accurate, while left3 is 96.97%. It’s hard to imagine that 0.21% difference being significant, especially when both models drop to around 90% on unknown words. Haven’t yet seen or run similar tests on the others, nor have I done my own with Stanford. There’s reason to believe that MorphAdorner will be especially good on my corpus, given its literary training set, but that has yet to be verified; training other taggers on MorphAdorner’s data might also be possible.

Takeaway points on speed

Provisionally, then: MorphAdorner is probably the frontrunner: It’s fast enough, free, under current development, and geared toward literature. TreeTagger is fastest, but not by a qualitative margin. I’m also concerned that it’s not currently being developed and that the source code doesn’t appear to be available. Lingpipe is plenty fast, though I have some concerns about the license. Stanford left3 remains viable, but speed is a moderate strike against it. Stanford bidirectional is likely out of the running.