Here’s a quick summary of the POS taggers I’m looking at. This probably should have come before the initial results, but at the time there were only two candidates. Now it’s four. I’ll add others if I have a need or if people ask me to, but for the moment I’m satisfied with this list.
As noted in an earlier post, Lingpipe is a commercial product from Alias-i. It’s a suite of Java applications for natural language processing, and it’s released under an unusual quasi-open-source license. If I’m reading the license correctly, you can use the software for free so long as you make the output freely available and aren’t selling whatever it is you do with it. My guess is that my research would qualify on all fronts, but I worry a bit about how the output redistribution requirement might limit my ability to work with commercial text repositories.
The version I’m evaluating is 3.6.0, which is current as of early November, 2008. It’s trained on the Brown corpus, which is made up of about a million words of American English published in 1961. There’s some fiction in the training set, but it’s a general-purpose corpus, not a literature-specific one. Lingpipe uses the what I presume is the Brown tagset (compare Alias-i’s own list of tags) and produces XML output. It can accept XML, HTML, or plain text input. It uses a hidden Markov model for POS determination, though I’m in no way equipped to know if that’s a good, bad, or indifferent thing.
Note: NLTK, to which Lingpipe sometimes refers, would be another option to check out if necessary. It’s written in Python and is just barely enough of a pain to install that I haven’t yet bothered. A review on the Linguist list had generally high praise for it as a teaching tool, but also said it was slower and less feature-rich than other packages. (“Many of the features that make it so good for learning get in the way of its usefulness for real work. … [NLTK is] too slow for programs involving large amounts of data.”)
I’m evaluating version 1.6, released 28 September 2008. The tagger is trained on a corpus from the Wall Street Journal (38 million words from 1988-89). It uses the Penn tagset (which is smaller than Brown, although that’s not necessarily a bad thing) and implements a maximum entropy tagging algorithm (ditto above on my inability to judge the relative merits of this approach). Takes plain text and XML input, produces equivalent output. The tagger offers two tagging models, left3words and bidirectional; the latter is said to be slower, but very slightly more accurate.
TreeTagger is a project of the Institute for Computational Linguistics at the University of Stuttgart. It’s released under a free, non-commercial license. The software was developed in the mid ’90s and doesn’t look to be currently maintained, though I could be wrong about that. There’s also no source code obviously available; the package is distributed as precompiled binaries for Linux, Mac, and Windows. This isn’t an immediate problem, but it might limit future flexibility. I also haven’t looked especially hard for the source code, nor have I attempted to contact the project developers.
I’m working with version 3.2, using the English chunker and parameters v. 3.1, all of which are the current releases. TreeTagger uses a probabilistic decision tree algorithm that the author claims is slightly better than Markov models. It uses the Penn tagset and is trained on the Penn treebank, a 4.5+ million word combination of the WSJ, Brown, ATIS, and Switchboard corpora. XML/SGML and plain text input is supported, but output is plain text only. It can be configured to produce either the single best tag for each word, or to list multiple possibilities and their respective probabilities. Lingpipe can do something similar; Stanford cannot, I think.
MorphAdorner is designed to prepare texts for analysis within the larger framework of tools available in MONK. What’s especially cool, for my purposes, is that it’s trained on specifically literary corpora, down to the level of separate nineteenth-century American and nineteenth-century British fiction collections, plus early modern English. So it’s likely to be significantly more accurate out of the box than the others. (Note for further investigation: How much work would be involved in using MorphAdorner’s training sets with the other packages, if necessary?) It also uses a custom tagset that’s “designed to accommodate written English from Chaucer on within one descriptive framework” (Martin Mueller, pers. comm.) and is generally equipped to work with literary texts in many genres from middle English to the present, all of which distinguishes it from the other tools, which were generally developed for and by corpus linguists.
I’m evaluating a prerelease snapshot (labeled version 0.81) supplied to me by the developers on 6 November 2008. MorphAdorner is a commandline Java package, like Lingpipe and the Stanford tagger. It can handle XML or plain text input and produces equivalent output. (It would also be possible to write a new output routine to produce XML output from plain text input.) It uses a hidden Markov model, or so I infer from the Java class names (Pib, correct me if I’m wrong).
Next up …
I think that’s it for the descriptive bit. An update to the speed comparison post (to finish the Stanford bidirectional run and add TreeTagger and MorphAdorner results) is coming shortly.