Evaluating POS Taggers: The Contenders

Here’s a quick summary of the POS taggers I’m looking at. This probably should have come before the initial results, but at the time there were only two candidates. Now it’s four. I’ll add others if I have a need or if people ask me to, but for the moment I’m satisfied with this list.

Lingpipe

As noted in an earlier post, Lingpipe is a commercial product from Alias-i. It’s a suite of Java applications for natural language processing, and it’s released under an unusual quasi-open-source license. If I’m reading the license correctly, you can use the software for free so long as you make the output freely available and aren’t selling whatever it is you do with it. My guess is that my research would qualify on all fronts, but I worry a bit about how the output redistribution requirement might limit my ability to work with commercial text repositories.

The version I’m evaluating is 3.6.0, which is current as of early November, 2008. It’s trained on the Brown corpus, which is made up of about a million words of American English published in 1961. There’s some fiction in the training set, but it’s a general-purpose corpus, not a literature-specific one. Lingpipe uses the what I presume is the Brown tagset (compare Alias-i’s own list of tags) and produces XML output. It can accept XML, HTML, or plain text input. It uses a hidden Markov model for POS determination, though I’m in no way equipped to know if that’s a good, bad, or indifferent thing.

Note: NLTK, to which Lingpipe sometimes refers, would be another option to check out if necessary. It’s written in Python and is just barely enough of a pain to install that I haven’t yet bothered. A review on the Linguist list had generally high praise for it as a teaching tool, but also said it was slower and less feature-rich than other packages. (“Many of the features that make it so good for learning get in the way of its usefulness for real work. … [NLTK is] too slow for programs involving large amounts of data.”)

Stanford

The Stanford POS tagger is part of the Stanford NLP group‘s suite of Java-based NLP tools. It’s licensed under GPL v.2, i.e., it’s free software. W00t!

I’m evaluating version 1.6, released 28 September 2008. The tagger is trained on a corpus from the Wall Street Journal (38 million words from 1988-89). It uses the Penn tagset (which is smaller than Brown, although that’s not necessarily a bad thing) and implements a maximum entropy tagging algorithm (ditto above on my inability to judge the relative merits of this approach). Takes plain text and XML input, produces equivalent output. The tagger offers two tagging models, left3words and bidirectional; the latter is said to be slower, but very slightly more accurate.

TreeTagger

TreeTagger is a project of the Institute for Computational Linguistics at the University of Stuttgart. It’s released under a free, non-commercial license. The software was developed in the mid ’90s and doesn’t look to be currently maintained, though I could be wrong about that. There’s also no source code obviously available; the package is distributed as precompiled binaries for Linux, Mac, and Windows. This isn’t an immediate problem, but it might limit future flexibility. I also haven’t looked especially hard for the source code, nor have I attempted to contact the project developers.

I’m working with version 3.2, using the English chunker and parameters v. 3.1, all of which are the current releases. TreeTagger uses a probabilistic decision tree algorithm that the author claims is slightly better than Markov models. It uses the Penn tagset and is trained on the Penn treebank, a 4.5+ million word combination of the WSJ, Brown, ATIS, and Switchboard corpora. XML/SGML and plain text input is supported, but output is plain text only. It can be configured to produce either the single best tag for each word, or to list multiple possibilities and their respective probabilities. Lingpipe can do something similar; Stanford cannot, I think.

MorphAdorner

MorphAdorner is produced at Northwestern and is part of the MONK project. The software hasn’t been formally released yet, but should be soon as the project draws to a close. The license is GPL v.2+.

MorphAdorner is designed to prepare texts for analysis within the larger framework of tools available in MONK. What’s especially cool, for my purposes, is that it’s trained on specifically literary corpora, down to the level of separate nineteenth-century American and nineteenth-century British fiction collections, plus early modern English. So it’s likely to be significantly more accurate out of the box than the others. (Note for further investigation: How much work would be involved in using MorphAdorner’s training sets with the other packages, if necessary?) It also uses a custom tagset that’s “designed to accommodate written English from Chaucer on within one descriptive framework” (Martin Mueller, pers. comm.) and is generally equipped to work with literary texts in many genres from middle English to the present, all of which distinguishes it from the other tools, which were generally developed for and by corpus linguists.

I’m evaluating a prerelease snapshot (labeled version 0.81) supplied to me by the developers on 6 November 2008. MorphAdorner is a commandline Java package, like Lingpipe and the Stanford tagger. It can handle XML or plain text input and produces equivalent output. (It would also be possible to write a new output routine to produce XML output from plain text input.) It uses a hidden Markov model, or so I infer from the Java class names (Pib, correct me if I’m wrong).

Next up …

I think that’s it for the descriptive bit. An update to the speed comparison post (to finish the Stanford bidirectional run and add TreeTagger and MorphAdorner results) is coming shortly.

Evaluating POS Taggers: The Archive

This is the first in a short series of posts comparing part-of-speech taggers (at the moment, just Lingpipe and the Stanford NLP group’s tagger). I need to settle on one for my history and genre project, but/and I haven’t yet come across any kind of broad consensus concerning their relative merits. Note to self: Talk to the people over in linguistics.

So … this post is a description of the testing archive. I picked fourteen novels from Gutenberg; there’s nothing all that special about them, except that they’re fairly canonical (I’ve read all of them, which could be useful) and they feel more or less representative of English-language fiction in Gutenberg. (No strong claims on this front; a full characterization of Gutenberg will be part of the project proper.) They’re typical in length and are distributed reasonably well by British/US origin, author gender, and date of composition. The list, arranged chronologically:

  • Defoe, Robinson Crusoe (1719), Gutenberg etext #521, 121,515 words. M/Brit.
  • Richardson, Pamela (1740), #6124, 221,252 words. M/Brit.
  • Austen, Sense and Sensibility (1811), #161, 118,572 words. F/Brit.
  • Scott, Ivanhoe (1819), #82, 183,104 words. M/Scottish.
  • Brontë, Jane Eyre (1847), #1260, 185,455 words. F/Brit.
  • Hawthorne, The Scarlet Letter (1850), #33, 82,912 words. M/Amer.
  • Melville, Moby-Dick (1851), #2489, 212,467 words. M/Amer.
  • Stowe, Uncle Tom’s Cabin (1852), #203, 180,554 words. F/Amer.
  • Dickens, Great Expectations (1861), #1400, 184,425 words. M/Brit.
  • Alcott, Little Women (1868/1869), #514, 185,890 words. F/Amer.
  • Eliot, Middlemarch (1871-72/1874), #145, 316,158 words. F/Brit.
  • Twain, Huckleberry Finn (1885), 76, 110,294 words. M/Amer.
  • Conrad, Heart of Darkness (1899/1902), 526, 38,109 words. M/Brit/Polish.
  • Joyce, Ulysses (1922), #4300, 264,965 words. M/Irish.

A few observations on the archive:

  • Size: 14 novels, 2.4 million words. This is a tiny fraction of the Gutenberg fiction holdings in English; there’s no way it can be fully representative, and I don’t claim otherwise. I’ve assembled it for benchmarking purposes only. Still, my hope is that it’s not radically unrepresentative.
  • Distribution:
    • 9 men, 5 women
    • 9 British Isles (including Conrad), 5 American
    • 2 18th century, 10 19th, 2 20th
    • Maybe more to the point, genres/periods include the early novel, epistolary novel, realism (heavily represented), Romanticism, allegory, and early and high modernism.
  • There’s a heavy skew toward men, Brits, and the nineteenth century. My guess is that this is true of the Gutenberg holdings overall; it’s also not hugely out of line with, say, the MONK project’s fiction archive. Actual numbers for Gutenberg as a whole will follow at some point in the future.
  • Middlemarch is the longest text by a fair margin; Heart of Darkness is an outlier at an order of magnitude shorter. There’s an uncanny cluster of lengths around 185,000 words. I assume this has much to do with the economics of book publishing in nineteenth-century Britain. Will be interesting to see if this is true in the full holdings.

My preparation of the texts was minimal and pretty loose—again, the goal is benchmarking, not final critical accuracy. The files are all plain ASCII text. I removed the Gutenberg headers and legal disclaimers, as well as any critical apparatus (pretty rare in Gutenberg to begin with) and editor’s introductions. Chapter heads, prefaces by the original author, etc. stayed in. An archive of the prepared texts is available (tar.gz, 5 MB), though there’s no reason to get them from me rather than from Gutenberg unless you want to repeat my trials.

More shortly in two posts, one on speed and the other on statistics and accuracy.

POS Taggers

Does anyone have an opinion on the relative merits of the various part-of-speech taggers? I’ve used (and had decent luck with) Lingpipe, which seems pretty quick and very accurate in my limited tests. I also just read a post by Matthew Jockers about the Stanford Log-linear Part-Of-Speech Tagger (which is what got me thinking about this; I admit I was largely sucked in by the discussion of Xgrid, which I’d really like to try). And I thought the Cornell NLP folks had one, too, though I now can’t find any reference to it, so I may well be wrong. Plus there’s MONK/Northwestern’s MorphAdorner (code not yet generally available, though I don’t think it would be a problem to get it), and any number of commercial options (less attractive, for many reasons).

I surely just need to test a bunch of them is some semi-systematic way, but is there any existing consensus about what works best for literary material?