Evaluating POS Taggers: The Contenders

Here’s a quick summary of the POS taggers I’m looking at. This probably should have come before the initial results, but at the time there were only two candidates. Now it’s four. I’ll add others if I have a need or if people ask me to, but for the moment I’m satisfied with this list.

Lingpipe

As noted in an earlier post, Lingpipe is a commercial product from Alias-i. It’s a suite of Java applications for natural language processing, and it’s released under an unusual quasi-open-source license. If I’m reading the license correctly, you can use the software for free so long as you make the output freely available and aren’t selling whatever it is you do with it. My guess is that my research would qualify on all fronts, but I worry a bit about how the output redistribution requirement might limit my ability to work with commercial text repositories.

The version I’m evaluating is 3.6.0, which is current as of early November, 2008. It’s trained on the Brown corpus, which is made up of about a million words of American English published in 1961. There’s some fiction in the training set, but it’s a general-purpose corpus, not a literature-specific one. Lingpipe uses the what I presume is the Brown tagset (compare Alias-i’s own list of tags) and produces XML output. It can accept XML, HTML, or plain text input. It uses a hidden Markov model for POS determination, though I’m in no way equipped to know if that’s a good, bad, or indifferent thing.

Note: NLTK, to which Lingpipe sometimes refers, would be another option to check out if necessary. It’s written in Python and is just barely enough of a pain to install that I haven’t yet bothered. A review on the Linguist list had generally high praise for it as a teaching tool, but also said it was slower and less feature-rich than other packages. (“Many of the features that make it so good for learning get in the way of its usefulness for real work. … [NLTK is] too slow for programs involving large amounts of data.”)

Stanford

The Stanford POS tagger is part of the Stanford NLP group‘s suite of Java-based NLP tools. It’s licensed under GPL v.2, i.e., it’s free software. W00t!

I’m evaluating version 1.6, released 28 September 2008. The tagger is trained on a corpus from the Wall Street Journal (38 million words from 1988-89). It uses the Penn tagset (which is smaller than Brown, although that’s not necessarily a bad thing) and implements a maximum entropy tagging algorithm (ditto above on my inability to judge the relative merits of this approach). Takes plain text and XML input, produces equivalent output. The tagger offers two tagging models, left3words and bidirectional; the latter is said to be slower, but very slightly more accurate.

TreeTagger

TreeTagger is a project of the Institute for Computational Linguistics at the University of Stuttgart. It’s released under a free, non-commercial license. The software was developed in the mid ’90s and doesn’t look to be currently maintained, though I could be wrong about that. There’s also no source code obviously available; the package is distributed as precompiled binaries for Linux, Mac, and Windows. This isn’t an immediate problem, but it might limit future flexibility. I also haven’t looked especially hard for the source code, nor have I attempted to contact the project developers.

I’m working with version 3.2, using the English chunker and parameters v. 3.1, all of which are the current releases. TreeTagger uses a probabilistic decision tree algorithm that the author claims is slightly better than Markov models. It uses the Penn tagset and is trained on the Penn treebank, a 4.5+ million word combination of the WSJ, Brown, ATIS, and Switchboard corpora. XML/SGML and plain text input is supported, but output is plain text only. It can be configured to produce either the single best tag for each word, or to list multiple possibilities and their respective probabilities. Lingpipe can do something similar; Stanford cannot, I think.

MorphAdorner

MorphAdorner is produced at Northwestern and is part of the MONK project. The software hasn’t been formally released yet, but should be soon as the project draws to a close. The license is GPL v.2+.

MorphAdorner is designed to prepare texts for analysis within the larger framework of tools available in MONK. What’s especially cool, for my purposes, is that it’s trained on specifically literary corpora, down to the level of separate nineteenth-century American and nineteenth-century British fiction collections, plus early modern English. So it’s likely to be significantly more accurate out of the box than the others. (Note for further investigation: How much work would be involved in using MorphAdorner’s training sets with the other packages, if necessary?) It also uses a custom tagset that’s “designed to accommodate written English from Chaucer on within one descriptive framework” (Martin Mueller, pers. comm.) and is generally equipped to work with literary texts in many genres from middle English to the present, all of which distinguishes it from the other tools, which were generally developed for and by corpus linguists.

I’m evaluating a prerelease snapshot (labeled version 0.81) supplied to me by the developers on 6 November 2008. MorphAdorner is a commandline Java package, like Lingpipe and the Stanford tagger. It can handle XML or plain text input and produces equivalent output. (It would also be possible to write a new output routine to produce XML output from plain text input.) It uses a hidden Markov model, or so I infer from the Java class names (Pib, correct me if I’m wrong).

Next up …

I think that’s it for the descriptive bit. An update to the speed comparison post (to finish the Stanford bidirectional run and add TreeTagger and MorphAdorner results) is coming shortly.

Evaluating POS Taggers: The Archive

This is the first in a short series of posts comparing part-of-speech taggers (at the moment, just Lingpipe and the Stanford NLP group’s tagger). I need to settle on one for my history and genre project, but/and I haven’t yet come across any kind of broad consensus concerning their relative merits. Note to self: Talk to the people over in linguistics.

So … this post is a description of the testing archive. I picked fourteen novels from Gutenberg; there’s nothing all that special about them, except that they’re fairly canonical (I’ve read all of them, which could be useful) and they feel more or less representative of English-language fiction in Gutenberg. (No strong claims on this front; a full characterization of Gutenberg will be part of the project proper.) They’re typical in length and are distributed reasonably well by British/US origin, author gender, and date of composition. The list, arranged chronologically:

  • Defoe, Robinson Crusoe (1719), Gutenberg etext #521, 121,515 words. M/Brit.
  • Richardson, Pamela (1740), #6124, 221,252 words. M/Brit.
  • Austen, Sense and Sensibility (1811), #161, 118,572 words. F/Brit.
  • Scott, Ivanhoe (1819), #82, 183,104 words. M/Scottish.
  • Brontë, Jane Eyre (1847), #1260, 185,455 words. F/Brit.
  • Hawthorne, The Scarlet Letter (1850), #33, 82,912 words. M/Amer.
  • Melville, Moby-Dick (1851), #2489, 212,467 words. M/Amer.
  • Stowe, Uncle Tom’s Cabin (1852), #203, 180,554 words. F/Amer.
  • Dickens, Great Expectations (1861), #1400, 184,425 words. M/Brit.
  • Alcott, Little Women (1868/1869), #514, 185,890 words. F/Amer.
  • Eliot, Middlemarch (1871-72/1874), #145, 316,158 words. F/Brit.
  • Twain, Huckleberry Finn (1885), 76, 110,294 words. M/Amer.
  • Conrad, Heart of Darkness (1899/1902), 526, 38,109 words. M/Brit/Polish.
  • Joyce, Ulysses (1922), #4300, 264,965 words. M/Irish.

A few observations on the archive:

  • Size: 14 novels, 2.4 million words. This is a tiny fraction of the Gutenberg fiction holdings in English; there’s no way it can be fully representative, and I don’t claim otherwise. I’ve assembled it for benchmarking purposes only. Still, my hope is that it’s not radically unrepresentative.
  • Distribution:
    • 9 men, 5 women
    • 9 British Isles (including Conrad), 5 American
    • 2 18th century, 10 19th, 2 20th
    • Maybe more to the point, genres/periods include the early novel, epistolary novel, realism (heavily represented), Romanticism, allegory, and early and high modernism.
  • There’s a heavy skew toward men, Brits, and the nineteenth century. My guess is that this is true of the Gutenberg holdings overall; it’s also not hugely out of line with, say, the MONK project’s fiction archive. Actual numbers for Gutenberg as a whole will follow at some point in the future.
  • Middlemarch is the longest text by a fair margin; Heart of Darkness is an outlier at an order of magnitude shorter. There’s an uncanny cluster of lengths around 185,000 words. I assume this has much to do with the economics of book publishing in nineteenth-century Britain. Will be interesting to see if this is true in the full holdings.

My preparation of the texts was minimal and pretty loose—again, the goal is benchmarking, not final critical accuracy. The files are all plain ASCII text. I removed the Gutenberg headers and legal disclaimers, as well as any critical apparatus (pretty rare in Gutenberg to begin with) and editor’s introductions. Chapter heads, prefaces by the original author, etc. stayed in. An archive of the prepared texts is available (tar.gz, 5 MB), though there’s no reason to get them from me rather than from Gutenberg unless you want to repeat my trials.

More shortly in two posts, one on speed and the other on statistics and accuracy.

Google Books and Programmatic Full-Text Access

A great deal (read: “everything”) will depend on the details, but Google has just announced a settlement with book publishers that resolves their long-standing lawsuit. It includes a provision for “researchers to study the millions of volumes in the Book Search index. Academics will be able to apply through an institution to run computational queries through the index without actually reading individual books.”

More to follow. In the meantime, see Dan Cohen’s “First Impressions of the Google Books Settlement,” which hits the high points.

POS Taggers

Does anyone have an opinion on the relative merits of the various part-of-speech taggers? I’ve used (and had decent luck with) Lingpipe, which seems pretty quick and very accurate in my limited tests. I also just read a post by Matthew Jockers about the Stanford Log-linear Part-Of-Speech Tagger (which is what got me thinking about this; I admit I was largely sucked in by the discussion of Xgrid, which I’d really like to try). And I thought the Cornell NLP folks had one, too, though I now can’t find any reference to it, so I may well be wrong. Plus there’s MONK/Northwestern’s MorphAdorner (code not yet generally available, though I don’t think it would be a problem to get it), and any number of commercial options (less attractive, for many reasons).

I surely just need to test a bunch of them is some semi-systematic way, but is there any existing consensus about what works best for literary material?

How Not to Read a Million Books

A nice, talk-length summary of the MONK Project‘s goals, methods, and initial use cases by John Unsworth, the co-PI. A useful place to direct people who wonder what digital literary studies might be about, if one doesn’t just want to dump them into Literary and Linguistic Computing or one of the many recent monographs/anthologies.

One note: John closes with a brief discussion of Brad Pasanek and D. Sculley’s recent piece, “Meaning and mining: The impact of implicit assumptions in data mining for the humanities,” in LLC, which is a kind of cautionary tale about the (fundamental, inescapable) role of interpretation in computationally assisted literary criticism. Pasanek and Sculley are right, of course, that computational results require close reading of their own, and there are probably people who sorely need this reminder—in fact there are probably a whole lot of humanities scholars who do—but I don’t think this group has much overlap with the set of people doing actual quantitative work. If there’s one thing we’ve learned from science studies—which is, after all, the sociological and theoretical study of fields that are grounded overwhelmingly in quantitative methods—it’s that experiments alone don’t tell us anything, much less give us unmediated access to objective truth. Pasanek and Sculley do a nice and valuable job of illustrating some specific issues in digital humanities, but I think the take-home message is “Remember Latour! (Or Kuhn! Or Fleck! Or Bloor! Or Shapin! Or …!)”

[Update: The original version of this post linked to the wrong Pasanek and Sculley article. I’ve corrected the link above; the erroneous one (well worth a read in its own right) was “Mining Millions of Metaphors.”]

Hathi, OCA, Gutenberg (and Local Stores)

Following Lisa’s recent comment on the Hathi Trust, I’ve been looking (briefly) into it as an alternative/supplement to OCA and to the much smaller Gutenberg archives. Some thoughts on their relative suitability for my project:

First, a note on usable sizes. Gutenberg has about 20,000 volumes, of which somewhere between 3,000 and 4,000 are novels in English (this off the top of my head from a while back; the numbers may be off a bit, but not by orders of magnitude). So Gutenberg, an archive that I think it’s fair to say skews toward fiction, is fifteen to twenty percent potentially usable material for my purposes. I haven’t yet looked closely at the specific historical distribution, but let’s assume the best case for my needs, i.e., that those usable volumes are distributed evenly across the period 1600–1920 (this won’t be true, of course, but I’m thinking of the limit case). So that’s on the order of 10 volumes/year. This is Not Good if I’m hoping to get useful statistics out of single-year bins, i.e., achieve one-year “resolution” of historical changes in literary texts.

OCA is larger: 545,232 “texts,” whatever that might mean. I’ll assume it’s similar to Gutenberg (looks that way on a brief inspection, though in either case the details will be murky), in which case OCA is 25-ish times larger. I’d be surprised if it’s as literature-heavy as Gutenberg, but let’s assume for the moment that it is. Again assuming uniform historical distribution, we’d expect 200-300 texts for any given year. That’s a lot more plausible, though still a bit low; in the real-case scenario (uneven distribution, likely lower concentration of literary texts), I’d expect not much better than 500 usable texts/year for the best years, and at least an order of magnitude less for poor ones (likely especially concentrated in earlier periods). Given uneven distribution, it might make sense to vary the historical bin size, i.e., to set a minimum number of volumes (say, 300 or 500) and group as many contiguous years together as necessary to achieve that sample size.

(But NB re: OCA: A query for “subject:Literature” and media type “text” returns only c. 14,500 hits, which is much, much worse—only about 4x Gutenberg—and that’s including dubious “text” media like Deja Vu image files. On the other hand, I doubt that a subject search catches all the relevant texts. On the other other hand, it’s not like I’m going to go through 500k texts to classify them as literature or not; if the search doesn’t work, they may as well not exist. Further investigation obviously required.)

Hathi is larger again: 2.2 million volumes. But there’s a catch – only 329,000 of those are public domain. So, public domain content is on the order of OCA. And a very quick look at some Hathi texts doesn’t look promising in terms of OCR quality. (This, incidentally, is a comparative strong suit of Gutenberg; their editions may not be perfectly reliable for traditional literary scholarship, but they’re more than good enough for mining. Hathi, from what little I’ve seen, may not be.)

But this is all preliminary to another point, which is that access to the OCA and Hathi collections isn’t (apparently) as easy as Gutenberg. With Gutenberg, you just download the whole archive and do with it what you will. It’s short on metadata, which rules it out for a lot of purposes (at least in the absence of some major curatorial work; I’m working on some scripts to do a bit of this programmatically, e.g., by hitting the Library of Congress records to get an idea of first circulation dates), but if you can use what they have, it’s really easy to work with it on your own machine and in whatever form you like. I don’t yet know what’s involved in getting one’s hands on the OCA stuff; I assume they’re amenable, what with having “open” right in the name, and I doubt it would be hard to find out (in fact I’ll be doing exactly that soon), but there’s no ready one-stop way to make it happen. Still, Wget and some scripting love may be the answer.

Hathi is harder to evaluate at the moment, since they don’t even have unified search across the archive working yet (for now, you access the content via the library Web sites of participating institutions). Who knows how it’ll work in the long run? Can I slurp down at least the public domain content? Can I redistribute it, including whatever metadata comes with it? What if I’m not affiliated with a member institution? What about the copyrighted material? (I’m assuming no to this last one, even if I make friends at Michigan and Berkeley, etc.) It’s not that I distrust the Hathi folks—in fact I’m sure they want things to be as open as possible—but I do imagine they’ll have to be careful about copyright and other IP issues that might prevent their archive from being as useful to me as I’d like.

Which leads to one last piece of speculation: Hathi (or Google, for that matter) might offer an API through which to access some or all of their material. (I know Google offers a very limited version aimed at embedding snippets of texts in other contexts, but it seems grossly inadequate for full-text analysis.) This wouldn’t necessarily be bad, but unless it offers access to full texts (out of the question for Google, I think), it would likely be extremely constraining.

Note to Self: Teaching and The Programming Historian

While my class for next semester is more or less set as a combination of media studies and digital humanities, I need to decide how much programming and other technical background to teach in the future. To that end, I’ll be evaluating William Turkel and Alan MacEachern’s The Programming Historian as a pseudo-textbook.

As the authors say, it’s probably not suitable as a lone resource; it doesn’t include exercises, for instance, nor is it a programming reference work. But it’s a smart and well-organized walk through some typical usage scenarios, and it includes (some) suggested readings from Lutz’s Learning Python, which might make for a good complement. I’d be curious to know how others have approached this curricular problem. Google is also my friend; future reports as events warrant.

I suppose the other thing that would be worth thinking about would be some sort of class project (as opposed to, or in addition to, individual student projects). Which means, probably, having a text archive in place to begin with. On which, more in a minute …

Open Content Alliance

Had a very pleasant lunch today with Lisa Spiro, who’s also here at Rice. One of the things we talked about was the ever-present (and extremely frustrating) problem of assembling usefully large literary corpora for digital humanities projects. More specifically, for my project. I’ve been tinkering with the Gutenberg texts, and they’re not bad, really, but there aren’t that many of them, despite the fact that they now have something over 20,000 “books” (read: “catalog items”). That number is more like 3,000-5,000 if you’re looking at novels in English, and if you want meaningful statistics for whole texts (as opposed to chapters, etc.) in, say, single-year bands over the last 500 years, you need probably a couple orders of magnitude more. Commercial databases like Chadwyck-Healy aren’t much help even if you have access to them, since their numbers are similar. Google remains the holy grail, but I haven’t heard anything about success in getting them to allow greater scholarly access, nor would I expect it to be forthcoming soon (though I really, really hope I’m wrong, and I know there’s some exploratory work underway on that front).

Anyway, Lisa reminded me of the existence of the Open Content Alliance, which should have been stunningly obvious to me all along. I remember looking at it (or maybe it was just the Internet Archive) a couple of years ago and thinking “Meh, looks like cached Web pages and bad OCRs of a few thousand books.” That probably wasn’t a fair assessment even at the time, and it’s certainly not true now. I still need to do much more to assess its suitability, and it’s not immediately clear to me how I might pull most of their archive to process locally, nor what might be involved in getting it into usable shape for my purposes (I suspect none of this would be trivial), but it’s definitely high on my list of tasks. 535,000+ items is an intriguing number. Now if we could just find a way to import them into MONK …

Oh, and I should probably write up my DH project here at some point. I could pull something from an existing proposal, but I think it would be a useful exercise to go over it from scratch. We shall see.

Preliminary Syllabus: Intro to Digital Humanities

I’m working on my class for next semester, an introduction to digital humanities. It’s shaping up to be about a third media studies and the rest true DH, on the theory that one needs to know something about why the medium of a work matters if you’re going to understand the changes wrought by digital texts.

But I don’t entirely like this split personality, which I think reflects a certain ambivalence in the field itself. Is what’s interesting about digital texts the fact that they take advantage of their “digitalness”? Or is it that we as scholars can do things with digital texts that are hard to do with dead-tree books? The former suggests we should lean on media studies, and that the most interesting objects will be born-digital works that may not have much to do with text in the conventional sense at all. The latter pushes us toward computation and text mining, suggests that what we want as objects of study are properly encoded/marked-up versions of more-or-less regular books, and lets us continue to ask more traditional literary-critical-historical questions.

I skew strongly toward the latter conception of the field, which better matches my own priorities and interests. But as a practical matter it’s far less developed than media studies, in part due to technical and professional limitations. So one way to teach a course like this would be to adopt Steve Ramsay’s approach: Stage a DH boot camp that teaches basic programming/scripting, database manipulation, text encoding, etc. I think that’s a fantastic thing to do, but it’s not something I wanted to work up from scratch at the moment. And my great hope is that someday soon it won’t be necessary; that tools like the MONK project will eliminate much of the low-level technical work that’s presently required to do meaningful corpus analysis. I fear, of course, that “soon” != soon enough, but I can hope, right? In the meantime, my students will be learning more about the theory and types of digital humanities projects than they will about Perl. Maybe I can send them to Steve for phase two.

DH Syllabus Draft.pdf

[Update: For reasons too bureaucratic to enumerate, the course title is now “Media Studies: Digital Humanities” (English 388).]