This is the first in a short series of posts comparing part-of-speech taggers (at the moment, just Lingpipe and the Stanford NLP group’s tagger). I need to settle on one for my history and genre project, but/and I haven’t yet come across any kind of broad consensus concerning their relative merits. Note to self: Talk to the people over in linguistics.
So … this post is a description of the testing archive. I picked fourteen novels from Gutenberg; there’s nothing all that special about them, except that they’re fairly canonical (I’ve read all of them, which could be useful) and they feel more or less representative of English-language fiction in Gutenberg. (No strong claims on this front; a full characterization of Gutenberg will be part of the project proper.) They’re typical in length and are distributed reasonably well by British/US origin, author gender, and date of composition. The list, arranged chronologically:
- Defoe, Robinson Crusoe (1719), Gutenberg etext #521, 121,515 words. M/Brit.
- Richardson, Pamela (1740), #6124, 221,252 words. M/Brit.
- Austen, Sense and Sensibility (1811), #161, 118,572 words. F/Brit.
- Scott, Ivanhoe (1819), #82, 183,104 words. M/Scottish.
- Brontë, Jane Eyre (1847), #1260, 185,455 words. F/Brit.
- Hawthorne, The Scarlet Letter (1850), #33, 82,912 words. M/Amer.
- Melville, Moby-Dick (1851), #2489, 212,467 words. M/Amer.
- Stowe, Uncle Tom’s Cabin (1852), #203, 180,554 words. F/Amer.
- Dickens, Great Expectations (1861), #1400, 184,425 words. M/Brit.
- Alcott, Little Women (1868/1869), #514, 185,890 words. F/Amer.
- Eliot, Middlemarch (1871-72/1874), #145, 316,158 words. F/Brit.
- Twain, Huckleberry Finn (1885), 76, 110,294 words. M/Amer.
- Conrad, Heart of Darkness (1899/1902), 526, 38,109 words. M/Brit/Polish.
- Joyce, Ulysses (1922), #4300, 264,965 words. M/Irish.
A few observations on the archive:
- Size: 14 novels, 2.4 million words. This is a tiny fraction of the Gutenberg fiction holdings in English; there’s no way it can be fully representative, and I don’t claim otherwise. I’ve assembled it for benchmarking purposes only. Still, my hope is that it’s not radically unrepresentative.
- Distribution:
- 9 men, 5 women
- 9 British Isles (including Conrad), 5 American
- 2 18th century, 10 19th, 2 20th
- Maybe more to the point, genres/periods include the early novel, epistolary novel, realism (heavily represented), Romanticism, allegory, and early and high modernism.
- There’s a heavy skew toward men, Brits, and the nineteenth century. My guess is that this is true of the Gutenberg holdings overall; it’s also not hugely out of line with, say, the MONK project’s fiction archive. Actual numbers for Gutenberg as a whole will follow at some point in the future.
- Middlemarch is the longest text by a fair margin; Heart of Darkness is an outlier at an order of magnitude shorter. There’s an uncanny cluster of lengths around 185,000 words. I assume this has much to do with the economics of book publishing in nineteenth-century Britain. Will be interesting to see if this is true in the full holdings.
My preparation of the texts was minimal and pretty loose—again, the goal is benchmarking, not final critical accuracy. The files are all plain ASCII text. I removed the Gutenberg headers and legal disclaimers, as well as any critical apparatus (pretty rare in Gutenberg to begin with) and editor’s introductions. Chapter heads, prefaces by the original author, etc. stayed in. An archive of the prepared texts is available (tar.gz, 5 MB), though there’s no reason to get them from me rather than from Gutenberg unless you want to repeat my trials.
More shortly in two posts, one on speed and the other on statistics and accuracy.