Had a very pleasant lunch today with Lisa Spiro, who’s also here at Rice. One of the things we talked about was the ever-present (and extremely frustrating) problem of assembling usefully large literary corpora for digital humanities projects. More specifically, for my project. I’ve been tinkering with the Gutenberg texts, and they’re not bad, really, but there aren’t that many of them, despite the fact that they now have something over 20,000 “books” (read: “catalog items”). That number is more like 3,000-5,000 if you’re looking at novels in English, and if you want meaningful statistics for whole texts (as opposed to chapters, etc.) in, say, single-year bands over the last 500 years, you need probably a couple orders of magnitude more. Commercial databases like Chadwyck-Healy aren’t much help even if you have access to them, since their numbers are similar. Google remains the holy grail, but I haven’t heard anything about success in getting them to allow greater scholarly access, nor would I expect it to be forthcoming soon (though I really, really hope I’m wrong, and I know there’s some exploratory work underway on that front).
Anyway, Lisa reminded me of the existence of the Open Content Alliance, which should have been stunningly obvious to me all along. I remember looking at it (or maybe it was just the Internet Archive) a couple of years ago and thinking “Meh, looks like cached Web pages and bad OCRs of a few thousand books.” That probably wasn’t a fair assessment even at the time, and it’s certainly not true now. I still need to do much more to assess its suitability, and it’s not immediately clear to me how I might pull most of their archive to process locally, nor what might be involved in getting it into usable shape for my purposes (I suspect none of this would be trivial), but it’s definitely high on my list of tasks. 535,000+ items is an intriguing number. Now if we could just find a way to import them into MONK …
Oh, and I should probably write up my DH project here at some point. I could pull something from an existing proposal, but I think it would be a useful exercise to go over it from scratch. We shall see.
Hi Matt,
The recently-announced Hathi Trust (see, for example, http://scholarlypublishing.org/jpwilkin/archives/16) may be another option.
I enjoyed lunch, too!
-Lisa