Following Lisa’s recent comment on the Hathi Trust, I’ve been looking (briefly) into it as an alternative/supplement to OCA and to the much smaller Gutenberg archives. Some thoughts on their relative suitability for my project:
First, a note on usable sizes. Gutenberg has about 20,000 volumes, of which somewhere between 3,000 and 4,000 are novels in English (this off the top of my head from a while back; the numbers may be off a bit, but not by orders of magnitude). So Gutenberg, an archive that I think it’s fair to say skews toward fiction, is fifteen to twenty percent potentially usable material for my purposes. I haven’t yet looked closely at the specific historical distribution, but let’s assume the best case for my needs, i.e., that those usable volumes are distributed evenly across the period 1600–1920 (this won’t be true, of course, but I’m thinking of the limit case). So that’s on the order of 10 volumes/year. This is Not Good if I’m hoping to get useful statistics out of single-year bins, i.e., achieve one-year “resolution” of historical changes in literary texts.
OCA is larger: 545,232 “texts,” whatever that might mean. I’ll assume it’s similar to Gutenberg (looks that way on a brief inspection, though in either case the details will be murky), in which case OCA is 25-ish times larger. I’d be surprised if it’s as literature-heavy as Gutenberg, but let’s assume for the moment that it is. Again assuming uniform historical distribution, we’d expect 200-300 texts for any given year. That’s a lot more plausible, though still a bit low; in the real-case scenario (uneven distribution, likely lower concentration of literary texts), I’d expect not much better than 500 usable texts/year for the best years, and at least an order of magnitude less for poor ones (likely especially concentrated in earlier periods). Given uneven distribution, it might make sense to vary the historical bin size, i.e., to set a minimum number of volumes (say, 300 or 500) and group as many contiguous years together as necessary to achieve that sample size.
(But NB re: OCA: A query for “subject:Literature” and media type “text” returns only c. 14,500 hits, which is much, much worse—only about 4x Gutenberg—and that’s including dubious “text” media like Deja Vu image files. On the other hand, I doubt that a subject search catches all the relevant texts. On the other other hand, it’s not like I’m going to go through 500k texts to classify them as literature or not; if the search doesn’t work, they may as well not exist. Further investigation obviously required.)
Hathi is larger again: 2.2 million volumes. But there’s a catch – only 329,000 of those are public domain. So, public domain content is on the order of OCA. And a very quick look at some Hathi texts doesn’t look promising in terms of OCR quality. (This, incidentally, is a comparative strong suit of Gutenberg; their editions may not be perfectly reliable for traditional literary scholarship, but they’re more than good enough for mining. Hathi, from what little I’ve seen, may not be.)
But this is all preliminary to another point, which is that access to the OCA and Hathi collections isn’t (apparently) as easy as Gutenberg. With Gutenberg, you just download the whole archive and do with it what you will. It’s short on metadata, which rules it out for a lot of purposes (at least in the absence of some major curatorial work; I’m working on some scripts to do a bit of this programmatically, e.g., by hitting the Library of Congress records to get an idea of first circulation dates), but if you can use what they have, it’s really easy to work with it on your own machine and in whatever form you like. I don’t yet know what’s involved in getting one’s hands on the OCA stuff; I assume they’re amenable, what with having “open” right in the name, and I doubt it would be hard to find out (in fact I’ll be doing exactly that soon), but there’s no ready one-stop way to make it happen. Still, Wget and some scripting love may be the answer.
Hathi is harder to evaluate at the moment, since they don’t even have unified search across the archive working yet (for now, you access the content via the library Web sites of participating institutions). Who knows how it’ll work in the long run? Can I slurp down at least the public domain content? Can I redistribute it, including whatever metadata comes with it? What if I’m not affiliated with a member institution? What about the copyrighted material? (I’m assuming no to this last one, even if I make friends at Michigan and Berkeley, etc.) It’s not that I distrust the Hathi folks—in fact I’m sure they want things to be as open as possible—but I do imagine they’ll have to be careful about copyright and other IP issues that might prevent their archive from being as useful to me as I’d like.
Which leads to one last piece of speculation: Hathi (or Google, for that matter) might offer an API through which to access some or all of their material. (I know Google offers a very limited version aimed at embedding snippets of texts in other contexts, but it seems grossly inadequate for full-text analysis.) This wouldn’t necessarily be bad, but unless it offers access to full texts (out of the question for Google, I think), it would likely be extremely constraining.