I spent most of the day—a beautiful, sunny, perfect spring day that I’ll never get back—munging Gutenberg catalog data to see how their holdings stack up for a short-term project of mine. I suppose this built character, and I know from experience that it’s useful to spend time poking around in your data. Still …
A few numbers that stood out to me (mostly rounded for easier reading):
There are close to 32,000 total volumes in the Gutenberg catalog, of which almost 20,000 have Library of Congress subject codes. This is good, but not perfect. Nine months ago, the numbers were 29,000 and a little over 16,000. This tells me that pretty much all new additions are being cataloged with full(ish) metadata, but there’s not much progress being made on filling in old records (and the old stuff is often high-profile, since it was what people worked on first).
Of the c. 32K total volumes, about 26,500 are in English. Among English titles, 16,600 have LC codes, about the same rate as for all titles.
There are 3,500 titles in English with LC code PR (British literature) and 3,400 with code PS (American). There are another 3,200 P* titles in English, most of which are translations from other languages. So we’re looking at roughly 7,000 readily identifiable titles of British and American literature in English from Gutenberg at the moment. (Note that all these PR/PS numbers exclude about 120 volumes by authors unknown, various, or missing.)
If the currently untagged volumes contain literature in the same proportion as the tagged ones, we should expect that number (7,000) to increase to 11,000 if everything were cataloged fully. But I’m not holding my breath for retrospective catalog work unless I do it myself by automating some queries against the LC servers. That’s an idea I’ve been kicking around for a while. Not sure if it’s worth the effort to increase the size of the relevant (to me) Gutenberg corpus by 50%.
Here’s a bit that might be more interesting. To what extent are Gutenberg’s literature holdings dominated by a small number of authors writing a lot of books? Well, of those 7,000 PR/PS entries in English, 5,800 belong to authors with more than one title to their names. Specifically, those 5,800 titles are the work of 726 different authors. So authors who have more than one title in Gutenberg have on average 8 titles apiece. It also implies that there are another 1,200 singleton authors. Overall, that means 7,000 titles by 2,000 authors. Not as bad as I expected, really.
When you look at the list of works by multi-title authors, you see that there’s a fair amount of duplication and cruft. Not in the metadata (which are generally pretty good), but in Gutenberg’s “acquisitions” process: There are lots of cases where etext volumes reflect separate individual paper volumes (e.g., Clarissa, vols. 1-9, each as a separate etext title), or where a work has been digitized multiple times, possibly from different sources. Nothing wrong with that, of course, but if you get rid of it (this involves some judgment, so you do it by hand! fun!), you’re left with about 4,500 more-or-less distinct titles by those same 720-ish authors. Even that number overcounts a bit (because you’re conservative about purging the rolls of duplicates), but it’s reasonably close. For what it’s worth, this means that there are really more like 5,700 distinct cataloged PR/PS volumes in English at the moment (= 7,000 – 1,300 “dupes”).
Other thoughts:
There’s a fair amount of science fiction and related genres in the catalog. I guess I knew this, and probably shouldn’t be surprised given the way the project works.
Date information for authors is good, at least if you restrict yourself to cases where an LC code exists (and metadata are thus in good shape). Birth and death dates are written into the creator records, so you have to parse them out, but it’s not hard. Still, would be nice if they were a separate entry in the catalog.
Original publication info is nonexistent. Bummer, though I knew this already. Gutenberg is not a home to bibliographic scholarship.
I was a little surprised that the total numbers for British and American titles were just about even. I expected more British stuff. Will produce a little graph of holdings by author birthdate for each, just for kicks (and update the post accordingly). I expect (unsurprisingly) that the American stuff will skew recent compared to the British.
That’s it for now. This is all still vaguely allegory-related. More when it’s ready. The job market is keeping me busy.
Hey Matt,
Thanks for this analysis! This is very useful and something I have often wondered about. We really need a “master-metadata-manifest” of PG. I admire the volume of volumes in PG, but plain vanilla text has its disadvantages when it comes to metadata. Without XML encoding (or some other kind of enhanceable DB) to provide the details, we have to rely on one-offs such as your’s here.
Now, when you get a minute, would you mind separating out the poetry from the drama from the prose. . . oh and please indicate fiction or non-fiction, and when you’re done with that, there’s that pony we talked about. . .
Yeah, plain text is a pain. For what it’s worth, the catalog is XML, and I suppose they might be talked into adding information beyond basic library records. At the other end of the scale, it would also be nice if there were a public interface to the cataloging process, à la Distributed Proofreaders, so that you could easily add info and correct errors.
I probably will look at sorting out fiction/drama/poetry at some point, plus primary/secondary lit (way more of the former, of course), author gender, etc. The big issue for me is deciding on the right way to store this info so that I can incorporate it programmatically in the future – I don’t want to be forced to either redo it by hand every time I return to the corpus or be limited to the current version of the catalog. Shouldn’t be terribly hard (ha! I’ve heard that before!), but I haven’t settled on anything.
The pony is a longer-range project.