Gutenplots

As promised yesterday, here are a few plots of the distribution of literary titles in the Gutenberg corpus by the date of their authors’ birth. Producing these was as much a way for me to play with ggplot2 (written my colleague Hadley Wickham in the statistics department here at Rice) as anything else, but the results are interesting, too.

(Note that in all of the following plots, the titles in question are from the Gutenberg catalog as of 22 March 2010. They include only volumes in English with Library of Congress subject codes PR [British literature] or PS [American lit] and with both a determinate author [no blanks, “Anonymous,” “Various,” etc.] and a supplied creator birth year. No further curation was performed. There are 3380 PS titles and 3145 PR titles that fit this description. These numbers are somewhat greater than those in yesterday’s post, because I didn’t do any manual de-duping. In any case, when I talk about “Gutenberg” below, be aware that I’m only addressing this specific, literary, English-language subset of the full 30,000+ volumes in the corpus.)

First up, histograms by decade (click to embiggen):

PR Hist Long.png
PS Hist Long.png

There’s a lot of whitespace in these because I’ve shown the full date range 1300-2000 in order to make direct comparisons between the British and American subsets easier.

No surprise that Gutenberg comprises primarily works by authors born in the nineteenth century. In both cases, there are large but not overwhelming spikes around the 1860s and ’70s. Those (birth) years produced a lot of prolific authors, including those who wrote stories and other multivolume works (we’re tallying volumes, not pages or words). It seems a little late, though, for authors born in these years—and presumably writing mostly in the very late nineteenth and early twentieth centuries—to be cranking out triple-deckers. Will look into this. I suspect it has more to do with a general upward trend in publishing volume over time, a trend that tales off in Gutenberg only because of copyright issues for authors born much later than 1880 or ’90. But I also can’t rule out some sort of other selection effect having to do with Gutenberg’s acquisitions process rather than the underlying literary production of the period. Should talk to Matt Jockers and Franco Moretti about this; they know big-picture numbers about the nineteenth century better than anyone else I know. In any case, the high numbers for the mid-late nineteenth century look to be “real,” by which I mean that there’s no obvious cataloging anomaly or small handful of over-represented authors to explain them away.

For more detail (and slightly niftier plotting), here are the counts for PR and PS volumes by year plotted against one another directly (same story, click to enlarge):

All Full.png

The outliers (with counts above about 125) are the years:

  • 1564 (Shakespeare, labeled; Martin Mueller’s not kidding about the extent to which Shakespeare dominates our understanding of the early modern period)
  • 1803 (PR; Lytton, mostly, who has lots of multivolume works)
  • 1835 (PS; Twain)
  • 1862 (PS; Edith Wharton, O. Henry, Gilbert Parker, and others)
  • 1863 (PR; W.W. Jacobs, author of many a short story, among others)

How about a more focused version for the years 1700-2000, with smoothed means, to make a core comparison easier?:

All Detail Fit.png

As predicted, the American lit is slightly more recent, on average, than the British. But the difference is small, and it’s mostly down to the presence of comparatively recent work by American (or at least PS-categorized) authors that has entered the public domain one way or another during a period when that wouldn’t happen automatically. Such recent works are totally absent from the British/PR list, which ends with authors born right at the turn of the last century (and not many of those, for obvious copyright-related reasons).

It would be nice to have dates of composition for the works themselves, but that’s not likely to happen without serious additional legwork. In the meantime, author birthdates aren’t all bad; if you make the debatable but not ridiculous assumption that most authors are largely formed in their early careers, you might do just as well grouping their works by “date of maturity” as you would by date of composition. (And you wouldn’t keep trying to shoehorn Henry James into modernism proper, for God’s sake!) Plus, you’d avoid the separate issue of publication dates that don’t line up with composition dates.

Finally, for my own future reference, the (ugly!) R/ggplot2 commands that generated these figures.

The fitted, annotated, detail scatterplot:

qplot(V1, V2, data=pr, xlab="Author Birth Year", ylab="Title Count", main="Gutenberg Titles by Author Birthdate (Detail, Fitted)", xlim=c(1700, 2000), ylim=c(0, 140)) +geom_smooth(data=pr, color="black", alpha=0) +geom_point(data=ps, color="red") +geom_smooth(data=ps, color="red", alpha=0) +annotate("text", x=1564, y=185, label="Shakespeare", size=4, alpha=0.4) +annotate("text", x=1955, y=55, label="PS\n(Amer)", color="red") +annotate("text", x=1745, y=40, label="PR\n(British)")

pr and ps are hash-like lists of author birth years and corresponding counts of volumes for that year, one year/count pair per line.

The histograms are similar but easier, involving variations on something like:

qplot(V1, data=pshist, geom = "histogram", binwidth=10, main="American (PS) Gutenberg Titles by Author Birthdate", xlab="Author Birth Year", ylab="Title Count", xlim(1300, 2000), ylim(0, 700))

Where pshist is just an unsorted list of author birth years, one for each volume (in this case, each PS volume) in the catalog (so yes, lots of repeats, which is the point).

Some Gutenberg Numbers

I spent most of the day—a beautiful, sunny, perfect spring day that I’ll never get back—munging Gutenberg catalog data to see how their holdings stack up for a short-term project of mine. I suppose this built character, and I know from experience that it’s useful to spend time poking around in your data. Still …

A few numbers that stood out to me (mostly rounded for easier reading):

There are close to 32,000 total volumes in the Gutenberg catalog, of which almost 20,000 have Library of Congress subject codes. This is good, but not perfect. Nine months ago, the numbers were 29,000 and a little over 16,000. This tells me that pretty much all new additions are being cataloged with full(ish) metadata, but there’s not much progress being made on filling in old records (and the old stuff is often high-profile, since it was what people worked on first).

Of the c. 32K total volumes, about 26,500 are in English. Among English titles, 16,600 have LC codes, about the same rate as for all titles.

There are 3,500 titles in English with LC code PR (British literature) and 3,400 with code PS (American). There are another 3,200 P* titles in English, most of which are translations from other languages. So we’re looking at roughly 7,000 readily identifiable titles of British and American literature in English from Gutenberg at the moment. (Note that all these PR/PS numbers exclude about 120 volumes by authors unknown, various, or missing.)

If the currently untagged volumes contain literature in the same proportion as the tagged ones, we should expect that number (7,000) to increase to 11,000 if everything were cataloged fully. But I’m not holding my breath for retrospective catalog work unless I do it myself by automating some queries against the LC servers. That’s an idea I’ve been kicking around for a while. Not sure if it’s worth the effort to increase the size of the relevant (to me) Gutenberg corpus by 50%.

Here’s a bit that might be more interesting. To what extent are Gutenberg’s literature holdings dominated by a small number of authors writing a lot of books? Well, of those 7,000 PR/PS entries in English, 5,800 belong to authors with more than one title to their names. Specifically, those 5,800 titles are the work of 726 different authors. So authors who have more than one title in Gutenberg have on average 8 titles apiece. It also implies that there are another 1,200 singleton authors. Overall, that means 7,000 titles by 2,000 authors. Not as bad as I expected, really.

When you look at the list of works by multi-title authors, you see that there’s a fair amount of duplication and cruft. Not in the metadata (which are generally pretty good), but in Gutenberg’s “acquisitions” process: There are lots of cases where etext volumes reflect separate individual paper volumes (e.g., Clarissa, vols. 1-9, each as a separate etext title), or where a work has been digitized multiple times, possibly from different sources. Nothing wrong with that, of course, but if you get rid of it (this involves some judgment, so you do it by hand! fun!), you’re left with about 4,500 more-or-less distinct titles by those same 720-ish authors. Even that number overcounts a bit (because you’re conservative about purging the rolls of duplicates), but it’s reasonably close. For what it’s worth, this means that there are really more like 5,700 distinct cataloged PR/PS volumes in English at the moment (= 7,000 – 1,300 “dupes”).

Other thoughts:

There’s a fair amount of science fiction and related genres in the catalog. I guess I knew this, and probably shouldn’t be surprised given the way the project works.

Date information for authors is good, at least if you restrict yourself to cases where an LC code exists (and metadata are thus in good shape). Birth and death dates are written into the creator records, so you have to parse them out, but it’s not hard. Still, would be nice if they were a separate entry in the catalog.

Original publication info is nonexistent. Bummer, though I knew this already. Gutenberg is not a home to bibliographic scholarship.

I was a little surprised that the total numbers for British and American titles were just about even. I expected more British stuff. Will produce a little graph of holdings by author birthdate for each, just for kicks (and update the post accordingly). I expect (unsurprisingly) that the American stuff will skew recent compared to the British.

That’s it for now. This is all still vaguely allegory-related. More when it’s ready. The job market is keeping me busy.