My 2011 MLA Session

For those attending MLA in Los Angeles this week, I’ll be taking part in a “digital roundtable” organized by the ACH. Details below. Lots of smart people and interesting projects. The session abstract:

The Association for Computers and the Humanities (ACH) is pleased to sponsor an electronic roundtable and demo session featuring new and renewed work in media and digital literary studies. Projects, groups, and initiatives highlighted in this session build on the editorial and archival roots of humanities scholarship to offer new, explicitly methodological and interpretive contributions to the digital literary scene, or to intervene in established patterns of scholarly communication and pedagogical practice. Each presenter will offer a very brief introduction to his or her work, setting it in the context of digital humanities research and praxis, before we open the floor for simultaneous demos and casual conversations with attendees at eight computer stations.

A complete session description, including a list of presenters and individual project abstracts, is available on the ACH site. MLA’s session description (less info but with up-to-date annotations) is available to MLA members.

Session details:

  • 193. New (and Renewed) Work in Digital Literary Studies
  • Friday, 7 January
  • 8:30–9:45 a.m., Plaza I, J. W. Marriott

    Books I Read in 2010

    As I did last year, here’s a list of the books I read for the first time in 2010. Just fiction; no criticism, theory, journals, etc.

    • Atwood, Margaret. Oryx and Crake.
    • Burgess, Anthony. A Clockwork Orange.
    • Camus, Albert. The Plague.
    • Capek, Karel. R.U.R.
    • Davis, Kathryn. The Thin Place.
    • Donoghue, Emma. Room.
    • Fowles, John. The French Lieutenant’s Woman.
    • Gilb, Dagoberto. The Last Known Residence of Mickey Acuña.
    • Golding, William. Lord of the Flies. (OK, I read this in high school, but that doesn’t count. Ditto Animal Farm, which I also reread this year, though I’m reluctant to cop to it.)
    • Johnson, B.S. Albert Angelo.
    • Kerouac, Jack. On the Road. (An exception here; a serious reread for the book manuscript.)
    • Lee, Andrea. Lost Hearts in Italy.
    • Mantel, Hilary. Wolf Hall.
    • Markson, David. Wittgenstein’s Mistress.
    • Millet, Lydia. Everyone’s Pretty.
    • Mitchell, David. The Thousand Autumns of Jacob de Zoet.
    • Peace, David. Occupied City.
    • Petterson, Per. I Curse the River of Time.
    • Powell, Padgett. The Interrogative Mood.
    • Russo, Richard. Straight Man.
    • Saro-Wiwa, Ken. Sozaboy.
    • Williams, Joy. The Quick and the Dead.
    • Yu, Charles. How to Live Safely in a Science Fictional Universe.

    Oh, and I’m in the middle of Adrian Johns’ Piracy, which isn’t fiction, but which I’m totally reading for the plot. Does that count?

    Read bits of a few others (Parrot and Olivier in America, Super Sad True Love Story, The Pregnant Widow, Death of the Adversary) to which I hope to return.

    Should post some thoughts on these eventually. Or maybe something more formal for the new Post45 journal. We shall see.

    First up in 2011: Alexander Theroux or Péter Esterházy, I think.

    Finally and unrelated: I have awesome maps of nineteenth-century American fiction. More to come.

    What To Do With Too Much Text

    Below are the slides from my talk on text mining, “What To Do with Too Much Text, or, Data Mining for the Humanities and Social Sciences,” given at the Washington University Center for Political Economy a few days ago (8 Oct. 2010). For those who weren’t there, the talk was primarily a survey of approaches to (mostly) humanities-oriented text analysis with examples drawn from literary studies, history, psychology, and political science. For a fuller treatment of the opening “Motivations” section, see this post. You might also want to check out the theoretical underpinnings of my own allegory project, about which I said relatively little.

    The original slides are in Keynote and include embedded videos that don’t translate well to PowerPoint (and confuse SlideShare); rather than make a hash of things, I’ve put up a Quicktime version for people who don’t have access to Keynote. The Keynote file includes my (hopefully non-embarrassing) presenter notes, which may give a fuller sense of what I said at some points.

    Below are links to the projects and tools I mentioned (roughly in order of appearance).

    Projects and Works Cited

    Tools

    There are many, many text analysis and natural language processing tools available, many of them geared toward specific research domains. I mentioned only a comparative handful. This list is a long way from exhaustive.

    All projects are free and open source unless otherwise noted.

    Built Tools

    Good places to start; little or no programming required.

    • Wordle. Word clouds. Noncommercial use only, I believe.
    • WordHoard. Statistics, analytics, and visualizations of classic literature.
    • GeoDict. Extract named places from unstructured text.
    • Docuscope. A semi-publicly-available tool for text analysis backed by an extensive, hand-curated dictionary.
    • Casstools.org. Contrast Analysis of Semantic Similarity. Evaluate differential word associations in text corpora.
    • Voyeur Tools. Simple, Web-based text analytics. BYO text/corpus.
    • The MONK Project. Integrated, Web-based corpus analysis. Uses only texts from the (relatively large) included corpus.
    • SEASR. Packaged text analytics and development environment aimed at scholars in the humanities. Includes Zotero integration. SEASR pushes toward a full toolkit.
    • And one tool that I didn’t have a chance to mention: Mark Olson’s ARTFL-associated PhiloLine/PAIR. Sequence alignment detection in textual corpora; the analogy is to similar work in genetics.

    Toolkits and Development Environments

    Most of these packages come with demos and tutorials that may be useful on their own, but they’re aimed at allowing you to create your own text-mining applications.

    • GATE. An advanced development environment for text analysis with included analysis routines.
    • LingPipe. Advanced, Java-based natural language processing (NLP) toolkit. Partially integrated with GATE, but also a stand-alone product. Open source, but free only if you make your output texts freely available.
    • NLTK. Well-documented, Python-based NLP toolkit. Used widely in teaching NLP.
    • MALLET. Java-based, command-line package for statistical NLP. Useful for topic modeling, among many other things.

    Statistics Packages

    These packages don’t necessarily have anything to do with natural language analysis, but they’re useful for general statistical work and visualization.

    • R. A platform for statistical computing. Baayen’s book on corpus linguistics with R is a useful introduction with a natural language focus.
    • SPSS. The long-serving standard for stats in the social sciences. Emphatically not free, but widely site-licensed.

    Hope this is of some use. Drop me a line (see the “About” page) if you spot any errors or want to chat about this work.

    Expanded List of Allegorical-Nonallegorical Pairs

    Background

    In an earlier post, I offered a brief list of paired allegorical and nonallegorical texts by single authors. The idea was to use these pairs to look for the distinguishing textual features of allegory by controlling for as many variables (such as authorial style, genre, national origin, gender, period of composition, etc.) as possible. Or in other words, the attempt was to get as close as possible to the unattainable ideal of a corpus of texts that differ only by the presence or absence of allegory.

    That short list was OK and was the basis of the second figure in my MLA paper on “Critical Text Mining.” But it was both (1) too short for corpus work and (2) depended on my own assessment of allegoricalness, with attendant limitations of historical scope. I’ve always felt that the better option would be to build an expanded version of this pairwise list on the basis of settled scholarship in the field.

    The table below represents the groundwork for such a corpus of well-established allegorical-nonallegorical pairs. It’s still under development—there are obvious holes and issues—but it’s an outline of where I’m headed. What I really need now is feedback on the composition of this list.

    Issues and Notes

    A few notes, followed by a request for kind assistance:

    • All of the allegorical works are attested by one or more of the following major sources on allegory. Most are attested by several of them.
      • Copeland, Rita, and Peter Struck, eds. The Cambridge Companion to Allegory. Cambridge: Cambridge UP, 2010.
      • Fletcher, Angus. Allegory: The Theory of a Symbolic Mode. Ithaca: Cornell UP, 1964.
      • Honig, Edwin. Dark Conceit: The Making of Allegory. Hanover, NH: UP of New England, 1959.
      • Leeming, David Adams, and Kathleen Morgan Drowne. Encyclopedia of Allegorical Literature. Santa Barbara, CA: ABC-CLIO, 1996.
      • Tambling, Jeremy. Allegory. New York: Routledge, 2010.
    • From these sources, I’ve excluded works mentioned only in passing or discussed as ambiguous or difficult cases. So while there’s always room to argue about the allegoricalness of any entry, the texts presented here under the heading of “Allegory” are about as canonically allegorical as it’s possible to be.
    • The nonallegorical texts are another matter; I’ve selected them myself as potential pairings for the allegorical entries. So far I’ve limited these to works by the same author, but I’m not necessarily averse to well-paired nonallegorical entries by other authors (and I’m aware that such pairings will sometimes be required).

    There are two ways to use this list, and therefore two potentially conflicting goals when selecting pairs of texts:

    • Pairwise comparisons. In this case, I’ll evaluate each allegorical text only against its paired nonallegorical counterpart. For this purpose, it’s not especially important where the two texts fall on the imagined spectrum of allegoricalness, only that they be well separated from one another on it. But it is important that the two members of the pair are otherwise as similar as possible.
    • Corpus comparisons. On the other hand, I’ll also want to compare the features of the allegorical texts taken together against those of the collected nonallegorical texts. For this purpose what’s important is to avoid cases in which any of the allegorical or nonallegorical entries stray too far toward the opposite category, even if they’re significantly different from their pairmates. But it’s not so crucial that any one pair be especially well matched in content, style, etc.; the two corpora just need to be similar in overall composition.

    Action Item

    So what I’m looking for is feedback on the suitability of the nonallegorical items that are currently listed below, plus suggestions for appropriate texts where none is given.

    The ideal case it to find a firmly nonallegorical text by the same author for each of the allegorical entries, but where that’s not possible, the next best solution is probably a text of similar origin, style, length, subject matter, form, and so forth. This will never be perfect, but the closer the match—while still maintaining good relative and absolute separation on the allegorical continuum—the better.

    I’d also love to know about potential issues or complications concerning any of these texts and pairings.

    Oh, and one other constraint: I need to be able to get my hands on electronic versions of whatever texts I’m going to use; this makes anything published after 1923 difficult (though not strictly impossible).

    Massive thanks in advance to any and all who care to comment. The comments section below is probably the easiest way to leave feedback, or you can email me by clicking the “About” link (over on the lefthand side).

    The Table: Allegorical and Nonallegorical Text Pairs Grouped by Era

    Author Allegory Nonallegory Notes
    Ancient and classical
    Aeschylus Prometheus Bound Agamemnon Disputed authorship of Prometheus Bound
    Aesop Fables ???
    Hesiod Theogony Works and Days
    Boethius Consolation of Philosophy De Musica De Musica seems unsuitable
    Capella, Martianus Marriage of Mercury and Philology ???
    Ovid Metamophoses Amores
    Prudentius Psychomachia Cathemerinon
    Virgil Aeneid Georgics
    Anon. Bible (Genesis) ??? Very likely more interpretational trouble than it’s worth
     
    Medieval and Renaissance
    Alain de Lille Complaint of Nature Liber poenitentialis
    Lorris, Guillaume de Romance of the Rose ??? Other medieval romance?
    Silvestris, Bernard Cosmographia ??? Maybe commentary on Aeneid, but disputed authorship and different form
    Bale, John King John ??? Another play from the era?
    Chaucer, Geoffrey House of Fame Troilus and Criseyde
    Chaucer, Geoffrey Parliament of Fowles Troilus and Criseyde
    Fletcher, Phineas Purple Island ??? "Brittain’s Ida" (erotic poem)?
    Gower, John Confessio Amantis Vox Clamantis
    Hawes, Stephen Passetyme of Pleasure Comfort of Lovers
    Kempe, Margery Book of Margery Kempe ???
    Langland, William Piers Plowman ???
    Lydgate, John Reson and Sensualitie King Henry VI’s Triumphal …
    Shakespeare, William Phoenix and the Turtle ??? Appropriate sonnets?
    Spenser, Edmund Faerie Queene Shepheardes Calender Or Complaints
    Anon. Castle of Perseverance ???
    Anon. Everyman ???
    Anon. Pearl ???
    Alighieri, Dante Divine Comedy Vita Nuova
    Tasso, Torquato Jerusalem Conquered Aminta
    Calderón Great Theater of the World ??? "Life Is a Dream" too allegorical?
     
    17th & 18th centuries
    La Fontaine, Jean de Fables Tales
    Bunyan, John Holy War Grace Abounding
    Bunyan, John Life and Death of Mr Badman
    Bunyan, John Pilgrim’s Progress
    Defoe, Daniel Robinson Crusoe Journal of the Plague Year
    Dryden, John Absalom and Achitophel Annis Mirabilis
    Milton, John Comus Samson Agonistes Samson Agonistes too allegorical?
    Milton, John Paradise Lost ??? Areopagitica? Genre/form mismatch.
    Pope, Alexander Dunciad Rape of the Lock
    Swift, Johnathan Battle of the Books Modest Prposal
    Swift, Johnathan Gulliver’s Travels Argument Against Abolishing Christianity
    Swift, Johnathan Tale of a Tub
     
    19th century British
    Verne, Jules Journey to the Center of the Earth Twenty Thousand Leagues Or "Around the World in 80 Days"
    Butler, Samuel Erewhon Way of All Flesh
    Conrad, Joseph Heart of Darkness Lord Jim
    Darwin, Erasmus Temple of Nature Botanic Garden
    Gissing, George Nether World New Grub Street
    Kipling, Rudyard Below the Mill-Dam Young Men at the Manor Better pairing?
    Shelley, Mary Frankenstein Mathilda
     
    19th century American
    Baum, L. Frank Wonderful Wizard of Oz Queen Zixi of Ix
    Hawthorne, Nathaniel Antique Ring ??? Suitable stories?
    Hawthorne, Nathaniel Birthmark ???
    Hawthorne, Nathaniel Rappaccini’s Daughter ???
    Hawthorne, Nathaniel Scarlet Letter House of the Seven Gables
    Melville, Herman Confidence-Man Israel Potter
    Melville, Herman Mardi Typee
    Melville, Herman Moby-Dick Omoo
     
    Modern
    Čapek, Karel R.U.R. ???
    Čapek, Karel War with the Newts ???
    Kafka, Franz Castle ???
    Kafka, Franz Country Doctor ???
    Kafka, Franz Metamophosis Description of a Struggle
    Kafka, Franz Trial Amerika
    Camus, Albert Plague First Man
    Huxley, Aldus Brave New World Point Counter Point "Crome Yellow" (and maybe "Antic Hay") are public domain
    Orwell, George 1984 Burmese Days
    Orwell, George Animal Farm Road to Wigan Pier
    Mann, Thomas Mario and the Magician Buddenbrooks
    Yeats, William Butler Dialogue of Self and Soul Second Coming
    Zamyatin, Yevgeny We Islanders
    Hurston, Zora Neale Moses, Man of the Mountain Thier Eyes Were Watching God
     
    Contemporary
    Golding, William Lord of the Flies The Scorpion God The Inheritors
    Lewis, C. S. Lion, the Witch, and the Wardrobe ???
    Rushdie, Salman Midnight’s Children Fury Or Ground Beneath Her Feet or Moor’s Last Sigh
    Beckett, Samuel Waiting for Godot All That Fall Suitable nonallegorical drama?
    Nabokov, Vladimir Lolita Ada
    Coetzee, J.M. Waiting for the Barbarians Boyhood Or Youth/Summertime
    Barth, John Giles Goat-Boy Sot-Weed Factor
    Ellison, Ralph Invisible Man ???
    Faulkner, William Fable The Hamlet
    Ginsberg, Allen Howl Kaddish
    Kesey, Ken One Flew over the Cuckoo’s Nest Sometimes a Great Notion
    O’Connor, Flannery Violent Bear It Away ??? Wise Blood too allegorical

    Publishing Stats from the UK

    A quick follow-on to my previous post on the number of novels published annually in the U.S. I’ve now seen roughly comparable figures for the UK from 1994 through 2008 (via Dan Cohen, with thanks for the pointer).

    The UK numbers come from Nielson and aren’t broken down by category, but the overall picture is that there have been about half as many total English-language volumes published annually there as in the U.S. in recent years. I don’t know if Brits are bigger readers of fiction, proportionately, than Americans, but I’d say the large-scale assumption that the two markets for fiction are of the same general magnitude (within about a factor of two) is reasonable.

    What I’d still like to know is the portion of their annual output that’s in common. Are twenty percent of novels published in one country also published in the other? Fifty percent? Eighty? And are novels more (or less?) internationally “portable” than other kinds of books?

    Elson et al., “Extracting Social Networks from Literary Fiction” (2010)

    Just had the chance to read this intriguing paper on automated assessment of social networks in nineteenth-century British fiction, presented at this year’s ACL conference (and picked up on DH Now). I’m posting more for the link than anything else, but a couple of thoughts that are too long for Twitter …

    The paper’s take-away point is that (British nineteenth-century) fiction set in an urban environment doesn’t seem to show the diffuse social networks vis. rural fiction that one might expect following Bakhtin and others. Social networks in urban fiction turn out to be about the same size as those in rural fiction, and the connections between urban characters are if anything more robust than those between rural characters. The theory of chronotopes is said to take something of a hit here, though it’s by no means overturned.

    The social networks in question are measured by the quantity of direct discourse exchanged between any two characters in a text. The dialogue in question needs to be presented in quotes, the people speaking or being spoken about need to be named (in a way amenable to algorithmic named entity extraction), and there can’t be more than 300 words of non-dialogic exposition between entries in a single conversation. The authors also prune minor and fleeting characters from their networks in order to keep them manageable. The methodological details are pretty interesting; have a look at the paper for the full run-down.

    This is compelling work and may be an important contribution to the way we think about urbanization in nineteenth-century fiction. There are a few tricky problems, though.

    1. Conversation seems like a pretty good proxy for social connectedness, but of course it’s a partial and imperfect gauge; there certainly could be others.
    2. The inability to detect and evaluate indirect discourse (in validation tests, the authors’ method missed about half of the relevant dialogic exchanges) might be especially important in urban settings. It’s possible (but by no means certain) that urban characters spend more time overhearing, summarizing, and recounting than speaking face-to-face. Or maybe urban novels emphasize indirect discourse as a means by which to convey some aspect of city life. The point is that there might be important differences between both the types of social networks presented through direct and indirect discourse and in the sheer quantity of indirect discourse in different types of fiction.
    3. And of course throwing out fleeting and minor characters, which might be expected to occur more often in urban settings, would tend to concentrate any measure of the resulting social network.

    Anyway, I’m fascinated by the work and don’t mean to pick nits. The paper is well worth a read. I look forward to seeing more from the group in the future.

    Book Revisions with LaTeX and Git

    As anticipated, a quiet summer around these parts as I revise my manuscript on the theory and mechanisms of midcentury fiction. A quick technical update and a couple of questions for those with experience using Git source control for writing projects.

    I spent a chunk of the day today getting my head around Git. I’d been thinking about using it for a while and was helped along by my decision to dump Word in favor of LaTeX a couple of months ago; Word’s binary blobs aren’t well suited to version control (though that’s the least of Word’s problems, really). I also use Dropbox, which does basic automatic versioning, so I hadn’t had much reason to mess with the complexity of Git until now. But Dropbox (reasonably enough) only keeps a finite number of old versions of a file, and it doesn’t let you flag any of them to let your future self know what changed in any given rev. And there are a lot of revs, since it creates a new one every time you save a file (there’s no notion of a commit). This is all totally reasonable for Dropbox, which is a dead simple tool that’s made my working life better in every way. But I wanted more control as I hack away at my very long, slightly disorganized, heavily commented, totally in flux mid-revision book.

    So … Git. What’s both cool and terrifying about Git is that it morphs the live files in your working directory as you switch from one branch or revision to another. See this concise explanation of the process from Ben Lynn. (Note to self: Do not switch branches while a file is open in your editor.) Git’s worth a look if you haven’t dealt with modern revision control systems before; much easier and niftier than my brief encounters with CVS years ago had lead me to believe.

    Anyway, two questions for those more experienced with this stuff than I:

    1. I’m planning to use branches for the major edits to each chapter, so that I can easily go back and consult or restore the large sections that are inevitably hacked off along the way. Does this make sense? Are tags or clones more appropriate? Are branches overkill? Should I just trust my commented commits on a single trunk? What does your workflow for writing and revising with Git look like?
    2. Is there any reason not to combine Git and Dropbox? I’ve put my .git directory inside my current project directory, which already lives in my Dropbox folder. I can’t see any harm in this beyond a bit of redundancy, but I’d welcome any warnings from hard-won experience.

    Two last things:

    One, I’ll put the full manuscript on GitHub or similar once it’s no longer filled with embarrassing and/or libelous comments.

    Two, tomorrow’s project is to merge the massive changes between the existing chapter on William Gaddis and the much more compact version that’s been accepted by Contemporary Literature. This is a good problem to have, but trying to manage it is the proximate cause of all this version control business.

    Oh, and DH 2010 starts the day after tomorrow. Very sorry not to be in London, but I’ll have the #dh2010 firehose open next to TeXShop for the next few days.

    Job News

    I’ve accepted a two-year postdoctoral fellowship in American Culture Studies at Washington University in St. Louis for 2010-2012. The fellowship—in digital humanities and American culture—is designed to support quantitative textual analysis related to American cultural and literary studies. I couldn’t have asked for a better fit or a more welcoming environment, and I can’t wait to join my new colleagues in St. Louis.

    I’m extremely grateful to Rice and to the Mellon Foundation for the two years of support that are now drawing to a close. Particular thanks are due to my supervisor and mentor Caroline Levander, director of the Humanities Research Center at Rice, and to Lisa Spiro, director of Rice’s Digital Media Center and all-around DH wunderkind.

    Nothing much should change here on the blog, though it may be a quiet summer as I finish a manuscript. I’ll post updated contact information when it’s available.

    Gutenplots

    As promised yesterday, here are a few plots of the distribution of literary titles in the Gutenberg corpus by the date of their authors’ birth. Producing these was as much a way for me to play with ggplot2 (written my colleague Hadley Wickham in the statistics department here at Rice) as anything else, but the results are interesting, too.

    (Note that in all of the following plots, the titles in question are from the Gutenberg catalog as of 22 March 2010. They include only volumes in English with Library of Congress subject codes PR [British literature] or PS [American lit] and with both a determinate author [no blanks, “Anonymous,” “Various,” etc.] and a supplied creator birth year. No further curation was performed. There are 3380 PS titles and 3145 PR titles that fit this description. These numbers are somewhat greater than those in yesterday’s post, because I didn’t do any manual de-duping. In any case, when I talk about “Gutenberg” below, be aware that I’m only addressing this specific, literary, English-language subset of the full 30,000+ volumes in the corpus.)

    First up, histograms by decade (click to embiggen):

    PR Hist Long.png
    PS Hist Long.png

    There’s a lot of whitespace in these because I’ve shown the full date range 1300-2000 in order to make direct comparisons between the British and American subsets easier.

    No surprise that Gutenberg comprises primarily works by authors born in the nineteenth century. In both cases, there are large but not overwhelming spikes around the 1860s and ’70s. Those (birth) years produced a lot of prolific authors, including those who wrote stories and other multivolume works (we’re tallying volumes, not pages or words). It seems a little late, though, for authors born in these years—and presumably writing mostly in the very late nineteenth and early twentieth centuries—to be cranking out triple-deckers. Will look into this. I suspect it has more to do with a general upward trend in publishing volume over time, a trend that tales off in Gutenberg only because of copyright issues for authors born much later than 1880 or ’90. But I also can’t rule out some sort of other selection effect having to do with Gutenberg’s acquisitions process rather than the underlying literary production of the period. Should talk to Matt Jockers and Franco Moretti about this; they know big-picture numbers about the nineteenth century better than anyone else I know. In any case, the high numbers for the mid-late nineteenth century look to be “real,” by which I mean that there’s no obvious cataloging anomaly or small handful of over-represented authors to explain them away.

    For more detail (and slightly niftier plotting), here are the counts for PR and PS volumes by year plotted against one another directly (same story, click to enlarge):

    All Full.png

    The outliers (with counts above about 125) are the years:

    • 1564 (Shakespeare, labeled; Martin Mueller’s not kidding about the extent to which Shakespeare dominates our understanding of the early modern period)
    • 1803 (PR; Lytton, mostly, who has lots of multivolume works)
    • 1835 (PS; Twain)
    • 1862 (PS; Edith Wharton, O. Henry, Gilbert Parker, and others)
    • 1863 (PR; W.W. Jacobs, author of many a short story, among others)

    How about a more focused version for the years 1700-2000, with smoothed means, to make a core comparison easier?:

    All Detail Fit.png

    As predicted, the American lit is slightly more recent, on average, than the British. But the difference is small, and it’s mostly down to the presence of comparatively recent work by American (or at least PS-categorized) authors that has entered the public domain one way or another during a period when that wouldn’t happen automatically. Such recent works are totally absent from the British/PR list, which ends with authors born right at the turn of the last century (and not many of those, for obvious copyright-related reasons).

    It would be nice to have dates of composition for the works themselves, but that’s not likely to happen without serious additional legwork. In the meantime, author birthdates aren’t all bad; if you make the debatable but not ridiculous assumption that most authors are largely formed in their early careers, you might do just as well grouping their works by “date of maturity” as you would by date of composition. (And you wouldn’t keep trying to shoehorn Henry James into modernism proper, for God’s sake!) Plus, you’d avoid the separate issue of publication dates that don’t line up with composition dates.

    Finally, for my own future reference, the (ugly!) R/ggplot2 commands that generated these figures.

    The fitted, annotated, detail scatterplot:

    qplot(V1, V2, data=pr, xlab="Author Birth Year", ylab="Title Count", main="Gutenberg Titles by Author Birthdate (Detail, Fitted)", xlim=c(1700, 2000), ylim=c(0, 140)) +geom_smooth(data=pr, color="black", alpha=0) +geom_point(data=ps, color="red") +geom_smooth(data=ps, color="red", alpha=0) +annotate("text", x=1564, y=185, label="Shakespeare", size=4, alpha=0.4) +annotate("text", x=1955, y=55, label="PS\n(Amer)", color="red") +annotate("text", x=1745, y=40, label="PR\n(British)")

    pr and ps are hash-like lists of author birth years and corresponding counts of volumes for that year, one year/count pair per line.

    The histograms are similar but easier, involving variations on something like:

    qplot(V1, data=pshist, geom = "histogram", binwidth=10, main="American (PS) Gutenberg Titles by Author Birthdate", xlab="Author Birth Year", ylab="Title Count", xlim(1300, 2000), ylim(0, 700))

    Where pshist is just an unsorted list of author birth years, one for each volume (in this case, each PS volume) in the catalog (so yes, lots of repeats, which is the point).

    Some Gutenberg Numbers

    I spent most of the day—a beautiful, sunny, perfect spring day that I’ll never get back—munging Gutenberg catalog data to see how their holdings stack up for a short-term project of mine. I suppose this built character, and I know from experience that it’s useful to spend time poking around in your data. Still …

    A few numbers that stood out to me (mostly rounded for easier reading):

    There are close to 32,000 total volumes in the Gutenberg catalog, of which almost 20,000 have Library of Congress subject codes. This is good, but not perfect. Nine months ago, the numbers were 29,000 and a little over 16,000. This tells me that pretty much all new additions are being cataloged with full(ish) metadata, but there’s not much progress being made on filling in old records (and the old stuff is often high-profile, since it was what people worked on first).

    Of the c. 32K total volumes, about 26,500 are in English. Among English titles, 16,600 have LC codes, about the same rate as for all titles.

    There are 3,500 titles in English with LC code PR (British literature) and 3,400 with code PS (American). There are another 3,200 P* titles in English, most of which are translations from other languages. So we’re looking at roughly 7,000 readily identifiable titles of British and American literature in English from Gutenberg at the moment. (Note that all these PR/PS numbers exclude about 120 volumes by authors unknown, various, or missing.)

    If the currently untagged volumes contain literature in the same proportion as the tagged ones, we should expect that number (7,000) to increase to 11,000 if everything were cataloged fully. But I’m not holding my breath for retrospective catalog work unless I do it myself by automating some queries against the LC servers. That’s an idea I’ve been kicking around for a while. Not sure if it’s worth the effort to increase the size of the relevant (to me) Gutenberg corpus by 50%.

    Here’s a bit that might be more interesting. To what extent are Gutenberg’s literature holdings dominated by a small number of authors writing a lot of books? Well, of those 7,000 PR/PS entries in English, 5,800 belong to authors with more than one title to their names. Specifically, those 5,800 titles are the work of 726 different authors. So authors who have more than one title in Gutenberg have on average 8 titles apiece. It also implies that there are another 1,200 singleton authors. Overall, that means 7,000 titles by 2,000 authors. Not as bad as I expected, really.

    When you look at the list of works by multi-title authors, you see that there’s a fair amount of duplication and cruft. Not in the metadata (which are generally pretty good), but in Gutenberg’s “acquisitions” process: There are lots of cases where etext volumes reflect separate individual paper volumes (e.g., Clarissa, vols. 1-9, each as a separate etext title), or where a work has been digitized multiple times, possibly from different sources. Nothing wrong with that, of course, but if you get rid of it (this involves some judgment, so you do it by hand! fun!), you’re left with about 4,500 more-or-less distinct titles by those same 720-ish authors. Even that number overcounts a bit (because you’re conservative about purging the rolls of duplicates), but it’s reasonably close. For what it’s worth, this means that there are really more like 5,700 distinct cataloged PR/PS volumes in English at the moment (= 7,000 – 1,300 “dupes”).

    Other thoughts:

    There’s a fair amount of science fiction and related genres in the catalog. I guess I knew this, and probably shouldn’t be surprised given the way the project works.

    Date information for authors is good, at least if you restrict yourself to cases where an LC code exists (and metadata are thus in good shape). Birth and death dates are written into the creator records, so you have to parse them out, but it’s not hard. Still, would be nice if they were a separate entry in the catalog.

    Original publication info is nonexistent. Bummer, though I knew this already. Gutenberg is not a home to bibliographic scholarship.

    I was a little surprised that the total numbers for British and American titles were just about even. I expected more British stuff. Will produce a little graph of holdings by author birthdate for each, just for kicks (and update the post accordingly). I expect (unsurprisingly) that the American stuff will skew recent compared to the British.

    That’s it for now. This is all still vaguely allegory-related. More when it’s ready. The job market is keeping me busy.