Some Thoughts on DH and Canons

January 29th, 2011 § 7 Comments

Below is a draft of the talk I’m giving next week at Austin for the first of three DH symposia this semester sponsored by the Texas Institute for Literary and Textual Studies. The theme of this first meeting is “Access, Authority, and Identity“; my paper is an attempt to think through some of the implications of working beyond the canon (however construed) for straight literary and cultural scholarship and for DH alike. It’s also a nice excuse to show a little preview of the geolocation work I’ve been doing recently.

A prettier PDF version is also available.

Undermining Canons

I have a point from which to start: Canons exist, and we should do something about them.

I wouldn’t have thought this was a dicey claim until I was scolded recently by a senior colleague who told me that I was thirty years out of date for making it. The idea being that we’d had this fight a generation ago, and the canon had lost. But I was right and he, I’m sorry to say, was wrong. Ask any grad student reading for her comps or English professor who might confess to having skipped Hamlet. As I say, canons exist. Not, perhaps, in the Arnoldian–Bloomian sense of the canon, a single list of great books, and in any case certainly not the same list of dead white male authors that once defined the field. But in the more pluralist sense? Of books one really needs to have read to take part in the discipline? And of books many of us teach in common to our own students? Certainly. These are canons. They exist.

So why, a few decades after the question of canonicity as such was in any way current, do we still have these things? If we all agree that canons are bad, why haven’t we done away with them? Why do we merely tinker around the edges, adding a Morrison here and subtracting a Dryden there? Is this a problem? If so, what are we going to do about it? And more to the immediate point, what does any of this have to do with digital humanities?

The answer to the first question—“Why do we still have canons?”—is as simple to articulate as it is apparently difficult to solve. We don’t read any faster than we ever did, even as the quantity of text produced grows larger by the year. If we need to read books in order to extract information from them and if we need to have read things in common in order to talk about them, we’re going to spend most of our time dealing with a relatively small set of texts. The composition of that set will change over time, but it will never get any bigger. This is a canon. [Footnote: How many canons are there? The answer depends on how many people need to have read a given set of materials in order to constitute a field of study. This was once more or less everyone, but then the field was also very small when that was true. My best guess is that the number is at least a hundred or more at the very lowest end—and an order of magnitude or two more than that at the high end—which would give us a few dozen subfields in English, give or take. That strikes me as roughly accurate.]

Another way of putting this would be to say that we need to decide what to ignore. And the answer with which we’ve contented ourselves for generations is: “Pretty much everything ever written.” We don’t read much. What little we do read is deeply nonrepresentative of the full field of literary and cultural production. Our canons are assembled haphazardly, with a deep set of ingrained cultural biases that are largely invisible to us, and in ignorance of their alternatives. We’re doing little better, frankly, than we were with the dead-white-male bunch fifty or a hundred years ago, and we’re just as smug in our false sense of intellectual scope.

So canons, even in their current, mildly multiculturalist form, are an enormous problem, one that follows from our single working method, that is, from the need to perform always and only close reading as a means of cultural analysis. It’s probably clear where I’m going with this, at least to a group of DH folks. We need to do less close reading and more of anything and everything else that might help us extract information from and about texts as indicators of larger cultural issues. That includes bibliometrics and book historical work, data-mining and quantitative text analysis, economic study of the book trade and of other cultural industries, geospatial analysis, and so on. Moretti is an obvious model here, as is the work of people like Michael Witmore on early modern drama and Nicholas Dames on social structures in nineteenth-century fiction.

To show you one quick example of what I have in mind, here’s a map of the locations mentioned in thirty-seven American literary texts published in 1851:

1851.png

Figure 1: Places named in 37 U.S. novels published in 1851

There are some squarely canonical works included in this collection, including Moby-Dick and House of the Seven Gables, but the large majority are obscure novels by the likes of T. S. Arthur and Sylvanus Cobb. I certainly haven’t read many of them, nor am I likely to spend months doing so. The corpus is drawn from the Wright American Fiction collection and represents about a third of the total American literary works published that year. [Footnote: Why only a third? Those are all the texts available in machine-readable format at the moment.] Place names were extracted using a tool called GeoDict, which looks for strings of text that match a large database of named locations. I had to do a bit of cleanup on the extracted places, mostly because many personal names and common adjectives are also the names of cities somewhere in the world. I erred on the conservative side, excluding any of those I found and requiring a leading preposition for cities and regions, so if anything, I’ve likely missed some valid places. But the results are fascinating. Two points of interest, just quickly:

  1. For one, there are a lot more international locations than one might have expected. True, many of them are in Britain and western Europe, but these are American novels, not British reprints, so even that fact might surprise us. And there are also multiple mentions of locations in South America, Africa, India, China, Russia, Australia, the Middle East, and so on. The imaginative landscape of American fiction in the mid-nineteenth century appears to be pretty diversely outward looking in a way that hasn’t received much attention.
  2. And then—point two—there’s the distinct cluster of named places in the American south. At some level this probably shouldn’t be surprising; we’re talking about books that appeared just a decade before the Civil War, and the South was certainly on people’s minds. But it doesn’t fit very well with the stories we currently tell about Romanticism and the American Renaissance, which are centered firmly in New England during the early 1850s and dominate our understanding of the period. Perhaps we need to at least consider the possibility that American regionalism took hold significantly earlier than we usually claim.

So as I say, I think this is a pretty interesting result, one that demonstrates a first step in the kind of analyses that remain literary and cultural but that don’t depend on close reading alone nor suffer the material limits such reading imposes. I think we should do more of this—not necessarily more geolocation extraction in mid-nineteenth-century American fiction (though what I just showed obviously doesn’t exhaust that little project), but certainly more algorithmic and quantitative analysis of piles of text much too large to tackle “directly.” (“Directly” gets scare quotes because it’s a deeply misleading synonym for close reading in this context.)

If we do that—shift more of our critical capacity to such projects—there will be a couple of important consequences. For one thing, we’ll almost certainly become worse readers. Our time is finite; the less of it we devote to an activity, the less we’ll develop our skill in that area. Exactly how much our reading suffers—and how much we should care—are matters of reasonable debate; they depend on both the extent of the shift and the shape of the skill–experience curve for close reading. My sense is that we’ll come out alright and that it’s a trade well worth making. We gain a lot by having available to us the kinds of evidence text mining (for example) provides, enough that the outcome will almost certainly be a net positive for the field. But I’m willing to admit that the proof will be in the practice and that the practice is, while promising, as yet pretty limited. The important point, though, is that the decay of close reading as such is a negative in itself only if we mistakenly equate literary and cultural analysis with their current working method.

Second—and maybe more important for those of us already engaged in digital projects of one sort or another—we’ll need to see a related reallocation of resources within DH itself. Over the last couple of decades, many of our most visible projects have been organized around canonical texts, authors, and cultural artifacts. They have been motivated by a desire to understand those (quite limited) objects more robustly and completely, on a model plainly derived from conventional humanities scholarship. That wasn’t a mistake, nor are those projects without significant value. They’ve contributed to our understanding of, for example, Rossetti and Whitman, Stowe and Dickinson, Shakespeare and Spenser. And they’ve helped legitimate digital work in the eyes of suspicious colleagues by showing how far we can extend our traditional scholarship with new technologies. They’ve provided scholars around the world—including those outside the centers of university power—with better access to rare materials and improved pedagogy by the same means. But we shouldn’t ignore the fact that they’ve also often been large, expensive undertakings built on the assumption that we already know which authors and texts are the proper ones to which to devote our scarce resources. And to the extent that they’ve succeeded, they’ve also reinforced the canonicity of their subjects by increasing the amount of critical attention paid to them.

What’s required for computational and quantitative work—the kind of work that undermines rather than reinforces canons—is more material, less elaborately developed. The Wright collection, on which the 1851 map that I showed a few minutes ago was based (Figure 1), is a partial example of the kind of resource that’s best suited to this next development in digital humanities research. It covers every known American literary text published in the U.S. between 1851 and 1875 and makes them available in machine-readable form with basic metadata. Google Books and the Hathi Trust aim for the same thing on a much larger scale. None of these projects is cheap. But on a per-volume basis, they’re not bad. And of course we got Google and Hathi for very little of our own money, considering the magnitude of the projects.

It will still cost a good deal to make use of these what we might call “bare” repositories. The time, money, and attention they demand will have to come from somewhere. My point, though, is that if (as seems likely) we can’t pull those resources from entirely new pools outside the discipline—that is to say, if we can’t just expand the discipline so as to do everything we already do, plus a great many new things—then we should be willing to make sacrifices not only in traditional or analog humanities, but also in the types of first-wave digital projects that made the name and reputation of DH. This will hurt, but it will also result in categorically better, more broadly based, more inclusive, and finally more useful humanities scholarship. It will do so by giving us our first real chance to break the grip of small, arbitrarily assembled canons on our thinking about large-scale cultural production. It’s an opportunity not to be missed and a chance to put our money—real and figurative—where our mouths have been for two generations. We’ve complained about canons for a long time. Now that we might do without them, are we willing to try? And to accept the trade-offs involved? I think we should be.

My 2011 MLA Session

January 6th, 2011 § Leave a Comment

For those attending MLA in Los Angeles this week, I’ll be taking part in a “digital roundtable” organized by the ACH. Details below. Lots of smart people and interesting projects. The session abstract:

The Association for Computers and the Humanities (ACH) is pleased to sponsor an electronic roundtable and demo session featuring new and renewed work in media and digital literary studies. Projects, groups, and initiatives highlighted in this session build on the editorial and archival roots of humanities scholarship to offer new, explicitly methodological and interpretive contributions to the digital literary scene, or to intervene in established patterns of scholarly communication and pedagogical practice. Each presenter will offer a very brief introduction to his or her work, setting it in the context of digital humanities research and praxis, before we open the floor for simultaneous demos and casual conversations with attendees at eight computer stations.

A complete session description, including a list of presenters and individual project abstracts, is available on the ACH site. MLA’s session description (less info but with up-to-date annotations) is available to MLA members.

Session details:

  • 193. New (and Renewed) Work in Digital Literary Studies
  • Friday, 7 January
  • 8:30–9:45 a.m., Plaza I, J. W. Marriott

    What To Do With Too Much Text

    October 10th, 2010 § 2 Comments

    Below are the slides from my talk on text mining, “What To Do with Too Much Text, or, Data Mining for the Humanities and Social Sciences,” given at the Washington University Center for Political Economy a few days ago (8 Oct. 2010). For those who weren’t there, the talk was primarily a survey of approaches to (mostly) humanities-oriented text analysis with examples drawn from literary studies, history, psychology, and political science. For a fuller treatment of the opening “Motivations” section, see this post. You might also want to check out the theoretical underpinnings of my own allegory project, about which I said relatively little.

    The original slides are in Keynote and include embedded videos that don’t translate well to PowerPoint (and confuse SlideShare); rather than make a hash of things, I’ve put up a Quicktime version for people who don’t have access to Keynote. The Keynote file includes my (hopefully non-embarrassing) presenter notes, which may give a fuller sense of what I said at some points.

    Below are links to the projects and tools I mentioned (roughly in order of appearance).

    Projects and Works Cited

    Tools

    There are many, many text analysis and natural language processing tools available, many of them geared toward specific research domains. I mentioned only a comparative handful. This list is a long way from exhaustive.

    All projects are free and open source unless otherwise noted.

    Built Tools

    Good places to start; little or no programming required.

    • Wordle. Word clouds. Noncommercial use only, I believe.
    • WordHoard. Statistics, analytics, and visualizations of classic literature.
    • GeoDict. Extract named places from unstructured text.
    • Docuscope. A semi-publicly-available tool for text analysis backed by an extensive, hand-curated dictionary.
    • Casstools.org. Contrast Analysis of Semantic Similarity. Evaluate differential word associations in text corpora.
    • Voyeur Tools. Simple, Web-based text analytics. BYO text/corpus.
    • The MONK Project. Integrated, Web-based corpus analysis. Uses only texts from the (relatively large) included corpus.
    • SEASR. Packaged text analytics and development environment aimed at scholars in the humanities. Includes Zotero integration. SEASR pushes toward a full toolkit.
    • And one tool that I didn’t have a chance to mention: Mark Olson’s ARTFL-associated PhiloLine/PAIR. Sequence alignment detection in textual corpora; the analogy is to similar work in genetics.

    Toolkits and Development Environments

    Most of these packages come with demos and tutorials that may be useful on their own, but they’re aimed at allowing you to create your own text-mining applications.

    • GATE. An advanced development environment for text analysis with included analysis routines.
    • LingPipe. Advanced, Java-based natural language processing (NLP) toolkit. Partially integrated with GATE, but also a stand-alone product. Open source, but free only if you make your output texts freely available.
    • NLTK. Well-documented, Python-based NLP toolkit. Used widely in teaching NLP.
    • MALLET. Java-based, command-line package for statistical NLP. Useful for topic modeling, among many other things.

    Statistics Packages

    These packages don’t necessarily have anything to do with natural language analysis, but they’re useful for general statistical work and visualization.

    • R. A platform for statistical computing. Baayen’s book on corpus linguistics with R is a useful introduction with a natural language focus.
    • SPSS. The long-serving standard for stats in the social sciences. Emphatically not free, but widely site-licensed.

    Hope this is of some use. Drop me a line (see the “About” page) if you spot any errors or want to chat about this work.

    Expanded List of Allegorical-Nonallegorical Pairs

    September 30th, 2010 § 1 Comment

    Background

    In an earlier post, I offered a brief list of paired allegorical and nonallegorical texts by single authors. The idea was to use these pairs to look for the distinguishing textual features of allegory by controlling for as many variables (such as authorial style, genre, national origin, gender, period of composition, etc.) as possible. Or in other words, the attempt was to get as close as possible to the unattainable ideal of a corpus of texts that differ only by the presence or absence of allegory.

    That short list was OK and was the basis of the second figure in my MLA paper on “Critical Text Mining.” But it was both (1) too short for corpus work and (2) depended on my own assessment of allegoricalness, with attendant limitations of historical scope. I’ve always felt that the better option would be to build an expanded version of this pairwise list on the basis of settled scholarship in the field.

    The table below represents the groundwork for such a corpus of well-established allegorical-nonallegorical pairs. It’s still under development—there are obvious holes and issues—but it’s an outline of where I’m headed. What I really need now is feedback on the composition of this list.

    Issues and Notes

    A few notes, followed by a request for kind assistance:

    • All of the allegorical works are attested by one or more of the following major sources on allegory. Most are attested by several of them.
      • Copeland, Rita, and Peter Struck, eds. The Cambridge Companion to Allegory. Cambridge: Cambridge UP, 2010.
      • Fletcher, Angus. Allegory: The Theory of a Symbolic Mode. Ithaca: Cornell UP, 1964.
      • Honig, Edwin. Dark Conceit: The Making of Allegory. Hanover, NH: UP of New England, 1959.
      • Leeming, David Adams, and Kathleen Morgan Drowne. Encyclopedia of Allegorical Literature. Santa Barbara, CA: ABC-CLIO, 1996.
      • Tambling, Jeremy. Allegory. New York: Routledge, 2010.
    • From these sources, I’ve excluded works mentioned only in passing or discussed as ambiguous or difficult cases. So while there’s always room to argue about the allegoricalness of any entry, the texts presented here under the heading of “Allegory” are about as canonically allegorical as it’s possible to be.
    • The nonallegorical texts are another matter; I’ve selected them myself as potential pairings for the allegorical entries. So far I’ve limited these to works by the same author, but I’m not necessarily averse to well-paired nonallegorical entries by other authors (and I’m aware that such pairings will sometimes be required).

    There are two ways to use this list, and therefore two potentially conflicting goals when selecting pairs of texts:

    • Pairwise comparisons. In this case, I’ll evaluate each allegorical text only against its paired nonallegorical counterpart. For this purpose, it’s not especially important where the two texts fall on the imagined spectrum of allegoricalness, only that they be well separated from one another on it. But it is important that the two members of the pair are otherwise as similar as possible.
    • Corpus comparisons. On the other hand, I’ll also want to compare the features of the allegorical texts taken together against those of the collected nonallegorical texts. For this purpose what’s important is to avoid cases in which any of the allegorical or nonallegorical entries stray too far toward the opposite category, even if they’re significantly different from their pairmates. But it’s not so crucial that any one pair be especially well matched in content, style, etc.; the two corpora just need to be similar in overall composition.

    Action Item

    So what I’m looking for is feedback on the suitability of the nonallegorical items that are currently listed below, plus suggestions for appropriate texts where none is given.

    The ideal case it to find a firmly nonallegorical text by the same author for each of the allegorical entries, but where that’s not possible, the next best solution is probably a text of similar origin, style, length, subject matter, form, and so forth. This will never be perfect, but the closer the match—while still maintaining good relative and absolute separation on the allegorical continuum—the better.

    I’d also love to know about potential issues or complications concerning any of these texts and pairings.

    Oh, and one other constraint: I need to be able to get my hands on electronic versions of whatever texts I’m going to use; this makes anything published after 1923 difficult (though not strictly impossible).

    Massive thanks in advance to any and all who care to comment. The comments section below is probably the easiest way to leave feedback, or you can email me by clicking the “About” link (over on the lefthand side).

    The Table: Allegorical and Nonallegorical Text Pairs Grouped by Era

    Author Allegory Nonallegory Notes
    Ancient and classical
    Aeschylus Prometheus Bound Agamemnon Disputed authorship of Prometheus Bound
    Aesop Fables ???
    Hesiod Theogony Works and Days
    Boethius Consolation of Philosophy De Musica De Musica seems unsuitable
    Capella, Martianus Marriage of Mercury and Philology ???
    Ovid Metamophoses Amores
    Prudentius Psychomachia Cathemerinon
    Virgil Aeneid Georgics
    Anon. Bible (Genesis) ??? Very likely more interpretational trouble than it’s worth
     
    Medieval and Renaissance
    Alain de Lille Complaint of Nature Liber poenitentialis
    Lorris, Guillaume de Romance of the Rose ??? Other medieval romance?
    Silvestris, Bernard Cosmographia ??? Maybe commentary on Aeneid, but disputed authorship and different form
    Bale, John King John ??? Another play from the era?
    Chaucer, Geoffrey House of Fame Troilus and Criseyde
    Chaucer, Geoffrey Parliament of Fowles Troilus and Criseyde
    Fletcher, Phineas Purple Island ??? "Brittain’s Ida" (erotic poem)?
    Gower, John Confessio Amantis Vox Clamantis
    Hawes, Stephen Passetyme of Pleasure Comfort of Lovers
    Kempe, Margery Book of Margery Kempe ???
    Langland, William Piers Plowman ???
    Lydgate, John Reson and Sensualitie King Henry VI’s Triumphal …
    Shakespeare, William Phoenix and the Turtle ??? Appropriate sonnets?
    Spenser, Edmund Faerie Queene Shepheardes Calender Or Complaints
    Anon. Castle of Perseverance ???
    Anon. Everyman ???
    Anon. Pearl ???
    Alighieri, Dante Divine Comedy Vita Nuova
    Tasso, Torquato Jerusalem Conquered Aminta
    Calderón Great Theater of the World ??? "Life Is a Dream" too allegorical?
     
    17th & 18th centuries
    La Fontaine, Jean de Fables Tales
    Bunyan, John Holy War Grace Abounding
    Bunyan, John Life and Death of Mr Badman
    Bunyan, John Pilgrim’s Progress
    Defoe, Daniel Robinson Crusoe Journal of the Plague Year
    Dryden, John Absalom and Achitophel Annis Mirabilis
    Milton, John Comus Samson Agonistes Samson Agonistes too allegorical?
    Milton, John Paradise Lost ??? Areopagitica? Genre/form mismatch.
    Pope, Alexander Dunciad Rape of the Lock
    Swift, Johnathan Battle of the Books Modest Prposal
    Swift, Johnathan Gulliver’s Travels Argument Against Abolishing Christianity
    Swift, Johnathan Tale of a Tub
     
    19th century British
    Verne, Jules Journey to the Center of the Earth Twenty Thousand Leagues Or "Around the World in 80 Days"
    Butler, Samuel Erewhon Way of All Flesh
    Conrad, Joseph Heart of Darkness Lord Jim
    Darwin, Erasmus Temple of Nature Botanic Garden
    Gissing, George Nether World New Grub Street
    Kipling, Rudyard Below the Mill-Dam Young Men at the Manor Better pairing?
    Shelley, Mary Frankenstein Mathilda
     
    19th century American
    Baum, L. Frank Wonderful Wizard of Oz Queen Zixi of Ix
    Hawthorne, Nathaniel Antique Ring ??? Suitable stories?
    Hawthorne, Nathaniel Birthmark ???
    Hawthorne, Nathaniel Rappaccini’s Daughter ???
    Hawthorne, Nathaniel Scarlet Letter House of the Seven Gables
    Melville, Herman Confidence-Man Israel Potter
    Melville, Herman Mardi Typee
    Melville, Herman Moby-Dick Omoo
     
    Modern
    Čapek, Karel R.U.R. ???
    Čapek, Karel War with the Newts ???
    Kafka, Franz Castle ???
    Kafka, Franz Country Doctor ???
    Kafka, Franz Metamophosis Description of a Struggle
    Kafka, Franz Trial Amerika
    Camus, Albert Plague First Man
    Huxley, Aldus Brave New World Point Counter Point "Crome Yellow" (and maybe "Antic Hay") are public domain
    Orwell, George 1984 Burmese Days
    Orwell, George Animal Farm Road to Wigan Pier
    Mann, Thomas Mario and the Magician Buddenbrooks
    Yeats, William Butler Dialogue of Self and Soul Second Coming
    Zamyatin, Yevgeny We Islanders
    Hurston, Zora Neale Moses, Man of the Mountain Thier Eyes Were Watching God
     
    Contemporary
    Golding, William Lord of the Flies The Scorpion God The Inheritors
    Lewis, C. S. Lion, the Witch, and the Wardrobe ???
    Rushdie, Salman Midnight’s Children Fury Or Ground Beneath Her Feet or Moor’s Last Sigh
    Beckett, Samuel Waiting for Godot All That Fall Suitable nonallegorical drama?
    Nabokov, Vladimir Lolita Ada
    Coetzee, J.M. Waiting for the Barbarians Boyhood Or Youth/Summertime
    Barth, John Giles Goat-Boy Sot-Weed Factor
    Ellison, Ralph Invisible Man ???
    Faulkner, William Fable The Hamlet
    Ginsberg, Allen Howl Kaddish
    Kesey, Ken One Flew over the Cuckoo’s Nest Sometimes a Great Notion
    O’Connor, Flannery Violent Bear It Away ??? Wise Blood too allegorical

    Publishing Stats from the UK

    September 29th, 2010 § Leave a Comment

    A quick follow-on to my previous post on the number of novels published annually in the U.S. I’ve now seen roughly comparable figures for the UK from 1994 through 2008 (via Dan Cohen, with thanks for the pointer).

    The UK numbers come from Nielson and aren’t broken down by category, but the overall picture is that there have been about half as many total English-language volumes published annually there as in the U.S. in recent years. I don’t know if Brits are bigger readers of fiction, proportionately, than Americans, but I’d say the large-scale assumption that the two markets for fiction are of the same general magnitude (within about a factor of two) is reasonable.

    What I’d still like to know is the portion of their annual output that’s in common. Are twenty percent of novels published in one country also published in the other? Fifty percent? Eighty? And are novels more (or less?) internationally “portable” than other kinds of books?

    Elson et al., “Extracting Social Networks from Literary Fiction” (2010)

    September 20th, 2010 § 2 Comments

    Just had the chance to read this intriguing paper on automated assessment of social networks in nineteenth-century British fiction, presented at this year’s ACL conference (and picked up on DH Now). I’m posting more for the link than anything else, but a couple of thoughts that are too long for Twitter …

    The paper’s take-away point is that (British nineteenth-century) fiction set in an urban environment doesn’t seem to show the diffuse social networks vis. rural fiction that one might expect following Bakhtin and others. Social networks in urban fiction turn out to be about the same size as those in rural fiction, and the connections between urban characters are if anything more robust than those between rural characters. The theory of chronotopes is said to take something of a hit here, though it’s by no means overturned.

    The social networks in question are measured by the quantity of direct discourse exchanged between any two characters in a text. The dialogue in question needs to be presented in quotes, the people speaking or being spoken about need to be named (in a way amenable to algorithmic named entity extraction), and there can’t be more than 300 words of non-dialogic exposition between entries in a single conversation. The authors also prune minor and fleeting characters from their networks in order to keep them manageable. The methodological details are pretty interesting; have a look at the paper for the full run-down.

    This is compelling work and may be an important contribution to the way we think about urbanization in nineteenth-century fiction. There are a few tricky problems, though.

    1. Conversation seems like a pretty good proxy for social connectedness, but of course it’s a partial and imperfect gauge; there certainly could be others.
    2. The inability to detect and evaluate indirect discourse (in validation tests, the authors’ method missed about half of the relevant dialogic exchanges) might be especially important in urban settings. It’s possible (but by no means certain) that urban characters spend more time overhearing, summarizing, and recounting than speaking face-to-face. Or maybe urban novels emphasize indirect discourse as a means by which to convey some aspect of city life. The point is that there might be important differences between both the types of social networks presented through direct and indirect discourse and in the sheer quantity of indirect discourse in different types of fiction.
    3. And of course throwing out fleeting and minor characters, which might be expected to occur more often in urban settings, would tend to concentrate any measure of the resulting social network.

    Anyway, I’m fascinated by the work and don’t mean to pick nits. The paper is well worth a read. I look forward to seeing more from the group in the future.

    Book Revisions with LaTeX and Git

    July 5th, 2010 § 10 Comments

    As anticipated, a quiet summer around these parts as I revise my manuscript on the theory and mechanisms of midcentury fiction. A quick technical update and a couple of questions for those with experience using Git source control for writing projects.

    I spent a chunk of the day today getting my head around Git. I’d been thinking about using it for a while and was helped along by my decision to dump Word in favor of LaTeX a couple of months ago; Word’s binary blobs aren’t well suited to version control (though that’s the least of Word’s problems, really). I also use Dropbox, which does basic automatic versioning, so I hadn’t had much reason to mess with the complexity of Git until now. But Dropbox (reasonably enough) only keeps a finite number of old versions of a file, and it doesn’t let you flag any of them to let your future self know what changed in any given rev. And there are a lot of revs, since it creates a new one every time you save a file (there’s no notion of a commit). This is all totally reasonable for Dropbox, which is a dead simple tool that’s made my working life better in every way. But I wanted more control as I hack away at my very long, slightly disorganized, heavily commented, totally in flux mid-revision book.

    So … Git. What’s both cool and terrifying about Git is that it morphs the live files in your working directory as you switch from one branch or revision to another. See this concise explanation of the process from Ben Lynn. (Note to self: Do not switch branches while a file is open in your editor.) Git’s worth a look if you haven’t dealt with modern revision control systems before; much easier and niftier than my brief encounters with CVS years ago had lead me to believe.

    Anyway, two questions for those more experienced with this stuff than I:

    1. I’m planning to use branches for the major edits to each chapter, so that I can easily go back and consult or restore the large sections that are inevitably hacked off along the way. Does this make sense? Are tags or clones more appropriate? Are branches overkill? Should I just trust my commented commits on a single trunk? What does your workflow for writing and revising with Git look like?
    2. Is there any reason not to combine Git and Dropbox? I’ve put my .git directory inside my current project directory, which already lives in my Dropbox folder. I can’t see any harm in this beyond a bit of redundancy, but I’d welcome any warnings from hard-won experience.

    Two last things:

    One, I’ll put the full manuscript on GitHub or similar once it’s no longer filled with embarrassing and/or libelous comments.

    Two, tomorrow’s project is to merge the massive changes between the existing chapter on William Gaddis and the much more compact version that’s been accepted by Contemporary Literature. This is a good problem to have, but trying to manage it is the proximate cause of all this version control business.

    Oh, and DH 2010 starts the day after tomorrow. Very sorry not to be in London, but I’ll have the #dh2010 firehose open next to TeXShop for the next few days.

    Gutenplots

    March 28th, 2010 § Leave a Comment

    As promised yesterday, here are a few plots of the distribution of literary titles in the Gutenberg corpus by the date of their authors’ birth. Producing these was as much a way for me to play with ggplot2 (written my colleague Hadley Wickham in the statistics department here at Rice) as anything else, but the results are interesting, too.

    (Note that in all of the following plots, the titles in question are from the Gutenberg catalog as of 22 March 2010. They include only volumes in English with Library of Congress subject codes PR [British literature] or PS [American lit] and with both a determinate author [no blanks, "Anonymous," "Various," etc.] and a supplied creator birth year. No further curation was performed. There are 3380 PS titles and 3145 PR titles that fit this description. These numbers are somewhat greater than those in yesterday’s post, because I didn’t do any manual de-duping. In any case, when I talk about “Gutenberg” below, be aware that I’m only addressing this specific, literary, English-language subset of the full 30,000+ volumes in the corpus.)

    First up, histograms by decade (click to embiggen):

    PR Hist Long.png
    PS Hist Long.png

    There’s a lot of whitespace in these because I’ve shown the full date range 1300-2000 in order to make direct comparisons between the British and American subsets easier.

    No surprise that Gutenberg comprises primarily works by authors born in the nineteenth century. In both cases, there are large but not overwhelming spikes around the 1860s and ’70s. Those (birth) years produced a lot of prolific authors, including those who wrote stories and other multivolume works (we’re tallying volumes, not pages or words). It seems a little late, though, for authors born in these years—and presumably writing mostly in the very late nineteenth and early twentieth centuries—to be cranking out triple-deckers. Will look into this. I suspect it has more to do with a general upward trend in publishing volume over time, a trend that tales off in Gutenberg only because of copyright issues for authors born much later than 1880 or ’90. But I also can’t rule out some sort of other selection effect having to do with Gutenberg’s acquisitions process rather than the underlying literary production of the period. Should talk to Matt Jockers and Franco Moretti about this; they know big-picture numbers about the nineteenth century better than anyone else I know. In any case, the high numbers for the mid-late nineteenth century look to be “real,” by which I mean that there’s no obvious cataloging anomaly or small handful of over-represented authors to explain them away.

    For more detail (and slightly niftier plotting), here are the counts for PR and PS volumes by year plotted against one another directly (same story, click to enlarge):

    All Full.png

    The outliers (with counts above about 125) are the years:

    • 1564 (Shakespeare, labeled; Martin Mueller’s not kidding about the extent to which Shakespeare dominates our understanding of the early modern period)
    • 1803 (PR; Lytton, mostly, who has lots of multivolume works)
    • 1835 (PS; Twain)
    • 1862 (PS; Edith Wharton, O. Henry, Gilbert Parker, and others)
    • 1863 (PR; W.W. Jacobs, author of many a short story, among others)

    How about a more focused version for the years 1700-2000, with smoothed means, to make a core comparison easier?:

    All Detail Fit.png

    As predicted, the American lit is slightly more recent, on average, than the British. But the difference is small, and it’s mostly down to the presence of comparatively recent work by American (or at least PS-categorized) authors that has entered the public domain one way or another during a period when that wouldn’t happen automatically. Such recent works are totally absent from the British/PR list, which ends with authors born right at the turn of the last century (and not many of those, for obvious copyright-related reasons).

    It would be nice to have dates of composition for the works themselves, but that’s not likely to happen without serious additional legwork. In the meantime, author birthdates aren’t all bad; if you make the debatable but not ridiculous assumption that most authors are largely formed in their early careers, you might do just as well grouping their works by “date of maturity” as you would by date of composition. (And you wouldn’t keep trying to shoehorn Henry James into modernism proper, for God’s sake!) Plus, you’d avoid the separate issue of publication dates that don’t line up with composition dates.

    Finally, for my own future reference, the (ugly!) R/ggplot2 commands that generated these figures.

    The fitted, annotated, detail scatterplot:

    qplot(V1, V2, data=pr, xlab="Author Birth Year", ylab="Title Count", main="Gutenberg Titles by Author Birthdate (Detail, Fitted)", xlim=c(1700, 2000), ylim=c(0, 140)) +geom_smooth(data=pr, color="black", alpha=0) +geom_point(data=ps, color="red") +geom_smooth(data=ps, color="red", alpha=0) +annotate("text", x=1564, y=185, label="Shakespeare", size=4, alpha=0.4) +annotate("text", x=1955, y=55, label="PS\n(Amer)", color="red") +annotate("text", x=1745, y=40, label="PR\n(British)")

    pr and ps are hash-like lists of author birth years and corresponding counts of volumes for that year, one year/count pair per line.

    The histograms are similar but easier, involving variations on something like:

    qplot(V1, data=pshist, geom = "histogram", binwidth=10, main="American (PS) Gutenberg Titles by Author Birthdate", xlab="Author Birth Year", ylab="Title Count", xlim(1300, 2000), ylim(0, 700))

    Where pshist is just an unsorted list of author birth years, one for each volume (in this case, each PS volume) in the catalog (so yes, lots of repeats, which is the point).

    Some Gutenberg Numbers

    March 27th, 2010 § 3 Comments

    I spent most of the day—a beautiful, sunny, perfect spring day that I’ll never get back—munging Gutenberg catalog data to see how their holdings stack up for a short-term project of mine. I suppose this built character, and I know from experience that it’s useful to spend time poking around in your data. Still …

    A few numbers that stood out to me (mostly rounded for easier reading):

    There are close to 32,000 total volumes in the Gutenberg catalog, of which almost 20,000 have Library of Congress subject codes. This is good, but not perfect. Nine months ago, the numbers were 29,000 and a little over 16,000. This tells me that pretty much all new additions are being cataloged with full(ish) metadata, but there’s not much progress being made on filling in old records (and the old stuff is often high-profile, since it was what people worked on first).

    Of the c. 32K total volumes, about 26,500 are in English. Among English titles, 16,600 have LC codes, about the same rate as for all titles.

    There are 3,500 titles in English with LC code PR (British literature) and 3,400 with code PS (American). There are another 3,200 P* titles in English, most of which are translations from other languages. So we’re looking at roughly 7,000 readily identifiable titles of British and American literature in English from Gutenberg at the moment. (Note that all these PR/PS numbers exclude about 120 volumes by authors unknown, various, or missing.)

    If the currently untagged volumes contain literature in the same proportion as the tagged ones, we should expect that number (7,000) to increase to 11,000 if everything were cataloged fully. But I’m not holding my breath for retrospective catalog work unless I do it myself by automating some queries against the LC servers. That’s an idea I’ve been kicking around for a while. Not sure if it’s worth the effort to increase the size of the relevant (to me) Gutenberg corpus by 50%.

    Here’s a bit that might be more interesting. To what extent are Gutenberg’s literature holdings dominated by a small number of authors writing a lot of books? Well, of those 7,000 PR/PS entries in English, 5,800 belong to authors with more than one title to their names. Specifically, those 5,800 titles are the work of 726 different authors. So authors who have more than one title in Gutenberg have on average 8 titles apiece. It also implies that there are another 1,200 singleton authors. Overall, that means 7,000 titles by 2,000 authors. Not as bad as I expected, really.

    When you look at the list of works by multi-title authors, you see that there’s a fair amount of duplication and cruft. Not in the metadata (which are generally pretty good), but in Gutenberg’s “acquisitions” process: There are lots of cases where etext volumes reflect separate individual paper volumes (e.g., Clarissa, vols. 1-9, each as a separate etext title), or where a work has been digitized multiple times, possibly from different sources. Nothing wrong with that, of course, but if you get rid of it (this involves some judgment, so you do it by hand! fun!), you’re left with about 4,500 more-or-less distinct titles by those same 720-ish authors. Even that number overcounts a bit (because you’re conservative about purging the rolls of duplicates), but it’s reasonably close. For what it’s worth, this means that there are really more like 5,700 distinct cataloged PR/PS volumes in English at the moment (= 7,000 – 1,300 “dupes”).

    Other thoughts:

    There’s a fair amount of science fiction and related genres in the catalog. I guess I knew this, and probably shouldn’t be surprised given the way the project works.

    Date information for authors is good, at least if you restrict yourself to cases where an LC code exists (and metadata are thus in good shape). Birth and death dates are written into the creator records, so you have to parse them out, but it’s not hard. Still, would be nice if they were a separate entry in the catalog.

    Original publication info is nonexistent. Bummer, though I knew this already. Gutenberg is not a home to bibliographic scholarship.

    I was a little surprised that the total numbers for British and American titles were just about even. I expected more British stuff. Will produce a little graph of holdings by author birthdate for each, just for kicks (and update the post accordingly). I expect (unsurprisingly) that the American stuff will skew recent compared to the British.

    That’s it for now. This is all still vaguely allegory-related. More when it’s ready. The job market is keeping me busy.

    Translation Numbers

    December 27th, 2009 § 2 Comments

    I came across an interesting summary of books translated in 2009 hosted on the blog “Three Percent” at the U of R (w00t!). A resource new to me.

    Headline numbers: 348 total new, first-time translations of fiction and poetry into English published in the U.S. this year. The blog reports that translations make up around 3% of the total publications in the States, and only about 0.7% of literary titles. Not much information on methodology that I could see (on a very cursory look), but I assume the list comes from Books in Print or similar. In any case, I’m grateful to have an answer to one of the questions that’s been on my to-do list for a while.

    Next question: How do these numbers compare to those for other countries and to the size of various publishing markets? If a country has a large domestic literary market, do more of its books (proportionately speaking) make it into U.S. translation?

    Where Am I?

    You are currently browsing the Digital Humanities category at Work Product.

    Follow

    Get every new post delivered to your Inbox.