Literature and Economics at Chicago

I’m giving a talk next Friday (5/22) on literature and economic geography as part of Richard Jean So and Hoyt Long’s Cultural Analytics conference at Chicago. (Talking econ at Chicago. That’s not terrifying at all!) The list of speakers is really impressive, present company excluded. If you’re in or near Chicago, hope to see you there.

My talk will be closely related to my recent lecture at Kansas, video of which is available on YouTube (and embedded below). There’s also some enlightening discussion on Facebook; you might need to be friends with Richard So to see it, but you should be friends with him anyway …

Looking forward to seeing folks in Chicago!

PSA: e-Book Publishing Stats

I just read Dan Cohen’s thoughts on the future of e-books. Dan thinks the current “plateau” in e-book sales is either a temporary pause or an artifact of bad sales data, and speculates that digital books will be the (heavily) dominant medium of literary consumption sooner rather than later. I’m strongly inclined to agree, and Dan’s piece is (as always) well worth a read if you’re interested in smart speculation about media, publishing, libraries, and readership.

I’m writing this up for the blog, rather than (just) tweeting it, because Dan’s piece led me to an informative and intriguing report by Author Earnings. I haven’t examined their methods in detail, but they claim, among other things, that 30% of purchased e-books in the US don’t have ISBN numbers, meaning they aren’t included in Bowker’s publishing reports (about which I’ve previously written, trying to figure out how many new novels are published in the US every year). Anyway, the AE report is worth a look if you’re at least abstractly interested in the economics of the changing publishing industry.


Literary Attention Lag

I gave a short talk on geography and memory at this year’s MLA in Vancouver (session info). I didn’t work from a script, but here’s the core material and a few key slides.

So the problem I was trying to address was this: How is geographic attention in literary fiction related to the distribution of population at the time the fiction is published? And what do the details of the relation between them tell us about literary memory? These are questions I just barely touched in my ALH article on the literary geography of the Civil-War period last year, and I thought they were worth a bit more consideration.

To review, we know that there’s a moderate correlation between the population of a geographic location and the amount of literary attention paid to it (measured by the number of times that place is mentioned in books). New York City is used in American literature more frequently than is Richmond, for instance. (This is all using a corpus of about a thousand volumes of U.S. fiction published between 1850 and 1875, but I strongly suspect the correlation holds elsewhere; I’ll be able to say more definitively and share results in a month or two.)

But there is, in at least some instances, a temporal component involved as well. After all, population isn’t a stable feature of cities. Witness the cases of New Orleans and Chicago:

Population, 1820-1900

Populations of New Orleans and Chicago, 1820-1900

Literary mentions, 1850-1875

Mentions of New Orleans and Chicago, 1850-1875

In short, those cities were about the same size in 1860, but New Orleans — the older of the two by far — was used much more often in fiction at the time. It appears to have taken a while for Chicago to catch on in the literary imagination.

I wondered, then, whether this was a generalizable trend and, if so, whether I could quantify and explain it. I considered four informal hypotheses about the temporal relationship between population and literary-geographic representation (if I were feeling a little grand, I’d refer to these as reduced models of literary-geographic memory).

  1. National or deep. Not all the way to deep time in Wai Chee Dimock’s sense, but maybe closer to Sacvan Bercovitch’s model of Puritan inheritance. Literature in the nineteenth century represents the nation as it was in the eighteenth.
  2. Formative-psychological. Authors (and readers?) represent the world as it existed during their formative years, for whatever value of “formative” we might choose. Presumably their childhood or school years.
  3. Presentist. We find in books largely the world as it is at the time they were written. We see evidence of this in the rapidly shifting topical content of many texts, especially the dross that we don’t tend to study in English departments but that dominates the quantitative output of any period.
  4. Predictive. Literature looks beyond the present to anticipate or shape cultural features not yet fully realized. I don’t think this as crazy as it might sound. Critics pretty consistently emphasize the transformational power of books in terms that aren’t strictly personal or metaphorical, and we often bristle, rightly, at the notion that literature merely “reflects” the world. The Romantics among us might say that authors are charged with diagnosing or symptomatizing features of the world that will be obvious in the future, but are hidden now.

For what it’s worth, I’d say that (3) and (2) strike me as most likely or broadly relevant, in that order, followed by (1) and, somewhat distantly for literature en masse, (4).

To (begin to) address the problem of literary-cultural lag/memory/prediction, I collected population data from census records for 23 cities that were relatively well represented in the literary corpus and of comparatively significant size at some point before 1900. They ranged from New York and Philadelphia to Newport (RI), Salem (MA), San Francisco, Detroit, Vicksburg and so on. I did a bit of hand correction on the data to account for changing municipal boundaries and to agglomerate urban areas (metro St. Louis, or Albany and Saratoga Springs, or Buffalo and Niagara Falls; in the second and third cases cases, the latter place was smaller but more frequently used in fiction).

Anyway, with that data in hand, I plotted total literary mentions (1850-1875) against decennial census counts and ran a simple linear regression on each one. Individually, this produced plots like this (using 1850 census data):


The r2 value in this case is 0.46, meaning that a city’s 1850 population appears to account for a little less than half the observed variation in literary attention to it over the next two deacdes. Repeat for every decade with census data to 1990 and you get this:

Literary attention vs. Population, 1790-1990

That’s pretty and all, but it’s a little hard to see the trends in the r2 values, which are the thing that would help to quantify the degree of correlation between population and literary attention over time. So let’s pull out the r2‘s and plot them:

r-squared values over time with Gaussian fit

Now this is pretty interesting (he says, of his own work). Note again that the literary data is the same in every case; the only thing that’s changing is the census-year population. So the position of the largest r2 tells us which decade’s population distribution most closely predicts the allocation of literary-geographic attention between 1850 and 1875. The maximum observed r2 is in the 1830 data. The fit line here (which is a simple Gaussian, by the way, a fact that’s also kind of nifty and unexpected, since it’s a pretty good fit and symmetrical forward and backward in time) has its max in 1832.

The average book in the literary corpus was published in 1862 and the average age of the author at publication was 42. So it looks like lag peaks at around 30 years and corresponds to the author’s … “experience,” maybe we’d call it? … at age 12. I’d say this is a piece of evidence in favor of the formative-psychological hypothesis, and then I’d wave my hands vigorously indeed.

I expect to do some more exploration in the months ahead. Having literary data forward to 1990 will be a big help. A few things I’ll be looking into:

  • International comparison. How does lag change, if at all, in other national contexts? The U.S. was (and is) pretty young. Maybe longer-established nations have different dynamics. And how about changes in U.S. representation of foreign cities and vice versa? My guess is that lag is longer the less an author or culture knows about a foreign place.
  • Does lag change over time? Is it shorter today than it was 150 years ago? My guess: yes, but not radically.
  • Is the falloff in fit quality always symmetrical in time, and am I capturing all the relevant dynamics? The near-symmetry in the current data is surprising to me; I would have expected better backward fit than forward. Could be an artifact of the United States’ youth at the time; several of the cities in question didn’t exist for much more than a decade or two before the literature represented in the corpus was written. I wonder if part of this, too, is down to offsetting effects of memory (skewing fit better backward in time) and relative population stability (skewing things forward).
  • Other ways to get at the same question. A comparison of topical content against textual media presumed to be faster moving (newspapers, journals, etc.) would be instructive. How much more conservative is fiction than non-fiction?

Finally, three data notes:

  • Full data is available from the data page. And the code used for analysis and plotting can be had as an IPython notebook.
  • Careful readers will have noticed that the fits are log-linear, i.e., I’ve used the (base 10) logarithms of the values for mentions and population. This is what you’d expect to do for data like these that follow a power-law distribution.
  • I’ve dropped non-existent cities from the computed regressions (though not the visualizations) as appropriate before 1850 (by which time all the cities have population tallies). I think this is defensible, but you could argue for keeping them and using zero population instead. If I’d done that, the fit quality for 1840 and earlier would have been lower, pushing support toward the presentist hypothesis. But that would also be misleading, since it would amount to treating those cities as if they did exist, but were very small, which isn’t true. That’s one of the reasons to include cities like Salem and Nantucket and Newport, which really were existent but small(ish) from the earliest days of the republic. Anyway, an interpretive choice.

A Bit of Position-Taking on Surface Reading

There’s a new piece by Jeffrey Williams in the Chronicle on surface reading and “the new modesty” in literary studies. Came to my attention via Ted Underwood, who had a kind of ambivalent response to it on Twitter.

I was going to reply there, but 140 characters weren’t quite enough, and I’m asked about this pretty often, so thought I’d set down my short thoughts in a more permanent way.

I like and respect Marcus and Best’s work, which I find subtle and illuminating, though most of it falls somewhat outside my own field. And I guess I understand why some people are fed up with ideologically committed, theoretically oriented, hermeneutically inflected literary scholarship. When that stuff is bad, it’s pretty bad. Then again, just about anything can be (and often is) bad. I don’t see any special monopoly on badness there.

I also understand how it’s possible to look at (some) digital humanities research and think that it shares some sort of imagined turn away from depth and detail in favor of “direct” observation of “obvious” features. People who have no experience with the sciences tend to imagine that such things exist and that they’re different from what literary people work with. They aren’t, though that’s an argument for another time. (I have a little on it in passing in my forthcoming Comparative Literature review, FWIW.) In any case, it’s true that you sometimes hear people talking about a desire for “empirical” or “descriptive” research in DH, though they’re in the minority and I’m not one of them.

It’s hopeless, of course, to try to tell other people how to frame their work or ultimately to control how people receive your own. But I’ll say that my own reasons for pursuing computational literary research have nothing to do with (naïve, illusory) empiricism or a desire for critical modesty or a disenchantment with symptomatic, culturally committed criticism. Quite the opposite. Computers help me marshall evidence for large-scale cultural claims. That’s why I’m interested in them: they help me do better the kind of big, not especially modest, fundamentally symptomatic and suspicious critical work that brought me to the field in the first place.

But then, I would say that. I was Fred Jameson’s student and I was his student for a reason.


Books I Read in 2014

Here’s the new (to me) fiction I read this year. As always, I like seeing other people’s lists, so I figure I ought to contribute my own. Archived lists back to 2009 are also available.

  • Aw, Tash. Five Star Billionaire (2013). Did less for me than I’d hoped, but I think I just faulted it for failing to be the sweeping social drama I wanted it to be.
  • Braak, Chris. The Translated Man (2007). Pretty fun steampunk piece. Can’t remember how I found it – a blog somewhere, I think.
  • Catton, Eleanor. The Luminaries (2013). Really well done, but (just?) an entertainment.
  • Hustvedt, Siri. The Blazing World (2014). For stories of women and art, I preferred Messud.
  • Lepucki, Edan. California (2014). Read this right after The Bone Clocks for maximum depression value. Should have been 30 pages shorter or 100 pages longer — the ending doesn’t quite work.
  • Marcus, Ben. The Age of Wire and String (1995). The lone full-on experimental text on the list. Didn’t enjoy it as much as I expected to, because I’m a hypocrite.
  • Martin, Valerie. The Ghost of the Mary Celeste (2014).
  • Mengestu, Dinaw. All Our Names (2014). Disappointing. Guess I wanted more and tighter politics, less domestic drama.
  • Messud, Claire. The Woman Upstairs (2013). Enjoyed this a lot, probably more than anything else on the year.
  • Mitchell, David. The Bone Clocks (2014). I love Mitchell, who’s almost good enough to pull off the book’s bizarre mashup of Black Swan Green, the innermost novella of Cloud Atlas, and interdimensional Manichaean sci-fi. Almost.
  • Murakami, Haruki. 1Q84 (2011). I also like Murakami, but 1,000 pages of close to literally nothing happening is a lot to ask.
  • Offill, Jenny. Dept. of Speculation (2014). Good, narratively interesting, but ultimately underdrawn in substance.
  • Osborne, John. Look Back in Anger (1956). A quick glance at Osborne, whom I’d never read.
  • Tartt, Donna. The Goldfinch (2013). More disaster/suffering porn. Didn’t like it.
  • Waldman, Adelle. The Love Affairs of Nathaniel P. (2013).
  • Weir, Andy. The Martian (2011). Picked up in an airport book rack for a flight with a dead Kindle. Fun to read, sociologically and symptomatically interesting.
  • Wolitzer, Meg. The Interestings (2013). Not really. (Ooh, sick burn!)

Also picked up and put down … let’s see … Hotel World by Ali Smith, Ugly Girls by Lindsay Hunter, We Are All Completely Beside Ourselves by Karen Joy Fowler, and a couple of others.

Sixteen books and one play in sum, a little better than usual. Helps to be on leave. But not a year full of great reads. Was briefly enamored of Offill’s book, but its genuinely cool schtick got a little flat over just 100 pages. The Woman Upstairs was probably my favorite, and even that one wasn’t something I fell in love with. Nothing on the list that I’d especially want to teach or that struck me as something I should spend more time thinking about.

On the whole, it seemed as though I’d read a lot of these things before; well-executed, straight-ahead fiction. Which I suppose is mostly a defect in me, picking things from the pages of the New Yorker and the LRB and the Times and such. I know their deal; it’s not like those outlets went unexpectedly conservative this year. I read a lot of things out of vague professional obligation. The books I had the most fun with — Dept. of Speculation, The Translated Man, The Martian — were either experimental or genre fiction. Maybe there’s a lesson here. Maybe I should learn it.

So, here’s to a better 2015. Leading off (in the absence of the aforementioned lesson) with maybe Lily King’s Euphoria or Hilary Mantel’s Assassination of Margaret Thatcher or Marlon James’s Brief History of Seven Killings or Phil Klay’s Redeployment. Or Emily St. John Mandel’s Station Eleven, if I want to continue the apocalyptic theme from Mitchell and Lepucki …

New Minor in Computing and Digital Technologies at Notre Dame

I’m pleased to announce a new collaborative undergraduate minor in Computing and Digital Technologies at the University of Notre Dame. Beginning next fall, students will be able to pursue a combination of tailored, rigorous instruction in computer programming and closely related coursework in the humanities, arts, and social sciences. There are six tracks within the minor, from UI design to cognitive psychology to digital humanities and more.

It’s an interesting model, one that’s intended to allow our best and most ambitious students to undertake serious research before graduation and to gain the skills they need for success at the highest levels once they leave campus. I’ll be closely involved, serving on the advisory board for the minor, teaching CDT classes in the digital humanities track, and bringing strong students into my research group. We’re seeing more of these kinds of programs elsewhere, including Columbia’s “Computing in Context” courses and Stanford’s “CS+X” majors. There’s been talk here — though not yet any concrete plans — of eventually expanding CDT to a full major and of offering a BA in computer science through Arts and Letters. In the meantime, there may also be teaching opportunities in the program for qualified grad students.

That last point reminds me: if you have outstanding students looking to do grad work in DH, I hope you’ll consider pointing them toward ND!

In any case, exciting times. Looking forward to getting under way in August.

NovelTM Grant and Project

Finally for the day, another announcement that’s been slightly delayed: I’m really pleased to be part of the SSHRC-funded, McGill-led NovelTM: Text Mining the Novel project. It’s a long-term effort “to produce the first large-scale cross-cultural study of the novel according to quantitative methods” (quoth the About page). A super-impressive group of people are attached – just have a look at the list of team members!

Our first all-hands project meeting is coming up this week. Looking forward to getting started on things that will keep me busy for years to come. Updates and preliminary results here in the months ahead.

A Generalist Talk on Digital Humanities

Since I’m apparently in self-promotion mode … This past weekend, I gave a talk in the Notre Dame College of Arts and Letters’ Saturday Scholars series. These are public lectures aimed at curious folks who are in town for football games. It was a lot of fun and, apart from spilling water on my laptop because I’m a doofus and a klutz, I think it went well. Video is embedded below; there was also a write-up in the student newspaper.

ACLS Digital Innovation Fellowship

I somehow failed to post about this when it was announced last summer, but I’ve received an ACLS Digital Innovation Fellowship for the 2014-15 academic year to work on the project “Literary Geography at Scale.”

Things are going well so far; I’ll be updating the site here with reports as the research moves along. Eventually, there will be a full site to access and visualize the data (think Google Ngrams for geographic data). In the meantime, here’s the project abstract:

Literary Geography at Scale uses natural language processing algorithms and automated geocoding to extract geographic information from nearly eleven million digitized volumes held by the HathiTrust Digital Library. The project extends existing computationally assisted work on American and international literary geography to new regions, new historical periods – including the present day – and to a vastly larger collection of texts. It also provides scholars in the humanities and social sciences with an enormous yet accessible trove of geographic information. Because the HathiTrust corpus includes books published over many centuries in a variety of languages and across nearly all disciplines, the derived data is potentially useful to researchers in a range of humanities and computational fields. Literary Geography at Scale is one of the largest humanities text-mining projects to date and the first truly large-scale study of 20th and 21st century literature.


Visualizing Uncertainty with Probability Clouds

I’ve come up with a visualization of data uncertainty that seems really obviously useful, but that I’ve never seen before. So I guess some combination of three things must be true:

  1. I am a genius. Deeply unlikely, given that I misspelled “genius” the first time I typed it here.
  2. There’s something wrong with the “new” method that makes it less useful than I think and/or total bunk.
  3. People do use this, and I just haven’t seen it before. Totally possible, given the number of statistical visualizations in most literary studies papers.

Anyway, the idea is to use probability clouds to show a density region around a given line of best fit through the data.[1] I think this avoids some visual-rhetorical pitfalls in the usual ways of showing trends and uncertainty in data, but/and I’d be grateful for thoughts on its value.

Here’s the context and an example: I’m working on a manuscript at the moment for which I need to visualize a bit of data. Nothing fancy; this is one of the basic figures:

Demo 0 data

Yeah, the axes aren’t labeled, etc. The point is, there are two series that are pretty noisy but seem to be doing different things over time (along the x axis).

OK, so to get a handle on the trend, let’s insert a linear fit for each series:

Demo 1 line

Neat! But the fit lines are a little misleadingly precise. I don’t think we want to say that the “true” value of series 2 in 1820 is exactly 0.15, or that the true values cross in exactly 1872. So let’s add a confidence interval at the usual 95% level:

Demo 2 line se

Better, but this manages to be somehow both too precise and not precise enough. Beyond the line of best fit, which still suggests false precision at the center, the shaded 95% confidence region comes to an abrupt end (too precise) and doesn’t have any internal differentiation (not precise enough). The true value, if we want to think of it that way, isn’t equally likely to fall anywhere within the shaded region; it’s probably somewhere near the middle. But there’s also a smallish chance (5%, to be exact) that it falls outside the shaded region entirely.

So why not indicate those facts visually, while getting rid of the fit line entirely? Here’s what this might look like:

Demo 3 cloud

This seems a lot better. It doesn’t draw your eye misleadingly to the fit line or to the edges of an arbitrarily bounded region, but it does suggest where the real fit might be. And it does that while making plain the fuzziness of the whole business. It would be even better in color, too. I like it. Am I missing something?

On the technical side, this is built up by brute force in R with ggplot. The relevant code is:


se_limit     = 0.99  # Largest standard error level to show; valid range 0 to 1
se_regions   = 100   # Number of regions in uncertainty cloud. 100 is a lot;
                     #   a little slow, but produces very smooth cloud.
se_alpha_max = 0.5   # How dark to make region at center of uncertainty cloud.
                     #   0.5 = 50% grey.
line_type    = 0     # A ggplot2 linetype for fit line; 0 = none, 1 = solid

p = qplot(x, y, data=data)  # Use real data, of course!
for(i in 1:se_regions) { # This loop generates the uncertainty density shading 
	p = p + geom_smooth(method = "lm", linetype = line_type, fill = "black", level = i*se_limit/se_regions, alpha = se_alpha_max/(se_regions))
p # Show the finished plot

That’s it. As you can see, it’s just brute force building up overlapping alpha layers at different confidence levels. I once looked at the denstrip package, but couldn’t make it do the same thing. But I’m dumb, so …

Update: I knew I couldn’t be the first to have thought of this! Doug Duhaime points me to visually-weighted regression, apparently first suggested by Solomon Hsiang in 2012. There’s R code (but I guess not yet a formal package) to do this at Felix Schönbrodt’s site.

Here’s a version using Felix Schönbrodt’s vwReg(). Not all cleaned up to match the above, but you get the idea:

Demo 4 vwreg

[1] If you’ve learned any undergrad-level physical chemistry, you can probably see where this idea came from. Here’s a bog-standard textbook visualization of the electron probability density of a 2p atomic orbital:

(source; back to the post body])