Cultural Analytics Symposium at Notre Dame

On May 26 and 27, Notre Dame is hosting Cultural Analytics 2017, a symposium devoted to new research in the fields of computational and data-intensive cultural studies. Combining methods and insights from computer science and the quantitative social sciences with questions central to the interpretive humanities, the event explores some of the most compelling contemporary interdisciplinary work in a rigorous, collegial environment.

The symposium is free and open to the public. For details including registration, schedule, and the full lineup of intimidatingly great speakers, see the Cultural Analytics 2017 site. Hope to see you there!

NEH Grant for Textual Geographies Project

map-nations-allI’m pleased to announce that the Textual Geographies Project has been awarded a $325,000 Digital Humanities Implementation Grant from the National Endowment for the Humanities. I’m hugely grateful for the NEH’s generous support and for previous startup funding from the ACLS and from the Notre Dame Office of Research.

I’m excited to work with project partners at Notre Dame, at the HathiTrust Research Center, and around the world. The grant will support further development of a Web-based front end for the enormous amount of textual-geographic data that the project has already generated, as well as ongoing improvements to the data collection process, new research using that data, and several events to engage scholars and members of the public who are interested in geography, history, literature, and the algorithmic study of culture. I’ll also be hiring a project postdoc for the 2017-19 academic years.

More information on all these fronts in the months ahead!

Postdoc in Computational Textual Geography


Update: The position has been filled. I’m very pleased that Dan Sinykin will be joining our group next year as a postdoctoral fellow.

I’m seeking a postdoctoral fellow for a two-year appointment to work on aspects of the Textual Geographies project and to collaborate on research of mutual interest in my lab in the Department of English at Notre Dame.

The ideal candidate will have demonstrated expertise in literary or cultural studies, machine learning or natural language processing, and geographic or spatial analysis, as well as a willingness to work in new areas. The fellow will contribute to the ongoing work of the Textual Geographies project, an NEH-funded collaboration between literary scholars, historians, geographers, and computer scientists to map and analyze geographic references in more than ten million digitized volumes held by the HathiTrust Digital Library. Areas of current investigation include machine learning for toponym disambiguation, named entity recognition in book-length texts, visualization of uncertainty in geospatial data sets, and cultural and economic analysis of large-scale, multinational literary geography. We welcome applications from candidates whose research interests might expand the range of our existing projects, as well as from those whose expertise builds on our present strengths.

Interdisciplinary collaboration with other groups at Notre Dame is possible. The fellow will also have access to the Text Mining the Novel project, which has helped to underwrite the position.


Details and application via Interfolio (free). Letters not required for initial stage. Review begins immediately and continues until position is filled. Salary $50,000/year plus research stipend. Initial appointment for one year, renewable for a second year subject to satisfactory progress. Teaching possible but not required.



My book, Revolution: The Event in Postwar Fiction, is out with Johns Hopkins University Press. OK, it’s been out since October, but still. I’m really excited about it.

Below is the description from JHUP’s site, and I have a related post, “5 Things You Might Not Know About Fifties Fiction,” on their blog as well. In brief, the book is about how one set of literary and cultural forms displaces another, especially as that process played out in the United States after World War II. Want to know why fifties fiction is full of rambling allegories and why no one writes like Jack Kerouac? Or what those facts have to do with the French Revolution or the invention of quantum mechanics? You’ve come to the right place.

There’s a preview of the book available via Google. Should you be so inclined, you can buy the thing directly from the press, via Amazon, or wherever fine literary-critical monographs are sold. Want to review it? The press has your hook-up.

Here’s a fuller description of project:

Socially, politically, and artistically, the 1950s make up an odd interlude between the first half of the twentieth century — still tied to the problems and orders of the Victorian era and Gilded Age — and the pervasive transformations of the later sixties. In Revolution, Matthew Wilkens argues that postwar fiction functions as a fascinating model of revolutionary change. Uniting literary criticism, cultural analysis, political theory, and science studies, Revolution reimagines the years after World War II as at once distinct from the decades surrounding them and part of a larger-scale series of rare, revolutionary moments stretching across centuries.

Focusing on the odd mix of allegory, encyclopedism, and failure that characterizes fifties fiction, Wilkens examines a range of literature written during similar times of crisis, in the process engaging theoretical perspectives from Walter Benjamin and Fredric Jameson to Bruno Latour and Alain Badiou alongside readings of major novels by Ralph Ellison, William Gaddis, Doris Lessing, Jack Kerouac, Thomas Pynchon, and others.

Revolution links the forces that shaped postwar fiction to the dynamics of revolutionary events in other eras and social domains. Like physicists at the turn of the twentieth century or the French peasantry of 1789, midcentury writers confronted a world that did not fit their existing models. Pressed to adapt but lacking any obvious alternative, their work became sprawling and figurative, accumulating unrelated details and reusing older forms to ambiguous new ends. While the imperatives of the postmodern eventually gave order to this chaos, Wilkens explains that the same forces are again at work in today’s fracturing literary market.

As I say, I’m super happy to have the book out in the world. I owe thanks to many, many people for their help along the way. Now, on to the next one!

Computational Approaches to Genre in CA

fig5New year, catch-up news. I have an article in CA, the journal of cultural analytics, on computational approaches to genre detection in twentieth-century fiction. The piece came out back in November, but, well, it’s been a busy year.

The big finding — beyond what I happen to think is a nifty way of considering genre — is that certain highly canonical, male-authored novels of the mid-late twentieth century (by the likes of Updike, Bellow, Vonnegut, DeLillo, etc.) resemble one another about as closely as do mid-century hard-boiled detective stories. That is, very closely indeed. There are a couple of conclusions one might draw from this; my preferred interpretation is that the functional definition of literary fiction in the postwar period (and probably everywhere else) remains much too narrow. But there are other possibilities as well …

CA, by the way, has had some really great work of late. Andrew Piper’s article on “fictionality” is especially worth a read; Piper shows that it’s not just possible but really pretty easy to separate fiction from nonfiction using a basic set of lexical features.

Masterclass and Lecture at Edinburgh

I’m giving a two-and-a-half day masterclass on quantitative methods for humanities researchers at the University of Edinburgh, 19-21 September, 2016. There’s a rough syllabus available now, with more materials to be added as the event draws nearer. If you’re in Scotland and want to attend, there may be (literally) a place or two left; details at the Digital Humanities Network Scotland.

There will also be a public lecture on the evening of Wednesday, September 21, featuring a response and discussion with the ever-excellent Jonathan Hope (Strathclyde).

I’m grateful to Maria Filippakopoulou for organizing the visit and to the Edinburgh Fund of the University of Edinburgh for providing financial support.

Come Work with Me!

Update (20 August 2016)

I wasn’t able to hire anyone for this post, but will rerun the search this fall. More information forthcoming soon. In the meantime, if you happen to know anyone suitable — especially with a strong background in NLP and an interest in humanities problems — please let me know so that I can get in touch. Thanks!

Original post

I’m hiring a postdoc for next year (2016-17) to work on literature, geography, and computational methods. Wide latitude in training and background; interest in working on a very large geographic dataset a big plus. Full details and application via Interfolio. Review begins next week. The position will remain open until filled.

Many thanks to the Text Mining the Novel Project for helping to underwrite the post.

Literature and Economics at Chicago

I’m giving a talk next Friday (5/22) on literature and economic geography as part of Richard Jean So and Hoyt Long’s Cultural Analytics conference at Chicago. (Talking econ at Chicago. That’s not terrifying at all!) The list of speakers is really impressive, present company excluded. If you’re in or near Chicago, hope to see you there.

My talk will be closely related to my recent lecture at Kansas, video of which is available on YouTube (and embedded below). There’s also some enlightening discussion on Facebook; you might need to be friends with Richard So to see it, but you should be friends with him anyway …

Looking forward to seeing folks in Chicago!

PSA: e-Book Publishing Stats

I just read Dan Cohen’s thoughts on the future of e-books. Dan thinks the current “plateau” in e-book sales is either a temporary pause or an artifact of bad sales data, and speculates that digital books will be the (heavily) dominant medium of literary consumption sooner rather than later. I’m strongly inclined to agree, and Dan’s piece is (as always) well worth a read if you’re interested in smart speculation about media, publishing, libraries, and readership.

I’m writing this up for the blog, rather than (just) tweeting it, because Dan’s piece led me to an informative and intriguing report by Author Earnings. I haven’t examined their methods in detail, but they claim, among other things, that 30% of purchased e-books in the US don’t have ISBN numbers, meaning they aren’t included in Bowker’s publishing reports (about which I’ve previously written, trying to figure out how many new novels are published in the US every year). Anyway, the AE report is worth a look if you’re at least abstractly interested in the economics of the changing publishing industry.

Literary Attention Lag

I gave a short talk on geography and memory at this year’s MLA in Vancouver (session info). I didn’t work from a script, but here’s the core material and a few key slides.

So the problem I was trying to address was this: How is geographic attention in literary fiction related to the distribution of population at the time the fiction is published? And what do the details of the relation between them tell us about literary memory? These are questions I just barely touched in my ALH article on the literary geography of the Civil-War period last year, and I thought they were worth a bit more consideration.

To review, we know that there’s a moderate correlation between the population of a geographic location and the amount of literary attention paid to it (measured by the number of times that place is mentioned in books). New York City is used in American literature more frequently than is Richmond, for instance. (This is all using a corpus of about a thousand volumes of U.S. fiction published between 1850 and 1875, but I strongly suspect the correlation holds elsewhere; I’ll be able to say more definitively and share results in a month or two.)

But there is, in at least some instances, a temporal component involved as well. After all, population isn’t a stable feature of cities. Witness the cases of New Orleans and Chicago:

Population, 1820-1900

Populations of New Orleans and Chicago, 1820-1900

Literary mentions, 1850-1875

Mentions of New Orleans and Chicago, 1850-1875

In short, those cities were about the same size in 1860, but New Orleans — the older of the two by far — was used much more often in fiction at the time. It appears to have taken a while for Chicago to catch on in the literary imagination.

I wondered, then, whether this was a generalizable trend and, if so, whether I could quantify and explain it. I considered four informal hypotheses about the temporal relationship between population and literary-geographic representation (if I were feeling a little grand, I’d refer to these as reduced models of literary-geographic memory).

  1. National or deep. Not all the way to deep time in Wai Chee Dimock’s sense, but maybe closer to Sacvan Bercovitch’s model of Puritan inheritance. Literature in the nineteenth century represents the nation as it was in the eighteenth.
  2. Formative-psychological. Authors (and readers?) represent the world as it existed during their formative years, for whatever value of “formative” we might choose. Presumably their childhood or school years.
  3. Presentist. We find in books largely the world as it is at the time they were written. We see evidence of this in the rapidly shifting topical content of many texts, especially the dross that we don’t tend to study in English departments but that dominates the quantitative output of any period.
  4. Predictive. Literature looks beyond the present to anticipate or shape cultural features not yet fully realized. I don’t think this as crazy as it might sound. Critics pretty consistently emphasize the transformational power of books in terms that aren’t strictly personal or metaphorical, and we often bristle, rightly, at the notion that literature merely “reflects” the world. The Romantics among us might say that authors are charged with diagnosing or symptomatizing features of the world that will be obvious in the future, but are hidden now.

For what it’s worth, I’d say that (3) and (2) strike me as most likely or broadly relevant, in that order, followed by (1) and, somewhat distantly for literature en masse, (4).

To (begin to) address the problem of literary-cultural lag/memory/prediction, I collected population data from census records for 23 cities that were relatively well represented in the literary corpus and of comparatively significant size at some point before 1900. They ranged from New York and Philadelphia to Newport (RI), Salem (MA), San Francisco, Detroit, Vicksburg and so on. I did a bit of hand correction on the data to account for changing municipal boundaries and to agglomerate urban areas (metro St. Louis, or Albany and Saratoga Springs, or Buffalo and Niagara Falls; in the second and third cases cases, the latter place was smaller but more frequently used in fiction).

Anyway, with that data in hand, I plotted total literary mentions (1850-1875) against decennial census counts and ran a simple linear regression on each one. Individually, this produced plots like this (using 1850 census data):


The r2 value in this case is 0.46, meaning that a city’s 1850 population appears to account for a little less than half the observed variation in literary attention to it over the next two deacdes. Repeat for every decade with census data to 1990 and you get this:

Literary attention vs. Population, 1790-1990

That’s pretty and all, but it’s a little hard to see the trends in the r2 values, which are the thing that would help to quantify the degree of correlation between population and literary attention over time. So let’s pull out the r2‘s and plot them:

r-squared values over time with Gaussian fit

Now this is pretty interesting (he says, of his own work). Note again that the literary data is the same in every case; the only thing that’s changing is the census-year population. So the position of the largest r2 tells us which decade’s population distribution most closely predicts the allocation of literary-geographic attention between 1850 and 1875. The maximum observed r2 is in the 1830 data. The fit line here (which is a simple Gaussian, by the way, a fact that’s also kind of nifty and unexpected, since it’s a pretty good fit and symmetrical forward and backward in time) has its max in 1832.

The average book in the literary corpus was published in 1862 and the average age of the author at publication was 42. So it looks like lag peaks at around 30 years and corresponds to the author’s … “experience,” maybe we’d call it? … at age 12. I’d say this is a piece of evidence in favor of the formative-psychological hypothesis, and then I’d wave my hands vigorously indeed.

I expect to do some more exploration in the months ahead. Having literary data forward to 1990 will be a big help. A few things I’ll be looking into:

  • International comparison. How does lag change, if at all, in other national contexts? The U.S. was (and is) pretty young. Maybe longer-established nations have different dynamics. And how about changes in U.S. representation of foreign cities and vice versa? My guess is that lag is longer the less an author or culture knows about a foreign place.
  • Does lag change over time? Is it shorter today than it was 150 years ago? My guess: yes, but not radically.
  • Is the falloff in fit quality always symmetrical in time, and am I capturing all the relevant dynamics? The near-symmetry in the current data is surprising to me; I would have expected better backward fit than forward. Could be an artifact of the United States’ youth at the time; several of the cities in question didn’t exist for much more than a decade or two before the literature represented in the corpus was written. I wonder if part of this, too, is down to offsetting effects of memory (skewing fit better backward in time) and relative population stability (skewing things forward).
  • Other ways to get at the same question. A comparison of topical content against textual media presumed to be faster moving (newspapers, journals, etc.) would be instructive. How much more conservative is fiction than non-fiction?

Finally, three data notes:

  • Full data is available from the data page. And the code used for analysis and plotting can be had as an IPython notebook.
  • Careful readers will have noticed that the fits are log-linear, i.e., I’ve used the (base 10) logarithms of the values for mentions and population. This is what you’d expect to do for data like these that follow a power-law distribution.
  • I’ve dropped non-existent cities from the computed regressions (though not the visualizations) as appropriate before 1850 (by which time all the cities have population tallies). I think this is defensible, but you could argue for keeping them and using zero population instead. If I’d done that, the fit quality for 1840 and earlier would have been lower, pushing support toward the presentist hypothesis. But that would also be misleading, since it would amount to treating those cities as if they did exist, but were very small, which isn’t true. That’s one of the reasons to include cities like Salem and Nantucket and Newport, which really were existent but small(ish) from the earliest days of the republic. Anyway, an interpretive choice.