NovelTM Grant and Project

Finally for the day, another announcement that’s been slightly delayed: I’m really pleased to be part of the SSHRC-funded, McGill-led NovelTM: Text Mining the Novel project. It’s a long-term effort “to produce the first large-scale cross-cultural study of the novel according to quantitative methods” (quoth the About page). A super-impressive group of people are attached – just have a look at the list of team members!

Our first all-hands project meeting is coming up this week. Looking forward to getting started on things that will keep me busy for years to come. Updates and preliminary results here in the months ahead.

A Generalist Talk on Digital Humanities

Since I’m apparently in self-promotion mode … This past weekend, I gave a talk in the Notre Dame College of Arts and Letters’ Saturday Scholars series. These are public lectures aimed at curious folks who are in town for football games. It was a lot of fun and, apart from spilling water on my laptop because I’m a doofus and a klutz, I think it went well. Video is embedded below; there was also a write-up in the student newspaper.

ACLS Digital Innovation Fellowship

I somehow failed to post about this when it was announced last summer, but I’ve received an ACLS Digital Innovation Fellowship for the 2014-15 academic year to work on the project “Literary Geography at Scale.”

Things are going well so far; I’ll be updating the site here with reports as the research moves along. Eventually, there will be a full site to access and visualize the data (think Google Ngrams for geographic data). In the meantime, here’s the project abstract:

Literary Geography at Scale uses natural language processing algorithms and automated geocoding to extract geographic information from nearly eleven million digitized volumes held by the HathiTrust Digital Library. The project extends existing computationally assisted work on American and international literary geography to new regions, new historical periods – including the present day – and to a vastly larger collection of texts. It also provides scholars in the humanities and social sciences with an enormous yet accessible trove of geographic information. Because the HathiTrust corpus includes books published over many centuries in a variety of languages and across nearly all disciplines, the derived data is potentially useful to researchers in a range of humanities and computational fields. Literary Geography at Scale is one of the largest humanities text-mining projects to date and the first truly large-scale study of 20th and 21st century literature.


Visualizing Uncertainty with Probability Clouds

I’ve come up with a visualization of data uncertainty that seems really obviously useful, but that I’ve never seen before. So I guess some combination of three things must be true:

  1. I am a genius. Deeply unlikely, given that I misspelled “genius” the first time I typed it here.
  2. There’s something wrong with the “new” method that makes it less useful than I think and/or total bunk.
  3. People do use this, and I just haven’t seen it before. Totally possible, given the number of statistical visualizations in most literary studies papers.

Anyway, the idea is to use probability clouds to show a density region around a given line of best fit through the data.[1] I think this avoids some visual-rhetorical pitfalls in the usual ways of showing trends and uncertainty in data, but/and I’d be grateful for thoughts on its value.

Here’s the context and an example: I’m working on a manuscript at the moment for which I need to visualize a bit of data. Nothing fancy; this is one of the basic figures:

Demo 0 data

Yeah, the axes aren’t labeled, etc. The point is, there are two series that are pretty noisy but seem to be doing different things over time (along the x axis).

OK, so to get a handle on the trend, let’s insert a linear fit for each series:

Demo 1 line

Neat! But the fit lines are a little misleadingly precise. I don’t think we want to say that the “true” value of series 2 in 1820 is exactly 0.15, or that the true values cross in exactly 1872. So let’s add a confidence interval at the usual 95% level:

Demo 2 line se

Better, but this manages to be somehow both too precise and not precise enough. Beyond the line of best fit, which still suggests false precision at the center, the shaded 95% confidence region comes to an abrupt end (too precise) and doesn’t have any internal differentiation (not precise enough). The true value, if we want to think of it that way, isn’t equally likely to fall anywhere within the shaded region; it’s probably somewhere near the middle. But there’s also a smallish chance (5%, to be exact) that it falls outside the shaded region entirely.

So why not indicate those facts visually, while getting rid of the fit line entirely? Here’s what this might look like:

Demo 3 cloud

This seems a lot better. It doesn’t draw your eye misleadingly to the fit line or to the edges of an arbitrarily bounded region, but it does suggest where the real fit might be. And it does that while making plain the fuzziness of the whole business. It would be even better in color, too. I like it. Am I missing something?

On the technical side, this is built up by brute force in R with ggplot. The relevant code is:


se_limit     = 0.99  # Largest standard error level to show; valid range 0 to 1
se_regions   = 100   # Number of regions in uncertainty cloud. 100 is a lot;
                     #   a little slow, but produces very smooth cloud.
se_alpha_max = 0.5   # How dark to make region at center of uncertainty cloud.
                     #   0.5 = 50% grey.
line_type    = 0     # A ggplot2 linetype for fit line; 0 = none, 1 = solid

p = qplot(x, y, data=data)  # Use real data, of course!
for(i in 1:se_regions) { # This loop generates the uncertainty density shading 
	p = p + geom_smooth(method = "lm", linetype = line_type, fill = "black", level = i*se_limit/se_regions, alpha = se_alpha_max/(se_regions))
p # Show the finished plot

That’s it. As you can see, it’s just brute force building up overlapping alpha layers at different confidence levels. I once looked at the denstrip package, but couldn’t make it do the same thing. But I’m dumb, so …

Update: I knew I couldn’t be the first to have thought of this! Doug Duhaime points me to visually-weighted regression, apparently first suggested by Solomon Hsiang in 2012. There’s R code (but I guess not yet a formal package) to do this at Felix Schönbrodt’s site.

Here’s a version using Felix Schönbrodt’s vwReg(). Not all cleaned up to match the above, but you get the idea:

Demo 4 vwreg

[1] If you’ve learned any undergrad-level physical chemistry, you can probably see where this idea came from. Here’s a bog-standard textbook visualization of the electron probability density of a 2p atomic orbital:

(source; back to the post body])

Bamman, Underwood, and Smith, “A Bayesian Mixed Effects Model of Literary Character” (2014)

Too long for Twitter, a pointer to a new article:

  • Bamman, David, Ted Underwood, and Noah A. Smith, “A Bayesian Mixed Effects Model of Literary CharacterProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014): 370-79.
    NB. The link here is to a synopsis of the work and related info; you’ll want the authors’ PDF for details.

The new work is related to Bamman, O’Connor, and Smith’s “Learning Latent Personas of Film Characters” (ACL 2013; PDF), which modeled character types in Wikipedia film summaries. I mention the new piece here mostly because it’s cool, but also because it addresses the biggest issue that came up in my grad seminar when we discussed the film personas work, namely the confounding influence of plot summaries. Isn’t it the case, my students wanted to know, that what you might be finding in the Wikipedia data is a set of conventions about describing and summarizing films, rather than (or, much more likely, in addition to) something about film characterization proper? And, given that Wikipedia has pretty strong gender­/race­/class­/age­/nationality­/etc.­/etc./etc. biases in its authorship, doesn’t that limit what you can infer about the underlying film narratives? Wouldn’t you, in short, really rather work with the films themselves (whether as scripts or, in some ideal world, as full media objects)?

The new paper is an important step in that direction. It’s based on a corpus of 15,000+ eighteenth- and nineteenth-century novels (via the HathiTrust corpus), from which the authors have inferred arbitrary numbers of character types (what they call “personas”). For details of the (very elegant and generalizable) method, see the paper. Note in particular that they’ve modeled author identity as an explicit parameter and that it would be relatively easy to do the same thing with date of publication, author nationality, gender, narrative point of view, and so on.

The new paper finds that the author-effects model — as expected — performs especially well in discriminating character types within a single author’s works, though less well than the older method (which doesn’t control for author effects) in discriminating characters between authors. Neither method does especially well on the most difficult cases, differentiating similar character types in historically divergent texts.

Anyway, nifty work with a lot of promise for future development.

Two Events at Stanford

I’m giving a couple of talks at Stanford next week. Announcements from the Lit Lab and CESTA:

On Monday, May 19th, 2014 at 10am, The Literary Lab will host Matt Wilkens, an Assistant Professor of English at the University of Notre Dame. His talk, entitled, “Computational Methods, Literary Attention, and the Geographic Imagination,” will focus on his recent work that combines Digital and Spatial Humanities research as he investigates the literary representation of place in American Literature.

For those interested in the role of Digital Humanities within humanities disciplines, Matt will also be leading a seminar/discussion on the institutional place of Digital Humanities, particularly focusing on its role in the classroom. This event, “Digital Humanities and New Institutional Structures” will take place on Tuesday, May 20th at 12pm in CESTA (the Fourth Floor of Wallenberg Hall, Building 160), Room 433A. Lunch will be provided.

Digital Americanists at ALA 2014

From the Digital Americanists site, which has full details:

Visualizing Non-Linearity: Faulkner and the Challenges of Narrative Mapping
Session 1-A. Thursday, May 22, 2014, 9:00 – 10:20 am

  1. Julie Napolin, The New School
  2. Worthy Martin, University of Virginia
  3. Johannes Burgers, Queensborough Community College

Digital Flânerie and Americans in Paris
Session 2-A. Thursday, May 22, 2014, 10:30-11:50 am

  1. “Mapping Movement, or, Walking with Hemingway,” Laura McGrath, Michigan State University
  2. “Parisian Remainder,” Steven Ambrose, Michigan State University
  3. “Sedentary City,” Anna Green, Michigan State University
  4. “Locating The Imaginary: Literary Mapping and Propositional Space,” Sarah Panuska, Michigan State University
Featured Image -- 2051

Matthew Wilkens: Geospatial Cultural Analysis and Literary Production

An interview with the DH group at Chicago in advance of my talk there this Friday. Looking forward!

digital humanities blog @UChicago

the distribution of US city-level locations, revealing a preponderance of literary–geographic occurrences in what we would now call the Northeast corridor between Washington, DC, and Boston, but also sizable numbers throughout the South, Midwest, Texas, and California. The distribution of US city-level locations, revealing a preponderance of literary–geographic occurrences in what we would now call the Northeast corridor between Washington, DC, and Boston, but also sizable numbers throughout the South, Midwest, Texas, and California.

Matthew Wilkens, Assistant Professor of English at Notre Dame University, will be speaking at the Digital Humanities Forum on March 7 about Geospatial Cultural Analysis and its intersection with Literary Production. Specifically, Wilkens’ research asks: Using computational analysis, how can we define and assess the geographic imagination of American fiction around the Civil War, and how did the geographic investments of American literature change across that sociopolitical event?

We spoke to him about his choice to use a quantitative methodology, the challenges that were consequently faced, and the overall future for the Digital Humanities. This is what he had to say:

What brought you to Digital Humanities methodologies?

I guess it was…

View original post 1,715 more words

Talk at Chicago, March 7, 2014

I’m giving a talk at the University of Chicago Digital Humanities Forum in a couple of weeks. Details at that link and reproduced here. Looking forward to the event and hope to see some of the many cool DH folks in Chicago there.

Date: March 7, 2014
Location: Regenstein Library 122
Time: 12:00-2:00 pm

Abstract: Scholars have long understood that there is a close relationship between literary production and the large-scale cultural contexts in which books are written. But it’s difficult to pin down the many ways in which this relationship might work, especially once we expand our interest from individual texts to systems of production and reception. In this talk, Wilkens offers a computationally assisted analysis of changes in geographic usage within more than a thousand works of nineteenth-century American fiction, arguing that literary-spatial attention around the Civil War was at once more diverse and more stable than has been previously shown. He examines correlations between literary attention and changes in demographic factors that offer preliminary insights into the driving forces behind a range of shifts in literary output. Wilkens also discusses the future of the project, which will soon expand to include millions of books from the early modern period to the present day.

DH Grad Syllabus

The syllabus for my current digital humanities grad seminar is now available. It’ll evolve a bit over the semester, mostly by gaining specific exercises and answers.

I tried to take my own advice from the last time I taught the class as I put together this version; there’s more (and more formal) programming and machine learning, different treatments of the intro to DH and of visualization, more GIS, and (much) less media studies. But if you think there are things I’ve missed, I’d me curious to know. Or, well, I know there are a lot of things I’ve been forced to leave out. Since time remains stubbornly finite, if you think something should be added, what might be cut to make room for it?