Bamman, Underwood, and Smith, “A Bayesian Mixed Effects Model of Literary Character” (2014)

Too long for Twitter, a pointer to a new article:

  • Bamman, David, Ted Underwood, and Noah A. Smith, “A Bayesian Mixed Effects Model of Literary CharacterProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014): 370-79.
    NB. The link here is to a synopsis of the work and related info; you’ll want the authors’ PDF for details.

The new work is related to Bamman, O’Connor, and Smith’s “Learning Latent Personas of Film Characters” (ACL 2013; PDF), which modeled character types in Wikipedia film summaries. I mention the new piece here mostly because it’s cool, but also because it addresses the biggest issue that came up in my grad seminar when we discussed the film personas work, namely the confounding influence of plot summaries. Isn’t it the case, my students wanted to know, that what you might be finding in the Wikipedia data is a set of conventions about describing and summarizing films, rather than (or, much more likely, in addition to) something about film characterization proper? And, given that Wikipedia has pretty strong gender­/race­/class­/age­/nationality­/etc.­/etc./etc. biases in its authorship, doesn’t that limit what you can infer about the underlying film narratives? Wouldn’t you, in short, really rather work with the films themselves (whether as scripts or, in some ideal world, as full media objects)?

The new paper is an important step in that direction. It’s based on a corpus of 15,000+ eighteenth- and nineteenth-century novels (via the HathiTrust corpus), from which the authors have inferred arbitrary numbers of character types (what they call “personas”). For details of the (very elegant and generalizable) method, see the paper. Note in particular that they’ve modeled author identity as an explicit parameter and that it would be relatively easy to do the same thing with date of publication, author nationality, gender, narrative point of view, and so on.

The new paper finds that the author-effects model — as expected — performs especially well in discriminating character types within a single author’s works, though less well than the older method (which doesn’t control for author effects) in discriminating characters between authors. Neither method does especially well on the most difficult cases, differentiating similar character types in historically divergent texts.

Anyway, nifty work with a lot of promise for future development.

Two Events at Stanford

I’m giving a couple of talks at Stanford next week. Announcements from the Lit Lab and CESTA:

On Monday, May 19th, 2014 at 10am, The Literary Lab will host Matt Wilkens, an Assistant Professor of English at the University of Notre Dame. His talk, entitled, “Computational Methods, Literary Attention, and the Geographic Imagination,” will focus on his recent work that combines Digital and Spatial Humanities research as he investigates the literary representation of place in American Literature.

For those interested in the role of Digital Humanities within humanities disciplines, Matt will also be leading a seminar/discussion on the institutional place of Digital Humanities, particularly focusing on its role in the classroom. This event, “Digital Humanities and New Institutional Structures” will take place on Tuesday, May 20th at 12pm in CESTA (the Fourth Floor of Wallenberg Hall, Building 160), Room 433A. Lunch will be provided.

Digital Americanists at ALA 2014

From the Digital Americanists site, which has full details:

Visualizing Non-Linearity: Faulkner and the Challenges of Narrative Mapping
Session 1-A. Thursday, May 22, 2014, 9:00 – 10:20 am

  1. Julie Napolin, The New School
  2. Worthy Martin, University of Virginia
  3. Johannes Burgers, Queensborough Community College

Digital Flânerie and Americans in Paris
Session 2-A. Thursday, May 22, 2014, 10:30-11:50 am

  1. “Mapping Movement, or, Walking with Hemingway,” Laura McGrath, Michigan State University
  2. “Parisian Remainder,” Steven Ambrose, Michigan State University
  3. “Sedentary City,” Anna Green, Michigan State University
  4. “Locating The Imaginary: Literary Mapping and Propositional Space,” Sarah Panuska, Michigan State University
Featured Image -- 2051

Matthew Wilkens: Geospatial Cultural Analysis and Literary Production

An interview with the DH group at Chicago in advance of my talk there this Friday. Looking forward!

digital humanities blog @UChicago

the distribution of US city-level locations, revealing a preponderance of literary–geographic occurrences in what we would now call the Northeast corridor between Washington, DC, and Boston, but also sizable numbers throughout the South, Midwest, Texas, and California. The distribution of US city-level locations, revealing a preponderance of literary–geographic occurrences in what we would now call the Northeast corridor between Washington, DC, and Boston, but also sizable numbers throughout the South, Midwest, Texas, and California.

Matthew Wilkens, Assistant Professor of English at Notre Dame University, will be speaking at the Digital Humanities Forum on March 7 about Geospatial Cultural Analysis and its intersection with Literary Production. Specifically, Wilkens’ research asks: Using computational analysis, how can we define and assess the geographic imagination of American fiction around the Civil War, and how did the geographic investments of American literature change across that sociopolitical event?

We spoke to him about his choice to use a quantitative methodology, the challenges that were consequently faced, and the overall future for the Digital Humanities. This is what he had to say:

What brought you to Digital Humanities methodologies?

I guess it was…

View original post 1,715 more words

Talk at Chicago, March 7, 2014

I’m giving a talk at the University of Chicago Digital Humanities Forum in a couple of weeks. Details at that link and reproduced here. Looking forward to the event and hope to see some of the many cool DH folks in Chicago there.

Date: March 7, 2014
Location: Regenstein Library 122
Time: 12:00-2:00 pm

Abstract: Scholars have long understood that there is a close relationship between literary production and the large-scale cultural contexts in which books are written. But it’s difficult to pin down the many ways in which this relationship might work, especially once we expand our interest from individual texts to systems of production and reception. In this talk, Wilkens offers a computationally assisted analysis of changes in geographic usage within more than a thousand works of nineteenth-century American fiction, arguing that literary-spatial attention around the Civil War was at once more diverse and more stable than has been previously shown. He examines correlations between literary attention and changes in demographic factors that offer preliminary insights into the driving forces behind a range of shifts in literary output. Wilkens also discusses the future of the project, which will soon expand to include millions of books from the early modern period to the present day.

DH Grad Syllabus

The syllabus for my current digital humanities grad seminar is now available. It’ll evolve a bit over the semester, mostly by gaining specific exercises and answers.

I tried to take my own advice from the last time I taught the class as I put together this version; there’s more (and more formal) programming and machine learning, different treatments of the intro to DH and of visualization, more GIS, and (much) less media studies. But if you think there are things I’ve missed, I’d me curious to know. Or, well, I know there are a lot of things I’ve been forced to leave out. Since time remains stubbornly finite, if you think something should be added, what might be cut to make room for it?

Books I Read in 2013

A list of the new (to me) fiction I read this year. Criticism, theory, and rereads for work excluded. (See also lists from 2012, 2011, 2010, and 2009.)

  • Bennett, Ronan. The Catastrophist (2001). A possible text for the Congo class, though I’d probably go with something by Mabanckou instead.
  • Coetzee, J. M. The Childhood of Jesus (2013). Very good, as always.
  • Fountain, Ben. Billy Lynn’s Long Halftime Walk (2012). Wanted to like this one, as the consensus best novel of the recent wars, but it left me kind of cold.
  • Klosterman, Chuck. The Visible Man (2011). Interesting premise: Try to think through all the implications of selective invisibility.
  • Kushner, Rachel. The Flamethrowers (2013). Not sure it’s as good as everyone says, but it is good.
  • Ledgard, J. M. Submergence (2013). Two-thirds of a really good book. The African sections are wonderful; the oceanography bits, not so much. Has the same problem Richard Powers does in writing about scientists — can’t get over a fawning love of science itself that finds expression as insufferably polymathic scientists.
  • Mantel, Hilary. Bring Up the Bodies (2012). I’m a shameless fan. The last Wolf Hall book is coming soon, right? Please?
  • Nutting, Alissa. Tampa (2013). The comparisons to Lolita are entirely unearned, though I suppose one could do worse than “not as good as Nabokov.”
  • Pava, Sergio De La. A Naked Singularity (2012). Best book I’ve read in a long time.
  • Pynchon, Thomas. Bleeding Edge. (2013) Really enjoyed this; as with Coetzee, I’m a sucker for everything Pynchon writes.
  • Saunders, George. Tenth of December (2013). Pretty much as good as everyone says, though I still never know what to do with short stories.
  • Winterbach, Ingrid. The Book of Happenstance. (2011). An interesting, patient novel, translated from Afrikaans.

A dozen books in all. Once again, not setting any records, but an enjoyable year. I’m on leave next fall, so may do a bit better in 2014. In the meantime, I’ve just started Tash Aw’s Five Star Billionaire

New Article in ALH

My article, “The Geographic Imagination of Civil War-Era American Fiction,” is in the latest issue of American Literary History (which happens to be the 100th issue of the journal). The easiest way to get it is probably via Muse (direct link, paywall), though it’s also available from Oxford (publisher of ALH, temporarily free to all). If your institution doesn’t subscribe to either of those outlets, drop me a line and I’ll send you a PDF offprint. I’m really pleased to see the piece in print, especially in an issue with so many people whose work I admire.

The article presents some of my recent work on geolocation extraction in a form that’s more complete than has been possible in the talks I’ve given over the last year or so. There’s more coming on a number of fronts: geographic attention as a function of demographic and economic factors, a wider historical scope, a (much) larger corpus, some marginally related studies of language use in the nineteenth century (with my students Bryan Santin and Dan Murphy), and more. Looking forward to sharing these projects in the months ahead.

Multilingual NER

Last week I finished a fellowship proposal to fund work on geolocation extraction across the whole of the HathiTrust corpus. It’s a big project and I’m excited to start working on it in the coming months.

One thing that came up in the course of polishing the proposal—but that didn’t make it into the finished product—is how volumes in languages other than English might be handled. The short version is that the multilingual nature of the HathiTrust corpus opens up a lot of interesting ground for comparative analysis without posing any particular technical challenges.

In slightly more detail: There are a fair number of HathiTrust volumes in languages other than English; the majority of HT’s holdings are English-language texts, but even 10 or 20% of nearly 11 million books is a lot. Fortunately, this is less of an issue than it might appear. You won’t get good performance running a named entity recognizer trained on English data over non-English texts, but all you need to do is substitute a language-appropriate NER model, of which there are many, especially for the European languages that make up the large bulk of HT’s non-English holdings. And it’s not hard at all to identify the language in which a volume is written, whether from metadata records or by examining its content (stopword frequency is especially quick and easy). In fact, you can do that all the way down to the page level, so it’s possible to treat volumes with mixed-language content in a fine-grained way.

About the only difference between English and other languages is that I won’t be able to supply as much of my own genre- and period-specific training data for non-English texts, so performance on non-English volumes published before about 1900 may be a bit lower than for volumes in those languages published in the twentieth century (since the available models are trained almost exclusively on contemporary sources). On the other hand, NER is easier in a lot of languages other than English because they’re more strongly inflected and/or rule bound, so this may not be much of a problem. And in any case, the bulk of the holdings in all languages are post-1900. When it comes time to match extracted locations with specific geographic data via Google’s geocoding API, handling non-English strings is just a matter of supplying the correct language setting with the API request.

Anyway, fun stuff and a really exciting opportunity …