A quick post to show some recent research on named places in nineteenth-century American fiction. I’m interested in the range and distribution of places mentioned in these books as potential indicators of cultural investments in, for example, internationalism and regionalism. I’m also curious about the extent to which large-scale changes (both cultural and formal) are observable in the overall literary production of this (or any) period. The mapping work I’ve done so far doesn’t come close to answering those questions, but it’s part of the larger inquiry.
The Maps
The maps below were generated using a modest corpus of American novels (about 300 in total) drawn from the Wright American Fiction Project at Indiana by way of the MONK project. They show the named locations used in those books; points correspond to everything from small towns through regions, nations and continents. Methodological details and (significant) caveats follow.
1851. 37 volumes (~2.5M words), with data cleanup.
1852. 44 volumes (~3.0M words), minimal cleanup.
1874. 38 volumes (~3.1M words), minimal cleanup.
The Method
Texts were taken from MONK in XML (TEI-A) format with hand-curated metadata. Location names were identified and extracted using Pete Warden’s simple gazetteering script GeoDict, backed by MaxMind’s free world cities database. [Note that there’s currently a bug in the database population script for Geodict. Pete tells me it’ll be fixed in the next release of his general-purpose Data Science Toolkit, into which Geodict has now been folded. But for now, you probably don’t want to use Geodict as-is for your own work.] I tweaked GeoDict to identify places more liberally than usual, which results (predictably) in fewer missed places but more false positives. The locations for 1851 were reviewed pretty carefully by hand; I haven’t done the same yet for the other years. Maps were generated in Flash using Modest Maps with code cribbed shamelessly from the awesome FlowingData Walmart project. This means that it should be relatively easy to turn the static maps above into a time-animated series, but I haven’t done that yet.
Discussion
As I pointed out in my talk on canons, the international scope and regional clustering of places in 1851 strike me as interesting. See the talk for (slightly) more discussion. Moving forward to 1874—and bearing in mind that we’re looking at dirty data best compared with the similarly dirty 1852—the density of named places in the American west increases after the Civil War and it looks as though a distinct cluster of places in the south central U.S is beginning to emerge.
The changes form 1852 to 1874 are (1) intriguing, (2) but also mostly as expected, and (3) more limited in scope than one might have imagined, given that they sit a decade on either side of the periodizing event of American history. I think an important question raised by a lot of work in corpus analysis (the present research included) concerns exactly what constitutes a “major” shift in form or content.
I’m going to avoid saying anything more here because I don’t want to build too much argument on top of a dataset that I know is still full of errors, but I wanted to put the maps up for anyone to puzzle through. If you have thoughts about what’s going on here, I’d love to hear them.
Caveats
A couple of notes and caveats on errors:
- Errors in the data are of several kinds. There are missed locations, i.e., named places that occur in the underlying text but are not flagged as such. Some places that existed in the nineteenth century don’t exist now. Some colloquial names aren’t in the database. And of course a book can be set in, say, New York City and yet fail to use the city’s name often or at all, possibly preferring street addresses or localisms like “the Village.” Also, GeoDict as configured identifies all country and continent names with no restrictions, but requires cities and regions (e.g., U.S. states) either to be paired with a larger geographic region (“Brooklyn, New York,” not “Brooklyn”) or preceded by “in” or “at” as indicators of place. You pretty much have to do this to keep the false positive rate manageable.
- But there are still false positives. There’s a city somewhere in the world named for just about any common English name, adjective, military rank, etc. “George,” for instance, is a city in South Africa. “George, South Africa,” if it ever occurred in a text, would be identified correctly. But “In George she had found a true friend” produces a false positive. When I clean the data, I eliminate almost all proper names of this kind and investigate anything else that looks suspicious. Note that the cluster of places in southern Africa visible in the (uncleaned) 1852 and 1874 maps is almost certainly attributable to this kind of error. Travis Brown tells me he’s seen the same thing in his own geocoding experiments.
- Then there are ambiguous locations, usually clear in context but not obvious to GeoDict. “Cambridge” is the most frequent example. Some study suggests that most American novels in the corpus mean the city in Massachusetts, but that’s surely not true of every instance. Most other ambiguities are much more easily resolved, but they still require human attention.
Matthew,
Very interesting mapping experiment.
Regarding the issue you raise about ambiguous place names, this problem was address (inter alia) in my book
Leidner, J. L. (2008), Topony Resolution in Text (Boca Raton, FL: Universal Publishers)
http://amzn.to/h8YgW8
See also this blog pointer for regular news in this area (including publications and software pointers):
http://jochenleidner.posterous.com/?sort=&search=+toponym+resolution
Best regards
Jochen
Jochen, thanks for the pointers – much appreciated!
Great work Matt. This stuff has so much promise.
(I’d love to learn how to do this kind of thing in R. Did you run into any examples?)
Working on a project with Kate Hayles, I tried to collect all the place names in Danielewski’s Only Revolutions and I ran into similar problems trying to pick out US city names in the text. I used Stanford’s NER to get the initial list and then pretty much looked up the cities by hand using geonames. (Re: Cambridge) I really don’t know what one can do with city names that occur in *almost every state*… (here’s the map if you’re curious.. http://is.gd/idcuRz)
I’m thinking there’s some sort of Bayesian approach that, given enough computation power, could get some really good results. For example, if you’ve read one book yourself and are confident about the places identified — that should give you some good prior probabilities for placenames in subsequent books by the same author. And you might extend it to look at books from the same publisher in the same decade, etc… and if you had a real random sample — one might get much closer to trusting conclusions about trends…
p.s. Congratulations on the job!
Thanks, Allen! Let’s see …
1. Examples in R: I’ve come across a few in my reading, but nothing specific that I can now recall. Will let you know, though, if I find something interesting in the future. I know that mapping packages for R do exist, and it seems like it would be straightforward (?) to perform the database lookups from within R as well. But python/mysql/Google Refine/Flash (pretty time-series animations but a pain to work with) are OK for me for now.
2. City name disambiguation: I’d thought of a much dumber approach that would just default to the largest city with that name. Or refine a bit by finding the most recently used nation/region/etc. in context and working outward by distance. That second one is non-trivial and the first is obviously dicey, even more so for historical texts where current population may have little enough to do with historical pop. I like the Baysean approach in principle, but I’d worry about the amount of training data/effort required.
3. Nice map of Danielewski! I liked House of Leaves, but haven’t read Only Revolutions. Worth the time?
The largest cities with previous region approach sounds good. It does seem like machine learning of some description might prove useful though. And having some training data would be useful, if only to assess the quality of whatever method you’re using.
Adding location data wouldn’t be that bad an assignment for a student alongside reading a novel. I still think Unsworth’s Bestsellers course is great since it gets students to read all these unknown early-mid 20th century bestselling novels and write up commentaries http://www3.isrl.illinois.edu/~unsworth/courses/bestsellers/
Agreed on both fronts. Will certainly need some training/known data for validation, and as long as one has that, it couldn’t hurt to try some ML stuff. Will post more on this when I get to it.
And I like the idea of students tagging locations as part of a course. The trick, I suppose, is to make that work part of the proper intellectual content of the class (and, of course, to have electronic copies of the texts if you want tagged locations in situ rather than just a list of places; could be tricky for the mostly contemporary material I tend to teach).
Very interesting project Matt; I immediately wanted to start playing with this myself (mostly on individual works rather than a large corpus), just to see what’s there to be seen.
I was wondering if I could ask you elaborate just a bit more about your method and workflow. I got geodict up and running; I thought I’d feed the results to a python script and then do the plotting in Processing. How did you handle country names? Am I correct in inferring from your maps that you plot country names in the middle of the country (I’m looking at Mexico, Canada, Australia, China, India…)? Where did that lat/long data come from? More generally, I’m interested in how you moved from the geodict output to the MaxMind lat/long data.
Thanks!
Hi Chris,
Sure thing, though you’ll probably want to note two tings:
1. There’s currently a bug in GeoDict (it’s not importing the set of named places correctly). Avoid using it for any real work until it’s fixed (soon, one hopes). See https://github.com/petewarden/dstk/issues#issue/7
2. GeoDict itself has been rolled into Pete’s Data Science Toolkit; see http://www.datasciencetoolkit.org. No longer necessary to run your own database server unless you want to, though I don’t know how well the API would hold up for millions of words of text. Will be looking into this myself soon.
But to answer your questions …
GeoDict will output to multiple formats including CSV and XML that include lat/long data pulled from the database when it does its original location lookup. Output format is command line selectable. I wrote a mildly custom output formatter based on the existing XML output. No further queries to the Geo IP database required, happily.
As for country names, I relied on the coordinates supplied in the database, which seem to be pretty much geographically central for each country (and continent). If memory serves, I teaked a few by hand (moved “Canada” out of the Hudson Bay and “Africa” a little further east, for example).
But again the caveat: the standalone GeoDict is now deprecated and the new version in DSTK is buggy until further notice. Pete says he’ll have a fix in the next release of DSTK, but I don’t know exactly when that’ll be. Development is active, so I wouldn’t expect too long a wait.
Let me know if you have any other questions or if I’ve failed to address the ones you asked – I’m happy to answer whatever I can. And keep me posted on your project!
Thanks for the extra tips.
I definitely should’ve RTFM’d for GeoDict; I had pulled just the place names and thought it would be necessary tomine the MaxMind data csv. Silly me.
In the mean time, I played a bit with with list of places I had already pulled from Ulysses using the (buggy) GeoDict; I used Yahoo PlaceFinder API to get lat/long. It does a good job of just taking whatever input you give it (“Dublin”, “England”… even errors like “La” or “Or”) and making a pretty solid guess.
I cribbed copiously from this thread to plot the points via Processing.
The ugly results on Ulysses look like this.
Something isn’t quite right; that point out in the Pacific to the West of S. America? That should be Chile… using transparency it would nice to show the relative number of mentions of, say, Dublin (versus the one mention of New York, e.g.).
Enough for now though. Thanks for the inspiration to tinker a bit!
The twitpic isn’t loading for me, but sounds cool! And yeah, the Yahoo API is a good point of comparison; GeoDict was apparently developed to emulate it, but with fewer false positives (and more false negatives).
Lovely maps. I wish I had something substantive to say about the technical issues, but I don’t. My two cents about Only Revolutions is that it’s more interesting to think about than to read.