A quick post to show some recent research on named places in nineteenth-century American fiction. I’m interested in the range and distribution of places mentioned in these books as potential indicators of cultural investments in, for example, internationalism and regionalism. I’m also curious about the extent to which large-scale changes (both cultural and formal) are observable in the overall literary production of this (or any) period. The mapping work I’ve done so far doesn’t come close to answering those questions, but it’s part of the larger inquiry.
The maps below were generated using a modest corpus of American novels (about 300 in total) drawn from the Wright American Fiction Project at Indiana by way of the MONK project. They show the named locations used in those books; points correspond to everything from small towns through regions, nations and continents. Methodological details and (significant) caveats follow.
Texts were taken from MONK in XML (TEI-A) format with hand-curated metadata. Location names were identified and extracted using Pete Warden’s simple gazetteering script GeoDict, backed by MaxMind’s free world cities database. [Note that there’s currently a bug in the database population script for Geodict. Pete tells me it’ll be fixed in the next release of his general-purpose Data Science Toolkit, into which Geodict has now been folded. But for now, you probably don’t want to use Geodict as-is for your own work.] I tweaked GeoDict to identify places more liberally than usual, which results (predictably) in fewer missed places but more false positives. The locations for 1851 were reviewed pretty carefully by hand; I haven’t done the same yet for the other years. Maps were generated in Flash using Modest Maps with code cribbed shamelessly from the awesome FlowingData Walmart project. This means that it should be relatively easy to turn the static maps above into a time-animated series, but I haven’t done that yet.
As I pointed out in my talk on canons, the international scope and regional clustering of places in 1851 strike me as interesting. See the talk for (slightly) more discussion. Moving forward to 1874—and bearing in mind that we’re looking at dirty data best compared with the similarly dirty 1852—the density of named places in the American west increases after the Civil War and it looks as though a distinct cluster of places in the south central U.S is beginning to emerge.
The changes form 1852 to 1874 are (1) intriguing, (2) but also mostly as expected, and (3) more limited in scope than one might have imagined, given that they sit a decade on either side of the periodizing event of American history. I think an important question raised by a lot of work in corpus analysis (the present research included) concerns exactly what constitutes a “major” shift in form or content.
I’m going to avoid saying anything more here because I don’t want to build too much argument on top of a dataset that I know is still full of errors, but I wanted to put the maps up for anyone to puzzle through. If you have thoughts about what’s going on here, I’d love to hear them.
A couple of notes and caveats on errors:
- Errors in the data are of several kinds. There are missed locations, i.e., named places that occur in the underlying text but are not flagged as such. Some places that existed in the nineteenth century don’t exist now. Some colloquial names aren’t in the database. And of course a book can be set in, say, New York City and yet fail to use the city’s name often or at all, possibly preferring street addresses or localisms like “the Village.” Also, GeoDict as configured identifies all country and continent names with no restrictions, but requires cities and regions (e.g., U.S. states) either to be paired with a larger geographic region (“Brooklyn, New York,” not “Brooklyn”) or preceded by “in” or “at” as indicators of place. You pretty much have to do this to keep the false positive rate manageable.
- But there are still false positives. There’s a city somewhere in the world named for just about any common English name, adjective, military rank, etc. “George,” for instance, is a city in South Africa. “George, South Africa,” if it ever occurred in a text, would be identified correctly. But “In George she had found a true friend” produces a false positive. When I clean the data, I eliminate almost all proper names of this kind and investigate anything else that looks suspicious. Note that the cluster of places in southern Africa visible in the (uncleaned) 1852 and 1874 maps is almost certainly attributable to this kind of error. Travis Brown tells me he’s seen the same thing in his own geocoding experiments.
- Then there are ambiguous locations, usually clear in context but not obvious to GeoDict. “Cambridge” is the most frequent example. Some study suggests that most American novels in the corpus mean the city in Massachusetts, but that’s surely not true of every instance. Most other ambiguities are much more easily resolved, but they still require human attention.