New Article in ALH

My article, “The Geographic Imagination of Civil War-Era American Fiction,” is in the latest issue of American Literary History (which happens to be the 100th issue of the journal). The easiest way to get it is probably via Muse (direct link, paywall), though it’s also available from Oxford (publisher of ALH, temporarily free to all). If your institution doesn’t subscribe to either of those outlets, drop me a line and I’ll send you a PDF offprint. I’m really pleased to see the piece in print, especially in an issue with so many people whose work I admire.

The article presents some of my recent work on geolocation extraction in a form that’s more complete than has been possible in the talks I’ve given over the last year or so. There’s more coming on a number of fronts: geographic attention as a function of demographic and economic factors, a wider historical scope, a (much) larger corpus, some marginally related studies of language use in the nineteenth century (with my students Bryan Santin and Dan Murphy), and more. Looking forward to sharing these projects in the months ahead.

Racial Dotmap

A few days back, I tweeted about the Racial Dotmap, a really cool GIS project by Dustin Cable of the Weldon Cooper Center for Public Service at UVa. The map shows the distribution (down to the block level) of US population by race according to the 2010 census. There’s a fuller explanation on the Cooper Center’s site.

The map is fascinating stuff — I lost most of a morning browsing around it. Really, you should check it out. To give you an idea of what you’ll find, here are a couple of screen grabs:

The eastern US (click for live version):

2013 08 17 04 27 00 pm

South Bend, Indiana (with Notre Dame). Not clickable, alas, but you can find it from the main map:
BReGz5 CIAAhOY1 png large

One of the things that’s especially appealing about the project is how open it is. The code is posted on GitHub and the underlying data comes from the National Historical Geographic Information System. That fact, along with a suggestion by Nathan Yau of FlowingData, made me wonder how much effort would be involved in creating a version of the map that would allow users to move between historical censuses. It would be really helpful to have an analogous picture for the nineteenth century as I work on the evolution of literary geography during that period.

If I were cooler than I am, this would be where I’d reveal that I had, in fact, created such a thing. I am not that cool. But I wanted to flag the possibility for future use by me or my students or anyone else who might be so inclined. I’m thinking of at least looking into this as a group project for the next iteration of my DH seminar.

I can imagine two big difficulties straight away:

  1. You’d need to have historical geo data, particularly block- or tract-level shapefiles. I have no idea how much the census blocks have changed over time nor whether such historical shapefiles exist. Seems like they should, but …
  2. You’d need the historical census info to be tabulated and available in a way that allows it to be dropped into the existing code or translated into an analogous form. I haven’t looked at that data, so I don’t know how much work would be involved.

Anyway, the Racial Dotmap is a great project to which I hope to be able to return in the future. In the meantime, enjoy!

Update: Mostly for my own future reference, see also MetroTrends’ Poverty and Race in America, Then and Now, which focuses on people below the poverty line and has a graphical slider to compare geographic distributions by race from 1980 through 2010. Click through for the full site.

Poverty Race Screenshot

Geolocation Correction at Uses of Scale

I’ve just posted a writeup and some data on hand-corrected geolocation extraction over at the Uses of Scale site (associated with the Mellon grant Ted Underwood and I are running). The idea is to share as much as possible of the tediously achieved process stuff that’s required for computational research but that isn’t itself “achieved” results. In addition to my post on geography, these’s also information (from Ted and others) on OCR correction and on removing running headers from scanned texts. Not always sexy, but we hope it’ll help others do related work without having to start entirely from scratch. And I suppose we’re also selfishly hoping for feedback and improvements from anyone else who might have experience dealing with related issues.

Population Growth and Literary Attention

I just posted an item about the literary uses of Chicago and New Orleans on the new Scalable Reading group blog (to which Martin Mueller, Ted Underwood, and Steve Ramsay are also contributors). A brief preview:

There’s a lot of jitter in the New Orleans numbers, but a couple of things seem clear:

  1. Through most of the period 1851–75, there’s much more literary attention paid to New Orleans than to Chicago.
  2. Interest in Chicago picks up meaningfully after about 1870.
  3. Interest in New Orleans wanes a bit around the same time, but only to the extent that the two cities occur at about equal rates in the last few years of the corpus.

[And in sum:] I’m sure there’s some novelty-driven interest in emerging cities and demographic changes, but at least in the case of Chicago and New Orleans, it doesn’t appear to be the dominant factor driving literary attention.

This is also a chance to put in a plug for Scalable Reading, both the blog and the concept. Well worth a read, I think, my own contribution notwithstanding.

Named Localities

Following on my last post about choropleth maps and regional densities, here are a couple of quick figures showing specific named locations at the city level and below (‘bare’ mentions of nations and regions/states alone are excluded) in the same nineteenth-century literary corpus, scaled by number of occurrences:

Localities AllYears

The biggies are New York, D.C., Boston, London, Paris, etc. Compare this to the log version, which seemed more useful in the density case:

Localities AllYears Log

Looks to me like the log version is less clear for this type of figure.

A few notes:

1. These figures include all the texts from 1851-75; still working on year-by-year figures and an animation. Won’t be hard.

2. A couple of things to check out in the near future. (a.) How does the density of named localities compare to that of named regions and nations? Consider Africa in particular, where there’s decent national density in some cases, but perhaps less geographic specificity. (b.) I need to produce a state-level density map that subtracts some measure of population from the number of named location mentions to get a sense of which states received a disproportionate share of literary attention.

3. These maps were produced using the ‘maps’ package in R. Really simple to use. Method cribbed from Nathan Yau’s Visualize This.

4. The top few cities:

Place Count
New York, NY, USA 9183
Washington D.C., DC, USA 4179
Boston, MA, USA 3951
Paris, France 3312
London, UK 3279
Rome, Italy 2154
Philadelphia, PA, USA 2058
New Orleans, LA, USA 1580
Richmond, VA, USA 1152
Jerusalem, Israel 925
Charleston, SC, USA 885
Baltimore, MD, USA 709
San Francisco, CA, USA 682

Density of Locations in U.S. Fiction around the Civil War

I’ve been working recently on different visualizations of the geolocation information I’ve discussed on a couple of previous occasions. (See posts on the corpus, on method and accuracy, and on an earlier style of mapping.)

Here’s the latest: Below are Google Fusion Tables intensity maps of the distribution of named places in my corpus (1098 volumes of U.S. fiction dating from 1851-75; good but not final data, so don’t get carried away just yet), aggregated by nation and by U.S. state.

Countries Linear

Named locations aggregated by nation, linear density scale.
(WordPress.com doesn’t allow embedded iframes; click on this (or any) map to see the live version, which includes raw counts per territory on mouseover.)

This first figure mostly shows that the large majority of named places in books written around the Civil War are located in the United States. But (a.) there’s a fair amount of international distribution and (b.) there’s more variation in that international distribution than the shading here reveals. (FWIW, the distribution looks power-law-like, but I haven’t checked yet.)

For better comparative resolution, we can use log-scaled density shading. Note that this of course flattens the difference between high and low densities, which is why I’ve included both figures.

Countries Log

Named locations aggregated by nation, log density scale.
Click for live version.

The log scale brings out a bit better the comparatively high concentrations of named places in western Europe, the Middle East, Russia (who knew?), China, India, Canada, Mexico, Brazil, and Australia. (If I’m remembering right, Greenland is all Melville. But don’t quote me on that.)

What about the distribution within the United States? Ask and ye shall receive:

States Linear

Named locations aggregated by state, linear density scale.
Click for live version.

New York, Virginia, and Massachusetts stand out; PA, CA, TX, and LA also have pretty decent numbers. A lot of flattening in this visualization, though, so …

The log version:

States Log

Named locations aggregated by state, log density scale.
Click for live version.

Interesting how this shows more clearly the notable density in the south and midwest.

More to come, especially time-resolved series (which should be really useful) and city/POI-level maps.

Two notes in passing:

1. Fusion Tables (the tool) and fusion tables (the output) are really cool. They’re dead simple; the charts here took about 15 minutes to create once I’d dumped the relevant data from MySQL. Great for testing and prototyping. But there are limits on what they can do and they’re not terribly flexible outside the things they’re built to do. I had to generate the log counts in Excel, for instance, because you can’t perform computations on aggregated data. (The aggregation itself was totally painless, though, as was the export-import.)

2. I’ll probably need a different package for the city-level mapping, because fusion tables intensity maps will only show 250 data points at a time. Even in my reduced and cleaned data, I have about 1700 unique locations. Also thinking about exactly how to represent both number of instances (marker size, I think) and time-evolution (maybe something like the Outbreak-style Walmart map from FlowingData, though I’d like for my sanity to avoid Flash.)

[Update: It would obviously also be interesting to compare these densities–and their evolution over time–to census data from the period. This is In The Works.]

Maps of American Fiction

A quick post to show some recent research on named places in nineteenth-century American fiction. I’m interested in the range and distribution of places mentioned in these books as potential indicators of cultural investments in, for example, internationalism and regionalism. I’m also curious about the extent to which large-scale changes (both cultural and formal) are observable in the overall literary production of this (or any) period. The mapping work I’ve done so far doesn’t come close to answering those questions, but it’s part of the larger inquiry.

The Maps

The maps below were generated using a modest corpus of American novels (about 300 in total) drawn from the Wright American Fiction Project at Indiana by way of the MONK project. They show the named locations used in those books; points correspond to everything from small towns through regions, nations and continents. Methodological details and (significant) caveats follow.

1851
1851. 37 volumes (~2.5M words), with data cleanup.

1852
1852. 44 volumes (~3.0M words), minimal cleanup.

1874
1874. 38 volumes (~3.1M words), minimal cleanup.

The Method

Texts were taken from MONK in XML (TEI-A) format with hand-curated metadata. Location names were identified and extracted using Pete Warden’s simple gazetteering script GeoDict, backed by MaxMind’s free world cities database. [Note that there’s currently a bug in the database population script for Geodict. Pete tells me it’ll be fixed in the next release of his general-purpose Data Science Toolkit, into which Geodict has now been folded. But for now, you probably don’t want to use Geodict as-is for your own work.] I tweaked GeoDict to identify places more liberally than usual, which results (predictably) in fewer missed places but more false positives. The locations for 1851 were reviewed pretty carefully by hand; I haven’t done the same yet for the other years. Maps were generated in Flash using Modest Maps with code cribbed shamelessly from the awesome FlowingData Walmart project. This means that it should be relatively easy to turn the static maps above into a time-animated series, but I haven’t done that yet.

Discussion

As I pointed out in my talk on canons, the international scope and regional clustering of places in 1851 strike me as interesting. See the talk for (slightly) more discussion. Moving forward to 1874—and bearing in mind that we’re looking at dirty data best compared with the similarly dirty 1852—the density of named places in the American west increases after the Civil War and it looks as though a distinct cluster of places in the south central U.S is beginning to emerge.

The changes form 1852 to 1874 are (1) intriguing, (2) but also mostly as expected, and (3) more limited in scope than one might have imagined, given that they sit a decade on either side of the periodizing event of American history. I think an important question raised by a lot of work in corpus analysis (the present research included) concerns exactly what constitutes a “major” shift in form or content.

I’m going to avoid saying anything more here because I don’t want to build too much argument on top of a dataset that I know is still full of errors, but I wanted to put the maps up for anyone to puzzle through. If you have thoughts about what’s going on here, I’d love to hear them.

Caveats

A couple of notes and caveats on errors:

  • Errors in the data are of several kinds. There are missed locations, i.e., named places that occur in the underlying text but are not flagged as such. Some places that existed in the nineteenth century don’t exist now. Some colloquial names aren’t in the database. And of course a book can be set in, say, New York City and yet fail to use the city’s name often or at all, possibly preferring street addresses or localisms like “the Village.” Also, GeoDict as configured identifies all country and continent names with no restrictions, but requires cities and regions (e.g., U.S. states) either to be paired with a larger geographic region (“Brooklyn, New York,” not “Brooklyn”) or preceded by “in” or “at” as indicators of place. You pretty much have to do this to keep the false positive rate manageable.
  • But there are still false positives. There’s a city somewhere in the world named for just about any common English name, adjective, military rank, etc. “George,” for instance, is a city in South Africa. “George, South Africa,” if it ever occurred in a text, would be identified correctly. But “In George she had found a true friend” produces a false positive. When I clean the data, I eliminate almost all proper names of this kind and investigate anything else that looks suspicious. Note that the cluster of places in southern Africa visible in the (uncleaned) 1852 and 1874 maps is almost certainly attributable to this kind of error. Travis Brown tells me he’s seen the same thing in his own geocoding experiments.
  • Then there are ambiguous locations, usually clear in context but not obvious to GeoDict. “Cambridge” is the most frequent example. Some study suggests that most American novels in the corpus mean the city in Massachusetts, but that’s surely not true of every instance. Most other ambiguities are much more easily resolved, but they still require human attention.