A Wee Debate in Post45 Contemporaries
January 2nd, 2012 § 1 Comment
Earlier this year, Andy Hoberek published a piece of mine called “Contemporary Fiction by the Numbers” in his Contemporaries section of Post45. There’s now a response up from Jeremy Rosen and a reply from me. The substance of the thing concerns the best uses of computational methods in literary and cultural studies.
Mostly, though, it’s good to have another excuse to link to Post45 in general and Contemporaries in particular. They’re on my own required reading list.
Books I Read in 2011
January 2nd, 2012 § Leave a Comment
As I did last year and the year before, here’s a list of books I read for the first time in 2011. Mostly confined to fiction, but including two popular-academic books that I (uncharacteristically) read from cover to cover.
- Adichie, Chimamanda Ngozi. Half of a Yellow Sun (2008).
- Aira, César. The Literary Conference (2010).
- Calvino, Italo. Invisible Cities (1978).
- Carson, Anne. Autobiography of Red (1998).
- DeLillo, Don. Libra (1988). [Ducks head in shame.]
- Egan, Jennifer. A Visit from the Goon Squad (2010).
- Graeber, David. Debt: The First 5,000 Years (2011).
- Johns, Adrian. Piracy: The Intellectual Property Wars from Gutenberg to Gates (2010).
- McCarthy, Tom. Remainder (2007).
- Miéville, China. The City and the City (2009).
- Millet, Lydia. Oh Pure and Radiant Heart (2005).
- O’Brien, Tim. In the Lake of the Woods (1994).
- Sayles, John. A Moment in the Sun (2011).
- Vollmann, William. Europe Central (2005).
- Wallace, David Foster. The Pale King (2011).
Not a record-breaking effort, I’d say, but a pretty fun year. I didn’t get to either Theroux or Esterházy as I’d hoped, but there’s always next year, right? Same goes for Dickens — I picked up and put down Our Mutual Friend a couple of times and keep meaning to go back to it. Oh, and I’m maybe twenty pages into Arthur Phillips’ The Tragedy of Arthur, which seems nifty so far. I’ve gotten a couple of other recommendations, but am always happy to have more …
Two Interesting Job Openings
December 26th, 2011 § Leave a Comment
I’ve recently received word of two intriguing DH jobs that might be of interest to some readers:
- The three-year Mark Steinberg Weil Early Career Fellowship in Digital Humanities at Washington University in St. Louis. I was at WashU last year and worked closely with many of the folks involved in this program. It’s a terrific place with great people — really, one of the best experiences of my academic life. I can’t endorse it highly enough. And this newly created fellowship is generous indeed.
- A Research Assistant Professorship to serve as Associate Director of the Center for Digital Humanities at South Carolina. An interesting research/admin hybrid at an important DH center.
Named Localities
September 26th, 2011 § Leave a Comment
Following on my last post about choropleth maps and regional densities, here are a couple of quick figures showing specific named locations at the city level and below (‘bare’ mentions of nations and regions/states alone are excluded) in the same nineteenth-century literary corpus, scaled by number of occurrences:

The biggies are New York, D.C., Boston, London, Paris, etc. Compare this to the log version, which seemed more useful in the density case:

Looks to me like the log version is less clear for this type of figure.
A few notes:
1. These figures include all the texts from 1851-75; still working on year-by-year figures and an animation. Won’t be hard.
2. A couple of things to check out in the near future. (a.) How does the density of named localities compare to that of named regions and nations? Consider Africa in particular, where there’s decent national density in some cases, but perhaps less geographic specificity. (b.) I need to produce a state-level density map that subtracts some measure of population from the number of named location mentions to get a sense of which states received a disproportionate share of literary attention.
3. These maps were produced using the ‘maps’ package in R. Really simple to use. Method cribbed from Nathan Yau’s Visualize This.
4. The top few cities:
| Place | Count |
| New York, NY, USA | 9183 |
| Washington D.C., DC, USA | 4179 |
| Boston, MA, USA | 3951 |
| Paris, France | 3312 |
| London, UK | 3279 |
| Rome, Italy | 2154 |
| Philadelphia, PA, USA | 2058 |
| New Orleans, LA, USA | 1580 |
| Richmond, VA, USA | 1152 |
| Jerusalem, Israel | 925 |
| Charleston, SC, USA | 885 |
| Baltimore, MD, USA | 709 |
| San Francisco, CA, USA | 682 |
Density of Locations in U.S. Fiction around the Civil War
September 12th, 2011 § 1 Comment
I’ve been working recently on different visualizations of the geolocation information I’ve discussed on a couple of previous occasions. (See posts on the corpus, on method and accuracy, and on an earlier style of mapping.)
Here’s the latest: Below are Google Fusion Tables intensity maps of the distribution of named places in my corpus (1098 volumes of U.S. fiction dating from 1851-75; good but not final data, so don’t get carried away just yet), aggregated by nation and by U.S. state.

Named locations aggregated by nation, linear density scale.
(WordPress.com doesn’t allow embedded iframes; click on this (or any) map to see the live version, which includes raw counts per territory on mouseover.)
This first figure mostly shows that the large majority of named places in books written around the Civil War are located in the United States. But (a.) there’s a fair amount of international distribution and (b.) there’s more variation in that international distribution than the shading here reveals. (FWIW, the distribution looks power-law-like, but I haven’t checked yet.)
For better comparative resolution, we can use log-scaled density shading. Note that this of course flattens the difference between high and low densities, which is why I’ve included both figures.

Named locations aggregated by nation, log density scale.
Click for live version.
The log scale brings out a bit better the comparatively high concentrations of named places in western Europe, the Middle East, Russia (who knew?), China, India, Canada, Mexico, Brazil, and Australia. (If I’m remembering right, Greenland is all Melville. But don’t quote me on that.)
What about the distribution within the United States? Ask and ye shall receive:

Named locations aggregated by state, linear density scale.
Click for live version.
New York, Virginia, and Massachusetts stand out; PA, CA, TX, and LA also have pretty decent numbers. A lot of flattening in this visualization, though, so …
The log version:

Named locations aggregated by state, log density scale.
Click for live version.
Interesting how this shows more clearly the notable density in the south and midwest.
More to come, especially time-resolved series (which should be really useful) and city/POI-level maps.
Two notes in passing:
1. Fusion Tables (the tool) and fusion tables (the output) are really cool. They’re dead simple; the charts here took about 15 minutes to create once I’d dumped the relevant data from MySQL. Great for testing and prototyping. But there are limits on what they can do and they’re not terribly flexible outside the things they’re built to do. I had to generate the log counts in Excel, for instance, because you can’t perform computations on aggregated data. (The aggregation itself was totally painless, though, as was the export-import.)
2. I’ll probably need a different package for the city-level mapping, because fusion tables intensity maps will only show 250 data points at a time. Even in my reduced and cleaned data, I have about 1700 unique locations. Also thinking about exactly how to represent both number of instances (marker size, I think) and time-evolution (maybe something like the Outbreak-style Walmart map from FlowingData, though I’d like for my sanity to avoid Flash.)
[Update: It would obviously also be interesting to compare these densities--and their evolution over time--to census data from the period. This is In The Works.]
Toponym Resolution Accuracy
July 8th, 2011 § 2 Comments
I just finished a study on the accuracy of automated location identification in nineteenth-century literary texts using the Stanford NLP package (for named entity extraction) and Google’s geocoding API (for associating location names with lat/lon and other GIS data). The full results will go in the article I’m currently writing, but here’s a quick preview of this piece.
Out of the box, the combination of Stanford NER + Google has precision of about 0.40 and recall of 0.73 on my data (U.S. novels published between 1851 and 1875). Precision is the fraction of identified places that are correct; recall is the fraction of actual places in the source text that are identified correctly. You could get great recall—and terrible precision—by identifying everything in the source text as a location; likewise you’d have terrific precision—but awful recall—by limiting the locations you identify to those that are easy and unambiguous, e.g., “Boston.” You can combine (well, take the harmonic mean of) precision and recall to get an overall sense of accuracy via an F measure; in this case F1 (which weighs P and R equally) is 0.52.
What those numbers mean is that the method succeeds in finding most of the named places, but it also finds a lot of other extraneous stuff that it thinks are places but really aren’t. Fortunately, many of its errors aren’t of the kind you might expect. For instance, the location of “Springfield” in a text is hard to resolve without more information. There are some of these ambiguity problems, of course, but many more come from text strings that ought not to have been identified as locations at all. Some of these are more or less ambiguous (“Charlotte” or “Providence,” for instance, both of which show up pretty often in nineteenth-century texts, almost always as a personal name and divine care, respectively). But many such false locations are (even more) straightforward: “New Jerusalem,” “Conrad,” “Caroline,” etc. (I saw something similar in my previous work with GeoDict.)
Because these sorts of errors are pretty easily identified out of context, it’s not terribly hard to clean up (quickly!) the results by hand, striking recognized locations that likely aren’t used as real places. At the same time, there are a few commonly-used pseudo-places that the NER package finds but Google doesn’t identify (“the South,” “Far East,” and so on). These are trivial to correct.
Applying such hand cleanup raises precision to 0.59 and recall to 0.84 (the latter mostly due to “South,” “North,” etc.—we’re talking about the lit of the Civil War, after all). The revised F1 score is 0.69. That’s not bad, really (though one would always like these numbers to be higher). Compare, for instance, Jochen Leidner’s evaluation of toponym resolution methods, which found lower numbers using more sophisticated techniques on locations mentioned in newspaper articles. Note in particular that even humans often don’t agree on what constitutes a named location (“Boston lawyer”: adjective or place?) nor on the identity of the referent (Leidner cites inter-annotator agreement of roughly 80-90% depending on the corpus).
So long story short: the combination of Stanford NER and Google geolocation performs (surprisingly?) well by contemporary standards. But keep in mind that even in the best case, around 40% of the identified results will be spurious.
Bowker Publishing Stats for 2010
June 13th, 2011 § Leave a Comment
I overlooked last month’s announcement from Bowker concerning the number of books published in 2009 and 2010. Condensed version: fiction is flat at a little under 50,000 new titles, literature dropped off a lot (~30%, to 8k from 11k), though if memory serves, “literature” is a catch-all for anthologies and books about literature; all novels fall under fiction, even when they’re categorized as “literary fiction.” Poetry and drama were off, too.
But—and this may explain much of the drop/flatness—”non-traditional” publication was way, way up. Like, into the millions up. Bowker reports about 316k new traditional titles across all categories for 2010, against almost 2.8 million non-traditional (mostly POD reprints of public domain works). Until c. 2006, the ratios were reversed at about 10:1 traditional:non-traditional. My guess would be that there’s also, buried in that landslide of reprints, a small but very non-trivial number of books that might in the past have been published traditionally, but now are sold direct via Amazon and author sites without the intervention of a regular publisher (note the presence of significant numbers from Lulu, AuthorHouse, XLibris, etc.).
Take-away point: There’s a lot of new fiction out there. I’ll assume most of it is awful, but then most of it has always been awful. It’s only that the sea of words is a lot bigger now.
Literary Production around the Civil War
June 13th, 2011 § 5 Comments
One more histogram, possibly of general interest. Below is a plot showing the number of literary titles by American authors published in the U.S. each year between 1850 and 1875 (via Lyle Wright’s 1957 bibliography as represented in Indiana’s holdings, black bars) along with the number of those titles held in fully edited form in the Wright American Fiction archive from Indiana and in the MONK project.
Note that this isn’t a stacked bar plot; you’re seeing three distinct histograms superimposed on one another. So if you’re looking just at the black bars, you’re seeing a comprehensive survey of American literary production around the Civil War.
Publication of literary texts drops off in the run-up to the Civil War and in its early years, then bounces back pretty quickly, even before the war is over. There are about 100 new books each year on average through the period.
Two notes for my own purposes. (1.) IU’s coverage of fully edited texts is around 40% of the total period output. That’s pretty good. Just as importantly, it hits that level roughly evenly for each year. No need to worry about serious variations from year to year or about individual years with very low representation (though be careful with, e.g., 1860–61). and (2.) I like what MONK did with its 300-text subset, clustering texts as far on either side of the war as possible. Even if you were only working with MONK, you’d still have a decent chance of picking out ante-/post-bellum features.
More Wright American Fiction
June 10th, 2011 § Leave a Comment
With the kind assistance of several folks at Indiana, I’ve now gotten my hands on IU’s full holdings of the digitized Wright American Fiction collection. This is the literary corpus spanning 1850-1875 from which the MONK texts that I used for my initial mapping project were drawn. But MONK chose to limit the size of their Wright-based corpus to around 300 volumes for reasons of balance across their several datasets.
IU has an additional c. 900 Wright texts that have been fully edited and XML encoded (plus 1300 more that have been OCR’ed and XML encoded but not hand edited). This means my depth and temporal coverage in the period around the Civil War just got way better.
More info and results to come as I work my way through this stuff. In the meantime, here’s a plot of the temporal distribution by original publication date of the texts in the two corpora:



