I just finished a study on the accuracy of automated location identification in nineteenth-century literary texts using the Stanford NLP package (for named entity extraction) and Google’s geocoding API (for associating location names with lat/lon and other GIS data). The full results will go in the article I’m currently writing, but here’s a quick preview of this piece.
Out of the box, the combination of Stanford NER + Google has precision of about 0.40 and recall of 0.73 on my data (U.S. novels published between 1851 and 1875). Precision is the fraction of identified places that are correct; recall is the fraction of actual places in the source text that are identified correctly. You could get great recall—and terrible precision—by identifying everything in the source text as a location; likewise you’d have terrific precision—but awful recall—by limiting the locations you identify to those that are easy and unambiguous, e.g., “Boston.” You can combine (well, take the harmonic mean of) precision and recall to get an overall sense of accuracy via an F measure; in this case F1 (which weighs P and R equally) is 0.52.
What those numbers mean is that the method succeeds in finding most of the named places, but it also finds a lot of other extraneous stuff that it thinks are places but really aren’t. Fortunately, many of its errors aren’t of the kind you might expect. For instance, the location of “Springfield” in a text is hard to resolve without more information. There are some of these ambiguity problems, of course, but many more come from text strings that ought not to have been identified as locations at all. Some of these are more or less ambiguous (“Charlotte” or “Providence,” for instance, both of which show up pretty often in nineteenth-century texts, almost always as a personal name and divine care, respectively). But many such false locations are (even more) straightforward: “New Jerusalem,” “Conrad,” “Caroline,” etc. (I saw something similar in my previous work with GeoDict.)
Because these sorts of errors are pretty easily identified out of context, it’s not terribly hard to clean up (quickly!) the results by hand, striking recognized locations that likely aren’t used as real places. At the same time, there are a few commonly-used pseudo-places that the NER package finds but Google doesn’t identify (“the South,” “Far East,” and so on). These are trivial to correct.
Applying such hand cleanup raises precision to 0.59 and recall to 0.84 (the latter mostly due to “South,” “North,” etc.—we’re talking about the lit of the Civil War, after all). The revised F1 score is 0.69. That’s not bad, really (though one would always like these numbers to be higher). Compare, for instance, Jochen Leidner’s evaluation of toponym resolution methods, which found lower numbers using more sophisticated techniques on locations mentioned in newspaper articles. Note in particular that even humans often don’t agree on what constitutes a named location (“Boston lawyer”: adjective or place?) nor on the identity of the referent (Leidner cites inter-annotator agreement of roughly 80-90% depending on the corpus).
So long story short: the combination of Stanford NER and Google geolocation performs (surprisingly?) well by contemporary standards. But keep in mind that even in the best case, around 40% of the identified results will be spurious.