I just finished a study on the accuracy of automated location identification in nineteenth-century literary texts using the Stanford NLP package (for named entity extraction) and Google’s geocoding API (for associating location names with lat/lon and other GIS data). The full results will go in the article I’m currently writing, but here’s a quick preview of this piece.
Out of the box, the combination of Stanford NER + Google has precision of about 0.40 and recall of 0.73 on my data (U.S. novels published between 1851 and 1875). Precision is the fraction of identified places that are correct; recall is the fraction of actual places in the source text that are identified correctly. You could get great recall—and terrible precision—by identifying everything in the source text as a location; likewise you’d have terrific precision—but awful recall—by limiting the locations you identify to those that are easy and unambiguous, e.g., “Boston.” You can combine (well, take the harmonic mean of) precision and recall to get an overall sense of accuracy via an F measure; in this case F1 (which weighs P and R equally) is 0.52.
What those numbers mean is that the method succeeds in finding most of the named places, but it also finds a lot of other extraneous stuff that it thinks are places but really aren’t. Fortunately, many of its errors aren’t of the kind you might expect. For instance, the location of “Springfield” in a text is hard to resolve without more information. There are some of these ambiguity problems, of course, but many more come from text strings that ought not to have been identified as locations at all. Some of these are more or less ambiguous (“Charlotte” or “Providence,” for instance, both of which show up pretty often in nineteenth-century texts, almost always as a personal name and divine care, respectively). But many such false locations are (even more) straightforward: “New Jerusalem,” “Conrad,” “Caroline,” etc. (I saw something similar in my previous work with GeoDict.)
Because these sorts of errors are pretty easily identified out of context, it’s not terribly hard to clean up (quickly!) the results by hand, striking recognized locations that likely aren’t used as real places. At the same time, there are a few commonly-used pseudo-places that the NER package finds but Google doesn’t identify (“the South,” “Far East,” and so on). These are trivial to correct.
Applying such hand cleanup raises precision to 0.59 and recall to 0.84 (the latter mostly due to “South,” “North,” etc.—we’re talking about the lit of the Civil War, after all). The revised F1 score is 0.69. That’s not bad, really (though one would always like these numbers to be higher). Compare, for instance, Jochen Leidner’s evaluation of toponym resolution methods, which found lower numbers using more sophisticated techniques on locations mentioned in newspaper articles. Note in particular that even humans often don’t agree on what constitutes a named location (“Boston lawyer”: adjective or place?) nor on the identity of the referent (Leidner cites inter-annotator agreement of roughly 80-90% depending on the corpus).
So long story short: the combination of Stanford NER and Google geolocation performs (surprisingly?) well by contemporary standards. But keep in mind that even in the best case, around 40% of the identified results will be spurious.
The best way to improve performance would be to label some data from the domain of 19th century novels. You have a serious mismatch with the newswire-type data that Stanford’s using.
There are several data annotation tools out there. Using our own tool from LingPipe, I’ve found this kind of data pretty easy to annotate — at least 4K-5K tokens/hour. And with a set of 50K labeled tokens, performance should be MUCH better. We’ve also had good luck annotating named-entity data on Mechanical Turk (Jenny Finkel, who was one of the main contributors to the Stanford NE work, recently collected a 1M token corpus over Wikipedia data which we’re currently in the process of adjudicating before releasing it).
A second line of attack would be in improving dictionary resources to match what you want, for instance by adding “South” as a location. Stanford’s tool uses conditional random fields (CRFs), so you could add dictionary information.
Finally, training a system just to recognize locations might also help. Presumably you’re using a generic person/location/organization setup.
Another thing you could do is use a tool that gives you probabilities on the output and you could trade off precision and recall. I don’t think Stanford’s lets you do that, but our CRFs and HMMs do, as do some of Mallet’s.
Thanks, Bob (and apologies for the long delay while I was moving) – these are useful suggestions all. Inter-annotator agreement is always going to be an issue (adjectival uses of place names are especially hard to classify), but that’s no reason not to collect the best possible domain-specific training data.
As for annotation tools, will need to investigate. So far I’ve used only Stanford’s packaged data, so haven’t looked into them.
Thanks again – I appreciate the advice!