Exercise: Entity Extraction

Summary

Use the Stanford NLP tools to identify named entities in a short text, then write a Python program to parse the NLP output.

Details

The text in question is a brief pamphlet, Growler’s Income Tax (1864), by the prolific mid-nineteenth-century writer T.S. Arthur. It’s a defense of the then-new income tax, instituted in 1861 to fund the Union’s war effort. As you’ll see, the text is pretty straightforward, but it’s kind of nifty (or infuriating, I guess) how similar are the arguments it presents concerning taxation to those you might hear today. Go ahead, read it now. It’s short.

Anyway, tax policy isn’t really the point. Your task is to identify algorithmically the named entities in the text and to extract them for further processing. To do this, you’ll use the Stanford NLP toolset. One possibility would be to use the full CoreNLP package, with which you can do much more than identify named entities (for example, you can do part of speech tagging, co-reference resolution, and sentiment analysis) in a single pass over a given text. But the simplest approach is to use just the NER tool.

The NER tool, like the full CoreNLP package, is a Java program. That means you’ll need to have Java on your computer. And not just the Java runtime for your browser; you need the full Java Development Kit. Make sure you get the proper version for your system (OS and 32/64-bit). You can check to see if you have Java installed by opening a terminal and typing java -version. If you get something like these lines, you’re good:

java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Follow the instructions included with the NER download to process the input text. Note that there are several output formats supported. I’d suggest using -outputFormat inlineXML, which I believe is the default. Save your output to an appropriately named file.

Now write some Python code that reads the NER output file and builds a list of unique entities in the output, each entity’s type, and a count of how many times each entity occurs. If you want to challenge yourself in a small way, make sure that the string “Philadelphia, Pennsylvania” is counted as a single location entity, not two separate locations.

Your program should print a summary of this data to the terminal and write a CSV file named entities.csv that contains the same information. Your terminal output should look roughly like this:

Entity		Type		Count
------		----		-----
Boston		Location	2
John Smith	Person		1

Your CSV file should have the same structure, but no fake underlining and no tabs, with entities separated by commas (compare the CSV output of the Beautiful Soup exercise). For another challenge, make sure your output can accommodate entity strings that include commas (probably by making sure all entity strings are enclosed by quotation marks or other markers that indicate the beginning and end of strings).

Submit

Submit three files via Sakai: your Python code, your NER output (in whatever format, probably ill-formed XML), and your CSV entity output.

Consider

A few things to think about before class:

  1. How well or poorly do the entities extracted from the text square with your sense of what the text is about, whom it involves, and where it occurs (or with what areas it’s concerned)?
  2. How accurate is the NER process?
  3. How might you try to improve NER accuracy?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s