What To Do With Too Much Text

Below are the slides from my talk on text mining, “What To Do with Too Much Text, or, Data Mining for the Humanities and Social Sciences,” given at the Washington University Center for Political Economy a few days ago (8 Oct. 2010). For those who weren’t there, the talk was primarily a survey of approaches to (mostly) humanities-oriented text analysis with examples drawn from literary studies, history, psychology, and political science. For a fuller treatment of the opening “Motivations” section, see this post. You might also want to check out the theoretical underpinnings of my own allegory project, about which I said relatively little.

The original slides are in Keynote and include embedded videos that don’t translate well to PowerPoint (and confuse SlideShare); rather than make a hash of things, I’ve put up a Quicktime version for people who don’t have access to Keynote. The Keynote file includes my (hopefully non-embarrassing) presenter notes, which may give a fuller sense of what I said at some points.

Below are links to the projects and tools I mentioned (roughly in order of appearance).

Projects and Works Cited


There are many, many text analysis and natural language processing tools available, many of them geared toward specific research domains. I mentioned only a comparative handful. This list is a long way from exhaustive.

All projects are free and open source unless otherwise noted.

Built Tools

Good places to start; little or no programming required.

  • Wordle. Word clouds. Noncommercial use only, I believe.
  • WordHoard. Statistics, analytics, and visualizations of classic literature.
  • GeoDict. Extract named places from unstructured text.
  • Docuscope. A semi-publicly-available tool for text analysis backed by an extensive, hand-curated dictionary.
  • Casstools.org. Contrast Analysis of Semantic Similarity. Evaluate differential word associations in text corpora.
  • Voyeur Tools. Simple, Web-based text analytics. BYO text/corpus.
  • The MONK Project. Integrated, Web-based corpus analysis. Uses only texts from the (relatively large) included corpus.
  • SEASR. Packaged text analytics and development environment aimed at scholars in the humanities. Includes Zotero integration. SEASR pushes toward a full toolkit.
  • And one tool that I didn’t have a chance to mention: Mark Olson’s ARTFL-associated PhiloLine/PAIR. Sequence alignment detection in textual corpora; the analogy is to similar work in genetics.

Toolkits and Development Environments

Most of these packages come with demos and tutorials that may be useful on their own, but they’re aimed at allowing you to create your own text-mining applications.

  • GATE. An advanced development environment for text analysis with included analysis routines.
  • LingPipe. Advanced, Java-based natural language processing (NLP) toolkit. Partially integrated with GATE, but also a stand-alone product. Open source, but free only if you make your output texts freely available.
  • NLTK. Well-documented, Python-based NLP toolkit. Used widely in teaching NLP.
  • MALLET. Java-based, command-line package for statistical NLP. Useful for topic modeling, among many other things.

Statistics Packages

These packages don’t necessarily have anything to do with natural language analysis, but they’re useful for general statistical work and visualization.

  • R. A platform for statistical computing. Baayen’s book on corpus linguistics with R is a useful introduction with a natural language focus.
  • SPSS. The long-serving standard for stats in the social sciences. Emphatically not free, but widely site-licensed.

Hope this is of some use. Drop me a line (see the “About” page) if you spot any errors or want to chat about this work.

2 thoughts on “What To Do With Too Much Text

  1. Hello Matthew,

    I have just been thinking a lot about this topic lately, and found that the idea of textual information as nested layers is simultaneously intriguing and offsetting. What I find fascinating about it is the flexibility of the system and ability to zoom in on specific elements of a text, like a play in a corpus or a line of an scene, yet not distinctly removing that element from the rest of the data as a whole. However, by zeroing in on certain features I fear that there is both a loss of information and a conscious intervention upon the data set, thereby potentially biasing the outcome.

    I suppose my main question is about what you think about too much information in a text. Is there a need or desire for a visualization tool that can incorporate every (or most) elements in a meaningful diagram or do we still need to continue collecting data through our various tools until we understand particular features better?

    I apologize if this is vague and a little unorganized. My most recent post at allistrue.org was about this which is partly why it has been on my mind. And congratulations on the job placement at Notre Dame. Best of luck to you!

    Thanks, Mike

    • Hi Mike,

      Two thoughts:

      1. Yes, zeroing in on certain features is an intervention in the data. It will always affect the outcome. But of course we do this all the time, no matter our working method. So I think the question isn’t /whether/ we do it, it’s whether we do it well or poorly, which is to say whether or not we can convince other people that what we’ve done is justified/reasonable/interesting/etc.

      2. I doubt very much that we’ll ever settle on a single tool to handle every possible visualization or analytical task. Some tools try to do a lot; others are very specialized. Each has its merits. But I think it’s just literally impossible that any one tool will cover every desirable use case. And I think a visualization that tried to incorporate every element of a text or corpus would very likely be useless. That said, I’m very sympathetic to the issue of “scalable” reading and analysis; it’s valuable to be able to move quickly between specific source material and analytical abstraction.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s