Below are the slides from my talk on text mining, “What To Do with Too Much Text, or, Data Mining for the Humanities and Social Sciences,” given at the Washington University Center for Political Economy a few days ago (8 Oct. 2010). For those who weren’t there, the talk was primarily a survey of approaches to (mostly) humanities-oriented text analysis with examples drawn from literary studies, history, psychology, and political science. For a fuller treatment of the opening “Motivations” section, see this post. You might also want to check out the theoretical underpinnings of my own allegory project, about which I said relatively little.
The original slides are in Keynote and include embedded videos that don’t translate well to PowerPoint (and confuse SlideShare); rather than make a hash of things, I’ve put up a Quicktime version for people who don’t have access to Keynote. The Keynote file includes my (hopefully non-embarrassing) presenter notes, which may give a fuller sense of what I said at some points.
- The original Keynote presentation (23 MB)
- The Quicktime version (42 MB). Just slides, not a video of the talk. Click to advance through the stack.
- Plain HTML; lacks animations and videos, but it’s a lot faster to load and doesn’t require any other software.
Below are links to the projects and tools I mentioned (roughly in order of appearance).
Projects and Works Cited
- R.R. Bowker, U.S. publishing industry statistics.
- Monroe, Colaresi, and Quinn. “Fightin’ Words: Methods of Lexical Feature Selection and Evaluation for Evaluating the Content of Political Conflict.” Political Analysis, 16.4 (2008).
- Dan Cohen, “Searching for the Victorians.”
- Matt Jockers’ work on the geography of Irish-American literature.
- Jockers’ clustering work with Shakespeare and novel genres.
- Michael Witmore’s similar clustering studies using Docuscope. See also this draft version of Witmore and Hope’s forthcoming piece in Shakespeare Quarterly.
- John Burrows’ work on clustering novels and plays. See also many of the works cited in Burrows’ chapter.
- Elson, D. K., N. Dames, and K. R. McKeown. “Extracting Social Networks from Literary Fiction.” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, 2010. 138-147. (PDF). See also my brief comments on this paper.
- Holtzman et al. on semantic measures of media bias (PDF). See also casstools.org.
- Cameron Blevins’ work on topic modeling and the diary of Martha Ballard.
Tools
There are many, many text analysis and natural language processing tools available, many of them geared toward specific research domains. I mentioned only a comparative handful. This list is a long way from exhaustive.
All projects are free and open source unless otherwise noted.
Built Tools
Good places to start; little or no programming required.
- Wordle. Word clouds. Noncommercial use only, I believe.
- WordHoard. Statistics, analytics, and visualizations of classic literature.
- GeoDict. Extract named places from unstructured text.
- Docuscope. A semi-publicly-available tool for text analysis backed by an extensive, hand-curated dictionary.
- Casstools.org. Contrast Analysis of Semantic Similarity. Evaluate differential word associations in text corpora.
- Voyeur Tools. Simple, Web-based text analytics. BYO text/corpus.
- The MONK Project. Integrated, Web-based corpus analysis. Uses only texts from the (relatively large) included corpus.
- SEASR. Packaged text analytics and development environment aimed at scholars in the humanities. Includes Zotero integration. SEASR pushes toward a full toolkit.
- And one tool that I didn’t have a chance to mention: Mark Olson’s ARTFL-associated PhiloLine/PAIR. Sequence alignment detection in textual corpora; the analogy is to similar work in genetics.
Toolkits and Development Environments
Most of these packages come with demos and tutorials that may be useful on their own, but they’re aimed at allowing you to create your own text-mining applications.
- GATE. An advanced development environment for text analysis with included analysis routines.
- LingPipe. Advanced, Java-based natural language processing (NLP) toolkit. Partially integrated with GATE, but also a stand-alone product. Open source, but free only if you make your output texts freely available.
- NLTK. Well-documented, Python-based NLP toolkit. Used widely in teaching NLP.
- MALLET. Java-based, command-line package for statistical NLP. Useful for topic modeling, among many other things.
Statistics Packages
These packages don’t necessarily have anything to do with natural language analysis, but they’re useful for general statistical work and visualization.
- R. A platform for statistical computing. Baayen’s book on corpus linguistics with R is a useful introduction with a natural language focus.
- SPSS. The long-serving standard for stats in the social sciences. Emphatically not free, but widely site-licensed.
Hope this is of some use. Drop me a line (see the “About” page) if you spot any errors or want to chat about this work.
Hello Matthew,
I have just been thinking a lot about this topic lately, and found that the idea of textual information as nested layers is simultaneously intriguing and offsetting. What I find fascinating about it is the flexibility of the system and ability to zoom in on specific elements of a text, like a play in a corpus or a line of an scene, yet not distinctly removing that element from the rest of the data as a whole. However, by zeroing in on certain features I fear that there is both a loss of information and a conscious intervention upon the data set, thereby potentially biasing the outcome.
I suppose my main question is about what you think about too much information in a text. Is there a need or desire for a visualization tool that can incorporate every (or most) elements in a meaningful diagram or do we still need to continue collecting data through our various tools until we understand particular features better?
I apologize if this is vague and a little unorganized. My most recent post at allistrue.org was about this which is partly why it has been on my mind. And congratulations on the job placement at Notre Dame. Best of luck to you!
Thanks, Mike
Hi Mike,
Two thoughts:
1. Yes, zeroing in on certain features is an intervention in the data. It will always affect the outcome. But of course we do this all the time, no matter our working method. So I think the question isn’t /whether/ we do it, it’s whether we do it well or poorly, which is to say whether or not we can convince other people that what we’ve done is justified/reasonable/interesting/etc.
2. I doubt very much that we’ll ever settle on a single tool to handle every possible visualization or analytical task. Some tools try to do a lot; others are very specialized. Each has its merits. But I think it’s just literally impossible that any one tool will cover every desirable use case. And I think a visualization that tried to incorporate every element of a text or corpus would very likely be useless. That said, I’m very sympathetic to the issue of “scalable” reading and analysis; it’s valuable to be able to move quickly between specific source material and analytical abstraction.