What To Do With Too Much Text
October 10, 2010 § 2 Comments
Below are the slides from my talk on text mining, “What To Do with Too Much Text, or, Data Mining for the Humanities and Social Sciences,” given at the Washington University Center for Political Economy a few days ago (8 Oct. 2010). For those who weren’t there, the talk was primarily a survey of approaches to (mostly) humanities-oriented text analysis with examples drawn from literary studies, history, psychology, and political science. For a fuller treatment of the opening “Motivations” section, see this post. You might also want to check out the theoretical underpinnings of my own allegory project, about which I said relatively little.
The original slides are in Keynote and include embedded videos that don’t translate well to PowerPoint (and confuse SlideShare); rather than make a hash of things, I’ve put up a Quicktime version for people who don’t have access to Keynote. The Keynote file includes my (hopefully non-embarrassing) presenter notes, which may give a fuller sense of what I said at some points.
- The original Keynote presentation (23 MB)
- The Quicktime version (42 MB). Just slides, not a video of the talk. Click to advance through the stack.
- Plain HTML; lacks animations and videos, but it’s a lot faster to load and doesn’t require any other software.
Below are links to the projects and tools I mentioned (roughly in order of appearance).
Projects and Works Cited
- R.R. Bowker, U.S. publishing industry statistics.
- Monroe, Colaresi, and Quinn. “Fightin’ Words: Methods of Lexical Feature Selection and Evaluation for Evaluating the Content of Political Conflict.” Political Analysis, 16.4 (2008).
- Dan Cohen, “Searching for the Victorians.”
- Matt Jockers’ work on the geography of Irish-American literature.
- Jockers’ clustering work with Shakespeare and novel genres.
- Michael Witmore’s similar clustering studies using Docuscope. See also this draft version of Witmore and Hope’s forthcoming piece in Shakespeare Quarterly.
- John Burrows’ work on clustering novels and plays. See also many of the works cited in Burrows’ chapter.
- Elson, D. K., N. Dames, and K. R. McKeown. “Extracting Social Networks from Literary Fiction.” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, 2010. 138-147. (PDF). See also my brief comments on this paper.
- Holtzman et al. on semantic measures of media bias (PDF). See also casstools.org.
- Cameron Blevins’ work on topic modeling and the diary of Martha Ballard.
There are many, many text analysis and natural language processing tools available, many of them geared toward specific research domains. I mentioned only a comparative handful. This list is a long way from exhaustive.
All projects are free and open source unless otherwise noted.
Good places to start; little or no programming required.
- Wordle. Word clouds. Noncommercial use only, I believe.
- WordHoard. Statistics, analytics, and visualizations of classic literature.
- GeoDict. Extract named places from unstructured text.
- Docuscope. A semi-publicly-available tool for text analysis backed by an extensive, hand-curated dictionary.
- Casstools.org. Contrast Analysis of Semantic Similarity. Evaluate differential word associations in text corpora.
- Voyeur Tools. Simple, Web-based text analytics. BYO text/corpus.
- The MONK Project. Integrated, Web-based corpus analysis. Uses only texts from the (relatively large) included corpus.
- SEASR. Packaged text analytics and development environment aimed at scholars in the humanities. Includes Zotero integration. SEASR pushes toward a full toolkit.
- And one tool that I didn’t have a chance to mention: Mark Olson’s ARTFL-associated PhiloLine/PAIR. Sequence alignment detection in textual corpora; the analogy is to similar work in genetics.
Toolkits and Development Environments
Most of these packages come with demos and tutorials that may be useful on their own, but they’re aimed at allowing you to create your own text-mining applications.
- GATE. An advanced development environment for text analysis with included analysis routines.
- LingPipe. Advanced, Java-based natural language processing (NLP) toolkit. Partially integrated with GATE, but also a stand-alone product. Open source, but free only if you make your output texts freely available.
- NLTK. Well-documented, Python-based NLP toolkit. Used widely in teaching NLP.
- MALLET. Java-based, command-line package for statistical NLP. Useful for topic modeling, among many other things.
These packages don’t necessarily have anything to do with natural language analysis, but they’re useful for general statistical work and visualization.
- R. A platform for statistical computing. Baayen’s book on corpus linguistics with R is a useful introduction with a natural language focus.
- SPSS. The long-serving standard for stats in the social sciences. Emphatically not free, but widely site-licensed.
Hope this is of some use. Drop me a line (see the “About” page) if you spot any errors or want to chat about this work.