Quantitative Methods Masterclass

This is the syllabus for a masterclass on quantitative methods in humanities research. The first iteration of the class was organized by Maria Filippakopoulou and will be held September 19-21, 2016 at the University of Edinburgh.

The course is a survey of a few useful methods and achieved results in quantitative humanities research. It runs for two and a half days in sum and assumes no prior training in programming or statistical analysis, though the aim is to cover material that is of interest to experienced practitioners and newcomers alike. It is not an introduction to programming, nor a reading group, nor a tools workshop, but it incorporates a little of each of those things.

Participants who complete the class will leave with a sense of what’s possible in quantitative humanities research, how such methods might apply to their own work, and how to continue their methodological training.

Basic outline

A bird’s-eye view of the proceedings …

  • Day 1 (afternoon, about 3 hours). Overview lecture, discussion of goals, and optional software install.
  • Day 2 (full day, morning and afternoon sessions of 3 hours each). Morning: Computational thinking, text statistics, basic natural language processing, visualization. Afternoon: Named entity recognition, geocoding, mapping.
  • Day 3 (full day). Morning: Unsupervised and supervised machine learning, including word vector embedding and text classification. Afternoon: Network analysis and visualization (abbreviated session).

We’ll spend most of the time in each session talking about the ideas behind these methods, discussing the kinds of research problems to which they are well suited (with examples), and looking at code or tools to implement them.

Readings and code

Copies of the suggested readings, as well as the code used during the class, are available on GitHub.

The primary readings are three:

Links here are to the original sources, which may require a subscription or institutional access. Copies are also included in the GitHub repo.

These suggested readings are meant to provide examples of what can be done in the humanities (especially literary studies) using some of the computational techniques the class will explore. They aren’t required; you’ll be able to follow along just fine without having read them. But I hope that many participants will review them in advance and I’ll make regular reference to them as examples.

Oh, and it couldn’t hurt to have read Franco Moretti’s short book, Graphs, Maps, Trees, which remains one of the best articulations of the rationale for pursuing quantitative humanities research.

Nearly all of the code that we examine will be in Python (version 3). I don’t assume that participants are familiar with Python, nor that they know how to code in any other language. If you want a worthy introduction to programming, you might try John Guttag’s Introduction to Computation and Programming Using Python (new edition forthcoming August, 2016) and the accompanying edX course.

Classroom code will be available from the GitHub repo for those who want to experiment on their own. I’ll use the Anaconda Python distribution (with Python 3.5) and Jupyter notebooks (browser-based documents that allow a nice mix of code, text, and visualizations). I’ll also show some packages (including the Stanford CoreNLP toolkit and Gephi network visualization tool) that require a Java Development Kit (JDK). Again, you don’t need to install or understand any of this to benefit from the course; you can follow along as I explain things on my computer. But if you’re comfortable installing development tools, you may get something extra out of working in parallel on your own machine.

Details

Day 1, Monday, September 19

Afternoon

Welcome and overview lecture. After introductions, we’ll spend about an hour discussing the goals and methods of the masterclass, as well as the state of quantitative humanities research today. The meeting will conclude with an optional install session, during which participants can get help setting up a Python environment and other software on their own computers.

Day 2, Tuesday, September 20

Morning

Introduction to statistical thinking and computational methods. We won’t have nearly enough time to cover statistics as such, but there are important differences between the ways in which we humanists usually think about the world and the way our objects look as collections of statistically distributed instances of observable phenomena. We’ll spend some time analyzing those differences and considering their methodological consequences. We will also introduce the rudiments of computational thinking, including variables, iteration, flow control, and abstraction. We’ll end by examining simple programs that transform texts into collections of numbers, that identify functional units such as parts of speech, and that allow us to visualize a range of textual features in a small corpus.

Afternoon

Named entities, geocoding, and maps. Geospatial analysis of texts and other cultural objects has garnered much interest in recent years. We’ll examine a process by which to identify the places named in a collection of texts, pair those locations with detailed geographic information, and display the results on a map. Related discussion of geographic information systems (GIS) and statistical analysis of social change over time. Suggested reading: Wilkens.

Day 3, Wednesday, September 21

Morning

Supervised and unsupervised machine learning. We’ll cover what’s meant by the term “machine learning” and the differences between the two classes of learning tasks. Demonstrations of supervised and unsupervised approaches to labeling textual genres. Time permitting, a discussion of advanced unsupervised techniques such as word vector embedding. Suggested reading: Underwood.

Afternoon

An abbreviated session (1.5 hours) on network analysis and visualization. We’ll introduce basic network concepts and combine them with text analysis to produce a visualization of the relationships within a group of texts and to calculate measures that help us identify especially important features of that network. Suggested reading: So and Long.

Public lecture

Wednesday’s session will be followed by an evening public lecture, including a response from the excellent Prof. Jonathan Hope (Strathclyde).