Multilingual NER

Last week I finished a fellowship proposal to fund work on geolocation extraction across the whole of the HathiTrust corpus. It’s a big project and I’m excited to start working on it in the coming months.

One thing that came up in the course of polishing the proposal—but that didn’t make it into the finished product—is how volumes in languages other than English might be handled. The short version is that the multilingual nature of the HathiTrust corpus opens up a lot of interesting ground for comparative analysis without posing any particular technical challenges.

In slightly more detail: There are a fair number of HathiTrust volumes in languages other than English; the majority of HT’s holdings are English-language texts, but even 10 or 20% of nearly 11 million books is a lot. Fortunately, this is less of an issue than it might appear. You won’t get good performance running a named entity recognizer trained on English data over non-English texts, but all you need to do is substitute a language-appropriate NER model, of which there are many, especially for the European languages that make up the large bulk of HT’s non-English holdings. And it’s not hard at all to identify the language in which a volume is written, whether from metadata records or by examining its content (stopword frequency is especially quick and easy). In fact, you can do that all the way down to the page level, so it’s possible to treat volumes with mixed-language content in a fine-grained way.

About the only difference between English and other languages is that I won’t be able to supply as much of my own genre- and period-specific training data for non-English texts, so performance on non-English volumes published before about 1900 may be a bit lower than for volumes in those languages published in the twentieth century (since the available models are trained almost exclusively on contemporary sources). On the other hand, NER is easier in a lot of languages other than English because they’re more strongly inflected and/or rule bound, so this may not be much of a problem. And in any case, the bulk of the holdings in all languages are post-1900. When it comes time to match extracted locations with specific geographic data via Google’s geocoding API, handling non-English strings is just a matter of supplying the correct language setting with the API request.

Anyway, fun stuff and a really exciting opportunity …

HTRC UnCamp Keynote

I’m giving a keynote address at the upcoming HathiTrust Research Center UnCamp (September 8-9 at UIUC). My talk aside, the event looks really cool. I attended last year and learned a lot about both the technical details of using the HTRC’s resources and the longer-range plans of the center. Highly recommended if you’re anywhere nearby (or even if you’re not).

There’s more information, including registration info, at the link above. Registration closes August 31. My talk is 8:30 am (central time) on Monday, September 9. Don’t know if it’ll be streamed or otherwise made available at some point. I’ll be talking about the newest results from the literary geography and demographics work, including some full-on statistical modeling of the relationships between geographic attention and multiple socioeconomic variables. Which reminds me that I should put at least some of the prettier pictures up on the blog sometime …

[Update: Abstracts and slides for my talk and for Christopher Warren’s (on the “Six Degrees of Francis Bacon” project) are now available at the conference site linked above.]

Racial Dotmap

A few days back, I tweeted about the Racial Dotmap, a really cool GIS project by Dustin Cable of the Weldon Cooper Center for Public Service at UVa. The map shows the distribution (down to the block level) of US population by race according to the 2010 census. There’s a fuller explanation on the Cooper Center’s site.

The map is fascinating stuff — I lost most of a morning browsing around it. Really, you should check it out. To give you an idea of what you’ll find, here are a couple of screen grabs:

The eastern US (click for live version):

2013 08 17 04 27 00 pm

South Bend, Indiana (with Notre Dame). Not clickable, alas, but you can find it from the main map:
BReGz5 CIAAhOY1 png large

One of the things that’s especially appealing about the project is how open it is. The code is posted on GitHub and the underlying data comes from the National Historical Geographic Information System. That fact, along with a suggestion by Nathan Yau of FlowingData, made me wonder how much effort would be involved in creating a version of the map that would allow users to move between historical censuses. It would be really helpful to have an analogous picture for the nineteenth century as I work on the evolution of literary geography during that period.

If I were cooler than I am, this would be where I’d reveal that I had, in fact, created such a thing. I am not that cool. But I wanted to flag the possibility for future use by me or my students or anyone else who might be so inclined. I’m thinking of at least looking into this as a group project for the next iteration of my DH seminar.

I can imagine two big difficulties straight away:

  1. You’d need to have historical geo data, particularly block- or tract-level shapefiles. I have no idea how much the census blocks have changed over time nor whether such historical shapefiles exist. Seems like they should, but …
  2. You’d need the historical census info to be tabulated and available in a way that allows it to be dropped into the existing code or translated into an analogous form. I haven’t looked at that data, so I don’t know how much work would be involved.

Anyway, the Racial Dotmap is a great project to which I hope to be able to return in the future. In the meantime, enjoy!

Update: Mostly for my own future reference, see also MetroTrends’ Poverty and Race in America, Then and Now, which focuses on people below the poverty line and has a graphical slider to compare geographic distributions by race from 1980 through 2010. Click through for the full site.

Poverty Race Screenshot

Video of My Talk on Geolocation at Illinois

I gave a talk on my recent work — titled “Where Was the American Renaissance: Computation, Space, and Literary History in the Civil War Era” — as part of the Uses of Scale planning meeting at Illinois earlier this month. Ted Underwood — convener of the meeting and driving force behind the Uses of Scale project — has posted a video of the event, which includes my talk as well as Ted’s extended intro and a follow-up round table discussion on future directions in literary studies.

The event was lovely; my thanks to Ted for the invitation, to the attendees for some very useful discussion, and to the Mellon Foundation and the University of Illinois for funding the Uses of Scale project, with which I’ve been involved as a co-PI over the past year.

Matthew Sag at Notre Dame, Friday 4/12/13

Matthew Sag, Associate Professor of Law at Loyola University Chicago, will be visiting Notre Dame this Friday (4/12) to give a lunchtime talk on copyright, text analysis, and the legal issues involved in digital humanities research. (Practical details below.)

Professor Sag has written widely on intellectual property law and was the lead author of an influential amicus brief in the recent HathiTrust case that cleared the way for “nonconsumptive” computational use of large digital archives. He’s an important thinker doing work in an area of law that touches more of us in the humanities every day.

All are welcome; hope you can join us. Light lunch will be served. Please do feel free to pass along word to anyone who might be interested!

Professor Sag’s visit is sponsored by the Notre Dame Working Group on Computational Methods in the Humanities and Sciences with generous support from the Office of the Provost.

Details …

  • Who: Matthew Sag (Loyola University Chicago School of Law)
  • What: A talk on — and discussion of — copyright and humanities research
  • Where: LaFortune Gold Room (3rd floor; campus map)
  • When: Friday, April 12, 11:45 am – 1:00 pm

For more information, contact Matt Wilkens (mwilkens@nd.edu).

Nathan Jensen on “Big” Data

An interesting post from Nathan Jensen, a political scientist at Wash U, on the practicalities of working with non-public datasets (via @Ted_Underwood). Worth a read; here are the two main takeaways:

… theory is even more important when using “big data”. You can only really harness the richness of complicated micro data if you have clear micro theories.

Barriers to entry can create rents for a researcher, but they also make it much more difficult to replicate your results. This means that journal reviewers and grant reviewers can hold this against you, and the ultimate impact of your work might be lower. This isn’t a suggestion. It is a warning.

That second point’s a big one.

Kuhn on the comparative difficulty of the disciplines

Noted for my own future use:

Unlike the engineer, and many doctors, and most theologians, the scientist need not choose problems because they urgently need solution and without regard for the tools available to solve them. In this respect, also, the contrast between natural scientists and many social scientists proves instructive. The latter often tend, as the former almost never do, to defend their choice of a research problem—e.g., the effects of racial discrimination or the causes of the business cycle—chiefly in terms of the social importance of achieving a solution. Which group would one then expect to solve problems at a more rapid rate? (Kuhn, Structure of Scientific Revolutions, 164).

How much has this changed as funding in the sciences has moved away from basic research?