Last week I finished a fellowship proposal to fund work on geolocation extraction across the whole of the HathiTrust corpus. It’s a big project and I’m excited to start working on it in the coming months.
One thing that came up in the course of polishing the proposal—but that didn’t make it into the finished product—is how volumes in languages other than English might be handled. The short version is that the multilingual nature of the HathiTrust corpus opens up a lot of interesting ground for comparative analysis without posing any particular technical challenges.
In slightly more detail: There are a fair number of HathiTrust volumes in languages other than English; the majority of HT’s holdings are English-language texts, but even 10 or 20% of nearly 11 million books is a lot. Fortunately, this is less of an issue than it might appear. You won’t get good performance running a named entity recognizer trained on English data over non-English texts, but all you need to do is substitute a language-appropriate NER model, of which there are many, especially for the European languages that make up the large bulk of HT’s non-English holdings. And it’s not hard at all to identify the language in which a volume is written, whether from metadata records or by examining its content (stopword frequency is especially quick and easy). In fact, you can do that all the way down to the page level, so it’s possible to treat volumes with mixed-language content in a fine-grained way.
About the only difference between English and other languages is that I won’t be able to supply as much of my own genre- and period-specific training data for non-English texts, so performance on non-English volumes published before about 1900 may be a bit lower than for volumes in those languages published in the twentieth century (since the available models are trained almost exclusively on contemporary sources). On the other hand, NER is easier in a lot of languages other than English because they’re more strongly inflected and/or rule bound, so this may not be much of a problem. And in any case, the bulk of the holdings in all languages are post-1900. When it comes time to match extracted locations with specific geographic data via Google’s geocoding API, handling non-English strings is just a matter of supplying the correct language setting with the API request.
Anyway, fun stuff and a really exciting opportunity …
5 thoughts on “Multilingual NER”
I’d love to know what you’re using for NER of place names and what training data you’re using. Do you have “period specific training data” for English before 1900?
Indeed we do, and will be posting it publicly in a few weeks. About half a million tokens (so far) from a cross section of American fiction published between 1790 and 1875. I’ll put a note on the blog when it’s ready.
Thanks. My own project deals with older texts and I’ve been struggling with the creation of enough useful training data for NER. So, I’d love to see the format of your training data.
Sure thing. As a preview, it’s:
Where [tag] is one of PERS, ORG, LOC, or O (for other), so a three-class model. The tokenized text is produced from plain-text sources using Stanford’s tokenizer.
Hope it’s useful for you when it’s ready. My students are working on the final version now.
Thanks. I appreciate the help.