A few quick follow-ups on the series of tagger comparison posts.
One of the limitations (nigh unto embarrassments) of the comparison series was the limited number of packages I examined. This was due to a combination of limited time and early unfamiliarity with the options, but it’s clear in retrospect that there are a few more that should have found a place in the roundup. It’s my hope that I’ll get a chance to look at some of these more closely in the future, but it will probably be a while before that’s a realistic possibility. In the meantime, some notes and links:
OpenNLP is a suite of Java-based, open source (LGPL) tools for natural language processing. Tom Morton, the project’s maintainer and lead developer, passed along some impressive numbers for speed (in line with what I saw for LingPipe and MorphAdorner) and accuracy (98.35% on the Brown corpus, 96.82% on WSJ). It’s threadsafe and has what appear to be modest memory requirements. I haven’t had a chance to test it myself, but I hope to in the future. In the meantime, it certainly seems worth a close look for anyone doing work like mine.
I’ve mentioned NLTK in the past, so will just reiterate that it looks especially useful to those who, like me, are new to NLP (though it’s certainly not limited to that audience). Bob Carpenter also mentions that they have book on NLP coming out soon with O’Reilly; the full text is already available under a CC license on their site.
And some further links of interest:
- MALLET (Machine Learning for Language Toolkit). Quoth their page: “MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.”
MinorThird: “MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text.” It looks like annotation and visualization are particular emphases:
Minorthird’s toolkit of learning methods is integrated tightly with the tools for manually and programmatically annotating text. Additionally, Minorthird differs from existing NLP and learning toolkits in a number of ways:
- Unlike many NLP packages (eg GATE, Alembic) it combines tools for annotating and visualizing text with state-of-the art learning methods.
- Unlike many other learning packages, it contains methods to visualize both training data and the performance of classifiers, which facilitates debugging. Unlike other learning packages less tightly integrated with text manipulation tools, it is possible to track and visualize the transformation of text data into machine learning data.
- Unlike many packages (including WEKA), it is open-source, and available for both commercial and research purposes.
- Unlike any open-source learning systems I know of, it is architected to support active learning and on-line learning, which should facilitate integration of learning methods into agents.
There are doubtless others that I’ve overlooked, but these are enough to keep me busy for the time being.
Helmut Schmid, the developer of TreeTagger, wrote to let me know that TreeTagger remains under active development (and to give me a few pointers on how best to avoid some of the difficulties I had with it). Good to know; I’ll update the earlier posts accordingly.
The AMALGAM Project
The AMALGAM Project is an attempt (more rigorously worked out) to do the type of tagset mapping that I performed in the bag-of-tags trials. The “multitagged” corpus they’ve produced is pretty small (180 sentences), but/and I guess I was pleased to see that they concluded more or less what I did: It’s hard to map one tagset onto another (see, e.g., “A comparative evaluation of modern English corpus grammatical annotation schemes” [PDF]). Still, an interesting project, and as I say, undertaken in more depth than my own preliminary trials.