Contemporary Novel

I’m working on a contemporary American novel syllabus for the fall semester. More details when it’s at least fully drafted, but in the meantime, I have a question.

There’s one slot that I’m not yet happy with; does anyone have a suggestion for an especially interesting/compelling/quirky/forward-looking novel by an American woman published after 2000? I have a couple of candidates, but nothing that seems perfect. Thoughts?

[Oh and yes, I’m aware that the last time I asked a question (about POS taggers) on this blog, I spent three months answering it. I’m aiming for that not to happen this time around.]

Evaluating POS Taggers: Coda

A few quick follow-ups on the series of tagger comparison posts.

Other Taggers

One of the limitations (nigh unto embarrassments) of the comparison series was the limited number of packages I examined. This was due to a combination of limited time and early unfamiliarity with the options, but it’s clear in retrospect that there are a few more that should have found a place in the roundup. It’s my hope that I’ll get a chance to look at some of these more closely in the future, but it will probably be a while before that’s a realistic possibility. In the meantime, some notes and links:

OpenNLP

OpenNLP is a suite of Java-based, open source (LGPL) tools for natural language processing. Tom Morton, the project’s maintainer and lead developer, passed along some impressive numbers for speed (in line with what I saw for LingPipe and MorphAdorner) and accuracy (98.35% on the Brown corpus, 96.82% on WSJ). It’s threadsafe and has what appear to be modest memory requirements. I haven’t had a chance to test it myself, but I hope to in the future. In the meantime, it certainly seems worth a close look for anyone doing work like mine.

NLTK

I’ve mentioned NLTK in the past, so will just reiterate that it looks especially useful to those who, like me, are new to NLP (though it’s certainly not limited to that audience). Bob Carpenter also mentions that they have book on NLP coming out soon with O’Reilly; the full text is already available under a CC license on their site.

Others

And some further links of interest:

  • MALLET (Machine Learning for Language Toolkit). Quoth their page: “MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.”
  • MinorThird: “MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text.” It looks like annotation and visualization are particular emphases:

    Minorthird’s toolkit of learning methods is integrated tightly with the tools for manually and programmatically annotating text. Additionally, Minorthird differs from existing NLP and learning toolkits in a number of ways:

    • Unlike many NLP packages (eg GATE, Alembic) it combines tools for annotating and visualizing text with state-of-the art learning methods.
    • Unlike many other learning packages, it contains methods to visualize both training data and the performance of classifiers, which facilitates debugging. Unlike other learning packages less tightly integrated with text manipulation tools, it is possible to track and visualize the transformation of text data into machine learning data.
    • Unlike many packages (including WEKA), it is open-source, and available for both commercial and research purposes.
    • Unlike any open-source learning systems I know of, it is architected to support active learning and on-line learning, which should facilitate integration of learning methods into agents.

There are doubtless others that I’ve overlooked, but these are enough to keep me busy for the time being.

TreeTagger

Helmut Schmid, the developer of TreeTagger, wrote to let me know that TreeTagger remains under active development (and to give me a few pointers on how best to avoid some of the difficulties I had with it). Good to know; I’ll update the earlier posts accordingly.

The AMALGAM Project

The AMALGAM Project is an attempt (more rigorously worked out) to do the type of tagset mapping that I performed in the bag-of-tags trials. The “multitagged” corpus they’ve produced is pretty small (180 sentences), but/and I guess I was pleased to see that they concluded more or less what I did: It’s hard to map one tagset onto another (see, e.g., “A comparative evaluation of modern English corpus grammatical annotation schemes” [PDF]). Still, an interesting project, and as I say, undertaken in more depth than my own preliminary trials.

Evaluating POS Taggers: Conclusions

OK, I’m as done as I care to be with the evaluation stage of this tagging business, which has taken the better part of three months of intermittent work. This for a project that I thought would take a week or two. There’s a lesson here, surely, about research work in general, my coding skills in particular, and the (de)merits of being a postdoc.

In short: I’m going to use MorphAdorner for my future work. The good news overall, though, is that several of the other taggers would also be adequate for my needs, if necessary.

Here’s a summary of the considerations that influenced my decision:

Accuracy

This is probably the most important issue, but it’s also the most difficult for me to assess. The underlying algorithms that each tagger implements make a difference, but I’m really not qualified to evaluate the relative merits of hidden Markov models vs. decision trees, for example, nor the quality of the code in each package.

What I do have is cross-validation results, and the deeply inconclusive bag-of-tags trials I’ve described previously. My own cross-validation tests tend to confirm what the projects themselves claim, namely that they’re about 97% accurate on average. Or at least that’s what I saw for MorphAdorner (97.1%) and LingPipe (97.0%); I wasn’t able to retrain the Stanford tagger on my reference data, so I can’t do anything other than accept Stanford’s reported cross-validation numbers on non-literary data, which are 96.9% (for the usably speedy left3words model) and 97.1% (for the very slow bidirectional model). TreeTagger fared very poorly in my cross-validation tests, though there may well have been problems on my end that explain that fact. Still, I’d be reluctant to use it for real work when trained (by me) on something other than its stock (non-literary) corpus; otherwise, TreeTagger’s self-reported cross-validation accuracy is 96.4-96.8%.

As I suggested in my last post, I don’t think the bag-of-tags trials told me much about the relative accuracy of the various taggers, except concerning the larger point that it’s (fundamentally) hard to translate between tagsets. That’s an important thing to know, but I don’t take MorphAdorner’s superiority in those tests as an indication that it’s necessarily more accurate than the others in the general case. I do, however, now understand better MorphAdorner’s performance characteristics on a reduced version of its tagset (it’s about 99% accurate in sum on the larger part-of-speech classes).

There are a priori reasons to think that MorphAdorner might to better out of the box on a literary corpus, since it’s trained on such data and uses a tagset specifically geared to literary-historical usage, but those are issues better addressed separately below; I wouldn’t say that I’ve managed to provide a posteriori support for them here.

Tagset

This matters much more than I imagined at the outset, since it determines not only the level of detail you can investigate, but also the kinds of information that are preserved or lost in a tagger’s output.

Out of the box, then, I think MorphAdorner and NUPOS win for literary work, with LingPipe/Brown a reasonably close second. Stanford and TreeTagger use the significantly smaller Penn tagset, which seems less suitable for my needs. One of the things I learned from the bag-of-tags work was that there isn’t any apparent benefit to working with a reduced tagset from the beginning; there’s no evidence in the trials I’ve run that the increased number of tokens in each of a smaller number of classes in such a tagset provides greater big-picture accuracy as compensation for reduced distributional information. I’d want to run the reverse trial of MorphAdorner over the Brown training corpus to make this claim with more confidence, but for now, that’s how things look.

The good news, of course, is that you can in principle retrain any of the taggers on a reference corpus tagged in any tagset, so you can switch back and forth between them. That is, provided you have access to the appropriate training data. Now, I do have access to MorphAdorner’s training data, but I don’t have the right to redistribute it, and I’m not sure what would be involved in doing so, were it necessary. And there’s a certain amount of work and computational horsepower involved in performing the retraining. Assuming I want to use NUPOS (and I do; see Martin Mueller’s NUPOS primer/rationale), MorphAdorner is the easiest way to get it.

Training Data

MorphAdorner is trained on a strictly literary corpus, both British and American, that spans the early modern period through the twentieth century. LingPipe uses the Brown corpus via NLTK, which has a good deal of fiction, but is certainly not exclusively literary. Stanford and TreeTagger use the Wall Street Journal (Penn treebank corpus). If each of the training corpora has been tagged with equal accuracy, one would expect MorphAdorner’s corpus to be best suited to arbitrary literary work, though it has the drawback of not being freely redistributable. I’m not sure if this is likely ever to change, as I’m told that although the works themselves are long out of copyright, the texts were originally derived at least in part from commercial sources like Chadwyck-Healey. It’s something to be aware of, but so long as the compiled model can be passed along (and it obviously can be, since it’s included with the base distribution), it will be possible for others to replicate my work. I’m not sure what, if any, issues I’d encounter if I were to reuse the training data with another tagger, though I don’t see why they’d be much different from those involving MorphAdorner itself.

MorphAdorner’s tokenizer and lemmatizer are also intended to deal accurately and efficiently with the vagaries of early modern orthography, which is certainly a plus.

In any case, one of the advantages of using MorphAdorner is that I don’t have to think about this stuff, nor do I have to work on retraining another tagger, nor do I have to worry about trying to pick up and replicate any improvements to, or refinements of, MorphAdorner’s training data that might happen down the road.

Speed

I didn’t imagine at the outset that speed would fall so far down my list of considerations, but I think this is the right place for it in my own usage scenario. As I mentioned in an earlier post, speed is a qualitative threshold issue for me. Quoting that post:

Faster is better, but the question isn’t really “which tagger is fastest?” but “is it fast enough to do x” or “how can I use this tagger, given its speed?” I think there are three potentially relevant thresholds:

  1. As a practical matter, can I run it once over the entire corpus?
  2. Could I retag the corpus more or less on whim, if it were necessary or useful?
  3. Could I retag the corpus in real time, on demand, if there turned out to be a reason to do so?

The first question is the most important one: I definitely have to be able to tag the entire corpus once. The others would be nice, especially #2, but I don’t yet have a compelling reason to require them.

To recap my earlier conclusions, TreeTagger, LingPipe, and MorphAdorner are all fast enough to meet thresholds 1 and 2. Stanford, using the (slightly less accurate) left3words model, meets threshold 1 (tag everything once) and might meet number 2 (tag it again) in a pinch. Stanford bidirectional would be a real stretch to run over a large corpus even once on moderate (read: affordable and accessible to a humanist) hardware. None of the taggers is fast enough on my hardware for full on-demand work, though it’s worth recalling that I made no real attempts to optimize for speed (Bob Carpenter at Alias-i reports 100x speedups of LingPipe are possible with some tweaking). But this on-demand business is a theoretical rather than an immediately practical issue for my work, so I don’t attach much weight to it.

License, Source Code, and Cost

Here it makes sense to break things down case by case:

MorphAdorner: Not yet generally available, but forthcoming when Phil Burns finishes the documentation, probably mid-late February of this year (2009). To be released under a modified NCSA license, freely redistributable with attribution. The same can’t be said of the raw training data, I think, but the compiled models will be included. No cost, all source code available. Under active development.

LingPipe: The only commercial offering of the bunch. Open source and free to use, provided you make all tagged output available in turn. That wouldn’t be a problem for me at the moment, when I’m looking to work with freely available texts (Gutenberg, etc.), but could be a limitation later if/when I use copyrighted corpora. Exceptions to the redistribution requirement are available for sale from Alias-i; they can be modest to pricey in the context of grant-challenged academic humanities, though Bob has suggested that there may be flexibility in their licensing for academics. In any case, I don’t doubt that I could make it work in my own case, but I have at least minor reservations about the impact of using commercial tools on the subsequent adoption of my methods. The ideal case would be for anyone who’s interested to pick up my toolset with the fewest possible encumbrances. That’s not to say there aren’t issues with the other packages’ licensing terms (and probably more importantly with copyright issues involving my working corpora), nor that I object to Alias-i’s business model (which I think is an eminently reasonable compromise between openness and the need to feed themselves), but it’s a consideration. Under very active development, and with outstanding support from the lead developer (the aforementioned Bob Carpenter).

Stanford: Like the other academic packages, open source and free software. Licensed under GPL2. Under active development.

TreeTagger: Distributed free for “research, teaching, and evaluation purposes.” No right to redistribute. No source code available and not under active development, as far as I can tell. [Update: Helmut Schmid writes to tell me that TreeTagger is indeed still under development.]

Other Considerations

There were a few other minor concerns and thoughts.

Threadsafeness can be an issue for the Java-based taggers (that is, all but TreeTagger). LingPipe is threadsafe. Stanford is not. MorphAdorner, I don’t know. This isn’t an immediate concern, since I have enough memory to throw at two separate JVM instances and only two cores to work with, but it would be a nice thing to have in the future.

Input/output encodings and formats. All three of the Java-based taggers can handle Unicode text (which is good), and they can take input data in either plain text or XML format. MorphAdorner and Stanford by default give you back out the same format you put in; LingPipe (again, by default) gives you XML output either way. Doesn’t make much difference to me, and it’s easy to write a simple output filter for any of the packages (TreeTagger possibly excepted) that gives you what you want.

Finally, Bob suggests having a look at NLTK, which I mentioned in an earlier post but didn’t really do anything with. Certainly something to keep in mind for the future, especially as it has a kind of “welcome to NLP work, please allow me to show you around and make things easier for you” vibe. It’s Python-based and GPL2 licensed. Will investigate as time allows.

And that, finally, is that. Back to proper literary work for a bit—polishing off the Coetzee article and talk I mentioned a while ago–then to book manuscript revisions. But the computational work will continue through the spring and summer. With results, eventually, I swear!

Evaluating POS Taggers: LingPipe Bag of Tags Accuracy and General Thoughts on Tagset Translation

Gah, this is all nearing completion. Will have a wrap-up of the whole series later tonight; I, for one, await my conclusions with bated breath.

Before I can finish the overall evaluation, here are the results of my last trial, an iteration of the bag-of-tags accuracy tests I’ve been doing, this time with LingPipe. Note, though, that the section below on tagset conversion and this bag-of-tags approach is probably more interesting than the specific LingPipe results (and there are nice summary graphs of the whole shebang down there, too!).

LingPipe Results

For reference, the list of basic tags and a LingPipe-to-MorphAdorner (i.e., Brown-to-NUPOS) translation table are available. Graphs are below, problematic translations from Brown to NUPOS as follow:

  • ‘abx’ = pre-quantifier, e.g., “both.” These are usually ‘dt’ in the reference data, but about 30% ‘av’. Adjusted as such in the figures below.
  • ‘ap’ = post-determiner, e.g., “many, most, next, last, other, few,” etc. These are complicated; they’re predominantly ‘dt’ (34%), ‘jj’ (22%), ‘nn’ (13%), and ‘nu’ (29%) in the reference data (the other 3% being various other tags). But of course it’s hard to know exactly how much confidence to place in such estimates, absent a line-by-line comparison of all 24,000+ cases. Figures below are nevertheless adjusted according to these percentages.
  • ‘ap$’ = possessive post-determiner. There aren’t very many of these, and they’re mostly attached to tokenization errors. Ignored entirely.
  • ‘tl’ = words in titles. This is supposed to be a tag-modifying suffix to indicate that the token occurs in a title (see also ‘-hl’ for headlines and ‘-nc’ for citations, but LingPipe uses the ‘tl’ tag alone. Split 50/50 between nouns and punctuation, since those dominate the tokens thus tagged, but this is a kludge.
  • ‘to’ = the word “to.” Translated as ‘pc’ = participle, but is also sometimes (~43%) ‘pp’ = preposition in the reference data. Adjusted below.
  • LingPipe doesn’t use the ‘fw’ (foreign word) or ‘sy’ (symbol) tags

Data

 

Table 1: POS frequency in reference data and LingPipe’s output

 

POS	Ref	Test	Ref %	Test %	Diff	Err %
av*	213765	228285	 5.551	 5.718	 0.167	  3.0
cc	243720	231708	 6.329	 5.804	-0.525	 -8.3
dt*	313671	310143	 8.145	 7.769	-0.377	 -4.6
fw	  4117		 0.107		-0.107	
jj*	210224	203683	 5.459	 5.102	-0.357	 -6.5
nn*	565304	596960	14.680	14.954	 0.274	  1.9
np	 91933	118115	 2.387	 2.959	 0.571	 23.9
nu*	 24440	 38856	 0.635	 0.973	 0.339	 53.4
pc*	 54518	 35098	 1.416	 0.879	-0.537	-37.9
pp*	323212	356411	 8.393	 8.928	 0.535	  6.4
pr	422442	430172	10.970	10.776	-0.194	 -1.8
pu*	632749	605152	16.431	15.159	-1.273	 -7.7
sy	   318		 0.008		-0.008	
uh	 19492	 35471	 0.506	 0.889	 0.382	 75.5
vb	666095	664957	17.297	16.657	-0.640	 -3.7
wh	 40162	 70998	 1.043	 1.778	 0.736	 70.5
xx	 24544	 23825	 0.637	 0.597	-0.041	 -6.4
zz	   167	 42272	 0.004	 1.059	 1.055	
Tot	3850873	3992106	

* Tag counts to which adjustments have been applied (see above)

Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Diff = Difference in percentage points between Ref % and Test %
Err % = Percent error in output frequency relative to reference data

Pictures

And then the graphs (click for large versions).

Figure 1: Percentage point errors in POS frequency relative to reference data

 

LP Errors.png

 

Figure 2: Percent error by POS type relative to reference data

 

TT Errors Pct.png

 

Discussion of LingPipe Results

This is about what we’ve seen with the other taggers that use a base tagset other than NUPOS; it’s a bit better than either Stanford or TreeTagger, a fact that stands out more clearly in the summary comparison graphs below, but there are just too many difficulties converting between any two tagsets to say much more. One could certainly point out some of the obvious features in the present case—LingPipe has a thing for numbers, proper nouns, wh-words, and interjections, plus an aversion to punctuation, verbs, and participles—but I think the only genuinely interesting feature is LingPipe’s willingness to tag things as unknown. I’ve left this out of Figure 2 because it badly skews the scale, but notice in the data above that there are just 167 ‘zz’ tags in the reference corpus, but 42,000+ instances of ‘nil’ (=’zz’) in the LingPipe output.

We didn’t see anything like this with Stanford or TreeTagger, but it might be useful. (Of course, it might also be a mess.) I can imagine situations in which it would be better to know that the tagger has no confidence in its output rather than pushing ahead with garbage results. This is one of the reasons that taggers with the option of producing confidence-based output are (potentially) useful, since they would allow one to isolate borderline cases. LingPipe and TreeTagger have such an option; Stanford and MorphAdorner do not, to the best of my knowledge.

Thoughts on Converting between Tagsets

First, some graphs that collate the results of all the bag-of-tags trials. They’re the same as the ones I’ve been using so far, but now with all the numbers presented together for easier comparison.

As always, click each graph for the full-size image.

Figure 3: Percentage point errors by POS type (Summary)

 

Sum Errors.png

 

Figure 4: Error % by POS type (Summary)

 

Note: Y-axis intentionally truncated at +100%.

Sum Errors Pct.png

 

Figure 5: Weighted error percentages by tagger and POS type (Summary)

 

This is the one I like best, since it makes plain the relative importance of each POS type; large errors on rare tags generally matter less than modest errors on common tags, though the details will depend on one’s application.

Sum Errors Bubbles.png

 

Confirming what you see above: MorphAdorner does well over the reference data, which looks like its own training corpus. (Recall that the MorphAdorner numbers are taken from its cross-validation output and without the benefit of its in-built lexicon, so it’s not just running directly over material it already knows. Apologies for the personification in the preceding sentence.) LingPipe is marginally better than Stanford or TreeTagger (this would be more obvious if the log scale weren’t compressing things in the 1-10% error range), but all three (using different training data and different tagsets) lag MorphAdorner significantly (by an eyeballed order of magnitude, more or less).

So … what have I learned from these attempts to measure accuracy across tagsets? Less than I’d hoped, at least in the direct sense. These trials were motivated by an interest in whether or not taggers trained on non-literary corpora would produce results similar to MorphAdorner’s (which is trained exclusively on literature). The problem was that they all use different tagsets out of the box, and I was somewhere between unwilling and unable to retrain all of them on a common one. My thinking was that I’d be able to smooth out their various quirks by picking a minimal set of parts of speech that they’d all recognize, and then mapping their full tagsets down to this basic one.

The problem is that the various tagsets don’t agree on what should be treated as a special case (wh-words, predeterminers, “to,” etc.), and the special cases don’t map consistently to individual parts of speech. The numbers I’ve presented in each of the recent posts on the topic have tried to apply appropriate statistical fixups, but they’re hacks and (informed) guesses at best. In any case, I think what I’m really seeing is that taggers are reasonably good at reproducing the characteristics of their own training input (which we knew already, based on ~97% cross-validation accuracy). So MorphAdorner does well (generally ~99% accuracy over the reduced tagset, i.e., distinguishing nouns from verbs from other major parts of speech) on data that resembles the stuff on which it was trained; the others do less well on that material, since it differs from their training data not just by genre, but also (and more importantly, I think) by tagset.

(An aside: LingPipe is trained on the Brown corpus, which contains a significant amount of fiction and “belles lettres” [Brown’s term, not mine]. Stanford and TreeTagger use the Penn treebank corpus, i.e., all Wall Street Journal, all the time. So there’s a priori reason to believe that LingPipe should do better than either of those two on literary fiction. I like the Brown tagset better than Penn, too, since it deals more elegantly with negatives, possessives, etc.)

For for the sake of comparison, I looked into running a bag-of-tags evaluation of MorphAdorner over the Brown corpus to see if the accuracy numbers would turn out more like those for the other taggers when faced with “foreign” data. My strong hunch is that they would, but it was going to be more trouble than it was worth to nail it down adequately. Perhaps another time.

Takeaway lessons? Mostly, be careful about direct comparisons of the output of different taggers. If I see somewhere that Irish fiction of the 1830s contains 15% nouns, and know that I’ve seen 17% in British Victorian novels, I probably can’t draw any meaningful conclusions from that fact without access to the underlying texts and/or a lot more information about the tools used. It also means that if I settle on one package, then later change course, I’ll almost certainly need to rerun any previous statistics gathered with the original setup if I’m going to compare them with confidence to new results.

More broadly speaking—and this looks ahead to the overall conclusions in the next post—this all highlights the fact that both tagsets and training data matter a lot. The algorithms used by each of the taggers do differ from one another, even when they use similar techniques, and they make different trade-offs concerning accuracy and speed. But the differences introduced by those underlying algorithmic changes—on the order of 1%, max—are small compared to the ones that result from trying to move between tagsets (and, presumably, between literary and non-literary corpora, though the numbers I’ve presented here don’t throw direct light on that point).

This concludes the bag-of-tags portion of tonight’s program. Stay tuned for the grand finale after the intermission.

See also …

Earlier posts on bag-of-tags accuracy:

Evaluating POS Taggers: TreeTagger Bag of Tags Accuracy

This will be brief-ish, since the issues are the same as those addressed re: the Stanford tagger in my last post, and the results are worse.

I’ve again used out-of-the-box settings; like Stanford, TreeTagger uses a version of the Penn tagset. A translation table is available, as is a list of basic tags I’m using for comparison.

As with Stanford, there are a couple of reasons to expect that the results will be worse than those seen with MorphAdorner. There’s the tokenizer again (TreeTagger breaks up things that are single tokens in the reference data), and there’s the non-lit training set. Plus the incompatibility between the tagsets. As before:

  • New: TreeTagger has a funky ‘IN/that’ tag, which might be translated as either ‘pp’ or ‘cs’ (where ‘cs’, subordinating conjunction, is already rolled into ‘cc’, conjunction, in my reduced tagset). I’ve used ‘pp’, which should therefore be overcounted, while ‘cc’ is undercounted.
  • TreeTagger/Penn has a ‘to’ tag for the word “to”; the reference data (MorphAdorner’s training corpus) has no such thing, using ‘pc’ and ‘pp’ as appropriate instead.
  • TreeTagger uses ‘pdt’ for “pre-determiner,” i.e., words that come before (and qualify) a determiner. MorphAdorner lacks this tag, using ‘dt’ or ‘jj’ as appropriate.
  • Easier: TreeTagger uses ‘pos’ for possessive suffixes, while MorphAdorner doesn’t break them off from the base token, and contains modified versions of the base tags that indicate possessiveness. But since I’m not looking as possessives as an individual class, I can just ignore these, since the base tokens will be tagged on their own anyway.
  • Also easy-ish: TreeTagger doesn’t use MorphAdorner’s ‘xx’ (negation) tag. It turns out that almost everything MorphAdorner tags ‘xx’, TreeTagger considers an adverb, so one could lump ‘xx’ and ‘av’ together, were one so inclined.

Data

 

Table 1: POS frequency in reference data and TreeTagger’s output

 

POS	Ref	Test	Ref %	Test %	Diff	Err %
av	213765	226125	 5.551	 5.830	 0.279	  5.0
cc*	243720	167227	 6.329	 4.312	-2.017	-31.9
dt*	313671	292794	 8.145	 7.549	-0.596	 -7.3
fw	  4117	   519	 0.107	 0.013	-0.094	-87.5
jj*	210224	262980	 5.459	 6.781	 1.322	 24.2
nn	565304	642627	14.680	16.570	 1.890	 12.9
np	 91933	162270	 2.387	 4.184	 1.797	 75.3
nu	 24440	 17668	 0.635	 0.456	-0.179	-28.2
pc*	 54518	 91877	 1.416	 2.369	 0.953	 67.3
pp*	323212	371449	 8.393	 9.577	 1.184	 14.1
pr	422442	386668	10.970	 9.970	-1.000	 -9.1
pu	632749	555115	16.431	14.313	-2.118	-12.9
sy	   318	   100	 0.008	 0.003	-0.006	-68.8
uh	 19492	  6063	 0.506	 0.156	-0.350	-69.1
vb	666095	650441	17.297	16.771	-0.526	 -3.0
wh	 40162	 44428	 1.043	 1.146	 0.103	  9.8
xx	 24544	     0	 0.637	     0	-0.637	-100
zz	   167	    13	 0.004	 0.000	-0.004	-92.3
	3850873	3878364				

* Tag counts for which there is reason to expect systematic errors

Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Diff = Difference in percentage points between Ref % and Test %
Err % = Percent error in output frequency relative to reference data

Pictures

And then the graphs (click for large versions).

Note: These graphs are corrected for the xx/av problem discussed above; ‘xx’ tags in the reference data have been rolled into ‘av’ here.

Figure 1: Percentage point errors in POS frequency relative to reference data

 

TT Errors.png

 

Figure 2: Percent error by POS type relative to reference data

 

TT Errors Pct.png

Note on Figure 2:
The bubble charts I’ve used in previous posts are a pain to create; this is much easier and, while not quite as useful for comparing weightings, is good enough for now, especially since the weightings don’t change between taggers (they’re based on tag frequency in the reference data). Note, too, that there’s no log scale involved this time.

Discussion

Ignoring the problematic tags (cc, dt, jj, pc, and pp), things are still pretty bad. Nouns, common and proper alike, are significantly overcounted, verbs and pronouns are undercounted. Rarer tokens (foreign words, symbols, interjections) are a mess, but that’s to be expected. Overall, the error rates are in the neighborhood of Stanford, but a bit worse.

The same caveats concerning the limits of translating between tagsets are in place here as were true in the Stanford case, but again, it’s hard to see how any of this could be construed as better than MorphAdorner.

Takeaway point: TreeTagger looks to be out, too.

Evaluating POS Taggers: Stanford Bag of Tags Accuracy

Following on from the MorphAdorner bag-o-tags post, here’s the same treatment for the Stanford tagger.

I’ve used out-of-the-box settings, which means the left3words tagger trained on the usual WSJ corpus and employing the Penn Treebank tagset. Translations from this set (as it exists in Stanford’s output data) to my (very) reduced tagset are also available.

There are a couple of reasons to expect that the results will be worse than those seen with MorphAdorner. One is the tokenizer. Another is the different (non-lit) training set. A third is incompatibility between the tagsets. This last point is unfortunate, but there’s not really any easy way to get around it. It crops up in a few of the tags, and works in both directions:

  • Stanford/Penn has a ‘to’ tag for the word “to”; the reference data (MorphAdorner’s training corpus) has no such thing, using ‘pc’ and ‘pp’ as appropriate instead.
  • Stanford uses ‘pdt’ for “pre-determiner,” i.e., words that come before (and qualify) a determiner. MorphAdorner lacks this tag, using ‘dt’ or ‘jj’ as appropriate.
  • Easier: Stanford uses ‘pos’ for possessive suffixes, while MorphAdorner doesn’t break them off from the base token, and contains modified versions of the base tags that indicate possessiveness. But since I’m not looking as possessives as an individual class, I can just ignore these, since the base tokens will be tagged on their own anyway.
  • Also easy-ish: Stanford doesn’t use MorphAdorner’s ‘xx’ (negation) tag. It turns out that almost everything MorphAdorner tags ‘xx’, Stanford considers an adverb, so one could lump ‘xx’ and ‘av’ together, were one so inclined.

Data

 

Table 1: POS frequency in reference data and Stanford output

 

POS	Ref.	Test	Ref %	Test %	Test Dif	Test Err %
av	213765	222465	 5.551	 5.754	 0.20264	  3.650
cc	243720	165546	 6.329	 4.282	-2.04736	-32.349
dt*	313671	299797	 8.145	 7.754	-0.39166	 -4.808
fw	  4117	  4094	 0.107	 0.106	-0.00103	 -0.959
jj*	210224	235856	 5.459	 6.100	 0.64093	 11.741
nn	565304	583214	14.680	15.084	 0.40405	  2.752
np	 91933	156650	 2.387	 4.052	 1.66419	 69.709
nu	 24440	 18006	 0.635	 0.466	-0.16896	-26.623
pc*	 54518	 92888	 1.416	 2.402	 0.98668	 69.694
pp*	323212	375382	 8.393	 9.709	 1.31547	 15.673
pr	422442	385059	10.970	 9.959	-1.01107	 -9.217
pu	632749	641420	16.431	16.589	 0.15804	  0.962
sy	   318	   118	 0.008	 0.003	-0.00521	-63.043
uh	 19492	  2239	 0.506	 0.058	-0.44826	-88.560
vb	666095	641714	17.297	16.597	-0.70029	 -4.049
wh	 40162	 41807	 1.043	 1.081	 0.03834	  3.676
xx	 24544	     0	 0.637	 0.000	-0.63736	-100.00
zz	   167	   201	 0.004	 0.005	 0.00086	 19.874
--	
Tot	3850873	3866456		

* Tag counts for which there is reason to expect systematic errors

Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref. = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Test Dif = Difference in percentage points between Ref % and Test %
Test Err % = Percent error in output frequency relative to reference data

Pictures

And then the graphs (click for large versions). Note: These graphs are corrected for the xx/av problem discussed above; ‘xx’ tags in the reference data have been rolled into ‘av’ here.

Figure 1: Percentage point errors in POS frequency relative to reference data

 

ST Errors.png

 

Figure 2: Percent error by POS type relative to reference data

 

ST Errors Bubble.png

Notes on Figure 2:

  1. As before, the size of each bubble represents the relative frequency of that POS tag in the reference data. Bigger bubbles mean more frequent tags.
  2. The x-axis is logarithmic, so the differences are both more (‘uh’, ‘sy’) and less (‘pu’, ‘nn’) dramatic than they appear.

Discussion

First, there are a few POS types that we know will be off, since there’s not a straightforward conversion for them between the NUPOS and Penn tagsets. These are: dt, jj, pc, and pp. Adjectives (jj) are the only ones that are a bit of a disappointment, since that’s a class in which I have some interest. On the plus side, though, this is a problem connected to the ‘pdt’ tag in Penn, and there are only about 4,000 occurrences of it in Stanford’s output, compared to 235,000+ occurrences of ‘jj.’ Even if half the ‘pdt’ tags should be ‘jj’ rather than ‘dt’, that’s still less than a 1% contribution to the overall error rate for ‘jj’.

That said, what sticks out here? Well, the numbers are a lot worse than those for MorphAdorner, for which the worst cases (proper nouns, nouns, and punctuation) were 0.1 – 0.2 percentage points off, compared to 1 – 2 here. So we have results that are about an order of magnitude worse. And in MorphAdorner’s case, the nouns overall may not have been as bad as they look, since proper nouns were undercounted, while common nouns were overcounted by a roughly offsetting amount. For Stanford, though, both common nouns and proper nouns are overcounted, so you can’t get rid of the error by lumping them together.

Similarly, relative error percentages for most tag types are much higher. In Figure 2, the main cluster of values is between 1% and about 50%; for MorphAdorner, it was 0.1 and 1.0. Nouns, verbs, adjectives, adverbs, and pronouns—all major components of the data set—are off by 3% to 10% or more.

The real question, though, is what to make of all this. How many of the errors are merely “errors,” i.e., differences between the way Stanford does things and the way MorphAdorner does them. What I’m interested in, ultimately, is a tagger that’s reliable across a historically diverse corpus; I don’t especially care if it undercounts nouns, for instance, so long as it undercounts them all the time, everywhere, to the same extent. But in the absence of literary reference data not linked to MorphAdorner, and without the ability to train the Stanford tagger on MorphAdorner’s reference corpus, it’s hard for me to assess accuracy other than by standard cross-validation and this bag of tags method.

Takeaway point: I don’t see any compelling reason to keep Stanford in the running at this point.

Evaluating POS Taggers: LingPipe Cross-Validation

Below are the promised cross-validation results for LingPipe. They’re produced by LingPipe’s own test suite rather than by my own (cruder) methods, but there’s no reason to suspect that these numbers shouldn’t be directly comparable to my earlier results for MorphAdorner.

So, without delay, the out-of-the-box numbers:

Accuracy: 96.9%
Accuracy on unknown tokens: 71%

That’s using 5-grams; compiled models are about 6 MB. With a cache and 8-grams (producing 16 MB models), things are about the same:

Beams  Acc   Unk   Speed (tokens/s)*
10    .961  .683   29K/s
14    .963  .694   28K/s
20    .967  .699   27K/s
28    .970  .699   18K/s
40    .970  .699    5K/s

* See note below on speed; my own are a bit lower, because my machine is slower.

Note, as Bob did in an email to me, that overall accuracy in this case is very slightly higher, but that it actually does a little bit worse on unknown tokens.

For reference, recall that MorphAdorner (with the potential benefit of running over training data produced in conjunction with one of its predecessors, hence likely to do a bit better on tokenization; Martin or Phil, correct me if I’m wrong about this) was 97.1% accurate when restricted to a lexicon derived solely from the cross-validation data. Unfortunately I don’t have figures for MorphAdorner’s performance on unknown tokens.

Takeaway point: This looks to me, for practical purposes, like a dead heat as far as accuracy goes.

Next up, a comparison of overall bag of tags statistics.

[Note: The numbers above are from Bob’s report to me. I’ve tried to rerun them on my machine, but have had trouble getting the cross-validation run to finish. It keeps dying at apparently random spots after many successful fold iterations with a Java error (not out of memory, but something about a variable being out of range). So I don’t have full numbers from my own trial. But the process goes more than far enough to suggest that the numbers above are reasonable and repeatable; I see very similar accuracy figures for each fold over many folds and with different fold sizes. My speeds are lower (about half the numbers given above, consistent with what I’ve seen for LingPipe on this computer in the past) since my machine is slower, but show a similar pattern: consistent through 14 or 20 beams, slowing at 20 or 28, and down by 3x or 4x at 40, with accuracy leveling off at 20 or 28 beams (which looks to be the speed/accuracy sweet spot). In any case, I’m satisfied with the way things stand and don’t see much reason to look into this further.]

Evaluating POS Taggers: MorphAdorner Bag of Tags Accuracy

In my last post, I outlined an approach to assessing tagger accuracy by measuring the relative frequencies of a reduced set of POS tags in a tagger’s output and comparing them to the frequencies in the training data I have on hand. Here’s the first set of results, from MorphAdorner.

A couple of notes. First, the procedure for MorphAdorner differs slightly from the one I’ll use for the other taggers. In order to mitigate some of the advantage it has thanks to working with its own training data, I’ll be using the output of its cross-validation runs, rather than that of a stock run over the full corpus. That means that it’s not making use of the extended lexicon built into the default distribution, nor of any other fixups that may have been applied by hand after the default training (to both of which the other taggers of course also lack access). I think MorphAdorner still has an advantage, but this should reduce it slightly (in cross-validation, lack of the extended lexicon reduced overall accuracy from 97.7% to 97.1%, both of which were down from 98.1% on a run of the default distribution back over the training data).

A listing of the reduced tagset is included in the last post. You can also download two versions of the translation table from MorphAdorner-ese (i.e., NUPOS) to this reduced set. The first covers only the 238 tags currently defined by NUPOS, and is derived directly from a dump of the tag statistics and help information in MONK. The second includes all 261 tags actually used in the MorphAdorner training data. Two points to note: (1.) These translations use the syntactic part of speech assigned to each tag, not the word class to which it belongs. So when a noun is used as an adjective, it’s translated to ‘jj’ not ‘nn’. (2.) Of the 23 tags that appear in the training data but not in the standard tagset, almost all are punctuation. In fact the only ones that I haven’t translated as punctuation are various iterations of asterisks, to which I’ve assigned ‘zz’ (unknown-other), though I could see the case for making them either punctuation or symbols. It doesn’t seem to have made much difference, as ‘zz’ remains underrepresented in the output, while ‘sy’ and ‘pu’ are overrepresented. Go figure.

Results

First, the raw numbers:

Table 1: POS frequency in MorphAdorner training data and cross-validation output

 

POS	Train	Cross	 Tr %	 Cr %	 Cr Dif		Cr Err %
av	213765	212662	 5.551	 5.512	-0.03909	 -0.516
cc	243720	240704	 6.329	 6.239	-0.09015	 -1.237
dt	313671	313111	 8.145	 8.116	-0.02993	 -0.179
fw	  4117	  4076	 0.107	 0.106	-0.00126	 -0.996
jj	210224	210454	 5.459	 5.455	-0.00437	  0.109
nn	565304	571418	14.680	14.811	 0.13069	  1.082
np	 91933	 84632	 2.387	 2.194	-0.19375	 -7.942
nu	 24440	 24776	 0.635	 0.642	 0.00751	  1.375
pc	 54518	 54200	 1.416	 1.405	-0.01092	 -0.583
pp	323212	325560	 8.393	 8.438	 0.04498	  0.726
pr	422442	422557	10.970	10.952	-0.01778	  0.027
pu	632749	640074	16.431	16.590	 0.15877	  1.158
sy	   318	  1282	 0.008	 0.033	 0.02497	303.145
uh	 19492	 20836	 0.506	 0.540	 0.03388	  6.895
vb	666095	666822	17.297	17.283	-0.01388	  0.109
wh	 40162	 40316	 1.043	 1.045	 0.00202	  0.383
xx	 24544	 24596	 0.637	 0.638	 0.00014	  0.212
zz	   167	    97	 0.004	 0.003	-0.00182	-41.916
--	    --	    --
Tot	3850873	3858173		

Legend
POS = Part of speech (see previous post for explanations)
Train = Number of occurrences in training data
Cross = Number of occurrences in cross-validation output
Tr % = Percentage of training data tagged with this POS
Cr % = Percentage of cross-validation data tagged with this POS
Cr Dif = Difference in percentage points between Tr % and Cr %
Cr Err % = Percent error in cross-validation frequency relative to training data

That’s a bit hard to read, and is better represented graphically (click for full-size images):

Figure 1: Percentage point errors in POS frequency relative to the training data

 

MA Errors.png

 

Figure 2: Percent error by POS type relative to the training data

 

MA Errors Bubble.png

Notes on Figure 2:

  1. The size of each bubble represents the relative frequency of that POS tag in the training data. Bigger bubbles mean more frequent tags.
  2. The x-axis is logarithmic, so the differences are both more (‘zz’, ‘sy’) and less (‘pr’, ‘nn’) dramatic than they appear.

Discussion

Just a couple of points, really.

  1. Things look pretty good overall. With the exception of a few (relatively low-frequency) outliers, most tags are recognized with around 99% accuracy or better. So I could have pretty good confidence were I to compare, say, adjective frequency in different contexts.
  2. It looks, though, like there’s systematic undercounting of proper nouns, conjunctions and, to a lesser extent, adverbs and determiners. Proper nouns will probably always be hard, since they’ll have a disproportionate share of unknown (i.e., not present in the training data) tokens (on which accuracy generally drops to 70-90%, perhaps a bit better with proper nouns since you can maybe crib from orthography?).
  3. Nouns are overcounted, reducing the proper-noun problem if I were to lump the two together. Punctuation is overcounted, but I find it hard to imagine a situation in which I would care. Prepositions are slightly overcounted.
  4. Symbols (303% error) and unknown/other tokens (42%) are totally unreliable. They don’t contribute much to the overall error rate, because there aren’t very many of them (hence the nearly invisible bubbles in Figure 2), which is likely also why they’re so unreliable in the first place. But you wouldn’t want to make much of any variation you might see in those frequencies. Same goes for interjections, which hover around 7% error rate. [Note: I’ll look into the ‘sy’ and ‘zz’ cases, since it could be something systematically askew, but also might just be small-sample effects.]

Anyway, this is useful information on its own, and I hope it will be even more so once I have analogous data for the other taggers. That data should be ready in the next day or two, so more to follow shortly.

Evaluating POS Taggers: Accuracy and Bags of Tags

As I mentioned in my last post, one of the things that’s tricky about comparing the results of different taggers—or even, for instance, of comparing MorphAdorner to its own training data—is that the tokens involved can change, since each tagger tokenizes input data differently. Even if you’re not going hold that against a given tagger directly (i.e., you’re not going to count token mismatches as errors in themselves), you still have the problem that the number of POS tags in the output will then sometimes differ from the number in the input, even for single tokens, which in turn will produce mismatches/errors (except where, as with MorphAdorner, single tokens can take multiple tags, so that it’s possible in principle for different tokenizations to produce equivalent series of tags).

(NB. Just to be clear, yes, you want tokenization to be as accurate as possible in general, but this is only a direct problem when comparing the output of a tagger to known-good data, as I’m doing at the moment. In the usual case, you’ll just accept whatever tokenization the package produces. There are valid arguments to be made for different treatments of some kinds of tokens, so accuracy is, as always, a matter of conformance to expectations rather than one of abstract correctness.)

One way to get around this problem, at least partially, is to consider the results not token-by-token, but in aggregate over an entire corpus. So the question is then not “Is this word tagged correctly as an adjective?” (since two taggers may not agree on what constitutes a word), but “How does the percentage of all words tagged as adjectives by this tagger compare to that of the training set?” This isn’t the same thing as cross-validation, though one could ask the question during a cross-validation procedure. Instead, what I’ll be doing is comparing the relative frequencies of various parts of speech assigned by each tagger using that tagger’s stock training data to the relative frequencies observed in the training corpus. So there’s no need to train the taggers on MorphAdorner’s data, which is good, since I’ve already seen that I can’t manage such a thing for Standford’s tagger on my hardware, and TreeTagger produced abysmal results (probably my fault, but I’ve run out of patience with it in any case).

The only other complication is that of course the taggers use different tagsets by default, so I’ll need to create a translation table for each one. I’m interested in pretty course-grained POS categories, specifically the following 18:

Table 1: POS Tags Used for Bag-of-Tags Accuracy Comparisons

av      adverb
cc      conjunction
dt      determiner
fw      foreign
jj      adjective
nn      noun
np      noun-proper
nu      number
pr      pronoun
pc      participle
pp      preposition
pu      punctuation
wh      wh-word
sy      symbol
uh      interjection
vb      verb
xx      negation
zz      unknown-other

Note in particular that I’m not dividing up verbs into classes like modals or “being” verbs, nor distinguishing by tense or mood. The down side of this is loss of detail, and if I were doing actual research work, I’d keep more classes. But it guarantees that every tag from every tagger will slot easily into one or another of these categories, and it’ll give me a feel for how they stack up at the most basic level.

Also, this will help answer a question I have about MorphAdorner’s accuracy, specifically how its “big picture” accuracy (determining whether a word is used as an adjective or an adverb or a noun, say) compares to the vary fine-grained accuracy over its full tagset (which we’ve already seen is around 97% in cross-validation). One of the reasons to care about this is that more or less every tagger claims to be (and tests as) about 97% accurate, but typically over smaller tagsets and on non-literary data. Does this mean that MorphAdorner will do better when it’s not being asked to discriminate so finely? Or is it possible that having been trained on such fine distinctions, it suffers on coarser ones? The second possibility doesn’t seem absurd to me; more categories means fewer examples of each one in the training data, and it’s possible that MorphAdorner will therefore do worse than taggers trained on fewer, broader categories. Now, even if that’s true (and I don’t know yet what the outcome will be), it’s still possible that the difference would be small and/or that the advantages of the large tagset would outweigh whatever drop in coarse-grained accuracy might be observed. But it’s an interesting question, and one that I’d like to be able to answer. Plus, it bears directly on the research in service of which I doing this whole evaluation of taggers: I want to examine POS frequency distribution in large corpora across historical periods. So I need to have a sense of how accurate the taggers are not on individual tokens, but in sum. True, accuracy in sum largely depends on getting individual tokens right, and yes, I’ll probably care at some point about things like bi- and tri-grams that are much more sensitive to the accuracy of individual tags. But still, big-picture, bag-of-tags accuracy matters to me.

Coming next, then, the first results of this reduction procedure with MorphAdorner. Including graphs! It’ll be a hoot.

Evaluating POS Taggers: More Info on the Training Data

With the help of Bob Carpenter at Alias-i, I now have cross-validation data for LingPipe using the MorphAdorner training set. Before I get to the numbers (which are good) in another post, a little background on the training data itself. (Incidentally, one of the real advantages of all this evaluative work has been to give me a much better sense of the data and tagsets I’m working with than I would have had otherwise.)

NUPOS, the tagset developed by Martin Mueller and used by MorphAdorner, is large: it’s up to 238 tags at the moment, compared to 60 or 80 in most other mainstream tagsets. This has a lot of potential benefits for literary work, especially if one wants to examine texts from earlier periods. (For the rationale behind NUPOS, see Martin’s post on the MONK wiki, to which I’ve linked in the past. Note that the list of tags there isn’t fully up to date as of mid-January 2009; if you have access to MONK development builds, there’s a better list at http://scribe.at.northwestern.edu:8090/monk/servlet?op=pos.) The short version: It might be nice to be able to deal with “thou shouldst” and to easily find cases of nouns used as adjectives, to name only two instances. But it has some costs, too, particularly in the system requirements for training a tagger. This is especially true because of the way NUPOS handles contractions and certain other compound, which it treats as single tokens with compound POS tags. I think I mentioned something about this earlier, but it wasn’t clear to me then exactly what was going on. So … a couple of examples:

I'll     pns11|vmb   i|will    i'll
...
there's  pc-acp|vbz  there|be  there's
...
You've   pn22|vhb    you|have  you've

There are about 22,000 such cases (out of 3.8+ M total tokens) in the training data. As I said, NUPOS and MorphAdorner treat these as single tokens with two parts of speech, but most other taggers split them into two tokens. Martin has a nice argument for why the NUPOS/MorphAdorner method is the right one, but it makes things a bit tricky for me at the moment. First, it means there are 428 unique tags in the MorphAdorner training data (almost doubling the size of the already large NUPOS tagset), which makes for potentially enormous matrices in the training and cross-validation process of other taggers. I’ve already seen that it makes training the Stanford tagger on this data impossible on my machine, and it made it slightly harder for me to repeat Bob’s cross-validation runs with LingPipe than it might have been with a smaller tagset.

Now, I don’t suspect that such things are really impossible (in fact I’m rerunning the LingPipe cross-validation process with some tweaked settings right now), but it does make things more resource intensive.

And then there are tokenization issues, which make direct comparison of tokens and parts of speech between taggers pretty tough. Every tagger wants to tokenize incoming data differently, and they don’t necessarily preserve MorphAdorner’s existing tokenization (heck, even MorphAdorner doesn’t respect MorphAdorner’s existing tokenization; see all the instances of ‘…’ turned into ‘..’ and ‘.’). I suppose it would be possible to write a pretty trivial tokenizer for each of the packages that says “hey, every token in the input data is on a separate line, just go with that.” But mucking about with tokenizers is beyond my current investment level.

Finally, as Bob pointed out elsewhere (I’ve forgotten now whether it was in an email or a blog comment), tokenization has an obvious impact on tagging accuracy; it’s hard to get parts of speech right (where “right” = “matches the training data”) if the tagger thinks the data consists of different tokens than does the trainer. In the end, the obvious effect is to skew accuracy results toward taggers/tokenizers more closely related to the ones used to help generate the training data in the first place. Martin tells me that the MorphAdorner training data is distantly descended from material tagged by CLAWS and so is, as far as I can tell, probably not closely related to the other taggers I’m evaluating. Takeaway point: It’s no surprise that using MorphAdorner’s training data will tend to produce accuracy evaluations that favor MorphAdorner.

What’s next? A couple of posts later today, one on LingPipe cross-validation results and another on raw accuracy using a sharply reduced tagset and treating the whole dataset as a big bag of tags (i.e., what is the overall frequency of adjectives, etc.). Good times. More soon.