POS Frequencies in the MONK Corpus, with Additional Musings

This post is on the work I presented at DH ’09, plus some thoughts on what’s next for my project. It’s related to this earlier post on preliminary part-of-speech frequencies across the entire MONK corpus, but includes new material and figures based on some data pruning and collection as mentioned in this post (details below).

A word, first, on why I’m working on this. I don’t really care, of course, about the relative frequencies of various parts of speech across time, any more than chemists care about, say, the absorption spectra of molecules. What I’m looking for are useful diagnostics of things that I do care about but that are hard to measure directly (like, say, changes in the use of allegory across historical time or, more broadly, in rhetorical cues of literary periodization).

My hypothesis is that allegory should be more prominent and widespread in the short intervals between literary-historical periods than during the periods themselves. Since we also suspect that allegorical writing should be “simpler” on its face than non-allegorical writing (because it needs to sustain an already complicated set of rhetorical mappings over large parts of its narrative), it makes sense (in the absence of a direct measure of “allegoricalness”) to look for markers of comparative narrative simplicity/complexity as proxies for allegory itself. I think part-of-speech frequency might be one such measure. In any case if I’m right about allegory and periodization and if I’m also right about specific POS frequencies as indicators of allegory, then we should expect certain POS frequencies to exhibit significant (in the statistical sense) fluctuations around periodizing moments and events. (I wish there were fewer ifs in that last sentence; I’ll say a bit below about how one could eliminate them.)

So … what do we see in the MONK case? Recall that the results from the full dataset looked like this:

POS Frequencies, Full MONK Corpus

POS Frequencies, Full MONK Corpus

But that’s messy and not of much use. It doesn’t focus on the few POS types that I think might be relevant (nouns, verbs, adjectives, adverbs); it includes a bunch of texts that aren’t narrative fiction (drama, sermons, etc.); and it’s especially noisy because I didn’t make any attempt to control for years in which very few texts (or authors) were published. (Note that the POS types listed are the reduced set of so-called “word classes” from NUPOS.)

Here’s what we get if we limit the POSs (PsOS?) in question, exclude texts that aren’t narrative fiction, and group together the counts from nearby years with low quantities of text:

POS Frequencies, Reduced and Consolidated MONK Corpus

POS Frequencies, Reduced and Consolidated MONK Corpus

And here’s the same figure with the descriptive types (adjectives and adverbs) added together:

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

[Some data details, skippable if you don’t care. First, note that the x axes in all three figures need to be fixed up; they’re just bins by year label, rather than proper independent variables. I’ll fix this soon, but it doesn’t make much difference in the results. You can download the raw POS counts for the full corpus (not sorted by year of publication), as well as those restricted to texts with genre = fiction. These are interesting, I guess, but more useful are the same figures split out by year of publication, both for the whole corpus, and just for fiction (presented as frequencies rather than counts). Finally, there are the fiction-only, year-consolidated numbers (back to counts for these, because I’m lazy). The table of translations between the full NUPOS tags and the (very reduced) word classes presented here is also available.]

So what does this all mean? The first thing to notice is that there’s no straightforward confirmation of my hypotheses in these figures. There’s some meaningful fluctuation in noun and verb frequency over the first half of the nineteenth century—which I think might be an interesting indication of the kind of writing that was dominant at the time (see the noun and verb frequency section of this post)—but no corresponding movement in the combined frequency of adjectives and adverbs. This might mean several things: I might be wrong about the correlation between such frequencies and periodizing events, or I might not be looking at the right POS types, or (quite likely, regardless of other factors) I might not have low enough noise levels to distinguish what one would expect to be fairly small variations in POS frequency.

Where to go from here? A few directions:

I’ll keep working on a bigger corpus. The fiction holdings from MONK are only about 1000 novels, spread (unevenly) over 120+ (or 150+) years. So we’re looking at eight or fewer books on average in any one year, and that’s just not very much if we want good statistics.

There are a couple of ways to go about doing this. Gutenberg has around 10,000 works of fiction in English, so it’s an order of magnitude larger. There are issues with their cataloging and bibliographic quality, but I think they’re addressable and I’m at work on them now. The Open Content Alliance has hundreds of thousands of high-quality digitizations from research libraries, though there are some cataloging issues and I’m not sure about final text quality (which relies on straight OCR rather than hand-correction as does Gutenberg). Still, OCA (or Google Books, depending on what happens with the proposed settlement, or Hathi) would offer the largest possible corpus for the foreseeable future. I’ve been talking to Tim Cole at UIUC about the OCA holdings and will report more as things come together.

But I think it’s also worth asking whether or not POS frequencies are the right way to go; I started down that path on a hunch, and it would be nice to have some promising data before I put too much more effort into pursuing it. What I need, really, are some exploratory descriptive statistics comparing known allegorical and nonallegorical texts. One of the reasons I’ve held off on doing that was because it seems like a big project. The time span I have in mind (several centuries), plus the range of styles, genres, national origins, genders, etc. suggest that the test corpus would need to be large (on the order of hundreds of books, say) if it’s not to be dominated by any one author/nation/gender/period/subject/etc. But how much reading and thinking would I have to do to identify, with high confidence, at least 100 major works of allegorical fiction and another 100 of comparable nonallegorical fiction? And would even that be enough? A daunting prospect, though it’s something that I’m probably going to have to do at some point.

But I got an interesting suggestion from Jan Rybicki (who works in authorship attribution, not coincidentally) at DH. Maybe it would suffice, at least preliminarily, to pick a handful of individual authors who wrote both allegorical and nonallegorical works reasonably close together in time, and to look for statistical distinctions between them. Since I’d be dealing with the same author, many of the problems about variations in period, national origin, gender, and so forth would go away, or at least be minimized. I suspect this wouldn’t do very well for finding distinctive keywords, which I imagine would be too closely tied to the specific content of each work (which is a problem that the larger training set is intended to overcome), but it might turn up interesting lower-level phenomena like (just off the top of my head) differences in character n-grams or sentence length. It would take some work to slice and dice the texts in every conceivably relevant statistical way, but I’m going to need to do that anyway and it’s hardly prohibitive.

So that’s one easy, immediate thing to do. In the longer run, what I really want is to see what people in the field have understood to be allegorical and what not, which would have the great advantage, at least as a reference point, of eliminating some of the problems of individual selection bias. One way to do that would be to mine JSTOR, looking, for example, for collocates of “allegor*” or (more ambitiously) trying to do sentiment analysis on actual judgments of allegoricalness. I suspect the latter is out of the question at the moment (as I understand it, the current state of the art is something like determining whether or not customer product reviews are positive or negative, which seems much, much easier than determining whether or not an arbitrary scholarly article considers any one of the several texts it discusses to be allegorical or not). But the former—finding terms that go along with allegory in the professional literature, seeing how the frequency of the term itself and of specific allegorical works and authors changes over (critical) time, and so on—might be both easy and helpful; at the very least, it would be immensely interesting to me. So that’s something to do soon, too, depending on the details of JSTOR access. (JSTOR is one of the partners for the Digging into Data Challenge and they’ve offered limited access to their collection through a program they’re calling “data for research,” so I know they’re amenable to sharing their corpus in at least some circumstances. I was told at THATCamp by Loretta Auvil that SEASR is working with them, too.)

[Incidentally, SEASR is something I’ve been meaning to check out more closely for a long time now. The idea of packaged but flexible data sources, analytics, and visualizations could be really powerful and could save me a ton of time.]

Finally (I had no idea I was going to go on so long), there are a couple of things I should read: Patrick Juola’s “Measuring Linguistic Complexity” (J Quant Ling 5:3 [1998], 206-13)—which might have some pointers on distinguishing complex nonallegorical works from simpler allegorical ones—plus newer work that cites it. And Colin Martindale’s The Clockwork Muse, which has been sitting on my shelf for a while and which was (re)described to me at DH as “brilliant and infuriating and wacky.” Sign me up.

Some POS Frequency Factoids

I’ll be posting a couple of times in the next few days about DH ’09, THATCamp, and the state of my project. First, though, a handful of (mildly) interesting plots concerning part-of-speech frequency correlations from the MONK corpus.

MONK contains about 1,000 novels and novel-like works spread over the eighteenth, nineteenth, and twentieth centuries. (The full corpus is larger and covers a longer timespan; it includes drama, witchcraft narratives, some nonfiction, etc.) I’ve counted occurrences of the major POS types across just the narrative fiction, divided them up by year of publication, and then grouped together a few nearby years in which few or no books were included. In the end, there’s coverage from 1742 through 1905, with all years (or groups of years) containing at least 500,000 words by four or more authors and no group spanning more than five years. This is the same dataset from which I’ll construct some POS frequency vs. time graphs in a later post (where I’ll also link to the raw counts).

First, two cases that that are easy to anticipate and serve as a kind of check that things aren’t too far off:

Adjective frequency vs. noun frequency

Adjective frequency vs. noun frequency

Adverb frequency vs. verb frequency

Adverb frequency vs. verb frequency

About what you’d expect: a decent positive correlation between the frequency of nouns or verbs and the frequency of words that modify them. Slightly weaker correlation in the adverb case, presumably because adverbs don’t always modify verbs.

Then there’s an interesting case that I think I can explain, but wouldn’t have predicted:

Noun frequency vs. verb frequency

Noun frequency vs. verb frequency

Noun and verb frequency are inversely correlated. This makes sense, I suppose, if you think of novels as tending toward portraiture or action (and for all I know if may be a well known phenomenon). But I expected to see more nouns imply more verbs, since you’d need more things for those subjects and objects to do. In any case, I learned something here from my few minutes with GGobi.

Finally, one that leaves me at a loss:

Adjective frequency vs. adverb frequency

Adjective frequency vs. adverb frequency

How can adjectives and adverbs be apparently uncorrelated? Shouldn’t there be flowery novels rich in both of them and plain ones rich in neither? I’ll investigate, but in the meantime I’d love to be told that this, too, is already accounted for.

Last note: GGobi is really nifty, even if it doesn’t produce beautiful figures out of the box (see above).

MorphAdorner Release

The first public release of MorphAdorner—version 0.9, released April 3, 2009—is now available. There’s full documentation, too. Congratulations and many thanks to Phil Burns – this is great news.

I discussed MorphAdorner as part of my series of posts on part-of-speech taggers a couple of months back, and will be using it for much of my upcoming work.

My understanding is that Phil intends to leave MorphAdorner mostly as-is for the time being, unless it’s taken up by another project; MONK has been funding current development, I think, and it (MONK) is winding down. Which reminds me: A public version of the MONK workbench, with a bevy of analytical tools and access to several thousand texts across four-plus centuries, should be available soon. Will post here when it’s up, though I’m not involved in making that happen.

Evaluating POS Taggers: Coda

A few quick follow-ups on the series of tagger comparison posts.

Other Taggers

One of the limitations (nigh unto embarrassments) of the comparison series was the limited number of packages I examined. This was due to a combination of limited time and early unfamiliarity with the options, but it’s clear in retrospect that there are a few more that should have found a place in the roundup. It’s my hope that I’ll get a chance to look at some of these more closely in the future, but it will probably be a while before that’s a realistic possibility. In the meantime, some notes and links:

OpenNLP

OpenNLP is a suite of Java-based, open source (LGPL) tools for natural language processing. Tom Morton, the project’s maintainer and lead developer, passed along some impressive numbers for speed (in line with what I saw for LingPipe and MorphAdorner) and accuracy (98.35% on the Brown corpus, 96.82% on WSJ). It’s threadsafe and has what appear to be modest memory requirements. I haven’t had a chance to test it myself, but I hope to in the future. In the meantime, it certainly seems worth a close look for anyone doing work like mine.

NLTK

I’ve mentioned NLTK in the past, so will just reiterate that it looks especially useful to those who, like me, are new to NLP (though it’s certainly not limited to that audience). Bob Carpenter also mentions that they have book on NLP coming out soon with O’Reilly; the full text is already available under a CC license on their site.

Others

And some further links of interest:

  • MALLET (Machine Learning for Language Toolkit). Quoth their page: “MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.”
  • MinorThird: “MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text.” It looks like annotation and visualization are particular emphases:

    Minorthird’s toolkit of learning methods is integrated tightly with the tools for manually and programmatically annotating text. Additionally, Minorthird differs from existing NLP and learning toolkits in a number of ways:

    • Unlike many NLP packages (eg GATE, Alembic) it combines tools for annotating and visualizing text with state-of-the art learning methods.
    • Unlike many other learning packages, it contains methods to visualize both training data and the performance of classifiers, which facilitates debugging. Unlike other learning packages less tightly integrated with text manipulation tools, it is possible to track and visualize the transformation of text data into machine learning data.
    • Unlike many packages (including WEKA), it is open-source, and available for both commercial and research purposes.
    • Unlike any open-source learning systems I know of, it is architected to support active learning and on-line learning, which should facilitate integration of learning methods into agents.

There are doubtless others that I’ve overlooked, but these are enough to keep me busy for the time being.

TreeTagger

Helmut Schmid, the developer of TreeTagger, wrote to let me know that TreeTagger remains under active development (and to give me a few pointers on how best to avoid some of the difficulties I had with it). Good to know; I’ll update the earlier posts accordingly.

The AMALGAM Project

The AMALGAM Project is an attempt (more rigorously worked out) to do the type of tagset mapping that I performed in the bag-of-tags trials. The “multitagged” corpus they’ve produced is pretty small (180 sentences), but/and I guess I was pleased to see that they concluded more or less what I did: It’s hard to map one tagset onto another (see, e.g., “A comparative evaluation of modern English corpus grammatical annotation schemes” [PDF]). Still, an interesting project, and as I say, undertaken in more depth than my own preliminary trials.

Evaluating POS Taggers: Conclusions

OK, I’m as done as I care to be with the evaluation stage of this tagging business, which has taken the better part of three months of intermittent work. This for a project that I thought would take a week or two. There’s a lesson here, surely, about research work in general, my coding skills in particular, and the (de)merits of being a postdoc.

In short: I’m going to use MorphAdorner for my future work. The good news overall, though, is that several of the other taggers would also be adequate for my needs, if necessary.

Here’s a summary of the considerations that influenced my decision:

Accuracy

This is probably the most important issue, but it’s also the most difficult for me to assess. The underlying algorithms that each tagger implements make a difference, but I’m really not qualified to evaluate the relative merits of hidden Markov models vs. decision trees, for example, nor the quality of the code in each package.

What I do have is cross-validation results, and the deeply inconclusive bag-of-tags trials I’ve described previously. My own cross-validation tests tend to confirm what the projects themselves claim, namely that they’re about 97% accurate on average. Or at least that’s what I saw for MorphAdorner (97.1%) and LingPipe (97.0%); I wasn’t able to retrain the Stanford tagger on my reference data, so I can’t do anything other than accept Stanford’s reported cross-validation numbers on non-literary data, which are 96.9% (for the usably speedy left3words model) and 97.1% (for the very slow bidirectional model). TreeTagger fared very poorly in my cross-validation tests, though there may well have been problems on my end that explain that fact. Still, I’d be reluctant to use it for real work when trained (by me) on something other than its stock (non-literary) corpus; otherwise, TreeTagger’s self-reported cross-validation accuracy is 96.4-96.8%.

As I suggested in my last post, I don’t think the bag-of-tags trials told me much about the relative accuracy of the various taggers, except concerning the larger point that it’s (fundamentally) hard to translate between tagsets. That’s an important thing to know, but I don’t take MorphAdorner’s superiority in those tests as an indication that it’s necessarily more accurate than the others in the general case. I do, however, now understand better MorphAdorner’s performance characteristics on a reduced version of its tagset (it’s about 99% accurate in sum on the larger part-of-speech classes).

There are a priori reasons to think that MorphAdorner might to better out of the box on a literary corpus, since it’s trained on such data and uses a tagset specifically geared to literary-historical usage, but those are issues better addressed separately below; I wouldn’t say that I’ve managed to provide a posteriori support for them here.

Tagset

This matters much more than I imagined at the outset, since it determines not only the level of detail you can investigate, but also the kinds of information that are preserved or lost in a tagger’s output.

Out of the box, then, I think MorphAdorner and NUPOS win for literary work, with LingPipe/Brown a reasonably close second. Stanford and TreeTagger use the significantly smaller Penn tagset, which seems less suitable for my needs. One of the things I learned from the bag-of-tags work was that there isn’t any apparent benefit to working with a reduced tagset from the beginning; there’s no evidence in the trials I’ve run that the increased number of tokens in each of a smaller number of classes in such a tagset provides greater big-picture accuracy as compensation for reduced distributional information. I’d want to run the reverse trial of MorphAdorner over the Brown training corpus to make this claim with more confidence, but for now, that’s how things look.

The good news, of course, is that you can in principle retrain any of the taggers on a reference corpus tagged in any tagset, so you can switch back and forth between them. That is, provided you have access to the appropriate training data. Now, I do have access to MorphAdorner’s training data, but I don’t have the right to redistribute it, and I’m not sure what would be involved in doing so, were it necessary. And there’s a certain amount of work and computational horsepower involved in performing the retraining. Assuming I want to use NUPOS (and I do; see Martin Mueller’s NUPOS primer/rationale), MorphAdorner is the easiest way to get it.

Training Data

MorphAdorner is trained on a strictly literary corpus, both British and American, that spans the early modern period through the twentieth century. LingPipe uses the Brown corpus via NLTK, which has a good deal of fiction, but is certainly not exclusively literary. Stanford and TreeTagger use the Wall Street Journal (Penn treebank corpus). If each of the training corpora has been tagged with equal accuracy, one would expect MorphAdorner’s corpus to be best suited to arbitrary literary work, though it has the drawback of not being freely redistributable. I’m not sure if this is likely ever to change, as I’m told that although the works themselves are long out of copyright, the texts were originally derived at least in part from commercial sources like Chadwyck-Healey. It’s something to be aware of, but so long as the compiled model can be passed along (and it obviously can be, since it’s included with the base distribution), it will be possible for others to replicate my work. I’m not sure what, if any, issues I’d encounter if I were to reuse the training data with another tagger, though I don’t see why they’d be much different from those involving MorphAdorner itself.

MorphAdorner’s tokenizer and lemmatizer are also intended to deal accurately and efficiently with the vagaries of early modern orthography, which is certainly a plus.

In any case, one of the advantages of using MorphAdorner is that I don’t have to think about this stuff, nor do I have to work on retraining another tagger, nor do I have to worry about trying to pick up and replicate any improvements to, or refinements of, MorphAdorner’s training data that might happen down the road.

Speed

I didn’t imagine at the outset that speed would fall so far down my list of considerations, but I think this is the right place for it in my own usage scenario. As I mentioned in an earlier post, speed is a qualitative threshold issue for me. Quoting that post:

Faster is better, but the question isn’t really “which tagger is fastest?” but “is it fast enough to do x” or “how can I use this tagger, given its speed?” I think there are three potentially relevant thresholds:

  1. As a practical matter, can I run it once over the entire corpus?
  2. Could I retag the corpus more or less on whim, if it were necessary or useful?
  3. Could I retag the corpus in real time, on demand, if there turned out to be a reason to do so?

The first question is the most important one: I definitely have to be able to tag the entire corpus once. The others would be nice, especially #2, but I don’t yet have a compelling reason to require them.

To recap my earlier conclusions, TreeTagger, LingPipe, and MorphAdorner are all fast enough to meet thresholds 1 and 2. Stanford, using the (slightly less accurate) left3words model, meets threshold 1 (tag everything once) and might meet number 2 (tag it again) in a pinch. Stanford bidirectional would be a real stretch to run over a large corpus even once on moderate (read: affordable and accessible to a humanist) hardware. None of the taggers is fast enough on my hardware for full on-demand work, though it’s worth recalling that I made no real attempts to optimize for speed (Bob Carpenter at Alias-i reports 100x speedups of LingPipe are possible with some tweaking). But this on-demand business is a theoretical rather than an immediately practical issue for my work, so I don’t attach much weight to it.

License, Source Code, and Cost

Here it makes sense to break things down case by case:

MorphAdorner: Not yet generally available, but forthcoming when Phil Burns finishes the documentation, probably mid-late February of this year (2009). To be released under a modified NCSA license, freely redistributable with attribution. The same can’t be said of the raw training data, I think, but the compiled models will be included. No cost, all source code available. Under active development.

LingPipe: The only commercial offering of the bunch. Open source and free to use, provided you make all tagged output available in turn. That wouldn’t be a problem for me at the moment, when I’m looking to work with freely available texts (Gutenberg, etc.), but could be a limitation later if/when I use copyrighted corpora. Exceptions to the redistribution requirement are available for sale from Alias-i; they can be modest to pricey in the context of grant-challenged academic humanities, though Bob has suggested that there may be flexibility in their licensing for academics. In any case, I don’t doubt that I could make it work in my own case, but I have at least minor reservations about the impact of using commercial tools on the subsequent adoption of my methods. The ideal case would be for anyone who’s interested to pick up my toolset with the fewest possible encumbrances. That’s not to say there aren’t issues with the other packages’ licensing terms (and probably more importantly with copyright issues involving my working corpora), nor that I object to Alias-i’s business model (which I think is an eminently reasonable compromise between openness and the need to feed themselves), but it’s a consideration. Under very active development, and with outstanding support from the lead developer (the aforementioned Bob Carpenter).

Stanford: Like the other academic packages, open source and free software. Licensed under GPL2. Under active development.

TreeTagger: Distributed free for “research, teaching, and evaluation purposes.” No right to redistribute. No source code available and not under active development, as far as I can tell. [Update: Helmut Schmid writes to tell me that TreeTagger is indeed still under development.]

Other Considerations

There were a few other minor concerns and thoughts.

Threadsafeness can be an issue for the Java-based taggers (that is, all but TreeTagger). LingPipe is threadsafe. Stanford is not. MorphAdorner, I don’t know. This isn’t an immediate concern, since I have enough memory to throw at two separate JVM instances and only two cores to work with, but it would be a nice thing to have in the future.

Input/output encodings and formats. All three of the Java-based taggers can handle Unicode text (which is good), and they can take input data in either plain text or XML format. MorphAdorner and Stanford by default give you back out the same format you put in; LingPipe (again, by default) gives you XML output either way. Doesn’t make much difference to me, and it’s easy to write a simple output filter for any of the packages (TreeTagger possibly excepted) that gives you what you want.

Finally, Bob suggests having a look at NLTK, which I mentioned in an earlier post but didn’t really do anything with. Certainly something to keep in mind for the future, especially as it has a kind of “welcome to NLP work, please allow me to show you around and make things easier for you” vibe. It’s Python-based and GPL2 licensed. Will investigate as time allows.

And that, finally, is that. Back to proper literary work for a bit—polishing off the Coetzee article and talk I mentioned a while ago–then to book manuscript revisions. But the computational work will continue through the spring and summer. With results, eventually, I swear!

Evaluating POS Taggers: LingPipe Bag of Tags Accuracy and General Thoughts on Tagset Translation

Gah, this is all nearing completion. Will have a wrap-up of the whole series later tonight; I, for one, await my conclusions with bated breath.

Before I can finish the overall evaluation, here are the results of my last trial, an iteration of the bag-of-tags accuracy tests I’ve been doing, this time with LingPipe. Note, though, that the section below on tagset conversion and this bag-of-tags approach is probably more interesting than the specific LingPipe results (and there are nice summary graphs of the whole shebang down there, too!).

LingPipe Results

For reference, the list of basic tags and a LingPipe-to-MorphAdorner (i.e., Brown-to-NUPOS) translation table are available. Graphs are below, problematic translations from Brown to NUPOS as follow:

  • ‘abx’ = pre-quantifier, e.g., “both.” These are usually ‘dt’ in the reference data, but about 30% ‘av’. Adjusted as such in the figures below.
  • ‘ap’ = post-determiner, e.g., “many, most, next, last, other, few,” etc. These are complicated; they’re predominantly ‘dt’ (34%), ‘jj’ (22%), ‘nn’ (13%), and ‘nu’ (29%) in the reference data (the other 3% being various other tags). But of course it’s hard to know exactly how much confidence to place in such estimates, absent a line-by-line comparison of all 24,000+ cases. Figures below are nevertheless adjusted according to these percentages.
  • ‘ap$’ = possessive post-determiner. There aren’t very many of these, and they’re mostly attached to tokenization errors. Ignored entirely.
  • ‘tl’ = words in titles. This is supposed to be a tag-modifying suffix to indicate that the token occurs in a title (see also ‘-hl’ for headlines and ‘-nc’ for citations, but LingPipe uses the ‘tl’ tag alone. Split 50/50 between nouns and punctuation, since those dominate the tokens thus tagged, but this is a kludge.
  • ‘to’ = the word “to.” Translated as ‘pc’ = participle, but is also sometimes (~43%) ‘pp’ = preposition in the reference data. Adjusted below.
  • LingPipe doesn’t use the ‘fw’ (foreign word) or ‘sy’ (symbol) tags

Data

 

Table 1: POS frequency in reference data and LingPipe’s output

 

POS	Ref	Test	Ref %	Test %	Diff	Err %
av*	213765	228285	 5.551	 5.718	 0.167	  3.0
cc	243720	231708	 6.329	 5.804	-0.525	 -8.3
dt*	313671	310143	 8.145	 7.769	-0.377	 -4.6
fw	  4117		 0.107		-0.107	
jj*	210224	203683	 5.459	 5.102	-0.357	 -6.5
nn*	565304	596960	14.680	14.954	 0.274	  1.9
np	 91933	118115	 2.387	 2.959	 0.571	 23.9
nu*	 24440	 38856	 0.635	 0.973	 0.339	 53.4
pc*	 54518	 35098	 1.416	 0.879	-0.537	-37.9
pp*	323212	356411	 8.393	 8.928	 0.535	  6.4
pr	422442	430172	10.970	10.776	-0.194	 -1.8
pu*	632749	605152	16.431	15.159	-1.273	 -7.7
sy	   318		 0.008		-0.008	
uh	 19492	 35471	 0.506	 0.889	 0.382	 75.5
vb	666095	664957	17.297	16.657	-0.640	 -3.7
wh	 40162	 70998	 1.043	 1.778	 0.736	 70.5
xx	 24544	 23825	 0.637	 0.597	-0.041	 -6.4
zz	   167	 42272	 0.004	 1.059	 1.055	
Tot	3850873	3992106	

* Tag counts to which adjustments have been applied (see above)

Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Diff = Difference in percentage points between Ref % and Test %
Err % = Percent error in output frequency relative to reference data

Pictures

And then the graphs (click for large versions).

Figure 1: Percentage point errors in POS frequency relative to reference data

 

LP Errors.png

 

Figure 2: Percent error by POS type relative to reference data

 

TT Errors Pct.png

 

Discussion of LingPipe Results

This is about what we’ve seen with the other taggers that use a base tagset other than NUPOS; it’s a bit better than either Stanford or TreeTagger, a fact that stands out more clearly in the summary comparison graphs below, but there are just too many difficulties converting between any two tagsets to say much more. One could certainly point out some of the obvious features in the present case—LingPipe has a thing for numbers, proper nouns, wh-words, and interjections, plus an aversion to punctuation, verbs, and participles—but I think the only genuinely interesting feature is LingPipe’s willingness to tag things as unknown. I’ve left this out of Figure 2 because it badly skews the scale, but notice in the data above that there are just 167 ‘zz’ tags in the reference corpus, but 42,000+ instances of ‘nil’ (=’zz’) in the LingPipe output.

We didn’t see anything like this with Stanford or TreeTagger, but it might be useful. (Of course, it might also be a mess.) I can imagine situations in which it would be better to know that the tagger has no confidence in its output rather than pushing ahead with garbage results. This is one of the reasons that taggers with the option of producing confidence-based output are (potentially) useful, since they would allow one to isolate borderline cases. LingPipe and TreeTagger have such an option; Stanford and MorphAdorner do not, to the best of my knowledge.

Thoughts on Converting between Tagsets

First, some graphs that collate the results of all the bag-of-tags trials. They’re the same as the ones I’ve been using so far, but now with all the numbers presented together for easier comparison.

As always, click each graph for the full-size image.

Figure 3: Percentage point errors by POS type (Summary)

 

Sum Errors.png

 

Figure 4: Error % by POS type (Summary)

 

Note: Y-axis intentionally truncated at +100%.

Sum Errors Pct.png

 

Figure 5: Weighted error percentages by tagger and POS type (Summary)

 

This is the one I like best, since it makes plain the relative importance of each POS type; large errors on rare tags generally matter less than modest errors on common tags, though the details will depend on one’s application.

Sum Errors Bubbles.png

 

Confirming what you see above: MorphAdorner does well over the reference data, which looks like its own training corpus. (Recall that the MorphAdorner numbers are taken from its cross-validation output and without the benefit of its in-built lexicon, so it’s not just running directly over material it already knows. Apologies for the personification in the preceding sentence.) LingPipe is marginally better than Stanford or TreeTagger (this would be more obvious if the log scale weren’t compressing things in the 1-10% error range), but all three (using different training data and different tagsets) lag MorphAdorner significantly (by an eyeballed order of magnitude, more or less).

So … what have I learned from these attempts to measure accuracy across tagsets? Less than I’d hoped, at least in the direct sense. These trials were motivated by an interest in whether or not taggers trained on non-literary corpora would produce results similar to MorphAdorner’s (which is trained exclusively on literature). The problem was that they all use different tagsets out of the box, and I was somewhere between unwilling and unable to retrain all of them on a common one. My thinking was that I’d be able to smooth out their various quirks by picking a minimal set of parts of speech that they’d all recognize, and then mapping their full tagsets down to this basic one.

The problem is that the various tagsets don’t agree on what should be treated as a special case (wh-words, predeterminers, “to,” etc.), and the special cases don’t map consistently to individual parts of speech. The numbers I’ve presented in each of the recent posts on the topic have tried to apply appropriate statistical fixups, but they’re hacks and (informed) guesses at best. In any case, I think what I’m really seeing is that taggers are reasonably good at reproducing the characteristics of their own training input (which we knew already, based on ~97% cross-validation accuracy). So MorphAdorner does well (generally ~99% accuracy over the reduced tagset, i.e., distinguishing nouns from verbs from other major parts of speech) on data that resembles the stuff on which it was trained; the others do less well on that material, since it differs from their training data not just by genre, but also (and more importantly, I think) by tagset.

(An aside: LingPipe is trained on the Brown corpus, which contains a significant amount of fiction and “belles lettres” [Brown’s term, not mine]. Stanford and TreeTagger use the Penn treebank corpus, i.e., all Wall Street Journal, all the time. So there’s a priori reason to believe that LingPipe should do better than either of those two on literary fiction. I like the Brown tagset better than Penn, too, since it deals more elegantly with negatives, possessives, etc.)

For for the sake of comparison, I looked into running a bag-of-tags evaluation of MorphAdorner over the Brown corpus to see if the accuracy numbers would turn out more like those for the other taggers when faced with “foreign” data. My strong hunch is that they would, but it was going to be more trouble than it was worth to nail it down adequately. Perhaps another time.

Takeaway lessons? Mostly, be careful about direct comparisons of the output of different taggers. If I see somewhere that Irish fiction of the 1830s contains 15% nouns, and know that I’ve seen 17% in British Victorian novels, I probably can’t draw any meaningful conclusions from that fact without access to the underlying texts and/or a lot more information about the tools used. It also means that if I settle on one package, then later change course, I’ll almost certainly need to rerun any previous statistics gathered with the original setup if I’m going to compare them with confidence to new results.

More broadly speaking—and this looks ahead to the overall conclusions in the next post—this all highlights the fact that both tagsets and training data matter a lot. The algorithms used by each of the taggers do differ from one another, even when they use similar techniques, and they make different trade-offs concerning accuracy and speed. But the differences introduced by those underlying algorithmic changes—on the order of 1%, max—are small compared to the ones that result from trying to move between tagsets (and, presumably, between literary and non-literary corpora, though the numbers I’ve presented here don’t throw direct light on that point).

This concludes the bag-of-tags portion of tonight’s program. Stay tuned for the grand finale after the intermission.

See also …

Earlier posts on bag-of-tags accuracy:

Evaluating POS Taggers: TreeTagger Bag of Tags Accuracy

This will be brief-ish, since the issues are the same as those addressed re: the Stanford tagger in my last post, and the results are worse.

I’ve again used out-of-the-box settings; like Stanford, TreeTagger uses a version of the Penn tagset. A translation table is available, as is a list of basic tags I’m using for comparison.

As with Stanford, there are a couple of reasons to expect that the results will be worse than those seen with MorphAdorner. There’s the tokenizer again (TreeTagger breaks up things that are single tokens in the reference data), and there’s the non-lit training set. Plus the incompatibility between the tagsets. As before:

  • New: TreeTagger has a funky ‘IN/that’ tag, which might be translated as either ‘pp’ or ‘cs’ (where ‘cs’, subordinating conjunction, is already rolled into ‘cc’, conjunction, in my reduced tagset). I’ve used ‘pp’, which should therefore be overcounted, while ‘cc’ is undercounted.
  • TreeTagger/Penn has a ‘to’ tag for the word “to”; the reference data (MorphAdorner’s training corpus) has no such thing, using ‘pc’ and ‘pp’ as appropriate instead.
  • TreeTagger uses ‘pdt’ for “pre-determiner,” i.e., words that come before (and qualify) a determiner. MorphAdorner lacks this tag, using ‘dt’ or ‘jj’ as appropriate.
  • Easier: TreeTagger uses ‘pos’ for possessive suffixes, while MorphAdorner doesn’t break them off from the base token, and contains modified versions of the base tags that indicate possessiveness. But since I’m not looking as possessives as an individual class, I can just ignore these, since the base tokens will be tagged on their own anyway.
  • Also easy-ish: TreeTagger doesn’t use MorphAdorner’s ‘xx’ (negation) tag. It turns out that almost everything MorphAdorner tags ‘xx’, TreeTagger considers an adverb, so one could lump ‘xx’ and ‘av’ together, were one so inclined.

Data

 

Table 1: POS frequency in reference data and TreeTagger’s output

 

POS	Ref	Test	Ref %	Test %	Diff	Err %
av	213765	226125	 5.551	 5.830	 0.279	  5.0
cc*	243720	167227	 6.329	 4.312	-2.017	-31.9
dt*	313671	292794	 8.145	 7.549	-0.596	 -7.3
fw	  4117	   519	 0.107	 0.013	-0.094	-87.5
jj*	210224	262980	 5.459	 6.781	 1.322	 24.2
nn	565304	642627	14.680	16.570	 1.890	 12.9
np	 91933	162270	 2.387	 4.184	 1.797	 75.3
nu	 24440	 17668	 0.635	 0.456	-0.179	-28.2
pc*	 54518	 91877	 1.416	 2.369	 0.953	 67.3
pp*	323212	371449	 8.393	 9.577	 1.184	 14.1
pr	422442	386668	10.970	 9.970	-1.000	 -9.1
pu	632749	555115	16.431	14.313	-2.118	-12.9
sy	   318	   100	 0.008	 0.003	-0.006	-68.8
uh	 19492	  6063	 0.506	 0.156	-0.350	-69.1
vb	666095	650441	17.297	16.771	-0.526	 -3.0
wh	 40162	 44428	 1.043	 1.146	 0.103	  9.8
xx	 24544	     0	 0.637	     0	-0.637	-100
zz	   167	    13	 0.004	 0.000	-0.004	-92.3
	3850873	3878364				

* Tag counts for which there is reason to expect systematic errors

Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Diff = Difference in percentage points between Ref % and Test %
Err % = Percent error in output frequency relative to reference data

Pictures

And then the graphs (click for large versions).

Note: These graphs are corrected for the xx/av problem discussed above; ‘xx’ tags in the reference data have been rolled into ‘av’ here.

Figure 1: Percentage point errors in POS frequency relative to reference data

 

TT Errors.png

 

Figure 2: Percent error by POS type relative to reference data

 

TT Errors Pct.png

Note on Figure 2:
The bubble charts I’ve used in previous posts are a pain to create; this is much easier and, while not quite as useful for comparing weightings, is good enough for now, especially since the weightings don’t change between taggers (they’re based on tag frequency in the reference data). Note, too, that there’s no log scale involved this time.

Discussion

Ignoring the problematic tags (cc, dt, jj, pc, and pp), things are still pretty bad. Nouns, common and proper alike, are significantly overcounted, verbs and pronouns are undercounted. Rarer tokens (foreign words, symbols, interjections) are a mess, but that’s to be expected. Overall, the error rates are in the neighborhood of Stanford, but a bit worse.

The same caveats concerning the limits of translating between tagsets are in place here as were true in the Stanford case, but again, it’s hard to see how any of this could be construed as better than MorphAdorner.

Takeaway point: TreeTagger looks to be out, too.

Evaluating POS Taggers: Stanford Bag of Tags Accuracy

Following on from the MorphAdorner bag-o-tags post, here’s the same treatment for the Stanford tagger.

I’ve used out-of-the-box settings, which means the left3words tagger trained on the usual WSJ corpus and employing the Penn Treebank tagset. Translations from this set (as it exists in Stanford’s output data) to my (very) reduced tagset are also available.

There are a couple of reasons to expect that the results will be worse than those seen with MorphAdorner. One is the tokenizer. Another is the different (non-lit) training set. A third is incompatibility between the tagsets. This last point is unfortunate, but there’s not really any easy way to get around it. It crops up in a few of the tags, and works in both directions:

  • Stanford/Penn has a ‘to’ tag for the word “to”; the reference data (MorphAdorner’s training corpus) has no such thing, using ‘pc’ and ‘pp’ as appropriate instead.
  • Stanford uses ‘pdt’ for “pre-determiner,” i.e., words that come before (and qualify) a determiner. MorphAdorner lacks this tag, using ‘dt’ or ‘jj’ as appropriate.
  • Easier: Stanford uses ‘pos’ for possessive suffixes, while MorphAdorner doesn’t break them off from the base token, and contains modified versions of the base tags that indicate possessiveness. But since I’m not looking as possessives as an individual class, I can just ignore these, since the base tokens will be tagged on their own anyway.
  • Also easy-ish: Stanford doesn’t use MorphAdorner’s ‘xx’ (negation) tag. It turns out that almost everything MorphAdorner tags ‘xx’, Stanford considers an adverb, so one could lump ‘xx’ and ‘av’ together, were one so inclined.

Data

 

Table 1: POS frequency in reference data and Stanford output

 

POS	Ref.	Test	Ref %	Test %	Test Dif	Test Err %
av	213765	222465	 5.551	 5.754	 0.20264	  3.650
cc	243720	165546	 6.329	 4.282	-2.04736	-32.349
dt*	313671	299797	 8.145	 7.754	-0.39166	 -4.808
fw	  4117	  4094	 0.107	 0.106	-0.00103	 -0.959
jj*	210224	235856	 5.459	 6.100	 0.64093	 11.741
nn	565304	583214	14.680	15.084	 0.40405	  2.752
np	 91933	156650	 2.387	 4.052	 1.66419	 69.709
nu	 24440	 18006	 0.635	 0.466	-0.16896	-26.623
pc*	 54518	 92888	 1.416	 2.402	 0.98668	 69.694
pp*	323212	375382	 8.393	 9.709	 1.31547	 15.673
pr	422442	385059	10.970	 9.959	-1.01107	 -9.217
pu	632749	641420	16.431	16.589	 0.15804	  0.962
sy	   318	   118	 0.008	 0.003	-0.00521	-63.043
uh	 19492	  2239	 0.506	 0.058	-0.44826	-88.560
vb	666095	641714	17.297	16.597	-0.70029	 -4.049
wh	 40162	 41807	 1.043	 1.081	 0.03834	  3.676
xx	 24544	     0	 0.637	 0.000	-0.63736	-100.00
zz	   167	   201	 0.004	 0.005	 0.00086	 19.874
--	
Tot	3850873	3866456		

* Tag counts for which there is reason to expect systematic errors

Legend
POS = Part of speech (see this previous post or this list for explanations)
Ref. = Number of occurrences in reference data
Test = Number of occurrences in output
Ref % = Percentage of reference data tagged with this POS
Test % = Percentage of output tagged with this POS
Test Dif = Difference in percentage points between Ref % and Test %
Test Err % = Percent error in output frequency relative to reference data

Pictures

And then the graphs (click for large versions). Note: These graphs are corrected for the xx/av problem discussed above; ‘xx’ tags in the reference data have been rolled into ‘av’ here.

Figure 1: Percentage point errors in POS frequency relative to reference data

 

ST Errors.png

 

Figure 2: Percent error by POS type relative to reference data

 

ST Errors Bubble.png

Notes on Figure 2:

  1. As before, the size of each bubble represents the relative frequency of that POS tag in the reference data. Bigger bubbles mean more frequent tags.
  2. The x-axis is logarithmic, so the differences are both more (‘uh’, ‘sy’) and less (‘pu’, ‘nn’) dramatic than they appear.

Discussion

First, there are a few POS types that we know will be off, since there’s not a straightforward conversion for them between the NUPOS and Penn tagsets. These are: dt, jj, pc, and pp. Adjectives (jj) are the only ones that are a bit of a disappointment, since that’s a class in which I have some interest. On the plus side, though, this is a problem connected to the ‘pdt’ tag in Penn, and there are only about 4,000 occurrences of it in Stanford’s output, compared to 235,000+ occurrences of ‘jj.’ Even if half the ‘pdt’ tags should be ‘jj’ rather than ‘dt’, that’s still less than a 1% contribution to the overall error rate for ‘jj’.

That said, what sticks out here? Well, the numbers are a lot worse than those for MorphAdorner, for which the worst cases (proper nouns, nouns, and punctuation) were 0.1 – 0.2 percentage points off, compared to 1 – 2 here. So we have results that are about an order of magnitude worse. And in MorphAdorner’s case, the nouns overall may not have been as bad as they look, since proper nouns were undercounted, while common nouns were overcounted by a roughly offsetting amount. For Stanford, though, both common nouns and proper nouns are overcounted, so you can’t get rid of the error by lumping them together.

Similarly, relative error percentages for most tag types are much higher. In Figure 2, the main cluster of values is between 1% and about 50%; for MorphAdorner, it was 0.1 and 1.0. Nouns, verbs, adjectives, adverbs, and pronouns—all major components of the data set—are off by 3% to 10% or more.

The real question, though, is what to make of all this. How many of the errors are merely “errors,” i.e., differences between the way Stanford does things and the way MorphAdorner does them. What I’m interested in, ultimately, is a tagger that’s reliable across a historically diverse corpus; I don’t especially care if it undercounts nouns, for instance, so long as it undercounts them all the time, everywhere, to the same extent. But in the absence of literary reference data not linked to MorphAdorner, and without the ability to train the Stanford tagger on MorphAdorner’s reference corpus, it’s hard for me to assess accuracy other than by standard cross-validation and this bag of tags method.

Takeaway point: I don’t see any compelling reason to keep Stanford in the running at this point.

Evaluating POS Taggers: LingPipe Cross-Validation

Below are the promised cross-validation results for LingPipe. They’re produced by LingPipe’s own test suite rather than by my own (cruder) methods, but there’s no reason to suspect that these numbers shouldn’t be directly comparable to my earlier results for MorphAdorner.

So, without delay, the out-of-the-box numbers:

Accuracy: 96.9%
Accuracy on unknown tokens: 71%

That’s using 5-grams; compiled models are about 6 MB. With a cache and 8-grams (producing 16 MB models), things are about the same:

Beams  Acc   Unk   Speed (tokens/s)*
10    .961  .683   29K/s
14    .963  .694   28K/s
20    .967  .699   27K/s
28    .970  .699   18K/s
40    .970  .699    5K/s

* See note below on speed; my own are a bit lower, because my machine is slower.

Note, as Bob did in an email to me, that overall accuracy in this case is very slightly higher, but that it actually does a little bit worse on unknown tokens.

For reference, recall that MorphAdorner (with the potential benefit of running over training data produced in conjunction with one of its predecessors, hence likely to do a bit better on tokenization; Martin or Phil, correct me if I’m wrong about this) was 97.1% accurate when restricted to a lexicon derived solely from the cross-validation data. Unfortunately I don’t have figures for MorphAdorner’s performance on unknown tokens.

Takeaway point: This looks to me, for practical purposes, like a dead heat as far as accuracy goes.

Next up, a comparison of overall bag of tags statistics.

[Note: The numbers above are from Bob’s report to me. I’ve tried to rerun them on my machine, but have had trouble getting the cross-validation run to finish. It keeps dying at apparently random spots after many successful fold iterations with a Java error (not out of memory, but something about a variable being out of range). So I don’t have full numbers from my own trial. But the process goes more than far enough to suggest that the numbers above are reasonable and repeatable; I see very similar accuracy figures for each fold over many folds and with different fold sizes. My speeds are lower (about half the numbers given above, consistent with what I’ve seen for LingPipe on this computer in the past) since my machine is slower, but show a similar pattern: consistent through 14 or 20 beams, slowing at 20 or 28, and down by 3x or 4x at 40, with accuracy leveling off at 20 or 28 beams (which looks to be the speed/accuracy sweet spot). In any case, I’m satisfied with the way things stand and don’t see much reason to look into this further.]