Some POS Frequency Factoids
July 5, 2009 § 1 Comment
I’ll be posting a couple of times in the next few days about DH ’09, THATCamp, and the state of my project. First, though, a handful of (mildly) interesting plots concerning part-of-speech frequency correlations from the MONK corpus.
MONK contains about 1,000 novels and novel-like works spread over the eighteenth, nineteenth, and twentieth centuries. (The full corpus is larger and covers a longer timespan; it includes drama, witchcraft narratives, some nonfiction, etc.) I’ve counted occurrences of the major POS types across just the narrative fiction, divided them up by year of publication, and then grouped together a few nearby years in which few or no books were included. In the end, there’s coverage from 1742 through 1905, with all years (or groups of years) containing at least 500,000 words by four or more authors and no group spanning more than five years. This is the same dataset from which I’ll construct some POS frequency vs. time graphs in a later post (where I’ll also link to the raw counts).
First, two cases that that are easy to anticipate and serve as a kind of check that things aren’t too far off:
About what you’d expect: a decent positive correlation between the frequency of nouns or verbs and the frequency of words that modify them. Slightly weaker correlation in the adverb case, presumably because adverbs don’t always modify verbs.
Then there’s an interesting case that I think I can explain, but wouldn’t have predicted:
Noun and verb frequency are inversely correlated. This makes sense, I suppose, if you think of novels as tending toward portraiture or action (and for all I know if may be a well known phenomenon). But I expected to see more nouns imply more verbs, since you’d need more things for those subjects and objects to do. In any case, I learned something here from my few minutes with GGobi.
Finally, one that leaves me at a loss:
How can adjectives and adverbs be apparently uncorrelated? Shouldn’t there be flowery novels rich in both of them and plain ones rich in neither? I’ll investigate, but in the meantime I’d love to be told that this, too, is already accounted for.
Last note: GGobi is really nifty, even if it doesn’t produce beautiful figures out of the box (see above).