In an earlier post, I offered a brief list of paired allegorical and nonallegorical texts by single authors. The idea was to use these pairs to look for the distinguishing textual features of allegory by controlling for as many variables (such as authorial style, genre, national origin, gender, period of composition, etc.) as possible. Or in other words, the attempt was to get as close as possible to the unattainable ideal of a corpus of texts that differ only by the presence or absence of allegory.

That short list was OK and was the basis of the second figure in my MLA paper on “Critical Text Mining.” But it was both (1) too short for corpus work and (2) depended on my own assessment of allegoricalness, with attendant limitations of historical scope. I’ve always felt that the better option would be to build an expanded version of this pairwise list on the basis of settled scholarship in the field.

The table below represents the groundwork for such a corpus of well-established allegorical-nonallegorical pairs. It’s still under development—there are obvious holes and issues—but it’s an outline of where I’m headed. What I really need now is feedback on the composition of this list.

Issues and Notes

A few notes, followed by a request for kind assistance:

  • All of the allegorical works are attested by one or more of the following major sources on allegory. Most are attested by several of them.
  • From these sources, I’ve excluded works mentioned only in passing or discussed as ambiguous or difficult cases. So while there’s always room to argue about the allegoricalness of any entry, the texts presented here under the heading of “Allegory” are about as canonically allegorical as it’s possible to be.
  • The nonallegorical texts are another matter; I’ve selected them myself as potential pairings for the allegorical entries. So far I’ve limited these to works by the same author, but I’m not necessarily averse to well-paired nonallegorical entries by other authors (and I’m aware that such pairings will sometimes be required).

There are two ways to use this list, and therefore two potentially conflicting goals when selecting pairs of texts:

  • Pairwise comparisons. In this case, I’ll evaluate each allegorical text only against its paired nonallegorical counterpart. For this purpose, it’s not especially important where the two texts fall on the imagined spectrum of allegoricalness, only that they be well separated from one another on it. But it is important that the two members of the pair are otherwise as similar as possible.
  • Corpus comparisons. On the other hand, I’ll also want to compare the features of the allegorical texts taken together against those of the collected nonallegorical texts. For this purpose what’s important is to avoid cases in which any of the allegorical or nonallegorical entries stray too far toward the opposite category, even if they’re significantly different from their pairmates. But it’s not so crucial that any one pair be especially well matched in content, style, etc.; the two corpora just need to be similar in overall composition.

Action Item

So what I’m looking for is feedback on the suitability of the nonallegorical items that are currently listed below, plus suggestions for appropriate texts where none is given.

The ideal case it to find a firmly nonallegorical text by the same author for each of the allegorical entries, but where that’s not possible, the next best solution is probably a text of similar origin, style, length, subject matter, form, and so forth. This will never be perfect, but the closer the match—while still maintaining good relative and absolute separation on the allegorical continuum—the better.

I’d also love to know about potential issues or complications concerning any of these texts and pairings.

Oh, and one other constraint: I need to be able to get my hands on electronic versions of whatever texts I’m going to use; this makes anything published after 1923 difficult (though not strictly impossible).

Massive thanks in advance to any and all who care to comment. The comments section below is probably the easiest way to leave feedback, or you can email me by clicking the “About” link (over on the lefthand side).

The Table: Allegorical and Nonallegorical Text Pairs Grouped by Era

Author Allegory Nonallegory Notes
Ancient and classical
Aeschylus Prometheus Bound Agamemnon Disputed authorship of Prometheus Bound
Aesop Fables ???
Hesiod Theogony Works and Days
Boethius Consolation of Philosophy De Musica De Musica seems unsuitable
Capella, Martianus Marriage of Mercury and Philology ???
Ovid Metamophoses Amores
Prudentius Psychomachia Cathemerinon
Virgil Aeneid Georgics
Anon. Bible (Genesis) ??? Very likely more interpretational trouble than it’s worth
Medieval and Renaissance
Alain de Lille Complaint of Nature Liber poenitentialis
Lorris, Guillaume de Romance of the Rose ??? Other medieval romance?
Silvestris, Bernard Cosmographia ??? Maybe commentary on Aeneid, but disputed authorship and different form
Bale, John King John ??? Another play from the era?
Chaucer, Geoffrey House of Fame Troilus and Criseyde
Chaucer, Geoffrey Parliament of Fowles Troilus and Criseyde
Fletcher, Phineas Purple Island ??? "Brittain’s Ida" (erotic poem)?
Gower, John Confessio Amantis Vox Clamantis
Hawes, Stephen Passetyme of Pleasure Comfort of Lovers
Kempe, Margery Book of Margery Kempe ???
Langland, William Piers Plowman ???
Lydgate, John Reson and Sensualitie King Henry VI’s Triumphal …
Shakespeare, William Phoenix and the Turtle ??? Appropriate sonnets?
Spenser, Edmund Faerie Queene Shepheardes Calender Or Complaints
Anon. Castle of Perseverance ???
Anon. Everyman ???
Anon. Pearl ???
Alighieri, Dante Divine Comedy Vita Nuova
Tasso, Torquato Jerusalem Conquered Aminta
Calderón Great Theater of the World ??? "Life Is a Dream" too allegorical?
17th & 18th centuries
La Fontaine, Jean de Fables Tales
Bunyan, John Holy War Grace Abounding
Bunyan, John Life and Death of Mr Badman
Bunyan, John Pilgrim’s Progress
Defoe, Daniel Robinson Crusoe Journal of the Plague Year
Dryden, John Absalom and Achitophel Annis Mirabilis
Milton, John Comus Samson Agonistes Samson Agonistes too allegorical?
Milton, John Paradise Lost ??? Areopagitica? Genre/form mismatch.
Pope, Alexander Dunciad Rape of the Lock
Swift, Johnathan Battle of the Books Modest Prposal
Swift, Johnathan Gulliver’s Travels Argument Against Abolishing Christianity
Swift, Johnathan Tale of a Tub
19th century British
Verne, Jules Journey to the Center of the Earth Twenty Thousand Leagues Or "Around the World in 80 Days"
Butler, Samuel Erewhon Way of All Flesh
Conrad, Joseph Heart of Darkness Lord Jim
Darwin, Erasmus Temple of Nature Botanic Garden
Gissing, George Nether World New Grub Street
Kipling, Rudyard Below the Mill-Dam Young Men at the Manor Better pairing?
Shelley, Mary Frankenstein Mathilda
19th century American
Baum, L. Frank Wonderful Wizard of Oz Queen Zixi of Ix
Hawthorne, Nathaniel Antique Ring ??? Suitable stories?
Hawthorne, Nathaniel Birthmark ???
Hawthorne, Nathaniel Rappaccini’s Daughter ???
Hawthorne, Nathaniel Scarlet Letter House of the Seven Gables
Melville, Herman Confidence-Man Israel Potter
Melville, Herman Mardi Typee
Melville, Herman Moby-Dick Omoo
Čapek, Karel R.U.R. ???
Čapek, Karel War with the Newts ???
Kafka, Franz Castle ???
Kafka, Franz Country Doctor ???
Kafka, Franz Metamophosis Description of a Struggle
Kafka, Franz Trial Amerika
Camus, Albert Plague First Man
Huxley, Aldus Brave New World Point Counter Point "Crome Yellow" (and maybe "Antic Hay") are public domain
Orwell, George 1984 Burmese Days
Orwell, George Animal Farm Road to Wigan Pier
Mann, Thomas Mario and the Magician Buddenbrooks
Yeats, William Butler Dialogue of Self and Soul Second Coming
Zamyatin, Yevgeny We Islanders
Hurston, Zora Neale Moses, Man of the Mountain Thier Eyes Were Watching God
Golding, William Lord of the Flies The Scorpion God The Inheritors
Lewis, C. S. Lion, the Witch, and the Wardrobe ???
Rushdie, Salman Midnight’s Children Fury Or Ground Beneath Her Feet or Moor’s Last Sigh
Beckett, Samuel Waiting for Godot All That Fall Suitable nonallegorical drama?
Nabokov, Vladimir Lolita Ada
Coetzee, J.M. Waiting for the Barbarians Boyhood Or Youth/Summertime
Barth, John Giles Goat-Boy Sot-Weed Factor
Ellison, Ralph Invisible Man ???
Faulkner, William Fable The Hamlet
Ginsberg, Allen Howl Kaddish
Kesey, Ken One Flew over the Cuckoo’s Nest Sometimes a Great Notion
O’Connor, Flannery Violent Bear It Away ??? Wise Blood too allegorical

Allegory in Single Authors

I’ve been following up a suggestion from Jan Rybicki about discovering statistically distinguishing features of allegorical and non-allegorical writing by comparing individual works by single authors rather than (or preliminary to) large corpora. This has some downsides (I don’t expect it to be much good for detecting characteristic terms/lemmata, for instance, which will be dominated by the specific content of the individual texts), but it might be a useful quick and dirty way to get a better feel for where to direct my attention.

Results to come in the next week or so, but in the interim I’m interested in thoughts on especially useful pairings. What I’m looking for are pairs of works by an author, one of which is decidedly allegorical, the other of which is not. Some examples are below.

Note that there are practical constraints: I need books that are available in full-text electronic form, which rules out most things published after 1923. And I’d like to use works that are reasonably familiar, if only so other literary folks can evaluate for themselves whether or not I’ve classified them correctly. Roughly matching word counts couldn’t hurt, but aren’t terribly important, since I’m mostly looking at frequency-regularized Dunning log likelihoods (and because length itself might be a marker of allegoricalness, though I don’t expect to answer that question with so small a sample). The more of these pairs, the better, but the point is that this isn’t true corpus work, so I’m not feeling like I need hundreds (that’s for later!).

Some suggestions thus far:

Author Allegorical Nonallegorical
Alcott Little Women (1868) Work (1873)
Bunyan Pilgrim’s Progress (1678) Grace Abounding (1666)
Defoe Robinson Crusoe (1719) Journal of the Plague Year (1722)
Moll Flanders (1722)
Dickens Christmas Carol (1843) Bleak House (1853)
Martin Chuzzlewit (1844)
Eliot Adam Bede (1859) Middlemarch (1871)
Silas Marner (1861) Mill on the Floss (1860)
Melville Confidence Man (1857) Israel Potter (1856)
Moby-Dick (1851)
Orwell* Animal Farm (1945) Burmese Days (1934)
1984 (1949) Road to Wigan Pier (1937)
Shelley Frankenstein (1818) Mathilda (1819)

* God bless wonky Australian copyright

A couple of comments: There’s a regrettable skew toward the middle of the nineteenth century here, but my sample will probably always be nineteenth-century rich due to both the historical development of novel writing and the realities of copyright law.

Poetry would be an interesting addition. I’m not sure to what extent the vagaries of rhyme, meter, etc. would impact the comparisons, but I’d like to find out. So … what should I use over against Paradise Lost, for instance? Also, did Bunyan ever write something non-allegorical? (I can’t think of anything.) (Grace Abounding; thanks to Suzanne Keen, by way of the narrative list.) What about Langland? (Ditto, no hope.)

I’d like to have decent national and gender balance, which seems OK in the tiny sample I’ve given here. More variety would always be better.

I’ve tried to avoid overt Bildungsromane, on the theory that they’re always at least a little allegorical, even when they’re not. Alcott’s the exception because, you know, Little Women.

Thoughts and suggestions for changes, deletions, or insertions?

POS Frequencies in the MONK Corpus, with Additional Musings

This post is on the work I presented at DH ’09, plus some thoughts on what’s next for my project. It’s related to this earlier post on preliminary part-of-speech frequencies across the entire MONK corpus, but includes new material and figures based on some data pruning and collection as mentioned in this post (details below).

A word, first, on why I’m working on this. I don’t really care, of course, about the relative frequencies of various parts of speech across time, any more than chemists care about, say, the absorption spectra of molecules. What I’m looking for are useful diagnostics of things that I do care about but that are hard to measure directly (like, say, changes in the use of allegory across historical time or, more broadly, in rhetorical cues of literary periodization).

My hypothesis is that allegory should be more prominent and widespread in the short intervals between literary-historical periods than during the periods themselves. Since we also suspect that allegorical writing should be “simpler” on its face than non-allegorical writing (because it needs to sustain an already complicated set of rhetorical mappings over large parts of its narrative), it makes sense (in the absence of a direct measure of “allegoricalness”) to look for markers of comparative narrative simplicity/complexity as proxies for allegory itself. I think part-of-speech frequency might be one such measure. In any case if I’m right about allegory and periodization and if I’m also right about specific POS frequencies as indicators of allegory, then we should expect certain POS frequencies to exhibit significant (in the statistical sense) fluctuations around periodizing moments and events. (I wish there were fewer ifs in that last sentence; I’ll say a bit below about how one could eliminate them.)

So … what do we see in the MONK case? Recall that the results from the full dataset looked like this:

POS Frequencies, Full MONK Corpus

POS Frequencies, Full MONK Corpus

But that’s messy and not of much use. It doesn’t focus on the few POS types that I think might be relevant (nouns, verbs, adjectives, adverbs); it includes a bunch of texts that aren’t narrative fiction (drama, sermons, etc.); and it’s especially noisy because I didn’t make any attempt to control for years in which very few texts (or authors) were published. (Note that the POS types listed are the reduced set of so-called “word classes” from NUPOS.)

Here’s what we get if we limit the POSs (PsOS?) in question, exclude texts that aren’t narrative fiction, and group together the counts from nearby years with low quantities of text:

POS Frequencies, Reduced and Consolidated MONK Corpus

POS Frequencies, Reduced and Consolidated MONK Corpus

And here’s the same figure with the descriptive types (adjectives and adverbs) added together:

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

[Some data details, skippable if you don’t care. First, note that the x axes in all three figures need to be fixed up; they’re just bins by year label, rather than proper independent variables. I’ll fix this soon, but it doesn’t make much difference in the results. You can download the raw POS counts for the full corpus (not sorted by year of publication), as well as those restricted to texts with genre = fiction. These are interesting, I guess, but more useful are the same figures split out by year of publication, both for the whole corpus, and just for fiction (presented as frequencies rather than counts). Finally, there are the fiction-only, year-consolidated numbers (back to counts for these, because I’m lazy). The table of translations between the full NUPOS tags and the (very reduced) word classes presented here is also available.]

So what does this all mean? The first thing to notice is that there’s no straightforward confirmation of my hypotheses in these figures. There’s some meaningful fluctuation in noun and verb frequency over the first half of the nineteenth century—which I think might be an interesting indication of the kind of writing that was dominant at the time (see the noun and verb frequency section of this post)—but no corresponding movement in the combined frequency of adjectives and adverbs. This might mean several things: I might be wrong about the correlation between such frequencies and periodizing events, or I might not be looking at the right POS types, or (quite likely, regardless of other factors) I might not have low enough noise levels to distinguish what one would expect to be fairly small variations in POS frequency.

Where to go from here? A few directions:

I’ll keep working on a bigger corpus. The fiction holdings from MONK are only about 1000 novels, spread (unevenly) over 120+ (or 150+) years. So we’re looking at eight or fewer books on average in any one year, and that’s just not very much if we want good statistics.

There are a couple of ways to go about doing this. Gutenberg has around 10,000 works of fiction in English, so it’s an order of magnitude larger. There are issues with their cataloging and bibliographic quality, but I think they’re addressable and I’m at work on them now. The Open Content Alliance has hundreds of thousands of high-quality digitizations from research libraries, though there are some cataloging issues and I’m not sure about final text quality (which relies on straight OCR rather than hand-correction as does Gutenberg). Still, OCA (or Google Books, depending on what happens with the proposed settlement, or Hathi) would offer the largest possible corpus for the foreseeable future. I’ve been talking to Tim Cole at UIUC about the OCA holdings and will report more as things come together.

But I think it’s also worth asking whether or not POS frequencies are the right way to go; I started down that path on a hunch, and it would be nice to have some promising data before I put too much more effort into pursuing it. What I need, really, are some exploratory descriptive statistics comparing known allegorical and nonallegorical texts. One of the reasons I’ve held off on doing that was because it seems like a big project. The time span I have in mind (several centuries), plus the range of styles, genres, national origins, genders, etc. suggest that the test corpus would need to be large (on the order of hundreds of books, say) if it’s not to be dominated by any one author/nation/gender/period/subject/etc. But how much reading and thinking would I have to do to identify, with high confidence, at least 100 major works of allegorical fiction and another 100 of comparable nonallegorical fiction? And would even that be enough? A daunting prospect, though it’s something that I’m probably going to have to do at some point.

But I got an interesting suggestion from Jan Rybicki (who works in authorship attribution, not coincidentally) at DH. Maybe it would suffice, at least preliminarily, to pick a handful of individual authors who wrote both allegorical and nonallegorical works reasonably close together in time, and to look for statistical distinctions between them. Since I’d be dealing with the same author, many of the problems about variations in period, national origin, gender, and so forth would go away, or at least be minimized. I suspect this wouldn’t do very well for finding distinctive keywords, which I imagine would be too closely tied to the specific content of each work (which is a problem that the larger training set is intended to overcome), but it might turn up interesting lower-level phenomena like (just off the top of my head) differences in character n-grams or sentence length. It would take some work to slice and dice the texts in every conceivably relevant statistical way, but I’m going to need to do that anyway and it’s hardly prohibitive.

So that’s one easy, immediate thing to do. In the longer run, what I really want is to see what people in the field have understood to be allegorical and what not, which would have the great advantage, at least as a reference point, of eliminating some of the problems of individual selection bias. One way to do that would be to mine JSTOR, looking, for example, for collocates of “allegor*” or (more ambitiously) trying to do sentiment analysis on actual judgments of allegoricalness. I suspect the latter is out of the question at the moment (as I understand it, the current state of the art is something like determining whether or not customer product reviews are positive or negative, which seems much, much easier than determining whether or not an arbitrary scholarly article considers any one of the several texts it discusses to be allegorical or not). But the former—finding terms that go along with allegory in the professional literature, seeing how the frequency of the term itself and of specific allegorical works and authors changes over (critical) time, and so on—might be both easy and helpful; at the very least, it would be immensely interesting to me. So that’s something to do soon, too, depending on the details of JSTOR access. (JSTOR is one of the partners for the Digging into Data Challenge and they’ve offered limited access to their collection through a program they’re calling “data for research,” so I know they’re amenable to sharing their corpus in at least some circumstances. I was told at THATCamp by Loretta Auvil that SEASR is working with them, too.)

[Incidentally, SEASR is something I’ve been meaning to check out more closely for a long time now. The idea of packaged but flexible data sources, analytics, and visualizations could be really powerful and could save me a ton of time.]

Finally (I had no idea I was going to go on so long), there are a couple of things I should read: Patrick Juola’s “Measuring Linguistic Complexity” (J Quant Ling 5:3 [1998], 206-13)—which might have some pointers on distinguishing complex nonallegorical works from simpler allegorical ones—plus newer work that cites it. And Colin Martindale’s The Clockwork Muse, which has been sitting on my shelf for a while and which was (re)described to me at DH as “brilliant and infuriating and wacky.” Sign me up.

The Allegory Project

Since I’m likely to end up referring on occasion to my previous work, and because everything I’m doing now is connected to it in one way or another, I thought I should put up a brief summary of what it’s about and what conclusions it reached. This is just for the gist—I’ll probably end up fleshing things out a bit here in the future. You can also see a couple of articles: The NLH piece I mentioned a few days ago (“Toward a Benjaminian Theory of Dialectical Allegory,” NLH 37.2 [2006], 285-298) and two non-MUSE-available pieces, “Narrating the Sublime Event” (Theory@Buffalo 11 [2007], 143-166) and “Events as Dual and Narrative Entities in Deleuze and Badiou” (Subject Matters 2 [2005], 25-34).

So … in my dissertation—and now a book manuscript—I developed a thesis about the relationship between allegory and the event. Specifically, I claimed that we should expect to see allegory play a prominent role in the mechanisms by which revolutionary ideas and movements take hold and propagate through their relevant communities (or situations and subjects, if we’re feeling Badiouian). What that means in practical terms for a literary scholar/theorist is that we’d expect to see on uptick in the quantity and perceived importance or centrality of allegorical literature produced during moments of transition between comparatively stable aesthetic and cultural regimes. One can probably hear Kuhn and Latour rattling around in the background here, if only by analogy from the scientific case; it’s also an attempt to flesh out a problem in Badiou’s theory of the event, which is good on the “what” but not so strong on the “how” of the matter.

I then have a more or less detailed case study of American late modernism, which I think illustrates this phenomenon pretty well; American fiction in the fifties and sixties is, in fact, shot through with allegory in a way that neither high modernism nor literary postmodernism proper can (nor would want to) match. That’s nice enough, and it identifies and explains an overlooked aspect of late modernism (which is otherwise usually just understood as an imperfect “prefiguring” of postmodernist literary production, or as a kind of last gasp of modernism proper). But the claim is much larger and more general than that; we should in principle be able to see a similar phenomenon in all kinds of other transformative moments, both within literary history and out in the rest of the world of events, be they political, aesthetic, scientific, whatever.

So now my new work aims to push this claim further and wider, specifically by analyzing the role of allegory and other tropological language in the primary literature of the natural sciences around moments of evental change (that’s the science project), and by examining a much larger historical sweep within literature proper (which is the digital humanities project). More on the details and status of both of those efforts on another occasion.