Critical Text Mining, or Reading Differently

The short, polemical version of my thesis is this: It can be done better. The more nuanced and accurate version is that close reading—our only real working method—has significant intellectual costs, and that we’re now positioned to reduce some of those costs if we’re willing to take seriously the opportunities afforded by digital texts.

Let me elaborate …

Here’s how nearly all of us work: We find a cultural object that we like, or dislike, or that strikes us as important. We study it closely. We note passages or images or structural features that reveal something about the work as a whole. And then we make an argument about the relationship between our object and some larger phenomenon; American naturalism, maybe, or late capital, or the Victorian public sphere. Sometimes—but not often—our argument is causal; that Uncle Tom’s Cabin, for instance, helped set in motion the events of the Civil War, or that Dickens’ style was a direct product of the economic conditions of publishing in the nineteenth century. More often, though, we treat our objects symptomatically, and address to them questions of the form “What must have been the case for this object to have been have been produced as it was?” or “What hidden features of its situation of production does this object reveal to us?”

This is all to say that we are not—and have not been for some decades, at the least—primarily aesthetic critics. Our immediate objects are aesthetic, yes, but our concerns and investments are sociocultural, political, economic, and so forth. This is a good thing, because it keeps our field from being utterly marginalized as a matter of belle-lettrism or art appreciation. More to the point, there’s broad agreement in the profession that these social and cultural questions are by far the most important and interesting ones we can address. As I say, this is all to the good.

But there’s a certain tension between our large-scale ambitions and the techniques we use to pursue them. After all, we’ve been reading a few important texts with exceeding care since Aristotle taught us how to achieve the effects of tragedy by referring to illustrative passages from Sophocles and Euripides. (NB. For convenience, I’m dropping the general “object” and “analyze” for “book” and “read,” but understand this as as a part-for-whole figure.) This makes sense: If you want to teach people to write well, you show them examples of good writing to imitate. And since most writing is dross, you can ignore the majority of it. Nor is the matter confined to pedagogy. If your aim is to say “this is good, this is art” you’ll work the same way; you don’t need to have read all the bad stuff to understand the good. In a related vein, if your task is to understand a text that you already know is important—the Bible, say—it pays to read that book closely, even if your devotion means you can’t read much else.

Two models, then, from which we derive our dominant working method: aestheticism and biblical hermeneutics. There’s nothing wrong with either of them, of course, but we should notice two things:

  1. They assume that there either aren’t very many books to read, or that we can get away with reading a smallish subset of a larger literary field. In other words, that there’s no necessary problem that follows from not reading everything.
  2. Neither one (aestheticism, hermeneutics) looks much like the kind of cultural criticism that I claimed now rightly dominates our field.

The first of these—the assumption that important books are scarce—is the reason we still have canons. If you should and can read everything, you don’t need a canon. But if you shouldn’t (because some books are unworthy, or politically suspect, or blasphemous—three ways of saying the same thing) or can’t (“because they are too menny”), then you have to pick a few to read, assuming it’s important to read in the first place. And the books you pick from any sufficiently large pool of candidates will be at some level arbitrary and nonrepresentative, if only because you’ll have read so little of the source material in the first place. This is why the canon wars of decades past were at once necessary and absurd. Necessary because it was in fact important to stop reading Dryden and start reading Morrison (synecdoches both, of course). But absurd because the idea that rearranging the canon is in any way egalitarian—it just picks new winners—or has anything to do with eliminating canonicity as such is entirely misguided. So long as we depend on close reading, we will always work on a group of texts that comes within a rounding error of nothing when compared to the full field of literary production.

[Footnote: Some quick figures and calculations. There are about 50,000 new novels published in the U.S. every year. There are 26,000 tenured and tenure-track faculty members in U.S. English departments (of which 7,000 are at R1 schools). Assume that ten percent of any English department’s faculty work on truly contemporary American fiction. Rounding a bit, that means twenty novels per TT faculty member per year, assuming absolutely no overlap. If you want to have just four other people with whom to discuss your work, you’ll need to publish on 100 new novels every year, just to keep up with the pace of literary production in the United States. Even that seems optimistic; consider the case of all English-language novels published before 1900. There are no more than 100,000 such titles, and the number isn’t growing. If 5,000 TT faculty work on them, that means each faculty member is responsible for just 20 (or 100, if we want to have overlapping coverage as above) novels over her entire career (not annually). And yet we know from experience that we haven’t dealt with anywhere near all of the novels published before 1900—not even close—over all of literary-critical history, much less during each academic generation. And in any case, this addresses the problem only at the level of the profession as a whole; it does nothing to provide each individual researcher with meaningful knowledge about the full range of relevant literary and cultural production. (Sources: R.R. Bowker, MLA.)]

Again, this wouldn’t matter so much if our aims were aesthetic or exegetical. But they aren’t. And so we have a real problem. We want to be able to claim that the books we study are representative (the word I used earlier was symptomatic) of the culture that produced them, so that by analyzing those books we are also necessarily analyzing that culture. True, this always works at an initial level: Any book is indeed a product—and a part—of a cultural situation. But it’s a minuscule part, not because books are unimportant (that’s a separate question), but because it represents so small a fraction of that situation’s cultural and symbolic output. So we need a second level of representativity, and the usual way of providing it is to argue that the book in question is especially illuminating, that it reveals something about the situation of its production that is either particularly important and typical (thus that the book is more central than others) or especially difficult to discern by other means (hence that the book is uniquely diagnostic). Either way, though, it’s hard to make this claim without already knowing, on independent grounds, a good deal about the features and configuration of the situation you’d like to say the book allows you to diagnose.

Another way of putting this is to say that literary studies as currently practiced has a problem with at least perceived selection bias. If you want to argue, for instance—and I’m borrowing from my own work here for an example that will eat up most of the rest of my talk—that allegory is a form well-suited to dealing with the problems of transitional cultural forms, and therefore that it (allegory) should play a prominent role in the literature of a transitional moment like late modernism, you’ll want to show that late modernist literature is indeed especially allegorical. You can do that by offering readings of a broad swath of important late modernist fiction—six or eight books, maybe—and pointing to the significant, underappreciated role that allegory plays in them, and then comparing these results to the standard account of modernism and postmodernism proper, from which allegory is essential absent. But you can already imagine the difficulties of this approach. What about The Bell Jar? What about other books written shortly after the war that aren’t allegories? What about instances of allegory before and after late modernism? What about Pilgrim’s Progress? Some of these questions are easier to answer than others, but they all amount to the same thing: Allegory wasn’t invented in the years following the second world war. If it’s been around for centuries, and if we know about examples of it both in and out of that period, and if you’ve read very little, statistically speaking, of the relevant literature, how can you be sure that the phenomenon you claim to have identified isn’t simply an artifact of your own haphazard selection of texts?

The long answer involves the meaning of cultural dominance, the practical and theoretical relationship between literary production and social-economic conditions, the evolution of established forms, a reasonable knowledge of literary history, and a certain amount of hand-waiving. But the short answer is that you don’t. You don’t know that any meaningful claim about the large-scale processes of cultural production is right in the sense that it describes the operations of that production by reference to the bulk of its output. And you never will, if that production occurs on an industrial scale and if you need to engage with each artifact individually and at length.

So what should you do instead, if this fact bothers you and if you don’t intend either to give up entirely or to return to small-scale, low-stakes aestheticism? One approach—an approach less insane than it sounds, since it’s really only what you were implicitly claiming in the first place—is to try to figure out how to measure allegory in millions of texts across hundreds of years. If you can do that, you’ll be able to see how the frequency and degree of allegorical fiction has changed over the last several centuries, and you’ll observe the degree to which those fluctuations correspond to our existing sense of literary and cultural periodization. If the initial claim was correct, and allegory is closely tied to eventful or revolutionary moments, you’ll expect to find something like this cartoon:

Figure 1. Allegory as a function of time (cartoon).
Figure 1. Allegory as a function of time (cartoon).

The best case scenario will be that most spikes in allegoricalness line up with major transitions in literary history as we now understand them (so that we have confidence that the method works and that the theoretical claim is plausibly supported), but also that a small number of them don’t correspond (so that we have new and interesting problems in literary periodization to investigate).

The problem, of course, is that we don’t know exactly what it is about a text that makes it allegorical. This is due in part to a defect in the existing conventional criticism, namely that we understand allegory much less well than we generally assume. But it’s also true that allegory is probably not reducible to anything so obvious as the frequency of intuitable keywords or the length of the text. What we need, then, is a reasonably large corpus that consists of known allegorical and nonallegorical works on which we can try out any proposed criteria. Better yet, we could go looking (pseudo-inductively) within that corpus for markers that work more or less well to separate allegory from nonallegory. But now we run up against a second difficulty, namely that there’s no ready list of allegorical books. (Note in passing that we’re therefore on different ground from Moretti and Jockers‘ or Witmore‘s work on genre determination, which relies on established bibliographies or the long-standing consensus around Shakespeare’s plays.) One could try to come up with—and to defend—such a list by hand, and it will probably be necessary to do so eventually. But that’s an extremely labor-intensive way to go. We’re looking for a sampling that covers three or four hundred years, across national origins, textual subgenres, author genders, subject matter, and so forth. If we don’t want the distinctions between allegorical and nonallegorical works to be dominated by the vagaries of selection bias (after all, that’s the problem we’re trying to get around), the corpus will need to include at least hundreds of books, maybe thousands. As I say, a lot of work.

[Footnote: Apropos our unsettled understanding of allegory, my own general-purpose definition is as follows: An allegorical text is one in which we can (and should) identify a pervasive set of figurative mappings between elements of the (independently legible) plot and elements of a coherent second plot, typically more significant (for us, the readers) than the first. The normative imperative is slippery here, of course, but I don’t see much way around it, and of course the whole thing is fundamentally context-dependent; allegory is as much about reading practices as about writing.]

What we’d like to have is a set of texts that differ only in that they are or are not allegorical, holding constant as many other variables as possible (author, subject, date of composition, and so on). That’s not likely to happen directly, but what if we compare works by the same author that fall on either side of the allegorical divide? That doesn’t do much to help with subject matter, so it’s not likely to help a great deal with keyword identification, but it controls for authorial style, gender, national origin, date of composition (give or take), and, if we’re selective, subgenre. If we can get a handle on the differences between a reasonable number of such pairs, we’ll have a start on identifying a more comprehensive set of differentiating features for allegory.

Preliminary work along these lines suggests that there’s a strong positive correlation between allegorical texts and those that are rich in verbs and poor in many of the other major parts of speech, including nouns, descriptive words, conjunctions, and prepositions. This makes decent sense; allegories tend to feature comparatively simple surface-level plots and focus on action over description, because the more complicated and specific the first-order narrative, the more difficult it is to maintain the mapping between levels that is the substance of allegory. The correlation concerning parts of speech is nowhere near perfect, of course, and analyzing it is a problem in an awkward number of variables, even before you add other potentially differentiating criteria such as keywords, common word frequencies, grammatical constructions, and other candidates for differentiating features. What’s required—and what’s currently underway—is an iterative process by which the corpus of known allegorical and nonallegorical texts can be built up in pieces and with increasing confidence. This relies on both computational methods to extract information about the texts and to weigh the various differentiating factors, and what the information retrieval people in computer science call “domain knowledge,” namely an informed critical assessment concerning the nature and function of allegory in a wide array of texts.

Figure 2. Allegory and parts of speech in single authors, demonstrating the number of pairings in which each part of speech was a significant differentiating factor. Blue = positive association with allegory, red = negative. Method of comparison is Dunning log likelihood.
Figure 2. Allegory and parts of speech in single authors, demonstrating the number of pairings in which each part of speech was a significant differentiating factor. Blue = positive association with allegory, red = negative. Method of comparison is Dunning log likelihood.

This is slow going, in a way: Applying descriptive statistics and information retrieval techniques to literary text analysis isn’t something that’s widely practiced or about which there’s much prior work to consult. And the less said about copyright and twentieth-century texts, the better. But because the process builds outward from close readings of a few dozen novels, it’s much faster than dealing with hundreds, thousands, or millions of novels individually. More importantly, once it reaches a critical mass past which we are confident of our ability to classify allegorical and nonallegorical works algorithmically, it can scale to arbitrarily large corpora with minimal effort. And that’s really exciting. Recall what’s happened here. We started with a problem that we simply couldn’t address, namely how to say something informed about the enormous range of literature that we can’t read. We had a conventional theoretical and literary-historical argument that was impacted in an important way by this problem. By introducing computational methods, we now have the ability to overcome this limitation, and to do it in a way that extends and augments (rather than replaces) our existing analytical techniques.

[Footnote: It’s not quite right to say that statistical-computational work in literary studies is entirely new, and I don’t want to be accused of forgetting several decades of progress. But I also don’t want to speak the name of authorship attribution, and I don’t see how you can do the former without the latter, at least not in a short paper whose main concerns lie elsewhere. See Martin Mueller’s excellent analysis of the problem, which does this work so I don’t have to.]

Even if you don’t care much about allegory in particular, this example should appeal to you. The idea isn’t, of course, either to replace close reading entirely or to replace ourselves with computers as the ones doing the close reading, but instead to offer a new kind of evidence that’s especially well suited to supporting the kinds of extracanonical cultural, historical, and sociological claims that have come to occupy a central place in the discipline over the last few decades. Insofar as computational methods like the ones I’ve described today advance this cause—and because we know only too well the limitations of our existing method—there’s little doubt that digitally assisted criticism can and must play a much more important role in both the immediate and long-range future of literary studies. We have only to begin doing it.

Postscript

The focus to this point has been on the movement “outward” or “upward” from individual texts toward large-scale social and historical issues understood through variations in large literary corpora. Computational methods are well suited to this type of scaling, since computers do simple things like part-of-speech recognition quickly and uncomplainingly. But the process can and does work in reverse, something I suggested elliptically through the interplay of machine learning and domain knowledge. To be more explicit, though, we can imagine a case in which computationally derived information about the prevailing level of allegorical usage in a given period would allow us to identify, for instance, the unexpectedly anomalous allegorical bent of an author not usually understood to be an allegorist. This is turn might point us to a new reading of that author’s works, or of her relationship to the dominant modes of her age. Or our attention might be drawn to an author’s work that scores especially highly or anomalously in allegorical terms, but that has never before been given critical attention.

More concretely and in a different area, what if statistical analyses of Shakespeare’s texts suggest that Othello looks much like a comedy? This certainly doesn’t mean that Othello is a comedy, but it might give us new reasons to return to a well-known text and to ask of it different questions than those we posed in the past. The results of our new line of inquiry may or may not be interesting; only time and analysis will tell. But new ways of approaching Shakespeare are probably a good thing, as are procedures that allow us to pluck interesting works from obscurity for guided critical attention. And both of these are examples of the “inward” or “downward” orientation that computational work also serves.


[Note, 24 May 2010: This is a lightly revised version of my talk for the 2009 MLA conference in Philadelphia.]

How Many New Novels are Published Each Year?

In my recent talks, I’ve been saying things like “there are tens or hundreds of thousands of new novels published every year, and I just can’t read all of them.” Matt Kirschenbaum says this demonstrates a deplorable lack of initiative in our younger scholars, and he’s probably right. But is my count reasonable? I pretty much made it up, so I thought I should check.

But how do you do that? Google could probably tell you how many volumes are in their metadata database, along with their years of publication, but how many of those are novels or other works of prose fiction? Wikipedia claims to know the totals for “books” broken down by country, though their numbers are oldish and taken from diverse sources.

If we can live with U.S.-only numbers, and if we’re mostly interested in English-language fiction, we can consult R.R. Bowker’s publishing statistics (they’re the people who run Books in Print). From them we learn that there were 407,000 books published in 2007 (the last year for which final numbers are available), a total that includes 123,000 “on-demand, short run, and other unclassified” titles. Of the 274,000 classified titles, 43,000 are “fiction,” a category that includes “strictly adult novels (including graphic novels) and short story collections.” (There are separate categories for anthologies, literary criticism, poetry, drama, etc. Oh, and “adult” is opposed to “juvenile,” not a synonym for porn.) If the same ratio holds for the unclassified category, we’d have another 19,000 novel-like entries, for a total of 62,000.

The U.S. isn’t the only (predominately) English-language book market in the world, of course; Britain’s is about the same size, Canada and Australia are significant, and there are many English-language novels published elsewhere. But there’s also redundancy in some of the titles shared between markets, and a portion of the new titles are only new editions or bindings of previously-released texts. (As an aside, I wonder how many of the books published annually ever exist in more than one edition? I’d bet it’s a much smaller number than our scholarly experience with canonical-ish texts would suggest. I also wonder how many new U.S. titles are in languages other than English.) Accounting for all of those factors is more work than I want to do at the moment, though I’d love to hear what other people know about them.

In the meantime let’s assume, conservatively, that the global total is on the order of twice the U.S. number. In that case it seems pretty safe to say there are around 100,000 new English-language works of long-form prose fiction published globally each year. That’s a ballpark number, but I don’t see any reason to believe that it’s off by more than a factor of about two, and it’s certainly of the right order of magnitude. Conclusion: I can go on using my line about the number of books I’m not reading.

[Update, 29 September 2010: See also this follow-up post on numbers from the UK. And note that Bowker has since released figures that include 2009; the major story there is that “nontraditional” volumes (reprints of public-domain classics and print-on-demand, mostly) have exploded in the last few years, now far outnumbering (by about 3:1) the mostly flat traditional volumes. Sales are another matter, of course.]

[Update, 30 July 2012: I see there’s been some bit rot at Bowker. 2011 numbers (with prior year figures) are now available. In general, this info is released annually via a press release in May or June for the previous year. If a specific Bowker link is dead, search their site for something like “publishing industry” or “publishing output” and the most recently past year (2011, etc.)]

Allegory in Single Authors

I’ve been following up a suggestion from Jan Rybicki about discovering statistically distinguishing features of allegorical and non-allegorical writing by comparing individual works by single authors rather than (or preliminary to) large corpora. This has some downsides (I don’t expect it to be much good for detecting characteristic terms/lemmata, for instance, which will be dominated by the specific content of the individual texts), but it might be a useful quick and dirty way to get a better feel for where to direct my attention.

Results to come in the next week or so, but in the interim I’m interested in thoughts on especially useful pairings. What I’m looking for are pairs of works by an author, one of which is decidedly allegorical, the other of which is not. Some examples are below.

Note that there are practical constraints: I need books that are available in full-text electronic form, which rules out most things published after 1923. And I’d like to use works that are reasonably familiar, if only so other literary folks can evaluate for themselves whether or not I’ve classified them correctly. Roughly matching word counts couldn’t hurt, but aren’t terribly important, since I’m mostly looking at frequency-regularized Dunning log likelihoods (and because length itself might be a marker of allegoricalness, though I don’t expect to answer that question with so small a sample). The more of these pairs, the better, but the point is that this isn’t true corpus work, so I’m not feeling like I need hundreds (that’s for later!).

Some suggestions thus far:

Author Allegorical Nonallegorical
Alcott Little Women (1868) Work (1873)
Bunyan Pilgrim’s Progress (1678) Grace Abounding (1666)
Defoe Robinson Crusoe (1719) Journal of the Plague Year (1722)
Moll Flanders (1722)
Dickens Christmas Carol (1843) Bleak House (1853)
Martin Chuzzlewit (1844)
Eliot Adam Bede (1859) Middlemarch (1871)
Silas Marner (1861) Mill on the Floss (1860)
Melville Confidence Man (1857) Israel Potter (1856)
Moby-Dick (1851)
Orwell* Animal Farm (1945) Burmese Days (1934)
1984 (1949) Road to Wigan Pier (1937)
Shelley Frankenstein (1818) Mathilda (1819)

* God bless wonky Australian copyright

A couple of comments: There’s a regrettable skew toward the middle of the nineteenth century here, but my sample will probably always be nineteenth-century rich due to both the historical development of novel writing and the realities of copyright law.

Poetry would be an interesting addition. I’m not sure to what extent the vagaries of rhyme, meter, etc. would impact the comparisons, but I’d like to find out. So … what should I use over against Paradise Lost, for instance? Also, did Bunyan ever write something non-allegorical? (I can’t think of anything.) (Grace Abounding; thanks to Suzanne Keen, by way of the narrative list.) What about Langland? (Ditto, no hope.)

I’d like to have decent national and gender balance, which seems OK in the tiny sample I’ve given here. More variety would always be better.

I’ve tried to avoid overt Bildungsromane, on the theory that they’re always at least a little allegorical, even when they’re not. Alcott’s the exception because, you know, Little Women.

Thoughts and suggestions for changes, deletions, or insertions?

Followups on the GBS Settlement

There have been some very smart comments on (and around) my previous post on the Google Book Search settlement. If you’re interested, you might want to see the comments section of that post, plus two good posts by Eric Kansa, one before and one after the recent GBS conference at Berkeley.

Most of my thoughts on the points Eric and others raise appear in the comments section of my last post (linked above). But I think maybe the gut-level difference is related to this passage from Eric’s second post:

The Google Books corpus is unique and not likely to be replicated (especially because of the risk of future lawsuits around orphan-works). This gives Google exclusive control over an unrivaled information source that no competitor can ever approximate.

If Eric’s right about this, then it’s critical to get as much public access as possible built into the settlement now, because we won’t have another shot at it. For reasons laid out in my previous post, though, I’m less pessimistic about the prospects for future competition. I think the settlement will make it easier for others to enter this space by providing both a template for negotiations with the authors and publishers and a strong antitrust incentive for the rightsholders to grant equal access.

Scanning is a big-ish project, no doubt, but not prohibitively so (witness the Open Content Alliance, as well as Microsoft’s former efforts, stopped more by fear of legal action than by lack of funds). This is especially true if it turns out there’s significant money to be made by doing it (and the objection, after all, is that the scanned corpus is an immensely valuable resource on which Google will be sitting). Plus, scanning will only get cheaper with time.

My own ideal case would be a combination of meaningful copyright reform (to clarify that scanning for indexical use doesn’t require permission from a rightsholder) and something like Dan Cohen’s proposal for a government (or Ivy League) -funded book-scanning “moon shot” to benefit society at large. Barring this (extremely unlikely, I think) outcome, by all means, let there be as many public-friendly provisions tacked onto the GBS settlement as possible. My point, though, is that even as it stands now, the settlement provides enough benefits to enough people that I’d rather have it go forward than not, and I’m optimistic that many of its shortcomings can and will be addressed (by competition, by legislation, by technological advances) in the short to medium term.

The alternative, just to be clear, is really bad: maybe no book search at all, from anyone (thanks to the unresolved legal questions), and certainly no search of anything outside the (fossilized) public domain. No research corpus. No free public terminals with millions of in-copyright books at libraries. And this situation would endure indefinitely, backed up by the very real example of a messy, expensive, status-quo-reinforcing failure.

Google and EPUBs

Google just announced that they’re making a million+ public-domain books downloadable in EPUB format. This is an improvement over the old situation, where you could download PDFs (sans OCRed text) of those books or read them in plain text online (one physical page at a time), but not download a small, well-OCRed text copy.

I’d be delighted if they went all the way to true plain text downloads. (And then let me download all the public-domain stuff in bulk. And gave me a pony.) But this is a nice improvement. In other news, I’d also be delighted if my Kindle supported EPUB natively.

Why I’m in Favor of the Google Book Search Settlement

When Google announced their book-scanning project five years ago, most academics I talked to about it were pretty happy. These days a lot of that enthusiasm seems, if not to have disappeared, then at least to have been tempered by serious doubts. I share some of these, but on the whole the settlement is a profoundly good thing. I support it, and I hope my colleagues will, too.

About the settlement

First, two notes: One on the underlying legal issue and one on what’s at stake. The publishers and authors (via the AAP and the Authors Guild, respectively) sued Google for alleged “massive copyright infringement” shortly after Google began scanning books from several prominent libraries. The theory is that because Google makes a copy of every book they scan, they require the rightsholders’ permission to assemble their book search database. Google says the process is covered by the fair use exception in copyright law and is no different from their Web search business, which also copies texts in order to index them. How this would be decided in court is unknown, mostly because the legal definition of fair use is extremely and deliberately vague.

But it’s clear that Google has a lot more to lose than do the publishers and authors if the case were to go against them. If the publishers were to lose, Google could index their stuff without further permission. But it’s hard to see how they’d be hurt by that, since it would only help people find books and wouldn’t change the strong basic copyright protections they already enjoy. Google still wouldn’t be able to sell or give away in-copyright books, for instance. Google, on the other hand, could be destroyed if they were to lose. They’d be on the hook for God knows how much in damages, of course (willful infringement of copyright carries maximum statutory damages of $150,000 per instance). But—and this is much more important—because there’s no fundamental difference between the copyright protections for Web pages and those for books, a decision in favor of the publishers would effectively outlaw search as it currently exists. Would a court dare do that? I have no idea, but Google obviously took the threat seriously enough to settle rather than to fight, especially since Web search is everything to them, whereas books are a comparative hobby. I wish Google had chosen to go to trial, because I think (and hope) they would have won, thereby clarifying and solidifying fair use rights in computational contexts, but it’s neither my money nor my business that’s at stake, and I understand why they chose to settle.

This dispute about fair use is interesting in its own right, but it’s not in itself the main objection to the settlement from most of my academic friends. (Most academics, though certainly not all, are in favor of more liberal fair use rights, and would therefore usually side with Google on copyright issues.) They’re concerned instead about a missed opportunity for real reform, and about the perceived market power the settlement would grant to Google and the rightsholders. How so? Not over works that are already in the public domain; these are free to copy and redistribute already, and there’s nothing in the settlement that would (or could) change that. Anyone else could create a competing database of public domain works (see the Open Content Alliance, for instance). And it’s not about current books, whether in or out of print, which the rightsholders are free to dispose of as they wish—they can be bought, sold, and licensed according to the whims of the publishers and authors. Again, nothing in the settlement could possibly change this, since to do so would involve rewriting American copyright law. The issue, then, is over so-called “orphan” works, books for which an appropriate rightsholder cannot be established or contacted.

Here’s how things stand now with respect to orphan works: They’re simply off limits for anything beyond ordinary fair use. They can’t be reissued, corrected, or adapted. You can’t assign them in a college course, because no one can produce a new edition and you can’t make copies of your own or your library’s (rare) copy. You can’t use an orphan sound or video clip in a new song or film. And, absent a real answer to the fair use question raised by Google’s scanning project, you can’t include them in a search tool, because you can’t get a rightsholder’s approval to do so. It would be an exaggeration to say orphan works may as well not exist—they still do sit in libraries and archives—but they’re a lot less useful than either public domain or current works.

The settlement would establish a “rights registry,” a clearinghouse tasked with identifying and tracking rightsholders (if any) and copyright status for all books. As a practical matter, “all” would mean “those scanned by Google,” at least at first. Google would pay $34.5 million to establish this registry, which would then operate as a non-profit and work on behalf of rightsholders, distributing whatever funds it collects to the appropriate parties. In exchange for setting up this registry and paying a chunk of cash ($125 million in all), the publishers and authors drop their copyright infringement claims (so Google can go on scanning). Maybe more importantly, as far as my uncomfortable academic friends are concerned, Google gets the right to scan, process, and sell orphan works, even though their proper rightsholders can’t be determined, and they get indemnity from lawsuits if they make honest mistakes about the copyright status of a work (and sell it or offer it for free when they shouldn’t, for instance). Rightsholders can opt out of this arrangement at any time, though of course they’ll then lose the benefits of being available through Google.

Some objections

This all looks pretty win-win. Google gets to do what they do, maybe opening up a big new market in the process, and they remove a significant legal cloud hanging over them. Publishers and authors get a pile of cash, a new outlet for their goods, and they get to sell a bunch of old stuff that’s currently out of print. Users win because they get a search and information resource that they wouldn’t otherwise have had.

The concern, though, is that Google is the only would-be scanner to benefit directly from the settlement. The settlement leaves unanswered the fair use question about book scanning. It leaves unchanged the status of orphan works, but allows Google alone (at least at first) to make use of them. And it gives two private, for-profit entities (the Authors Guild and the Association of American Publishers) control over the rights registry.

Wouldn’t it be better, these friends of mine say, to resolve these issues legislatively, so that the law would be clear and everyone would stand of level ground? Couldn’t we create a limited right to use orphan works, to store “non-consumptive” copies of texts for computational use, and set up a public rights registry? Wouldn’t that provide better and fairer competition in the marketplace? Absent those changes, don’t we risk creating a situation in which there are only two (cooperating) players (Google and the rightsholders) in the marketplace? Would any other company be able to negotiate an equivalent agreement with the rightsholders? Especially since those rightsholders wouldn’t have any incentive to help set up a competitive market for their products? Would any other company have the resources to scan millions of books, especially after Google has a head start on both the technical and the business sides? Isn’t this our one big chance to get scanning done right? Aren’t we missing a great opportunity to reform a badly out-of-whack U.S. copyright regime? And won’t libraries be almost required by their patrons to subscribe to Google’s digital products, available at only monopoly prices?

My answers

I share many of these concerns. But I still think we’ll be much, much better off with the settlement than without it. Here’s why:

Copyright reform

We do need copyright reform, including provisions for orphan works. But I don’t think we’ll ever get it, especially in the absence of the settlement. When has Congress ever scaled back any part of copyright protection? Is there any reason to think it will do so now or in the foreseeable future? Even if it were to, how long do you think we’ll have to wait for it, given our current political priorities, making no progress on things like book search and computational analysis rights in the interim?

Our current copyright regime—which allows for effectively endless copyright protection without any provision for an evolving public domain—is totally out of alignment with the social cost/benefit analysis that authorizes U.S. copyright law. I don’t think there’s any chance that’s going to change, but if the settlement is approved, there will at least be large, powerful, monied interests (cf. Microsoft, Amazon, and Yahoo, all of which recently [re-]joined the Open Content Alliance) lobbying to create specific provisions relaxing aspects of copyright control like those affecting orphan works and computational use. This differs from the current situation in which all the money and influence is on the other side. And they’ll have a legislator-friendly argument, namely that they’re just trying to compete in the marketplace on terms equal to Google’s. So far, they haven’t had to make this push, because no one has been making much money there. The settlement will change those incentives.

[Note in passing that the Berne Convention is always going to pose problems, since it’s built around absurdly strong European-style (“moral”) copyright provisions that prohibit things like registration requirements. The U.S. has never, of course, been especially keen on international agreements, but copyright protection is one of its long-standing hobby horses. It seems unlikely that the U.S. government would push for serious changes to Berne.]

An open market

There’s no reason to believe other entities won’t be able to enter the marketplace. The settlement provides only non-exclusive licenses to Google, and will serve as a ready-made template for a legal agreement between the rightsholders and any future scanners. Moreover, there would surely be serious antitrust scrutiny if the rightsholders were to withhold similar terms from others who wanted to enter the market. And why would they, really? More outlets means more differentiated products and more opportunities to sell their goods. Plus, with the registry already in place and both scanning and storage getting cheaper by the day, the barriers to entry are falling with time, not rising.

The status quo

What’s the alternative? If the settlement isn’t approved, no one can go ahead with any scanning projects. Not even those limited to the public domain (which, as noted, is less relevant by the day, because nothing new will ever fall into it); it would only take one mistaken scan of a protected work to expose a scanner to bankrupting litigation. Our current copyright system, written exclusively for content creators without even a nod to the public interest, will go on unchanged. And the public, academics and normal people alike, will have lost a terrifically promising resource, one assembled at significant cost and risk (if not with strictly altruistic motives) by a private company at almost no expense to us.

Library costs

Finally, libraries will, as always, have a choice to make about how they spend their subscription money, including whether or not to buy extended access to Google’s offerings. But they’ll already have free access (albeit at a single “terminal,” whatever that will mean in practice) to all of Google’s digital holdings. If prices are too high and they choose not to subscribe, they’ll still be better off than they were to begin with, since they’ll have one terminal with millions of in-copyright books, rather than none, as they do now. And how different is this situation from the one that holds with respect to commercial presses and journal publishers? Those publishers are already effective monopolies, and no one (alas!) seems to be suggesting legislation to change that fact. Do you think Google will be better or worse? How much do you pay for Google’s services now? Plus, if I’m right and other companies or not-for-profits enter the market, any monopoly concern disappears.

Summary

My argument here isn’t so different from the one progressives are now making about health care reform: The current situation is really, really bad. This plan makes things a lot better, with minimal downsides. I’d like real copyright reform as much as I’d like single-payer healthcare, but I think they’re about equally likely. So let’s not let the perfect be the enemy of the good.

Now, there’s a chance that a defeat for the settlement would be galvanizing in its own way, and that it would give rise to serious copyright reform. My own feeling is that if Eldred v. Ashcroft didn’t do it, nothing will. Maybe I’m wrong, but I’d much rather have Google Book Search and all it entails, plus the settlement-provided computational research corpus, a useful and well-funded rights registry (a significant public good), the plausible prospect of a thriving marketplace for digital texts and products based on them, and the first ever relaxation of at least a few copyright protections, than torpedo the settlement in hope of getting a marginally better legislative result that’s a huge longshot.

POS Frequencies in the MONK Corpus, with Additional Musings

This post is on the work I presented at DH ’09, plus some thoughts on what’s next for my project. It’s related to this earlier post on preliminary part-of-speech frequencies across the entire MONK corpus, but includes new material and figures based on some data pruning and collection as mentioned in this post (details below).

A word, first, on why I’m working on this. I don’t really care, of course, about the relative frequencies of various parts of speech across time, any more than chemists care about, say, the absorption spectra of molecules. What I’m looking for are useful diagnostics of things that I do care about but that are hard to measure directly (like, say, changes in the use of allegory across historical time or, more broadly, in rhetorical cues of literary periodization).

My hypothesis is that allegory should be more prominent and widespread in the short intervals between literary-historical periods than during the periods themselves. Since we also suspect that allegorical writing should be “simpler” on its face than non-allegorical writing (because it needs to sustain an already complicated set of rhetorical mappings over large parts of its narrative), it makes sense (in the absence of a direct measure of “allegoricalness”) to look for markers of comparative narrative simplicity/complexity as proxies for allegory itself. I think part-of-speech frequency might be one such measure. In any case if I’m right about allegory and periodization and if I’m also right about specific POS frequencies as indicators of allegory, then we should expect certain POS frequencies to exhibit significant (in the statistical sense) fluctuations around periodizing moments and events. (I wish there were fewer ifs in that last sentence; I’ll say a bit below about how one could eliminate them.)

So … what do we see in the MONK case? Recall that the results from the full dataset looked like this:

POS Frequencies, Full MONK Corpus

POS Frequencies, Full MONK Corpus

But that’s messy and not of much use. It doesn’t focus on the few POS types that I think might be relevant (nouns, verbs, adjectives, adverbs); it includes a bunch of texts that aren’t narrative fiction (drama, sermons, etc.); and it’s especially noisy because I didn’t make any attempt to control for years in which very few texts (or authors) were published. (Note that the POS types listed are the reduced set of so-called “word classes” from NUPOS.)

Here’s what we get if we limit the POSs (PsOS?) in question, exclude texts that aren’t narrative fiction, and group together the counts from nearby years with low quantities of text:

POS Frequencies, Reduced and Consolidated MONK Corpus

POS Frequencies, Reduced and Consolidated MONK Corpus

And here’s the same figure with the descriptive types (adjectives and adverbs) added together:

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

[Some data details, skippable if you don’t care. First, note that the x axes in all three figures need to be fixed up; they’re just bins by year label, rather than proper independent variables. I’ll fix this soon, but it doesn’t make much difference in the results. You can download the raw POS counts for the full corpus (not sorted by year of publication), as well as those restricted to texts with genre = fiction. These are interesting, I guess, but more useful are the same figures split out by year of publication, both for the whole corpus, and just for fiction (presented as frequencies rather than counts). Finally, there are the fiction-only, year-consolidated numbers (back to counts for these, because I’m lazy). The table of translations between the full NUPOS tags and the (very reduced) word classes presented here is also available.]

So what does this all mean? The first thing to notice is that there’s no straightforward confirmation of my hypotheses in these figures. There’s some meaningful fluctuation in noun and verb frequency over the first half of the nineteenth century—which I think might be an interesting indication of the kind of writing that was dominant at the time (see the noun and verb frequency section of this post)—but no corresponding movement in the combined frequency of adjectives and adverbs. This might mean several things: I might be wrong about the correlation between such frequencies and periodizing events, or I might not be looking at the right POS types, or (quite likely, regardless of other factors) I might not have low enough noise levels to distinguish what one would expect to be fairly small variations in POS frequency.

Where to go from here? A few directions:

I’ll keep working on a bigger corpus. The fiction holdings from MONK are only about 1000 novels, spread (unevenly) over 120+ (or 150+) years. So we’re looking at eight or fewer books on average in any one year, and that’s just not very much if we want good statistics.

There are a couple of ways to go about doing this. Gutenberg has around 10,000 works of fiction in English, so it’s an order of magnitude larger. There are issues with their cataloging and bibliographic quality, but I think they’re addressable and I’m at work on them now. The Open Content Alliance has hundreds of thousands of high-quality digitizations from research libraries, though there are some cataloging issues and I’m not sure about final text quality (which relies on straight OCR rather than hand-correction as does Gutenberg). Still, OCA (or Google Books, depending on what happens with the proposed settlement, or Hathi) would offer the largest possible corpus for the foreseeable future. I’ve been talking to Tim Cole at UIUC about the OCA holdings and will report more as things come together.

But I think it’s also worth asking whether or not POS frequencies are the right way to go; I started down that path on a hunch, and it would be nice to have some promising data before I put too much more effort into pursuing it. What I need, really, are some exploratory descriptive statistics comparing known allegorical and nonallegorical texts. One of the reasons I’ve held off on doing that was because it seems like a big project. The time span I have in mind (several centuries), plus the range of styles, genres, national origins, genders, etc. suggest that the test corpus would need to be large (on the order of hundreds of books, say) if it’s not to be dominated by any one author/nation/gender/period/subject/etc. But how much reading and thinking would I have to do to identify, with high confidence, at least 100 major works of allegorical fiction and another 100 of comparable nonallegorical fiction? And would even that be enough? A daunting prospect, though it’s something that I’m probably going to have to do at some point.

But I got an interesting suggestion from Jan Rybicki (who works in authorship attribution, not coincidentally) at DH. Maybe it would suffice, at least preliminarily, to pick a handful of individual authors who wrote both allegorical and nonallegorical works reasonably close together in time, and to look for statistical distinctions between them. Since I’d be dealing with the same author, many of the problems about variations in period, national origin, gender, and so forth would go away, or at least be minimized. I suspect this wouldn’t do very well for finding distinctive keywords, which I imagine would be too closely tied to the specific content of each work (which is a problem that the larger training set is intended to overcome), but it might turn up interesting lower-level phenomena like (just off the top of my head) differences in character n-grams or sentence length. It would take some work to slice and dice the texts in every conceivably relevant statistical way, but I’m going to need to do that anyway and it’s hardly prohibitive.

So that’s one easy, immediate thing to do. In the longer run, what I really want is to see what people in the field have understood to be allegorical and what not, which would have the great advantage, at least as a reference point, of eliminating some of the problems of individual selection bias. One way to do that would be to mine JSTOR, looking, for example, for collocates of “allegor*” or (more ambitiously) trying to do sentiment analysis on actual judgments of allegoricalness. I suspect the latter is out of the question at the moment (as I understand it, the current state of the art is something like determining whether or not customer product reviews are positive or negative, which seems much, much easier than determining whether or not an arbitrary scholarly article considers any one of the several texts it discusses to be allegorical or not). But the former—finding terms that go along with allegory in the professional literature, seeing how the frequency of the term itself and of specific allegorical works and authors changes over (critical) time, and so on—might be both easy and helpful; at the very least, it would be immensely interesting to me. So that’s something to do soon, too, depending on the details of JSTOR access. (JSTOR is one of the partners for the Digging into Data Challenge and they’ve offered limited access to their collection through a program they’re calling “data for research,” so I know they’re amenable to sharing their corpus in at least some circumstances. I was told at THATCamp by Loretta Auvil that SEASR is working with them, too.)

[Incidentally, SEASR is something I’ve been meaning to check out more closely for a long time now. The idea of packaged but flexible data sources, analytics, and visualizations could be really powerful and could save me a ton of time.]

Finally (I had no idea I was going to go on so long), there are a couple of things I should read: Patrick Juola’s “Measuring Linguistic Complexity” (J Quant Ling 5:3 [1998], 206-13)—which might have some pointers on distinguishing complex nonallegorical works from simpler allegorical ones—plus newer work that cites it. And Colin Martindale’s The Clockwork Muse, which has been sitting on my shelf for a while and which was (re)described to me at DH as “brilliant and infuriating and wacky.” Sign me up.