Translation Numbers

I came across an interesting summary of books translated in 2009 hosted on the blog “Three Percent” at the U of R (w00t!). A resource new to me.

Headline numbers: 348 total new, first-time translations of fiction and poetry into English published in the U.S. this year. The blog reports that translations make up around 3% of the total publications in the States, and only about 0.7% of literary titles. Not much information on methodology that I could see (on a very cursory look), but I assume the list comes from Books in Print or similar. In any case, I’m grateful to have an answer to one of the questions that’s been on my to-do list for a while.

Next question: How do these numbers compare to those for other countries and to the size of various publishing markets? If a country has a large domestic literary market, do more of its books (proportionately speaking) make it into U.S. translation?

Books I Read in 2009

In the spirit of year-end lists, and for my own future reference, here are the books I read for the first time this year. Most of them, anyway – I didn’t keep a running list and my memory is imperfect. Also: Just primary literature, no scholarship (too many, too complicated, too fragmented).

John Barth, Giles Goat-Boy
Mark Z. Danielewski, House of Leaves
Junot Díaz, Brief Wondrous Life of Oscar Wao
Dave Eggers, What Is the What?
Jonathan Safran Foer, Extremely Loud and Incredibly Close
Rivka Galchen, Atmospheric Disturbances
Dagoberto Gilb, The Last Known Residence of Mickey Acuña
Ivan Goncharov, Oblomov
Jonathan Littell, The Kindly Ones
David Markson, The Last Novel
Nick Monfort, Book and Volume
Marisha Pessl, Special Topics in Calamity Physics
Kim Stanley Robinson, Red Mars
Emily Short, Bronze
Uzodinma Iweala, Beasts of No Nation
Richard Yates, Revolutionary Road

Picked up and put down Sacred Games, a couple of Jonatham Lethem novels, Let the Great World Spin, and some other things I’m sure I’ve forgotten. Will update as I remember more.

First up in 2010: The Interrogative Mood.

Shakespearean Clustering

Michael Witmore has a new post up at Wine Dark Sea on further clustering results using Docuscope on Shakespeare’s plays. I don’t have much to add, but comments are disabled on the site, and I do have a question: In his earlier work using principal components, he found that Othello clustered with the comedies. Using the new method reported today (based on “language action types”), that’s not the case. Or is it? When Witmore “standardizes” the texts, Othello returns to the comedies (it’s closest to Twelfth Night and Measure for Measure). So my question is: What is “standardization,” and why should it have so great a negative effect on clustering accuracy? (Othello isn’t the only play that changes places under standardization; as Witmore observes, the standardized results are much less eerily perfect than the nonstandardized ones.)

Critical Text Mining, or Reading Differently

The short, polemical version of my thesis is this: It can be done better. The more nuanced and accurate version is that close reading—our only real working method—has significant intellectual costs, and that we’re now positioned to reduce some of those costs if we’re willing to take seriously the opportunities afforded by digital texts.

Let me elaborate …

Here’s how nearly all of us work: We find a cultural object that we like, or dislike, or that strikes us as important. We study it closely. We note passages or images or structural features that reveal something about the work as a whole. And then we make an argument about the relationship between our object and some larger phenomenon; American naturalism, maybe, or late capital, or the Victorian public sphere. Sometimes—but not often—our argument is causal; that Uncle Tom’s Cabin, for instance, helped set in motion the events of the Civil War, or that Dickens’ style was a direct product of the economic conditions of publishing in the nineteenth century. More often, though, we treat our objects symptomatically, and address to them questions of the form “What must have been the case for this object to have been have been produced as it was?” or “What hidden features of its situation of production does this object reveal to us?”

This is all to say that we are not—and have not been for some decades, at the least—primarily aesthetic critics. Our immediate objects are aesthetic, yes, but our concerns and investments are sociocultural, political, economic, and so forth. This is a good thing, because it keeps our field from being utterly marginalized as a matter of belle-lettrism or art appreciation. More to the point, there’s broad agreement in the profession that these social and cultural questions are by far the most important and interesting ones we can address. As I say, this is all to the good.

But there’s a certain tension between our large-scale ambitions and the techniques we use to pursue them. After all, we’ve been reading a few important texts with exceeding care since Aristotle taught us how to achieve the effects of tragedy by referring to illustrative passages from Sophocles and Euripides. (NB. For convenience, I’m dropping the general “object” and “analyze” for “book” and “read,” but understand this as as a part-for-whole figure.) This makes sense: If you want to teach people to write well, you show them examples of good writing to imitate. And since most writing is dross, you can ignore the majority of it. Nor is the matter confined to pedagogy. If your aim is to say “this is good, this is art” you’ll work the same way; you don’t need to have read all the bad stuff to understand the good. In a related vein, if your task is to understand a text that you already know is important—the Bible, say—it pays to read that book closely, even if your devotion means you can’t read much else.

Two models, then, from which we derive our dominant working method: aestheticism and biblical hermeneutics. There’s nothing wrong with either of them, of course, but we should notice two things:

They assume that there either aren’t very many books to read, or that we can get away with reading a smallish subset of a larger literary field. In other words, that there’s no necessary problem that follows from not reading everything.
Neither one (aestheticism, hermeneutics) looks much like the kind of cultural criticism that I claimed now rightly dominates our field.

The first of these—the assumption that important books are scarce—is the reason we still have canons. If you should and can read everything, you don’t need a canon. But if you shouldn’t (because some books are unworthy, or politically suspect, or blasphemous—three ways of saying the same thing) or can’t (“because they are too menny”), then you have to pick a few to read, assuming it’s important to read in the first place. And the books you pick from any sufficiently large pool of candidates will be at some level arbitrary and nonrepresentative, if only because you’ll have read so little of the source material in the first place. This is why the canon wars of decades past were at once necessary and absurd. Necessary because it was in fact important to stop reading Dryden and start reading Morrison (synecdoches both, of course). But absurd because the idea that rearranging the canon is in any way egalitarian—it just picks new winners—or has anything to do with eliminating canonicity as such is entirely misguided. So long as we depend on close reading, we will always work on a group of texts that comes within a rounding error of nothing when compared to the full field of literary production.

[Footnote: Some quick figures and calculations. There are about 50,000 new novels published in the U.S. every year. There are 26,000 tenured and tenure-track faculty members in U.S. English departments (of which 7,000 are at R1 schools). Assume that ten percent of any English department’s faculty work on truly contemporary American fiction. Rounding a bit, that means twenty novels per TT faculty member per year, assuming absolutely no overlap. If you want to have just four other people with whom to discuss your work, you’ll need to publish on 100 new novels every year, just to keep up with the pace of literary production in the United States. Even that seems optimistic; consider the case of all English-language novels published before 1900. There are no more than 100,000 such titles, and the number isn’t growing. If 5,000 TT faculty work on them, that means each faculty member is responsible for just 20 (or 100, if we want to have overlapping coverage as above) novels over her entire career (not annually). And yet we know from experience that we haven’t dealt with anywhere near all of the novels published before 1900—not even close—over all of literary-critical history, much less during each academic generation. And in any case, this addresses the problem only at the level of the profession as a whole; it does nothing to provide each individual researcher with meaningful knowledge about the full range of relevant literary and cultural production. (Sources: R.R. Bowker, MLA.)]

Again, this wouldn’t matter so much if our aims were aesthetic or exegetical. But they aren’t. And so we have a real problem. We want to be able to claim that the books we study are representative (the word I used earlier was symptomatic) of the culture that produced them, so that by analyzing those books we are also necessarily analyzing that culture. True, this always works at an initial level: Any book is indeed a product—and a part—of a cultural situation. But it’s a minuscule part, not because books are unimportant (that’s a separate question), but because it represents so small a fraction of that situation’s cultural and symbolic output. So we need a second level of representativity, and the usual way of providing it is to argue that the book in question is especially illuminating, that it reveals something about the situation of its production that is either particularly important and typical (thus that the book is more central than others) or especially difficult to discern by other means (hence that the book is uniquely diagnostic). Either way, though, it’s hard to make this claim without already knowing, on independent grounds, a good deal about the features and configuration of the situation you’d like to say the book allows you to diagnose.

Another way of putting this is to say that literary studies as currently practiced has a problem with at least perceived selection bias. If you want to argue, for instance—and I’m borrowing from my own work here for an example that will eat up most of the rest of my talk—that allegory is a form well-suited to dealing with the problems of transitional cultural forms, and therefore that it (allegory) should play a prominent role in the literature of a transitional moment like late modernism, you’ll want to show that late modernist literature is indeed especially allegorical. You can do that by offering readings of a broad swath of important late modernist fiction—six or eight books, maybe—and pointing to the significant, underappreciated role that allegory plays in them, and then comparing these results to the standard account of modernism and postmodernism proper, from which allegory is essential absent. But you can already imagine the difficulties of this approach. What about The Bell Jar? What about other books written shortly after the war that aren’t allegories? What about instances of allegory before and after late modernism? What about Pilgrim’s Progress? Some of these questions are easier to answer than others, but they all amount to the same thing: Allegory wasn’t invented in the years following the second world war. If it’s been around for centuries, and if we know about examples of it both in and out of that period, and if you’ve read very little, statistically speaking, of the relevant literature, how can you be sure that the phenomenon you claim to have identified isn’t simply an artifact of your own haphazard selection of texts?

The long answer involves the meaning of cultural dominance, the practical and theoretical relationship between literary production and social-economic conditions, the evolution of established forms, a reasonable knowledge of literary history, and a certain amount of hand-waiving. But the short answer is that you don’t. You don’t know that any meaningful claim about the large-scale processes of cultural production is right in the sense that it describes the operations of that production by reference to the bulk of its output. And you never will, if that production occurs on an industrial scale and if you need to engage with each artifact individually and at length.

So what should you do instead, if this fact bothers you and if you don’t intend either to give up entirely or to return to small-scale, low-stakes aestheticism? One approach—an approach less insane than it sounds, since it’s really only what you were implicitly claiming in the first place—is to try to figure out how to measure allegory in millions of texts across hundreds of years. If you can do that, you’ll be able to see how the frequency and degree of allegorical fiction has changed over the last several centuries, and you’ll observe the degree to which those fluctuations correspond to our existing sense of literary and cultural periodization. If the initial claim was correct, and allegory is closely tied to eventful or revolutionary moments, you’ll expect to find something like this cartoon:

Figure 1. Allegory as a function of time (cartoon).

The best case scenario will be that most spikes in allegoricalness line up with major transitions in literary history as we now understand them (so that we have confidence that the method works and that the theoretical claim is plausibly supported), but also that a small number of them don’t correspond (so that we have new and interesting problems in literary periodization to investigate).

The problem, of course, is that we don’t know exactly what it is about a text that makes it allegorical. This is due in part to a defect in the existing conventional criticism, namely that we understand allegory much less well than we generally assume. But it’s also true that allegory is probably not reducible to anything so obvious as the frequency of intuitable keywords or the length of the text. What we need, then, is a reasonably large corpus that consists of known allegorical and nonallegorical works on which we can try out any proposed criteria. Better yet, we could go looking (pseudo-inductively) within that corpus for markers that work more or less well to separate allegory from nonallegory. But now we run up against a second difficulty, namely that there’s no ready list of allegorical books. (Note in passing that we’re therefore on different ground from Moretti and Jockers‘ or Witmore‘s work on genre determination, which relies on established bibliographies or the long-standing consensus around Shakespeare’s plays.) One could try to come up with—and to defend—such a list by hand, and it will probably be necessary to do so eventually. But that’s an extremely labor-intensive way to go. We’re looking for a sampling that covers three or four hundred years, across national origins, textual subgenres, author genders, subject matter, and so forth. If we don’t want the distinctions between allegorical and nonallegorical works to be dominated by the vagaries of selection bias (after all, that’s the problem we’re trying to get around), the corpus will need to include at least hundreds of books, maybe thousands. As I say, a lot of work.

[Footnote: Apropos our unsettled understanding of allegory, my own general-purpose definition is as follows: An allegorical text is one in which we can (and should) identify a pervasive set of figurative mappings between elements of the (independently legible) plot and elements of a coherent second plot, typically more significant (for us, the readers) than the first. The normative imperative is slippery here, of course, but I don’t see much way around it, and of course the whole thing is fundamentally context-dependent; allegory is as much about reading practices as about writing.]

What we’d like to have is a set of texts that differ only in that they are or are not allegorical, holding constant as many other variables as possible (author, subject, date of composition, and so on). That’s not likely to happen directly, but what if we compare works by the same author that fall on either side of the allegorical divide? That doesn’t do much to help with subject matter, so it’s not likely to help a great deal with keyword identification, but it controls for authorial style, gender, national origin, date of composition (give or take), and, if we’re selective, subgenre. If we can get a handle on the differences between a reasonable number of such pairs, we’ll have a start on identifying a more comprehensive set of differentiating features for allegory.

Preliminary work along these lines suggests that there’s a strong positive correlation between allegorical texts and those that are rich in verbs and poor in many of the other major parts of speech, including nouns, descriptive words, conjunctions, and prepositions. This makes decent sense; allegories tend to feature comparatively simple surface-level plots and focus on action over description, because the more complicated and specific the first-order narrative, the more difficult it is to maintain the mapping between levels that is the substance of allegory. The correlation concerning parts of speech is nowhere near perfect, of course, and analyzing it is a problem in an awkward number of variables, even before you add other potentially differentiating criteria such as keywords, common word frequencies, grammatical constructions, and other candidates for differentiating features. What’s required—and what’s currently underway—is an iterative process by which the corpus of known allegorical and nonallegorical texts can be built up in pieces and with increasing confidence. This relies on both computational methods to extract information about the texts and to weigh the various differentiating factors, and what the information retrieval people in computer science call “domain knowledge,” namely an informed critical assessment concerning the nature and function of allegory in a wide array of texts.

Figure 2. Allegory and parts of speech in single authors, demonstrating the number of pairings in which each part of speech was a significant differentiating factor. Blue = positive association with allegory, red = negative. Method of comparison is Dunning log likelihood.

This is slow going, in a way: Applying descriptive statistics and information retrieval techniques to literary text analysis isn’t something that’s widely practiced or about which there’s much prior work to consult. And the less said about copyright and twentieth-century texts, the better. But because the process builds outward from close readings of a few dozen novels, it’s much faster than dealing with hundreds, thousands, or millions of novels individually. More importantly, once it reaches a critical mass past which we are confident of our ability to classify allegorical and nonallegorical works algorithmically, it can scale to arbitrarily large corpora with minimal effort. And that’s really exciting. Recall what’s happened here. We started with a problem that we simply couldn’t address, namely how to say something informed about the enormous range of literature that we can’t read. We had a conventional theoretical and literary-historical argument that was impacted in an important way by this problem. By introducing computational methods, we now have the ability to overcome this limitation, and to do it in a way that extends and augments (rather than replaces) our existing analytical techniques.

[Footnote: It’s not quite right to say that statistical-computational work in literary studies is entirely new, and I don’t want to be accused of forgetting several decades of progress. But I also don’t want to speak the name of authorship attribution, and I don’t see how you can do the former without the latter, at least not in a short paper whose main concerns lie elsewhere. See Martin Mueller’s excellent analysis of the problem, which does this work so I don’t have to.]

Even if you don’t care much about allegory in particular, this example should appeal to you. The idea isn’t, of course, either to replace close reading entirely or to replace ourselves with computers as the ones doing the close reading, but instead to offer a new kind of evidence that’s especially well suited to supporting the kinds of extracanonical cultural, historical, and sociological claims that have come to occupy a central place in the discipline over the last few decades. Insofar as computational methods like the ones I’ve described today advance this cause—and because we know only too well the limitations of our existing method—there’s little doubt that digitally assisted criticism can and must play a much more important role in both the immediate and long-range future of literary studies. We have only to begin doing it.

Postscript

The focus to this point has been on the movement “outward” or “upward” from individual texts toward large-scale social and historical issues understood through variations in large literary corpora. Computational methods are well suited to this type of scaling, since computers do simple things like part-of-speech recognition quickly and uncomplainingly. But the process can and does work in reverse, something I suggested elliptically through the interplay of machine learning and domain knowledge. To be more explicit, though, we can imagine a case in which computationally derived information about the prevailing level of allegorical usage in a given period would allow us to identify, for instance, the unexpectedly anomalous allegorical bent of an author not usually understood to be an allegorist. This is turn might point us to a new reading of that author’s works, or of her relationship to the dominant modes of her age. Or our attention might be drawn to an author’s work that scores especially highly or anomalously in allegorical terms, but that has never before been given critical attention.

More concretely and in a different area, what if statistical analyses of Shakespeare’s texts suggest that Othello looks much like a comedy? This certainly doesn’t mean that Othello is a comedy, but it might give us new reasons to return to a well-known text and to ask of it different questions than those we posed in the past. The results of our new line of inquiry may or may not be interesting; only time and analysis will tell. But new ways of approaching Shakespeare are probably a good thing, as are procedures that allow us to pluck interesting works from obscurity for guided critical attention. And both of these are examples of the “inward” or “downward” orientation that computational work also serves.

[Note, 24 May 2010: This is a lightly revised version of my talk for the 2009 MLA conference in Philadelphia.]

How Many New Novels are Published Each Year?

In my recent talks, I’ve been saying things like “there are tens or hundreds of thousands of new novels published every year, and I just can’t read all of them.” Matt Kirschenbaum says this demonstrates a deplorable lack of initiative in our younger scholars, and he’s probably right. But is my count reasonable? I pretty much made it up, so I thought I should check.

But how do you do that? Google could probably tell you how many volumes are in their metadata database, along with their years of publication, but how many of those are novels or other works of prose fiction? Wikipedia claims to know the totals for “books” broken down by country, though their numbers are oldish and taken from diverse sources.

If we can live with U.S.-only numbers, and if we’re mostly interested in English-language fiction, we can consult R.R. Bowker’s publishing statistics (they’re the people who run Books in Print). From them we learn that there were 407,000 books published in 2007 (the last year for which final numbers are available), a total that includes 123,000 “on-demand, short run, and other unclassified” titles. Of the 274,000 classified titles, 43,000 are “fiction,” a category that includes “strictly adult novels (including graphic novels) and short story collections.” (There are separate categories for anthologies, literary criticism, poetry, drama, etc. Oh, and “adult” is opposed to “juvenile,” not a synonym for porn.) If the same ratio holds for the unclassified category, we’d have another 19,000 novel-like entries, for a total of 62,000.

The U.S. isn’t the only (predominately) English-language book market in the world, of course; Britain’s is about the same size, Canada and Australia are significant, and there are many English-language novels published elsewhere. But there’s also redundancy in some of the titles shared between markets, and a portion of the new titles are only new editions or bindings of previously-released texts. (As an aside, I wonder how many of the books published annually ever exist in more than one edition? I’d bet it’s a much smaller number than our scholarly experience with canonical-ish texts would suggest. I also wonder how many new U.S. titles are in languages other than English.) Accounting for all of those factors is more work than I want to do at the moment, though I’d love to hear what other people know about them.

In the meantime let’s assume, conservatively, that the global total is on the order of twice the U.S. number. In that case it seems pretty safe to say there are around 100,000 new English-language works of long-form prose fiction published globally each year. That’s a ballpark number, but I don’t see any reason to believe that it’s off by more than a factor of about two, and it’s certainly of the right order of magnitude. Conclusion: I can go on using my line about the number of books I’m not reading.

[Update, 29 September 2010: See also this follow-up post on numbers from the UK. And note that Bowker has since released figures that include 2009; the major story there is that “nontraditional” volumes (reprints of public-domain classics and print-on-demand, mostly) have exploded in the last few years, now far outnumbering (by about 3:1) the mostly flat traditional volumes. Sales are another matter, of course.]

[Update, 30 July 2012: I see there’s been some bit rot at Bowker. 2011 numbers (with prior year figures) are now available. In general, this info is released annually via a press release in May or June for the previous year. If a specific Bowker link is dead, search their site for something like “publishing industry” or “publishing output” and the most recently past year (2011, etc.)]

Allegory in Single Authors

I’ve been following up a suggestion from Jan Rybicki about discovering statistically distinguishing features of allegorical and non-allegorical writing by comparing individual works by single authors rather than (or preliminary to) large corpora. This has some downsides (I don’t expect it to be much good for detecting characteristic terms/lemmata, for instance, which will be dominated by the specific content of the individual texts), but it might be a useful quick and dirty way to get a better feel for where to direct my attention.

Results to come in the next week or so, but in the interim I’m interested in thoughts on especially useful pairings. What I’m looking for are pairs of works by an author, one of which is decidedly allegorical, the other of which is not. Some examples are below.

Note that there are practical constraints: I need books that are available in full-text electronic form, which rules out most things published after 1923. And I’d like to use works that are reasonably familiar, if only so other literary folks can evaluate for themselves whether or not I’ve classified them correctly. Roughly matching word counts couldn’t hurt, but aren’t terribly important, since I’m mostly looking at frequency-regularized Dunning log likelihoods (and because length itself might be a marker of allegoricalness, though I don’t expect to answer that question with so small a sample). The more of these pairs, the better, but the point is that this isn’t true corpus work, so I’m not feeling like I need hundreds (that’s for later!).

Some suggestions thus far:

Author	Allegorical	Nonallegorical
Alcott	Little Women (1868)	Work (1873)
Bunyan	Pilgrim’s Progress (1678)	Grace Abounding (1666)
Defoe	Robinson Crusoe (1719)	Journal of the Plague Year (1722)
	Moll Flanders (1722)
Dickens	Christmas Carol (1843)	Bleak House (1853)
		Martin Chuzzlewit (1844)
Eliot	Adam Bede (1859)	Middlemarch (1871)
	Silas Marner (1861)	Mill on the Floss (1860)
Melville	Confidence Man (1857)	Israel Potter (1856)
	Moby-Dick (1851)
Orwell*	Animal Farm (1945)	Burmese Days (1934)
	1984 (1949)	Road to Wigan Pier (1937)
Shelley	Frankenstein (1818)	Mathilda (1819)

* God bless wonky Australian copyright

A couple of comments: There’s a regrettable skew toward the middle of the nineteenth century here, but my sample will probably always be nineteenth-century rich due to both the historical development of novel writing and the realities of copyright law.

Poetry would be an interesting addition. I’m not sure to what extent the vagaries of rhyme, meter, etc. would impact the comparisons, but I’d like to find out. So … what should I use over against Paradise Lost, for instance? ~~Also, did Bunyan ever write something non-allegorical? (I can’t think of anything.)~~ (Grace Abounding; thanks to Suzanne Keen, by way of the narrative list.) What about Langland? (Ditto, no hope.)

I’d like to have decent national and gender balance, which seems OK in the tiny sample I’ve given here. More variety would always be better.

I’ve tried to avoid overt Bildungsromane, on the theory that they’re always at least a little allegorical, even when they’re not. Alcott’s the exception because, you know, Little Women.

Thoughts and suggestions for changes, deletions, or insertions?

Followups on the GBS Settlement

There have been some very smart comments on (and around) my previous post on the Google Book Search settlement. If you’re interested, you might want to see the comments section of that post, plus two good posts by Eric Kansa, one before and one after the recent GBS conference at Berkeley.

Most of my thoughts on the points Eric and others raise appear in the comments section of my last post (linked above). But I think maybe the gut-level difference is related to this passage from Eric’s second post:

The Google Books corpus is unique and not likely to be replicated (especially because of the risk of future lawsuits around orphan-works). This gives Google exclusive control over an unrivaled information source that no competitor can ever approximate.

If Eric’s right about this, then it’s critical to get as much public access as possible built into the settlement now, because we won’t have another shot at it. For reasons laid out in my previous post, though, I’m less pessimistic about the prospects for future competition. I think the settlement will make it easier for others to enter this space by providing both a template for negotiations with the authors and publishers and a strong antitrust incentive for the rightsholders to grant equal access.

Scanning is a big-ish project, no doubt, but not prohibitively so (witness the Open Content Alliance, as well as Microsoft’s former efforts, stopped more by fear of legal action than by lack of funds). This is especially true if it turns out there’s significant money to be made by doing it (and the objection, after all, is that the scanned corpus is an immensely valuable resource on which Google will be sitting). Plus, scanning will only get cheaper with time.

My own ideal case would be a combination of meaningful copyright reform (to clarify that scanning for indexical use doesn’t require permission from a rightsholder) and something like Dan Cohen’s proposal for a government (or Ivy League) -funded book-scanning “moon shot” to benefit society at large. Barring this (extremely unlikely, I think) outcome, by all means, let there be as many public-friendly provisions tacked onto the GBS settlement as possible. My point, though, is that even as it stands now, the settlement provides enough benefits to enough people that I’d rather have it go forward than not, and I’m optimistic that many of its shortcomings can and will be addressed (by competition, by legislation, by technological advances) in the short to medium term.

The alternative, just to be clear, is really bad: maybe no book search at all, from anyone (thanks to the unresolved legal questions), and certainly no search of anything outside the (fossilized) public domain. No research corpus. No free public terminals with millions of in-copyright books at libraries. And this situation would endure indefinitely, backed up by the very real example of a messy, expensive, status-quo-reinforcing failure.

Google and EPUBs

Google just announced that they’re making a million+ public-domain books downloadable in EPUB format. This is an improvement over the old situation, where you could download PDFs (sans OCRed text) of those books or read them in plain text online (one physical page at a time), but not download a small, well-OCRed text copy.

I’d be delighted if they went all the way to true plain text downloads. (And then let me download all the public-domain stuff in bulk. And gave me a pony.) But this is a nice improvement. In other news, I’d also be delighted if my Kindle supported EPUB natively.

Victorians! Science! Semantic Indexing!

I had a very pleasant talk yesterday with Devin Griffiths, a late-stage grad student at Rutgers. (Thanks to Martin Mueller for putting us in touch.) Devin’s working on some cool LSA techniques to extract information about analogies from Darwin’s Origin and other nineteenth-century texts. He’s just started a blog to track his work and put up code. If you’re interested, check it out.

Supplemental Readings: Contemporary Edition

This semester’s “Contemporary U.S. Novel” syllabus has six primary texts:

David Foster Wallace, Infinite Jest (1996, 1104 pp.)
Barbara Kingsolver, The Poisonwood Bible (1998, 576 pp.)
Colson Whitehead, John Henry Days (2001, 389 pp.)
Jonathan Safran Foer, Extremely Loud and Incredibly Close (2005, 368 pp.)
Junot Díaz, The Brief Wondrous Life of Oscar Wao (2007, 352 pp.)
Rivka Galchen, Atmospheric Disturbances (2008, 256 pp.)

Six texts aren’t a whole lot to cover a decade, especially when there’s no consensus concerning what’s important. If you’re one of my students and you’re looking for reading that will extend what we’re covering in class, here are some suggestions. All of these are texts that I considered putting on some version of the syllabus; not all of them are American and not all are from the last decade (but very few are more than twenty years old):

Octavia Butler, Parable of the Sower
J. M. Coetzee, Disgrace
Edwidge Danticat, Farming of Bones
Don DeLillo, Falling Man
Louise Erdrich, Tracks
Jonathan Safran Foer, Everything is Illuminated
William Gaddis, Carpenter’s Gothic
Dagoberto Gilb, The Last Known Residence of Mickey Acuña
Gish Jen, Mona in the Promised Land
Nathaniel Mackey, Bedouin Hornbook
David Mitchell, Cloud Atlas
Toni Morrison, Beloved and Song of Solomon and A Mercy
Thomas Pynchon, Inherent Vice
Marilynne Robinson, Home and Gilead
Arundhati Roy, The God of Small Things
Salman Rushdie, Midnight’s Children and Satanic Verses
Leslie Marmon Silko, Almanac of the Dead
John Edgar Wideman, Fanon and The Cattle Killing

Even this list is much too short, but it’ll point you in some interesting directions.

Work Product

Research notes in quantitative humanities

Menu

Author Archives: Matthew Wilkens