## Visualizing Uncertainty with Probability Clouds

August 20, 2014 § Leave a comment

I’ve come up with a visualization of data uncertainty that seems really obviously useful, but that I’ve never seen before. So I guess some combination of three things must be true:

- I am a genius. Deeply unlikely, given that I misspelled “genius” the first time I typed it here.
- There’s something wrong with the “new” method that makes it less useful than I think and/or total bunk.
- People do use this, and I just haven’t seen it before. Totally possible, given the number of statistical visualizations in most literary studies papers.

Anyway, the idea is to use probability clouds to show a density region around a given line of best fit through the data.[1] I think this avoids some visual-rhetorical pitfalls in the usual ways of showing trends and uncertainty in data, but/and I’d be grateful for thoughts on its value.

Here’s the context and an example: I’m working on a manuscript at the moment for which I need to visualize a bit of data. Nothing fancy; this is one of the basic figures:

Yeah, the axes aren’t labeled, etc. The point is, there are two series that are pretty noisy but seem to be doing different things over time (along the x axis).

OK, so to get a handle on the trend, let’s insert a linear fit for each series:

Neat! But the fit lines are a little misleadingly precise. I don’t think we want to say that the “true” value of series 2 in 1820 is exactly 0.15, or that the true values cross in exactly 1872. So let’s add a confidence interval at the usual 95% level:

Better, but this manages to be somehow both too precise and not precise enough. Beyond the line of best fit, which still suggests false precision at the center, the shaded 95% confidence region comes to an abrupt end (too precise) and doesn’t have any internal differentiation (not precise enough). The true value, if we want to think of it that way, isn’t equally likely to fall anywhere within the shaded region; it’s *probably* somewhere near the middle. But there’s also a smallish chance (5%, to be exact) that it falls outside the shaded region entirely.

So why not indicate those facts visually, while getting rid of the fit line entirely? Here’s what this might look like:

This seems a lot better. It doesn’t draw your eye misleadingly to the fit line or to the edges of an arbitrarily bounded region, but it does suggest where the real fit might be. And it does that while making plain the fuzziness of the whole business. It would be even better in color, too. I like it. Am I missing something?

On the technical side, this is built up by brute force in R with ggplot. The relevant code is:

library(ggplot2) se_limit = 0.99 # Largest standard error level to show; valid range 0 to 1 se_regions = 100 # Number of regions in uncertainty cloud. 100 is a lot; # a little slow, but produces very smooth cloud. se_alpha_max = 0.5 # How dark to make region at center of uncertainty cloud. # 0.5 = 50% grey. line_type = 0 # A ggplot2 linetype for fit line; 0 = none, 1 = solid p = qplot(x, y, data=data) # Use real data, of course! for(i in 1:se_regions) { # This loop generates the uncertainty density shading p = p + geom_smooth(method = "lm", linetype = line_type, fill = "black", level = i*se_limit/se_regions, alpha = se_alpha_max/(se_regions)) } p # Show the finished plot

That’s it. As you can see, it’s just brute force building up overlapping alpha layers at different confidence levels. I once looked at the denstrip package, but couldn’t make it do the same thing. But I’m dumb, so …

**Update:** I *knew* I couldn’t be the first to have thought of this! Doug Duhaime points me to visually-weighted regression, apparently first suggested by Solomon Hsiang in 2012. There’s R code (but I guess not yet a formal package) to do this at Felix Schönbrodt’s site.

Here’s a version using Felix Schönbrodt’s vwReg(). Not all cleaned up to match the above, but you get the idea:

[1] If you’ve learned any undergrad-level physical chemistry, you can probably see where this idea came from. Here’s a bog-standard textbook visualization of the electron probability density of a 2p atomic orbital:

## Bamman, Underwood, and Smith, “A Bayesian Mixed Effects Model of Literary Character” (2014)

June 18, 2014 § Leave a comment

Too long for Twitter, a pointer to a new article:

- Bamman, David, Ted Underwood, and Noah A. Smith, “A Bayesian Mixed Effects Model of Literary Character”
*Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics*(2014): 370-79.

NB. The link here is to a synopsis of the work and related info; you’ll want the authors’ PDF for details.

The new work is related to Bamman, O’Connor, and Smith’s “Learning Latent Personas of Film Characters” (ACL 2013; PDF), which modeled character types in Wikipedia film summaries. I mention the new piece here mostly because it’s cool, but also because it addresses the biggest issue that came up in my grad seminar when we discussed the film personas work, namely the confounding influence of plot *summaries*. Isn’t it the case, my students wanted to know, that what you might be finding in the Wikipedia data is a set of conventions about describing and summarizing films, rather than (or, much more likely, in addition to) something about film characterization proper? And, given that Wikipedia has pretty strong gender/race/class/age/nationality/etc./etc./etc. biases in its authorship, doesn’t that limit what you can infer about the underlying film narratives? Wouldn’t you, in short, really rather work with the films themselves (whether as scripts or, in some ideal world, as full media objects)?

The new paper is an important step in that direction. It’s based on a corpus of 15,000+ eighteenth- and nineteenth-century novels (via the HathiTrust corpus), from which the authors have inferred arbitrary numbers of character types (what they call “personas”). For details of the (very elegant and generalizable) method, see the paper. Note in particular that they’ve modeled author identity as an explicit parameter and that it would be relatively easy to do the same thing with date of publication, author nationality, gender, narrative point of view, and so on.

The new paper finds that the author-effects model — as expected — performs especially well in discriminating character types within a single author’s works, though less well than the older method (which doesn’t control for author effects) in discriminating characters between authors. Neither method does especially well on the most difficult cases, differentiating similar character types in historically divergent texts.

Anyway, nifty work with a lot of promise for future development.

## Digital Americanists at ALA 2014

March 13, 2014 § Leave a comment

From the Digital Americanists site, which has full details:

Visualizing Non-Linearity: Faulkner and the Challenges of Narrative Mapping

Session 1-A. Thursday, May 22, 2014, 9:00 – 10:20 am

- Julie Napolin, The New School
- Worthy Martin, University of Virginia
- Johannes Burgers, Queensborough Community College

Digital Flânerie and Americans in Paris

Session 2-A. Thursday, May 22, 2014, 10:30-11:50 am

- “Mapping Movement, or, Walking with Hemingway,” Laura McGrath, Michigan State University
- “Parisian Remainder,” Steven Ambrose, Michigan State University
- “Sedentary City,” Anna Green, Michigan State University
- “Locating The Imaginary: Literary Mapping and Propositional Space,” Sarah Panuska, Michigan State University