Opinion, Berkeley Blogs

Novel data: promise and perils

By Claude Fischer

Big Data” and “Digital Humanities” are two of the hot terms – “with a bullet,” as they used to say on the pop music charts – in the academy these days. The terms label a variety of projects: preserving large archives by digitizing them and crunching vast amounts of raw data to address topics in the humanities, such as visualizing the economic interconnections of ancient China, mapping the lines of influence among abstract artists, and finding out who authored the anonymous Federalist papers (although that was answered 50 years ago here).

(Source) An article in the summer issue of Social Science History by Marc Engal is a nice example of both the kinds of discoveries that might be found and the kinds of pitfalls that might be encountered while tramping through the Big Data jungle. Egnal seeks to describe in numbers the thematic evolution of the American novel by drawing on Google’s “Ngram” program. This is a publicly available resource that tallies the words that have appeared in millions of books from before 1800 through 2008. We’ll see what a fertile terrain of  findings it offers — and how one can easily get tripped up exploring them.


Google  has scanned millions of books and identified just about every word in every one of  those books. The user enters a word or phrase into the Ngram Viewer and it produces a graph. The graph shows how often that term appeared each year from before 1800 to after 2000, as a percentage of all the words in scanned books for that year. Jean-Baptiste Michel, et al., introduced the technology and its possibilities in Science in 2010. Here is an example of mine: In American books published around 1900, “gentleman” appeared about once in every 10,000 words, which was 130 times more often than “guy” appeared. In American books published around 2000, “gentleman” appeared much less often, only once every 90,000 words and only about one-third as often as “guy.” Maybe guys  have replaced gentlemen – at least in American books. Fun! And you can get much fancier than that (see here and here). I have dipped into Ngrams for this blog a couple of times (here, here) and for academic work, too.

In his article on the novel, Engal uses Ngram to track words that he believes indicate central themes in four eras of American fiction. This accounting, he suggests, shows more concretely than literary critics’ analyses the topical shifts from one period to another. In the “Sentimental Era, 1789-1860,” for example, words like seduce and faithful appeared much more often, proportionally, than they did later on. Religious words, too, were relatively most common then – although it may be of interest that “God”’s Ngram low point was over 70 years ago.

(An aside: One of the authors of the Science article announcing Ngram is noted psychologist Steven Pinker, a “New Atheist” writer. That may explain the snarky penultimate sentence of the article: "'God' is not dead but needs a new publicist." Yet, between 1971, the year of Pinker’s college graduation, and 2008, “God” appeared, proportionally, 50% more often, “heaven” 170% more often, and “Jesus” 250% more often. God language seems to be making a comeback. Also: “atheism” peaked in 1810, dwindled away, got a brief spurt in the 1960s, and faded again. Maybe the atheism needs a new publicist.)

Engal sees a rejection of bourgeois values during the “Post-Modern Era, 1960-on,” in, for instance, the explosive increase in four-letter words (readers can check that out for themselves as a homework assignment); the growing importance of women (“women” overtakes “men” around 1980; see graph below); and perhaps surprisingly, words having to do with children, such as nurturing and childhood, rose even as the birth rate was dropping in the 1960s and 1970s.

Ngram: Men v. Women
Ngram: Men v. Women

A key description of “Post-Modern” writing is narcissism. Psychologist Jean Twenge and colleagues also use Ngram to argue that individualism and self-absorption have been on the rise since 1960 (see here and critiques here and here). Engal shows that the words I and he were about equally frequent in American books until about the 1970s and then “I” rose, while “he” fell — a finding perhaps suggesting an increase in self-involvement, at least by American authors. This finding provides a good opportunity to see how the procedural details matter and can trip us up.

Engal looked at “I” and “he” but neglected “she” and “you.” Using additional tools available in Ngram, I found that around 1970 “I” appeared at the start of sentences – thus clearly as the subject of the action – about 70% as often as “He” or “She”; in 2000, that ratio was about the same. No change. But the ratio of “I” at the start of a sentence to “You” at the start of sentence dropped from around 5.5:1 to 3.5:1 – maybe a sign that you-absorption rather than I-absorption has been on the rise. Then, if one starts playing with “us” and “we” and “they”…. well, things get more complicated. Which brings us to the deeper issues.


Even as scholars continue exploring Big Data, we see some of the perils in premature conclusions. A few are apparent in this particular application of Ngrams to what its proponents call “cultunomics.”

There are procedural issues. For instance, which books become the basis of inferring something about the culture — or about the writers, or about the readers? The mix of books that were published changed sharply over time. Early in American history, books were expensive to make, expensive to buy, and readership was limited. Didactic books – e.g., how to keep accounts; farmers’ almanacs; religious books – were especially common. (Even a lot of sensationalist crime material appeared surrounded by religious text warning the reader against following in the criminal’s footsteps.) Later, publishers churned out cheap romance and adventure novels. The mix of words in the books changed accordingly.

Similarly, we can ask which books have survived to be scanned? Engal admits that many if not most of the earlier books in the Google collection are not even novels and many cheap novels never made it into the archives. (I could not confirm whether the novel shown above, The Texan Scout, appears in the Google database.) Google goes to university libraries for the books of yesteryear. How many of those libraries have, for example, collected and kept books from earlier times with crude obscenities? So, did the American novel change, did American society change, or did the mix of books published and surviving change?

Second: Words change, in various ways. Over time, there are simply more different words and phrases. Michel et al “estimated the number of words in the English lexicon as . . . 597,000 in 1950, and 1,022,000 in 2000.” Any particular word has to compete with a growing number of new ones in the denominator of the Ngram calculations. Also, new words crowd out the old. One study found that the apparent decline of Americans’ vocabulary skills reflects in part the fact that some of the words used in vocabulary tests have just become less common. Moreover, word meanings shift over time. The word spiritual used to be associated with the occult, as in spiritualism. The word text as a verb – as in, text me the address – is apparently so new that Ngram does not (yet) recognize it as a verb, instead counting all instances of “text,” even in 2008, as nouns.

At a yet deeper level, we have to think hard (as, for example, here) about the connection between the words writers use and American culture, whether it is the literary culture or the wider culture. Are book writers like weather vanes, showing in the words they use the direction society is moving, perhaps revealing the Zeitgeist or at least fluctuations in customer demand? Or, are writers actually agents of change, using words to move the culture? Or, are writers in a separate world of literature with its own shifting fads and fashions, little connected to the wider society?

The word "vampire" appeared 10 times as often, proportionally, in American books in 2008 as it did in 1950. What shall we make of that?

Cross-posted from Claude Fischer’s blog, Made in America: Notes on American life from American history.