Opinion, Berkeley Blogs

Word counts and what counts

By Claude Fischer


A post back in June on “digital humanities” discussed the promises and perils of turning to “Big Data” to answer questions about American history. I focused there on a study that looked specifically at the history of American literature. A paper in Psychological Science this August uses the same tool – the Ngram function in Google that counts a word in the company’s sample of over 1 million books ever published in U.S. and calculates the percentage of all words it represents – to make broad claims about historical changes in American character.

(Source: LC)

(Source: LC)

Patricia Greenfield, an eminent UCLA psychologist who has conducted terrific research on cognitive development, changes in cognitive skills, and cultural differences in thinking, much of it based on her work in rural Mexico (mentioned in this 2012 post), uses Ngram to argue that there was a major shift in America over two centuries from a communal to a self-centered culture. Ngram word counts in American books from 1800 to 2000 show, she claims, that Americans changed from being group-oriented and sharing to being individualistic and self-absorbed. Maybe. But there are a lot of issues to consider before accepting the claim. These concerns show, once again, the pitfalls in using such statistical methods ahistorically.

Missing texts

Greenfield borrows from 19th-century European gemeinschaft-gesellschaft (community-society) theory for her argument. One problem, just to mention in passing, is that she skips over about 50 years of important sociological, anthropological, and historical research that raised serious doubts about the loss-of-community argument. That scholarship indicated that it is neither an accurate description of the rural premodern and urban modern worlds nor a useful way of thinking about social change. Most experts are quite skeptical. But that is a topic for another time (see Ch. 1 of Made in America).

Word choice

Greenfield first points out that the population of the United States shifted from overwhelmingly rural to overwhelmingly urban between 1800 and 2000, which drove the cultural changes she expects to see. She then makes what seems to be a plausible choice of words as measures of communal and of individualistic culture. Using Ngram, she reports that there was decline in American books in the relative frequency of words like “obliged,” “give,” and “authority” and an increase in the relative frequency of words like “individual,” “choose,” and “get” – consistent with the claim of a shift from group- to self-interest.


(Source: LC)

Inevitably, the particular word choice matters. (Anybody can do this research at Ngram. Join in.) Let’s take the decline over the centuries in the percentage of times American books use the word “give,” a sign for Greenfield of a less giving, less communal society. But: Give what? If you use Ngram to count the phrase “give a blow,” you’ll see that it has indeed declined between 1800 and 2000, but the phrases “give a gift” and “give charity” have actually risen. The word “provide,” which is a synonym for “give,” has risen in relative frequency about as much as “give” has fallen; if you add the two words together in Ngram, the combined line is flat – no net change.

Words have histories; some old ones fade, new ones emerge, word culture changes. A simple example: the frequency of “thou” plus “thee” drops by 97% between 1800 and 2000, but “you” appears 34% more often, so that, in the end, there is no difference between 1800 and 2000 in the relative frequency of the three words added together. Moreover, often word uses and word meanings change in psychological valence. For example, “awful” went from a positive to a negative meaning (here).

More to the topic at hand, the relative frequency of “donate,” a new-ish word (O.E.D.), has gone up among Google’s books, while the more archaic “bequeath” has gone down. Use of “altruism/tic” takes off only in the 20th century, but then again it was only invented in the 1800s, by a sociologist. Without a real system for determining measures that measure the same thing over generations, word counting can become cherry-picking (a 20th-century phrase).

How do words matter? Or books matter?

Even if the right words were systematically determined, Greenfield’s analysis is missing an even bigger piece: an explanation of why and how words in published books tell us anything about the general culture. Is it because writers pick up signals from the culture and then re-broadcast those signals in their word choices? Scholars of book publishing (and publishers, too, I suspect) would scoff.

As an example of how implausible this assumption is, consider that using either “city” or “cities” became 40% proportionally less common in American books between 1800 and 2000, while using any of “farm/s,” “rural,” and “countryside” became 67% more common, exactly the opposite of what Greenfield might expect given the shift in population.

A better assumption would be that the book-buying public in different historical eras is interested different themes; it financially rewards authors and publishers who use the right themes of the day and the words that fit those themes. (I pointed out in the earlier blog post that “vampire” has taken off recently.) We have to consider who is buying books when and what they care about.

Historians know that the audiences and the marketing of books changed greatly even in just the 19th century, even before competition from mass media like magazines, the penny press, film, and television. A couple of centuries ago, for example, books were more often than now didactic texts, instruction manuals, almanacs, and the like and were often pirated editions of British books. The Google Ngram program helps illustrate the point.

(Source: LC)

(Source: LC)

You can call up the actual books for, say, 1800-1815 in which Ngram found “give.” The first five that popped up when I did this include two editions of a British minister’s essay on why the British Bible Society should give out prayer books along with bibles; two editions of a book by the Lord High Chancellor of England titled “Religion and Policy and the Countenance and Assistance Each Should Give the Other”; and an English handbook for architects and draftsmen. In contrast, the first five books using “give” that appear in Google’s list for 1997-2000 – all originally published here – are: “Don’t Give It Away,” an inspirational book for young women; a book with the subtitle, “Practicing the Authority Jesus Gave Us”; a child’s picture book; a book on the “Christian Patriotism” of Patrick (Give me Liberty…) Henry; and a collection of inspirational sayings. We can see how radically different the audiences and producers are. Are the word counts of these books comparable indicators?

There certainly remains much of value in the use of Ngram – if disciplined. For example, the word “slavery” roughly triples in its rate of appearance between 1830s books and early-1860s books; it then plummets by 1870 – seemingly mirroring Americans’ obsession.

In the end

But what about Greenfield’s basic argument, the hypothesized shift from group- to self-oriented? As I said before, that’s a big topic. Here is a small part of the discussion: Greenfield assumes that these traits of group- and self-orientation come in neat packages that are polar opposites. Is that necessarily so? Modern Americans might well be more consciously self-aware, more engaged in self-improvement, more into choosing than Americans of earlier eras, but at the same time also more concerned about others. Back to Ngram: Put in “oneself” and “other people.” Both terms increased substantially from the early 19th century to late 20th century. Make of it what you will.

Cross-posted from Claude Fischer’s blog, Made in America: Notes on American life from American history.