Computer Uses Ordinary Journalism to Forecast the Future

At the end of War and Peace Tolstoy compares belief in free will to medieval cosmologies where the Sun revolved around the Earth. To know the true cosmos, he writes, people had to give up their certainty that the ground beneath them was still, “and to recognize a motion we do not feel.” So to understand history, people have to recognize their individuality counts for nothing—that the real causes of our actions are to be found in vast social patterns of which we are a tiny and unwitting part. That’s a difficult viewpoint to hold onto for a species that focusses so much brainpower on individual personalities and relationships. But maybe computers can help. That’s certainly the impression I took from this mind-blowing paper in the online journal First Monday. It describes a technique for finding hidden social motion by analyzing the texts of news stories over time. This “window into national consciousness,” it claims, predicted upheavals in Tunisia and Egypt and approximated the location of Osama bin Laden within 200 kilometers.

Author Kalev H. Leetaru‘s approach uses huge databases of news stories from around the world (translated into English on a daily basis for American and British intelligence purposes) and performs on them the sort of analysis I would have thought required a human. The system, he writes, engages in “tone mining,” to take the measure of national mood; geocoding, to deduce the location of new subjects from the location of stories about them; and network analysis, to show who is reading (or viewing or listening) about whom.

Leetaru says this work is an extension of the concept of “culturomics” (which I wrote about here). Culturomics, as launched in this paper last October, is retrospective, measuring changes in the frequency of published words are used over decades or centuries. On the other hand, Leetaru’s “Culturomics 2.0” works on the near-real-time evidence of the news cycle, and it assigns meaning to the frequency changes. For instance, it counts and tallies negative words (“terrible,” “awful”) and positive ones (“good,” “nice”) to score news stories’ sentiments.

Human analysts have been doing this kind of thing for governments for decades (among the many things I learned from Leetaru’s paper is that more than 80 percent of the “actionable intelligence” that the Cold War West got about the Soviet Union came out of this kind of work done on newspaper articles, conference proceedings, news broadcasts, technical reports and similar non-secret sources). That computer algorithms can do this sort of work (and are being used by corporations to monitor their brands) is interesting, but the big news in the paper is this: Leetaru says a computer’s score of the emotional tone of journalism and other open sources in a nation can predict when conflict is likeliest to occur there.

For example, his system analyzed a collection of the British Summary of World Broadcasts’ 52,438 articles in any language from January 1979 until March 2011 that mentioned an Egyptian city (in other words, it included both Egyptian sources and foreigners’ views of the country). The computer’s score for the aggregate emotional tone of the articles showed a plunge toward negativity in January 2011. The drop was equalled only by January 1991 (the beginning of the first Iraq War) and nearly equalled in March 2003 (the start of the U.S. invasion of Iraq). An analysis of Egypt-only and Arabic-only sources from the same database showed the same pattern, but with a less extreme swing downward, which Leetaru attributes to censorship.

“While such a surge in negativity about Egypt would not have automatically indicated that the government would be overthrown,” Leetaru writes, “it would at the very least have suggested to policy–makers and intelligence analysts that there was increased potential for unrest.” An additional indicator, he adds, is that the 13,061 stories in the database that mentioned Hosni Mubarak showed the most negative tone in three decades, in the weeks before the Egyptian revolution began.

Interestingly, despite the Internet’s rep for unequalled reaction-time, a cross-check with a database of web-only news showed that the tone there followed the mainstream non-American journalistic outlets by about a month. In turn, articles in The New York Times lagged behind the web sources).

More surprising, to me anyway, was Leetaru’s attempt to see if geocoding of news sources could be used to find a prominent person. To do this, he crunched all the articles in the Summary of World Broadcasts that mentioned “bin Laden” between January 1979 and April 2011, coding every geographic reference. Northern Pakistan is the most frequently mentioned geographic area in the articles, the analysis found. And two cities there, Islamabad and Peshawar, were among the five most-mentioned non-Western cities in the texts. Hence, Leetaru writes, “global news content would have suggested Northern Pakistan in a 200 kilometer radius around Islamabad and Peshawar” as the place to hunt for bin Laden.

Well, not too many points for being right—this analysis, like the one on Egypt, was done retrospectively to test the system. I hope if similar indicators crop up in the future, Leetaru will be willing to make some forecasts, just to see if the project works in real-time conditions. For the moment, though, there’s no denying that it’s a fascinating set of results.

Every time I look at this Tolstoyan approach to human behavior (for instance here and here and here), I’m struck by its eeriness. It is hard to wrap my mind around the notion that the real causes and effects of our actions are hiding in plain sight all around us, traceable in the ups and downs of the stock market, or the rise and fall of hemlines. It is especially hard to envision what the chain of causes could be that links adjectives chosen by journalists with some individual’s decision to set himself on fire. It all has an air of haruspicy, somehow.

Still, if ever humanity can find a way to describe society’s motions that we do not feel (which, of course, will have to also include a description of the effects of the description), politics will never be the same.