The uses of Web-based lexical profiling for news and business decisions
.
Every morning, Stephane Levy is poring over his "cron", an automatically generated email that monitors a vast network of sites affiliated to Weborama, an agency specialized in behavioral analysis on the Internet. Working together with Isabelle Cabrera, a linguist lifted from the French computer lab Inria, and Rodolphe Rodrigues a PhD in physics, Weborama's lexical profiling manager  he has has created a unique Web analysis tool. With real-time monitoring of 200,000 websites yielding 8 million unique visitors a day, "Le Lab" feeds an enormous database of usage patterns, behaviors and habits. The "Lab" performs a real-time analysis in terms of page content and request structure.  A true gold mine.
.
One morning last spring, Stephane Levy (himself an computer science engineer with a math degree) looked at the previous night's log analysis, he noted a new entry in the lexicon at the core of the system. The word "eeepc" had been flagged by the system because of a spike in the Internet search "noise". The system decided to enter “eeepc” into the 350,000 words lexicon. The ultra-portable, netbook wave had not yet reached the Gallic shores and, even in an Internet company such as Weborama, no one had heard of the "eeepc". With one Google search, everybody did. A few weeks later, the term became mainstream
.
The experience repeats itself in various ways. Suddenly, something pops up from the Internet background noise. Here, we’re not looking at a ranking of the most searched terms, this is about occurences and co-occurences, associations of words that become meaningfull -- they call it clusters.  A cluster is way more precious than isolated terms. For instance, a cluster analysis shows that a growing number of people do connect cosmetic products to health hazards and cancer.
.
When Stephane Levy, along with his brother Alain (Weborama founder and CEO) discussed lexical profiling, two things struck me:
-    the potential use as an advanced probe of "what will be next" in terms of newsworthyness,
-    and the ability to perform a quantitative distinction – fact-based, no guess -- between the media noise (what you read, hear and watch on your favorite sources) and the "true" noise, what people are really talking about, on a much broader scale, outside of “medias.
.
Before going further, let's set the record straight. As a journalist, I'm convinced that our biggest collective mistake is to have allowed our newsrooms to become dominated by a top-down culture: we decide what is and isn't newsworthy today for our readers, hoping they will agree. In the last few years, the Internet, and its offspring the blog culture, has corrected this error. Even the Wall Street Journal now offers an online community to its white-collar (and now very nervous) audience. Yes, it helps to have readers yelling at you, but it will be much more useful to know, in real time, or even before they know it, what is on the masses’ minds.
.
Hence the "quant journalism". I wouldn't go as far as Chris Anderson, Wired magazine's chief editor, who told me last May  "If I had the opportunity to run this magazine the same [quant] way as a hedge fund, I would do it without hesitation” (see Monday Note #38). But Chris is right on one key point: we definitely should feed a measure of "quant" data into the day to day management of our news organizations. Don't get me wrong: I'm not advocating front page content decisions based on what the "crowd" wants. I truly believe a news organization has a built-in duty of education. On occasion, the organization ought to provide information that will perform poorly in ratings, but is a component of democratic awareness.  For example, an analysis of the fiscal policies of the two American presidential candidates won't be a hit; the same is true for an account of key decisions at the European Council.  Still, these news items must be delivered to the public simply because they are an important component of the political process. Having said that, the duty of the editorial team is clear.  It must “stay on top” of what is or is will soon be on the public's mind. Or it, the team, will be soon out of the public’s mind.
.
Let’s return to the gap between "offline" noise (media originated) noise, and "online". The massive sample analytics techniques we just discussed are priceless. Take for instance the Beijing Olympics. European medias were filled with human rights stories, Tibet repression, etc. That's the "offline" noise. On the "online" seismograph: zip, nada (almost). People were solely (and massively) passionate about sports performances and medals. It doesn't necessarily mean news outlets should have lowered the pitch on political issues, but it is nonetheless actionable knowledge. You really want to help? Keep the human rights stories for a time when readers’ minds are more open.
.
Just for the fun of it: Weborama's logs show clusters of words associated with French president Sarkozy have nothing to do with politics or policies. Actually, the Internet crowd discusses his wife (the whispering singer and former model Carla Bruni) and other idiosyncrasies. For this article, I asked Stephane Levy to run a search on "Edvige", a highly controversial police database that will record sexual and religious proclivities, among other things. The press (Left leaning, especially) went ballistic. Verdict of "Le Lab": no online noise at all. The word had not even entered the lexicon. Does it means that media should keep quiet? You be the judge.
.
Now, let's zoom out, and look at applications for brands and business. Let's say you are Orange. As my co-writer Jean-Louis reported last week, you are facing irked iPhone users because you are "throttling" your 3G network. Question: should you go for an expensive communication campaign or just wait for the storm to fade as you slowly upgrade your cell network? That is a several millions Euros question. Judging by the media noise, you better act fast and summon your communication agency for a brief. Now, have a look at a cookie analysis of millions visitors of 200,000 sites. Chances are you’ll see the anger is fading fast -- faster than a classic media analysis leads you to think.
.
Same thing if you are in the entertainment business: think about the buzz generated by a TV series targeted to a 20 year-old audience. You can't expect a clear idea of its performance from traditional medias run by middle-age people.
.
Such tools are still in infancy (Weborama’s "Lab" is merely eight months old). Again -- sorry to repeat myself -- I don't think a news organization should be run solely by such methods. Unfortunately, it turns out that mainstream media are, by and large, out of touch with their audience (just look at newspapers circulation figures and audience of the evening news if you have any doubt).  More than ever, we need numbers, we need probes to understand fast moving, fluid, widespread audiences. --FF
.
.

Print Friendly