In their fight against Google, traditional media firmly believe the search engine needs them to refine (and monetize) its algorithm. Let's explore the facts.
The European press got itself in a bitter battle against Google. In a nutshell, legacy media want money from the search engine: first, for the snippets of news it grabs and feeds into its Google News service; second, on a broader basis, for all the referencing Google builds with news media material. In Germany, the Bundestag is working on a bill to force all news aggregators to pay their toll; in France, the executive is pushing for a negotiated solution before year-end. Italy is more or less following the same path. (For a detailed and balanced background, see this Eric Pfanner story in the International Herald Tribune.)
In the controversy, an argument keeps rearing its head. According to the proponents of a "Google Tax", media contents greatly improve the contextualization of advertising. Therefore, the search engine giant ought to pay for such value. Financially speaking, without media articles Google would not perform as well it does, hence the European media hunt for a piece of the pie.
Last week, rooting for facts, I spoke with several people possessing deep knowledge of Google's inner mechanics; they ranged from Search Engine Marketing specialists to a Stanford Computer Science professor who taught Larry Page and Sergey Brin back in the mid-90's.
First of all, pretending to know Google is indeed... pretentious. In order to outwit both competitors and manipulators (a.k.a, Search Engine Optimization gurus), the search engine keeps tweaking its secret sauce. Just for the August-September period, Google made no less than 65 alterations to its algorithm (list here.) And that's only for the known part of the changes; in fact, Google allocates large resources to counter people who try too game its algorithm with an endless stream of tricks.
Maintaining such a moving target also preserves Google's lead: along with its distributed computing capabilities (called MapReduce), its proprietary data storage system BigTable, its immense infrastructure, Google's PageRank algorithm is at the core of the search engine's competitive edge. Allowing anyone to catch up, even a little, is strategically inconceivable.
Coming back to the Press issues, let's consider both quantitative and qualitative approaches. In the Google universe -- currently about 40 billion indexed pages --, contents coming from media amount to a small fraction. It is said to be a low single-digit percentage. To put things in perspective, on average, an online newspaper adds between 20,000 and 100,000 new URLs per year. Collectively, the scale roughly looks like millions of news articles versus a web growing by billions of pages each year.
Now, let's consider the nature of searches. Using Google Trends for the last three months, the charts below ranks the most searched terms in the United States, France and Germany (click to enlarge):
Do the test yourself by going to the actual page: you'll notice that, except for large dominant American news topics ("Hurricane Sandy" or "presidential debate"), very few search results bring back contents coming from mainstream media. As Google rewards freshness of contents -- as well as sharp SEO tactics -- "web native" media and specialized web sites perform much better than their elder "migrants", that is web versions of traditional media.
What about monetization ? How do media contents contribute to Google's bottom line? Again let's look at the independent rankings of the most expensive keywords, those that can bring $50 per click to Google -- through its opaque pay-per-click bidding system. For instance, here is a recent Wordstream ranking (example keywords in parenthesis):
Insurance ("buy car insurance online" and "auto insurance price quotes")
Loans ("consolidate graduate student loans" and "cheapest homeowner loans")
Mortgage ("refinanced second mortgages" and "remortgage with bad credit")
Attorney ("personal injury attorney" and "dui defense attorney")
Credit ("home equity line of credit" and "bad credit home buyer")
Lawyer ("personal injury lawyer", "criminal defense lawyer)
Donate ("car donation centers", "donating a used car")
Degree ("criminal justice degrees online", "psychology bachelors degree online")
Hosting ("hosting ms exchange", "managed web hosting solution")
Claim ("personal injury claim", "accident claims no win no fee")
Conference Call ("best conference call service", "conference calls toll free")
Trading ("cheap online trading", "stock trades online")
Software ("crm software programs", "help desk software cheap")
Recovery ("raid server data recovery", "hard drive recovery laptop")
Transfer ("zero apr balance transfer", "credit card balance transfer zero interest")
Gas/Electricity ("business electricity price comparison", "switch gas and electricity suppliers")
Classes ("criminal justice online classes", "online classes business administration")
Rehab ("alcohol rehab centers", "crack rehab centers")
Treatment ("mesothelioma treatment options", "drug treatment centers")
Cord Blood ("cordblood bank", "store umbilical cord blood")
(In my research, several Search Engine Marketing specialists came up with similar results.)
You see where I'm heading to. By construction, traditional media do not bring money to the classification above. In addition, as an insider said to me this week, no one is putting ads against keywords such as "war in Syria" or against the 3.2 billion results of a "Hurricane Sandy" query. Indeed, in the curve of ad words value, news slides to the long tail.
Then, why is Google so interested in news contents? Why has it has been maintaining Google News for the past ten years, in so many languages, without making a dime from it (there are no ads on the service)?
The answer pertains to the notion of Google's general internet "footprint". Being number one in search is fine, but not sufficient. In its goal to own the semantic universe, taking over "territories" is critical. In that context, a "territory" could be a semantic environment that is seen as critical to everyone's daily life, or one with high monetization potential.
Here are two recent examples of monetization potential as viewed by Google: Flights and Insurance. Having (easily) determined flight schedules were among the most sought after informations on the web, Google dipped into its deep cash reserve and, for $700m, acquired ITA software in July 2010. ITA was the world largest airline search company, powering sites such as Expedia or TripAdvisor. Unsurprisingly, the search giant launched Goolge Flight Search in Sept 2011.
In passing, Google showed its ability to kill any price comparator of its choosing. As for Insurance, the most expensive keyword, Google recently launched its own insurance comparison service in the United Kingdom... just after launching a similar system for credit cards and bank services.
Over the last ten years, Google has become the search tool of choice for Patents, and for scientific papers with Google Scholar. This came after shopping, books, Hotel Finder, etc.
Aside of this strategy of making Google the main -- if only -- entry point to the web, the search engine is working hard on its next transition: going from a search engine to a knowledge engine.
Early this year, Google created Knowledge Graph, a system that connects search terms to what is known as entities (names, places, events, things) -- millions of them. This is Google's next quantum leap. Again, you might think news related corpuses could constitute the most abundant trove of information to be fed into the Knowledge Graph. Unfortunately, this is not the case. At the core of the Knowledge Graph resides Metaweb, acquired by Google in July 2010. One of its key assets was a database of 12 million entities (now 23m) called Freebase. This database is fed by sources (listed here), ranging from the International Federation of Association Football (FIFA) to the Library of Congress, Eurostat or the India Times. (The only French source of the list is the movie database AlloCine.)
Out of about 230 sources, there are less than 10 medias outlets. Why? Again, volume and, perhaps even more important, ability to properly structure data. When the New York Times has about 14,000 topics, most newspapers only have hundreds of those, and a similar number of named entities in their database. (As a comparison, web native medias are much more skilled at indexation: the Huffington Post assigns between 12 and 20 keywords to each story.) By building upon acquisitions such as Metaweb's Freebase, Google now has about half billion entries of all kinds.
Legacy media must deal with a harsh reality: despite their role in promoting and defending democracy, in lifting the veil on things that mean much for society, or in propagating new ideas, when it come to data, news media compete in the junior leagues. And for Google, the most data-driven company in the world, having newspapers articles in its search system is no more than small cool stuff.