A closer look at Google's patent for its news retrieval algorithm reveals a greater than expected emphasis on quality over quantity. Can this bias stay reliable over time?
Ten years after its launch, Google News' raw numbers are staggering: 50,000 sources scanned, 72 editions in 30 languages. Google's crippled communication machine, plagued by bureaucracy and paranoia, has never been able to come up with tangible facts about its benefits for the news media it feeds on. It's official blog merely mentions "6 billion visits per month" sent to news sites and Google News claims to connect "1 billion unique users a week to news content" (to put things in perspective, the NYT.com or the Huffington Post are cruising at about 40 million UVs per month). Assuming the clicks are sent to a relatively fresh news page bearing higher value advertising, the six billion visits can translate into about $400 million per year in ad revenue. (This is based on a $5 to $6 revenue per 1,000 pages, i.e. a few dollars in CPM per single ad, depending on format, type of selling, etc.) That's a very rough estimate. Again: Google should settle the matter and come up with accurate figures for its largest markets. (On the same subject, see a previous Monday Note: The press, Google, its algorithm, their scale.)
But how exactly does Google News work? What kind of media does its algorithm favor most? Last week, the search giant updated its patent filing with a new document detailing the thirteen metrics it uses to retrieve and rank articles and sources for its news service. (Computerworld unearthed the filing, it's here).
What follows is a summary of those metrics, listed in the order shown in the patent filing, along with a subjective appreciation of their reliability, vulnerability to cheating, relevancy, etc.
#1. Volume of production from a news source:
A first metric in determining the quality of a news source may include the number of articles produced by the news source during a given time period [week or month]. [This metric] may be determined by counting the number of non-duplicate articles produced by the news source over the time period [or] counting the number of original sentences produced by the news source.
This metric clearly favors production capacity. It benefits big media companies deploying large staffs. But the system can also be cheated by content farms (Google already addressed these questions); new automated content creation systems are gaining traction, many of them could now easily pass the Turing Test.
#2. Length of articles. Plain and simple: the longer the story (on average), the higher the source ranks. This is bad news for aggregators whose digital serfs cut, paste, compile and mangle abstracts of news stories that real media outlets produce at great expense.
#3. "The importance of coverage by the news source". To put it another way, this matches the volume of coverage by the news source against the general volume of text generated by a topic. Again, it rewards large resource allocation to a given event. (In New York Times parlance, such effort is called called "flooding the zone".)
#4. The "Breaking News Score":
This metric may measure the ability of the news source to publish a story soon after an important event has occurred. This metric may average the "breaking score" of each non-duplicate article from the news source, where, for example, the breaking score is a number that is a high value if the article was published soon after the news event happened and a low value if the article was published after much time had elapsed since the news story broke.
Beware slow moving newsrooms: On this metric, you'll be competing against more agile, maybe less scrupulous staffs that "publish first, verify later". This requires a smart arbitrage by the news producers. Once the first headline has been pushed, they'll have to decide what's best: Immediately filing a follow-up or waiting a bit and moving a longer, more value-added story that will rank better in metrics #2 and #3? It depends on elements such as the size of the "cluster" (the number of stories pertaining to a given event).
#5. Usage Patterns:
Links going from the news search engine's web page to individual articles may be monitored for usage (e.g., clicks). News sources that are selected often are detected and a value proportional to observed usage is assigned. Well known sites, such as CNN, tend to be preferred to less popular sites (...). The traffic measured may be normalized by the number of opportunities readers had of visiting the link to avoid biasing the measure due to the ranking preferences of the news search engine.
This metric is at the core of Google's business: assessing the popularity of a website thanks to the various PageRank components, including the number of links that point to it.
#6. The "Human opinion of the news source":
Users in general may be polled to identify the newspapers (or magazines) that the users enjoy reading (or have visited). Alternatively or in addition, users of the news search engine may be polled to determine the news web sites that the users enjoy visiting.
Here, things get interesting. Google clearly states it will use third party surveys to detect the public's preference among various medias -- not only their website, but also their "historic" media assets. According to the patent filing, the evaluation could also include the number of Pulitzer Prizes the organization collected and the age of the publication. That's for the known part. What lies behind the notion of "Human opinion" is a true "quality index" for news sources that is not necessarily correlated to their digital presence. Such factors clearly favors legacy media.
# 7. Audience and traffic. Not surprisingly Google relies on stats coming from Nielsen Netratings and the like.
#8. Staff size. The bigger a newsroom is (as detected in bylines) the higher the value will be. This metric has the merit of rewarding large investments in news gathering. But it might become more imprecise as "large" digital newsrooms tend now to be staffed with news repackagers bearing little added value.
#9. Numbers of news bureaus. It's another way to favor large organizations -- even though their footprint tends to shrink both nationally and abroad.
#10. Number of "original named entities". That's one of the most interesting metric. A "named entity is the name of a person, place or organization". It's the primary tool for semantic analysis.
If a news source generates a news story that contains a named entity that other articles within the same cluster (hence on the same topic) do not contain, this may be an indication that the news source is capable of original reporting.
Of course, some cheaters insert misspelled entities to create "false" original entities and fool the system (Google took care of it). But this metric is a good way to reward original source-finding.
#11. The "breadth" of the news source. It pertains to the ability of a news organizations to cover a wide range of topics.
#12. The global reach of the news sources. Again, it favors large media who are viewed, linked, quoted, "liked", tweeted from abroad.
This metric may measure the number of countries from which the news site receives network traffic. In one implementation consistent with the principles of the invention, this metric may be measured by considering the countries from which known visitors to the news web site are coming (e.g., based at least in part on the Internet Protocol (IP) addresses of those users that click on the links from the search site to articles by the news source being measured). The corresponding IP addresses may be mapped to the originating countries based on a table of known IP block to country mappings.
#13. Writing style. In the Google world, this means statistical analysis of contents against a huge language model to assess "spelling correctness, grammar and reading levels".
What conclusions can we draw? This enumeration clearly shows Google intends to favor legacy media (print or broadcast news) over pure players, aggregators or digital native organizations. All the features recently added, such as Editor's pick, reinforce this bias. The reason might be that legacy media are less prone to tricking the algorithm. For once, a know technological weakness becomes an advantage.