Google News: The Secret Sauce

 

A closer look at Google’s patent for its news retrieval algorithm reveals a greater than expected emphasis on quality over quantity. Can this bias stay reliable over time?

Ten years after its launch, Google News’ raw numbers are staggering: 50,000 sources scanned, 72 editions in 30 languages. Google’s crippled communication machine, plagued by bureaucracy and paranoia, has never been able to come up with tangible facts about its benefits for the news media it feeds on. It’s official blog merely mentions “6 billion visits per month” sent to news sites and Google News claims to connect “1 billion unique users a week to news content” (to put things in perspective, the NYT.com or the Huffington Post are cruising at about 40 million UVs per month). Assuming the clicks are sent to a relatively fresh news page bearing higher value advertising, the six billion visits can translate into about $400 million per year in ad revenue. (This is based on a $5 to $6 revenue per 1,000 pages, i.e. a few dollars in CPM per single ad, depending on format, type of selling, etc.) That’s a very rough estimate. Again: Google should settle the matter and come up with accurate figures for its largest markets. (On the same subject, see a previous Monday Note: The press, Google, its algorithm, their scale.)

But how exactly does Google News work? What kind of media does its algorithm favor most? Last week, the search giant updated its patent filing with a new document detailing the thirteen metrics it uses to retrieve and rank articles and sources for its news service. (Computerworld unearthed the filing, it’s here).

What follows is a summary of those metrics, listed in the order shown in the patent filing, along with a subjective appreciation of their reliability, vulnerability to cheating, relevancy, etc.

#1. Volume of production from a news source:

A first metric in determining the quality of a news source may include the number of articles produced by the news source during a given time period [week or month]. [This metric] may be determined by counting the number of non-duplicate articles produced by the news source over the time period [or] counting the number of original sentences produced by the news source.

This metric clearly favors production capacity. It benefits big media companies deploying large staffs. But the system can also be cheated by content farms (Google already addressed these questions); new automated content creation systems are gaining traction, many of them could now easily pass the Turing Test.

#2. Length of articles. Plain and simple: the longer the story (on average), the higher the source ranks. This is bad news for aggregators whose digital serfs cut, paste, compile and mangle abstracts of news stories that real media outlets produce at great expense.

#3. “The importance of coverage by the news source”. To put it another way, this matches the volume of coverage by the news source against the general volume of text generated by a topic. Again, it rewards large resource allocation to a given event. (In New York Times parlance, such effort is called called “flooding the zone”.)

#4. The “Breaking News Score”:   

This metric may measure the ability of the news source to publish a story soon after an important event has occurred. This metric may average the “breaking score” of each non-duplicate article from the news source, where, for example, the breaking score is a number that is a high value if the article was published soon after the news event happened and a low value if the article was published after much time had elapsed since the news story broke.

Beware slow moving newsrooms: On this metric, you’ll be competing against more agile, maybe less scrupulous staffs that “publish first, verify later”. This requires a smart arbitrage by the news producers. Once the first headline has been pushed, they’ll have to decide what’s best: Immediately filing a follow-up or waiting a bit and moving a longer, more value-added story that will rank better in metrics #2 and #3? It depends on elements such as the size of the “cluster” (the number of stories pertaining to a given event).

#5. Usage Patterns:

Links going from the news search engine’s web page to individual articles may be monitored for usage (e.g., clicks). News sources that are selected often are detected and a value proportional to observed usage is assigned. Well known sites, such as CNN, tend to be preferred to less popular sites (…). The traffic measured may be normalized by the number of opportunities readers had of visiting the link to avoid biasing the measure due to the ranking preferences of the news search engine.

This metric is at the core of Google’s business: assessing the popularity of a website thanks to the various PageRank components, including the number of links that point to it.

#6. The “Human opinion of the news source”:

Users in general may be polled to identify the newspapers (or magazines) that the users enjoy reading (or have visited). Alternatively or in addition, users of the news search engine may be polled to determine the news web sites that the users enjoy visiting. 

Here, things get interesting. Google clearly states it will use third party surveys to detect the public’s preference among various medias — not only their website, but also their “historic” media assets. According to the patent filing, the evaluation could also include the number of Pulitzer Prizes the organization collected and the age of the publication. That’s for the known part. What lies behind the notion of “Human opinion” is a true “quality index” for news sources that is not necessarily correlated to their digital presence. Such factors clearly favors legacy media.

# 7. Audience and traffic. Not surprisingly Google relies on stats coming from Nielsen Netratings and the like.

#8. Staff size. The bigger a newsroom is (as detected in bylines) the higher the value will be. This metric has the merit of rewarding large investments in news gathering. But it might become more imprecise as “large” digital newsrooms tend now to be staffed with news repackagers bearing little added value.

#9. Numbers of news bureaus. It’s another way to favor large organizations — even though their footprint tends to shrink both nationally and abroad.

#10. Number of “original named entities”. That’s one of the most interesting metric. A “named entity is the name of a person, place or organization”. It’s the primary tool for semantic analysis.

If a news source generates a news story that contains a named entity that other articles within the same cluster (hence on the same topic) do not contain, this may be an indication that the news source is capable of original reporting.

Of course, some cheaters insert misspelled entities to create “false” original entities and fool the system (Google took care of it). But this metric is a good way to reward original source-finding.

#11. The “breadth” of the news source. It pertains to the ability of a news organizations to cover a wide range of topics.

#12. The global reach of the news sources. Again, it favors large media who are viewed, linked, quoted, “liked”, tweeted from abroad.

This metric may measure the number of countries from which the news site receives network traffic. In one implementation consistent with the principles of the invention, this metric may be measured by considering the countries from which known visitors to the news web site are coming (e.g., based at least in part on the Internet Protocol (IP) addresses of those users that click on the links from the search site to articles by the news source being measured). The corresponding IP addresses may be mapped to the originating countries based on a table of known IP block to country mappings.

#13. Writing style. In the Google world, this means statistical analysis of contents against a huge language model to assess “spelling correctness, grammar and reading levels”.

What conclusions can we draw? This enumeration clearly shows Google intends to favor legacy media (print or broadcast news) over pure players, aggregators or digital native organizations. All the features recently added, such as Editor’s pick, reinforce this bias. The reason might be that legacy media are less prone to tricking the algorithm. For once, a know technological weakness becomes an advantage.

frederic.filloux@mondaynote.com

Be Sociable, Share!

Related columns:

  1. The end of the breaking news — as we know it TweetIn the internet storm sweeping the media, breaking news is, without a doubt, the main casualty. This branch of the information stream is the most likely one to endure a kind of “commodity syndrome”. The breaking news circa 2010 will be ubiquitous, instantaneous and simultaneous. Its value, its market price actually, will tend to zero [...]...
  2. How to make readers pay for news TweetAn idea is gaining momentum: online readers must open their wallet. In recent weeks, several suggestions for moving from wish to implementation have popped up. The latest one comes from Google. The company proposes to give a boost to its not-so-successful Checkout service by harnessing it to online newspapers interests. Quite a change here. Only [...]...
  3. The press, Google, its algorithm, their scale Tweet  In their fight against Google, traditional media firmly believe the search engine needs them to refine (and monetize) its algorithm. Let’s explore the facts. The European press got itself in a bitter battle against Google. In a nutshell, legacy media want money from the search engine: first, for the snippets of news it grabs [...]...
  4. The tragic economics of ultra-small news sites TweetTwo days before heading the Elysée gathering, I had a conversation with the founder of a tiny French news sites called Bakchich. info. The guy’s name is Nicolas Beau. He has a hell of a track record in investigative reporting. He spent quite a while at le Canard Enchained, a satirical newspaper-like weekly known for [...]...
  5. The News Cycle Heartbeat TweetHow do mainstream media and blogs interact? How do they feed each other ? Everyone in the newsmedia would love to get a better view of the mating dance. A few weeks ago, scientists at the Cornell University unveiled a thorough analysis of the relationship between the two universes. Borrowing from genomics techniques, they dug [...]...

10 Comments

  1. Walt French
    Posted February 24, 2013 at 10:57 pm | Permalink

    This recipe is so multi-faceted that the emphasis on any part of it could produce anything from an indigestible fast-food pizza-burrito with to a really comforting, rich cassoulet.
    .
    Which, I suppose, is why Google published it. Allows us to project on it whatever we want to get.
    .
    Bon appétit!

  2. Fafnir
    Posted February 25, 2013 at 12:08 am | Permalink

    Those metrics are necessary but will not grasp the essential.

  3. Posted February 25, 2013 at 2:52 pm | Permalink

    Don’t forget that today Google = Google news — Just type “Argo oscar” in Google today and the first 3 results are (officially) news results, but all the rest are news items as well. This is where IMHO the largest part of the 6 billion clicks comes from. We are not really talking about Google news traffic, but traffic that goes from Google to news sites.

    So all this new criteria are actually there because the page rank doesn’t really work so well anymore in the real-time world of news.

  4. Posted February 25, 2013 at 10:55 pm | Permalink

    “This enumeration clearly shows Google intends to favor legacy media (print or broadcast news) over pure players, aggregators or digital native organizations.”

    Actually, the 2003 version of the patent did. The 2012 version is a continuation patent, and while it contains almost exactly the same description section, the important part of the patent is the claims section, and they’ve changed substantially. I hate to be the bearer of bad news for you, but this misinformation is being bounced from Computerworld to The Nation, to Forbes, to CBS Marketwatch, to the Guardian, and here as well.

    Any “bias” that might favor traditional media in Google’s News Ranking signals is gone, and that is clearly reflected in the claims section of the newest version of the patent, filed last year (and granted last December).

    See: http://www.seobythesea.com/2013/02/traditional-news-agency-is-dead/

  5. Posted March 18, 2013 at 6:20 pm | Permalink

    I really like your blog.. very nice colors & theme. Did you design this website yourself or did you hire someone to do it for you?

    Plz reply as I’m looking to construct my own blog and would like to know where u got this from. appreciate it

  6. Posted April 6, 2013 at 3:19 am | Permalink

    I pay a visit everyday some web pages and information sites
    to read articles or reviews, but this weblog
    provides quality based articles.

  7. Posted April 10, 2013 at 5:49 am | Permalink

    Spot on with this write-up, I genuinely think this web site wants a lot more consideration. I’ll likely be once again to read much more, thanks for that info.

    [URL=http://www.shopsmichaelkorsbags.com/michael-kors-messenger-c-38.html]michael kors makeup bag[/URL]

  8. Posted April 22, 2013 at 7:39 am | Permalink

    Woah this blog is wonderful i like studying your posts. Stay up the great work! You understand, lots of people are searching round for this info, you could aid them greatly.

  9. Posted May 7, 2013 at 2:15 am | Permalink

    WOW just what I was looking for. Came here by searching for online publishing

  10. Posted May 14, 2013 at 7:14 pm | Permalink

    Nice info. I’ve occassionally found it to be a pain to find someone in state prison. Here is a article: inmate search that sort of pointed me in the right direction for my search.

4 Trackbacks

  1. By Silvia Cobo » Google News: The Secret Sauce on February 25, 2013 at 1:36 pm

    [...] Google News: The Secret Sauce [...]

  2. [...] engineers recently overhauled those criteria and disclosed the changes in a patent filing. As Frederic Filloux explains, the new criteria constitute a massive act of generosity toward legacy news organizations, [...]

  3. [...] engineers recently overhauled those criteria and disclosed a changes in a obvious filing. As Frederic Filloux explains, a new criteria consecrate a large act of munificence toward bequest news organizations, generally [...]

  4. [...] The Monday Note, Frédérick Filloux informa los parámetros del algoritmo secreto de Google News. El servicio escanea 50 mil fuentes, 72 ediciones en 30 [...]

Post a Comment

Your email is never shared. Required fields are marked *

*
*