online publishing

image_print

Google News: The Secret Sauce

 

A closer look at Google’s patent for its news retrieval algorithm reveals a greater than expected emphasis on quality over quantity. Can this bias stay reliable over time?

Ten years after its launch, Google News’ raw numbers are staggering: 50,000 sources scanned, 72 editions in 30 languages. Google’s crippled communication machine, plagued by bureaucracy and paranoia, has never been able to come up with tangible facts about its benefits for the news media it feeds on. It’s official blog merely mentions “6 billion visits per month” sent to news sites and Google News claims to connect “1 billion unique users a week to news content” (to put things in perspective, the NYT.com or the Huffington Post are cruising at about 40 million UVs per month). Assuming the clicks are sent to a relatively fresh news page bearing higher value advertising, the six billion visits can translate into about $400 million per year in ad revenue. (This is based on a $5 to $6 revenue per 1,000 pages, i.e. a few dollars in CPM per single ad, depending on format, type of selling, etc.) That’s a very rough estimate. Again: Google should settle the matter and come up with accurate figures for its largest markets. (On the same subject, see a previous Monday Note: The press, Google, its algorithm, their scale.)

But how exactly does Google News work? What kind of media does its algorithm favor most? Last week, the search giant updated its patent filing with a new document detailing the thirteen metrics it uses to retrieve and rank articles and sources for its news service. (Computerworld unearthed the filing, it’s here).

What follows is a summary of those metrics, listed in the order shown in the patent filing, along with a subjective appreciation of their reliability, vulnerability to cheating, relevancy, etc.

#1. Volume of production from a news source:

A first metric in determining the quality of a news source may include the number of articles produced by the news source during a given time period [week or month]. [This metric] may be determined by counting the number of non-duplicate articles produced by the news source over the time period [or] counting the number of original sentences produced by the news source.

This metric clearly favors production capacity. It benefits big media companies deploying large staffs. But the system can also be cheated by content farms (Google already addressed these questions); new automated content creation systems are gaining traction, many of them could now easily pass the Turing Test.

#2. Length of articles. Plain and simple: the longer the story (on average), the higher the source ranks. This is bad news for aggregators whose digital serfs cut, paste, compile and mangle abstracts of news stories that real media outlets produce at great expense.

#3. “The importance of coverage by the news source”. To put it another way, this matches the volume of coverage by the news source against the general volume of text generated by a topic. Again, it rewards large resource allocation to a given event. (In New York Times parlance, such effort is called called “flooding the zone”.)

#4. The “Breaking News Score”:   

This metric may measure the ability of the news source to publish a story soon after an important event has occurred. This metric may average the “breaking score” of each non-duplicate article from the news source, where, for example, the breaking score is a number that is a high value if the article was published soon after the news event happened and a low value if the article was published after much time had elapsed since the news story broke.

Beware slow moving newsrooms: On this metric, you’ll be competing against more agile, maybe less scrupulous staffs that “publish first, verify later”. This requires a smart arbitrage by the news producers. Once the first headline has been pushed, they’ll have to decide what’s best: Immediately filing a follow-up or waiting a bit and moving a longer, more value-added story that will rank better in metrics #2 and #3? It depends on elements such as the size of the “cluster” (the number of stories pertaining to a given event).

#5. Usage Patterns:

Links going from the news search engine’s web page to individual articles may be monitored for usage (e.g., clicks). News sources that are selected often are detected and a value proportional to observed usage is assigned. Well known sites, such as CNN, tend to be preferred to less popular sites (…). The traffic measured may be normalized by the number of opportunities readers had of visiting the link to avoid biasing the measure due to the ranking preferences of the news search engine.

This metric is at the core of Google’s business: assessing the popularity of a website thanks to the various PageRank components, including the number of links that point to it.

#6. The “Human opinion of the news source”:

Users in general may be polled to identify the newspapers (or magazines) that the users enjoy reading (or have visited). Alternatively or in addition, users of the news search engine may be polled to determine the news web sites that the users enjoy visiting. 

Here, things get interesting. Google clearly states it will use third party surveys to detect the public’s preference among various medias — not only their website, but also their “historic” media assets. According to the patent filing, the evaluation could also include the number of Pulitzer Prizes the organization collected and the age of the publication. That’s for the known part. What lies behind the notion of “Human opinion” is a true “quality index” for news sources that is not necessarily correlated to their digital presence. Such factors clearly favors legacy media.

# 7. Audience and traffic. Not surprisingly Google relies on stats coming from Nielsen Netratings and the like.

#8. Staff size. The bigger a newsroom is (as detected in bylines) the higher the value will be. This metric has the merit of rewarding large investments in news gathering. But it might become more imprecise as “large” digital newsrooms tend now to be staffed with news repackagers bearing little added value.

#9. Numbers of news bureaus. It’s another way to favor large organizations — even though their footprint tends to shrink both nationally and abroad.

#10. Number of “original named entities”. That’s one of the most interesting metric. A “named entity is the name of a person, place or organization”. It’s the primary tool for semantic analysis.

If a news source generates a news story that contains a named entity that other articles within the same cluster (hence on the same topic) do not contain, this may be an indication that the news source is capable of original reporting.

Of course, some cheaters insert misspelled entities to create “false” original entities and fool the system (Google took care of it). But this metric is a good way to reward original source-finding.

#11. The “breadth” of the news source. It pertains to the ability of a news organizations to cover a wide range of topics.

#12. The global reach of the news sources. Again, it favors large media who are viewed, linked, quoted, “liked”, tweeted from abroad.

This metric may measure the number of countries from which the news site receives network traffic. In one implementation consistent with the principles of the invention, this metric may be measured by considering the countries from which known visitors to the news web site are coming (e.g., based at least in part on the Internet Protocol (IP) addresses of those users that click on the links from the search site to articles by the news source being measured). The corresponding IP addresses may be mapped to the originating countries based on a table of known IP block to country mappings.

#13. Writing style. In the Google world, this means statistical analysis of contents against a huge language model to assess “spelling correctness, grammar and reading levels”.

What conclusions can we draw? This enumeration clearly shows Google intends to favor legacy media (print or broadcast news) over pure players, aggregators or digital native organizations. All the features recently added, such as Editor’s pick, reinforce this bias. The reason might be that legacy media are less prone to tricking the algorithm. For once, a know technological weakness becomes an advantage.

frederic.filloux@mondaynote.com

The Google Fund for the French Press

 

At the last minute, ending three months of  tense negotiations, Google and the French Press hammered a deal. More than yet another form of subsidy, this could mark the beginning of a genuine cooperation.

Thursday night, at 11:00pm Paris time, Marc Schwartz, the mediator appointed by the French government got a call from the Elysée Palace: Google’s chairman Eric Schmidt was en route to meet President François Hollande the next day in Paris. They both intended to sign the agreement between Google and the French press the Friday at 6:15pm. Schwartz, along with Nathalie Collin, the chief representative for the French Press, were just out of a series of conference calls between Paris and Mountain view: Eric Schmidt and Google’s CEO Larry Page had green-lighted the deal. At 3 am on Friday, the final draft of the memorandum was sent to Mountain View. But at 11:00am everything had to be redone: Google had made unacceptable changes, causing Schwartz and Collin to  consider calling off the signing ceremony at the Elysée. Another set of conference calls ensued. The final-final draft, unanimously approved by the members of the IPG association (General and Political Information), was printed at 5:30pm, just in time for the gathering at the Elysée half an hour later.

The French President François Hollande was in a hurry, too: That very evening, he was bound to fly to Mali where the French troops are waging as small but uncertain war to contain Al-Qaeda’s expansion in Africa. Never shy of political calculations, François Hollande seized the occasion to be seen as the one who forced Google to back down. As for Google’s chairman, co-signing the agreement along with the French President was great PR. As a result, negotiators from the Press were kept in the dark until Eric Schmidt’s plane landed in Paris Friday afternoon and before heading to the Elysée. Both men underlined what  they called “a world premiere”, a “historical deal”…

This agreement ends — temporarily — three months of difficult negotiations. Now comes the hard part.

According to Google’s Eric Schmidt, the deal is built on two stages:

“First, Google has agreed to create a €60 million Digital Publishing Innovation Fund to help support transformative digital publishing initiatives for French readers. Second, Google will deepen our partnership with French publishers to help increase their online revenues using our advertising technology.”

As always, the devil lurks in the details, most of which will have to be ironed over the next two months.

The €60m ($82m) fund will be provided by Google over a three-year period; it will be dedicated to new-media projects. About 150 websites members of the IPG association will be eligible for submission. The fund will be managed by a board of directors that will include representatives from the Press, from Google as well as independent experts. Specific rules are designed to prevent conflicts of interest. The fund will most likely be chaired by the Marc Schwartz, the mediator, also partner at the global audit firm Mazars (all parties praised him for his mediation and wish him to take the job).

Turning to the commercial part of the pact, it is less publicized but at least as equally important as the fund itself. In a nutshell, using a wide array of tools ranging from advertising platforms to content distribution systems, Google wants to increase its business with the Press in France and elsewhere in Europe. Until now, publishers have been reluctant to use such tools because they don’t want to increase their reliance on a company they see as cold-blooded and ruthless.

Moving forward, the biggest challenge will be overcoming an extraordinarily high level distrust on both sides. Google views the Press (especially the French one) as only too eager to “milk” it, and unwilling to genuinely cooperate in order to build and share value from the internet. The engineering-dominated, data-driven culture of the search engine is light-years away from the convoluted “political” approach of legacy media that don’t understand or look down on the peculiar culture of tech companies.

Dealing with Google requires a mastery of two critical elements: technology (with the associated economics), and the legal aspect. Contractually speaking, it means transparency and enforceability. Let me explain.

Google is a black box. For good and bad reasons, it fiercely protects the algorithms that are key to squeezing money from the internet, sometimes one cent at a time — literally. If Google consents to a cut of, say, advertising revenue derived from a set of contents, the partner can’t really ascertain whether the cut truly reflects the underlying value of the asset jointly created – or not. Understandably, it bothers most of Google’s business partners: they are simply asked to be happy with the monthly payment they get from Google, no questions asked. Specialized lawyers I spoke with told me there are ways to prevent such opacity. While it’s futile to hope Google will lift the veil on its algorithms, inserting an audit clause in every contract can be effective; in practical terms, it means an independent auditor can be appointed to verify specific financial records pertaining to a business deal.

Another key element: From a European perspective, a contract with Google is virtually impossible to enforce. The main reason: Google won’t give up on the Governing Law of a contract that is to be “Litigated exclusively in the Federal or States Courts of Santa Clara County, California”. In other words: Forget about suing Google if things go sour. Your expensive law firm based in Paris, Madrid, or Milan will try to find a correspondent in Silicon Valley, only to be confronted with polite rebuttals: For years now, Google has been parceling out multiples pieces of litigation among local law firms simply to make them unable to litigate against it. Your brave European lawyer will end up finding someone that will ask several hundreds thousands dollars only to prepare but not litigate the case. The only way to prevent this is to put an arbitration clause in every contract. Instead of going before a court of law, the parties agrees to mediate the matter through a private tribunal. Attorneys say it offers multiples advantages: It’s faster, much cheaper, the terms of the settlement are confidential, and it carries the same enforceability as a Court order.

Google (and all the internet giants for that matter) usually refuses an arbitration clause as well as the audit provision mentioned earlier. Which brings us to a critical element: In order to develop commercial relations with the Press, Google will have to find ways to accept collective bargaining instead of segmenting negotiations one company at a time. Ideally, the next round of discussions should come up with a general framework for all commercial dealings. That would be key to restoring some trust between the parties. For Google, it means giving up some amount of tactical as well as strategic advantage… that is part of its long-term vision. As stated by Eric Schmidt in its upcoming book “The New Digital Age” (the Wall Street Journal had access to the galleys) :

“[Tech companies] will also have to hire more lawyers. Litigation will always outpace genuine legal reform, as any of the technology giants fighting perpetual legal battles over intellectual property, patents, privacy and other issues would attest.”

European media are warned: they must seriously raise their legal game if they want to partner with Google — and the agreement signed last Friday in Paris could help.

Having said that, I personally believe it could be immensely beneficial for digital media to partner with Google as much as possible. This company spends roughly two billion dollars a year refining its algorithms and improving its infrastructure. Thousands of engineers work on it. Contrast this with digital media: Small audiences, insufficient stickiness, low monetization plague both web sites and mobile apps; the advertising model for digital information is mostly a failure — and that’s not Google’s fault. The Press should find a way to capture some of Google’s technical firepower and concentrate on what it does best: producing original, high quality contents, a business that Google is unwilling (and probably culturally unable) to engage in. Unlike Apple or Amazon, Google is relatively easy to work with (once the legal hurdles are cleared).

Overall, this deal is a good one. First of all, both sides are relieved to avoid a law (see last Monday Note Google vs. the press: avoiding the lose-lose scenario). A law declaring that snippets and links are to be paid-for would have been a serious step backward.

Second, it’s a departure from the notion of “blind subsidies” that have been plaguing the French Press for decades. Three months ago, the discussion started with irreconcilable positions: publishers were seeking absurd amounts of money (€70m per year, the equivalent of IPG’s members total ads revenue) and Google was focused on a conversion into business solutions. Now, all the people I talked to this weekend seem genuinely supportive of building projects, boosting innovation and also taking advantage of Google’s extraordinary engineering capabilities. The level of cynicism often displayed by the Press is receding.

Third, Google is changing. The fact that Eric Schmidt and Larry Page jumped in at the last minute to untangle the deal shows a shift of perception towards media. This agreement could be seen as a template for future negotiations between two worlds that still barely understand each other.

frederic.filloux@mondaynote.com

Google vs. the press: avoiding the lose-lose scenario

 

Google and the French press have been negotiating for almost three months now. If there is no agreement within ten days, the government is determined to intervene and pass a law instead. This would mean serious damage for both parties. 

An update about the new corporate tax system. Read this story in Forbes by the author of the report quoted below 

Since last November, about twice a week and for several hours, representatives from Google and the French press have been meeting behind closed doors. To ease up tensions, an experienced mediator has been appointed by the government. But mistrust and incomprehension still plague the discussions, and the clock is ticking.

In the currently stalled process, the whole negotiation revolves around cash changing hands. Early on, representatives of media companies where asking Google to pay €70m ($93m) per year for five years. This would be “compensation” for “abusively” indexing and linking their contents and for collecting 20 words snippets (see a previous Monday Note: The press, Google, its algorithm, their scale.) For perspective, this €70m amount is roughly the equivalent to the 2012 digital revenue of newspapers and newsmagazines that constitutes the IPG association (General and Political Information).

When the discussion came to structuring and labeling such cash transfer, IPG representatives dismissively left the question to Google: “Dress it up!”, they said. Unsurprisingly, Google wasn’t ecstatic with this rather blunt approach. Still, the search engine feels this might be the right time to hammer a deal with the press, instead of perpetuating a latent hostility that could later explode and cost much more. At least, this is how Google’s European team seems to feel. (In its hyper-centralized power structure, management in Mountain View seems slow to warm up to the idea.)

In Europe, bashing Google is more popular than ever. Not only just Google, but all the US-based internet giants, widely accused of killing old businesses (such as Virgin Megastore — a retail chain that also made every possible mistake). But the actual core issue is tax avoidance. Most of these companies hired the best tax lawyers money can buy and devised complex schemes to avoid paying corporate taxes in EU countries, especially UK, Germany, France, Spain, Italy…  The French Digital Advisory Board — set up by Nicolas Sarkozy and generally business-friendly — estimated last year that Google, Amazon, Apple’s iTunes and Facebook had a combined revenue of €2.5bn – €3bn but each paid only on average €4m in corporate taxes instead of €500m (a rough 20% to 25% tax rate estimate). At a time of fiscal austerity, most governments see this (entirely legal) tax avoidance as politically unacceptable. In such context, Google is the target of choice. In the UK for instance, Google made £2.5bn (€3bn or $4bn) in 2011, but paid only £6m (€7.1m or $9.5m) in corporate taxes. To add insult to injury, in an interview with The Independent, Google’s chairman Eric Schmidt defended his company’s tax strategy in the worst possible manner:

“I am very proud of the structure that we set up. We did it based on the incentives that the governments offered us to operate. It’s called capitalism. We are proudly capitalistic. I’m not confused about this.”

Ok. Got it. Very helpful.

Coming back to the current negotiation about the value of the click, the question was quickly handed over to Google’s spreadsheet jockeys who came up with the required “dressing up”. If the media accepted the use of the full range of Google products, additional value would be created for the company. Then, a certain amount could be derived from said value. That’s the basis for a deal reached last year with the Belgium press (the agreement is shrouded in a stringent confidentiality clause.)

Unfortunately, the French press began to eliminate most of the eggs in the basket, one after the other, leaving almost nothing to “vectorize” the transfer of cash. Almost three months into the discussion, we are stuck with antagonistic positions. The IPG representatives are basically saying: We don’t want to subordinate ourselves further to Google by adopting opaque tools that we can find elsewhere. Google retorts: We don’t want to be considered as another deep-pocketed “fund” that the French press will tap forever into without any return for our businesses; plus, we strongly dispute any notion of “damages” to be paid for linking to media sites. Hence the gap between the amount of cash asked by one side and what is (reluctantly) acceptable on the other.

However, I think both parties vastly underestimate what they’ll lose if they don’t settle quickly.

The government tax howitzer is loaded with two shells. The first one is a bill (drafted by no one else than IPG’s counsel, see PDF here), which introduces the disingenuous notion of “ancillary copyright”. Applied to the snippets Google harvests by the thousands every day, it creates some kind of legal ground to tax it the hard way. This montage is adapted from the music industry in which the ancillary copyright levy ranges from 4% to 7% of the revenue generated by a sector or a company. A rate of 7% for the revenue officially declared by Google in France (€138m) would translate into less than €10m, which is pocket change for a company that in fact generates about €1.5 billion from its French operations.

That’s where the second shell could land. Last Friday, the Ministry of Finances released a report on the tax policy applied to the digital economy  titled “Mission d’expertise sur la fiscalité de l’économie numérique” (PDF here). It’s a 200 pages opus, supported by no less than 600 footnotes. Its authors, Pierre Collin and Nicolas Colin are members of the French public elite (one from the highest jurisdiction, le Conseil d’Etat, the other from the equivalent of the General Accounting Office — Nicolas Colin being  also a former tech entrepreneur and a writer). The Collin & Colin Report, as it’s now dubbed, is based on a set of doctrines that also come to the surface in the United States (as demonstrated by the multiple references in the report).

To sum up:
— The core of the digital economy is now the huge amount of data created by users. The report categorizes different types of data: “Collected Data”, are  gathered through cookies, wether the user allows it or not. Such datasets include consumer behaviors, affiliations, personal information, recommendations, search patterns, purchase history, etc.  “Submitted Data” are entered knowingly through search boxes, forms, timelines or feeds in the case of Facebook or Twitter. And finally, “Inferred Data” are byproducts of various processing, analytics, etc.
— These troves of monetized data are created by the free “work” of users.
— The location of such data collection is independent from the place where the underlying computer code is executed: I create a tangible value for Amazon or Google with my clicks performed in Paris, while the clicks are processed in a  server farm located in Netherlands or in the United Sates — and most of the profits land in a tax shelter.
— The location of the value insofar created by the “free work” of users is currently dissociated from the location of the tax collection. In fact, it escapes any taxation.

Again, I’m quickly summing up a lengthy analysis, but the conclusion of the Collin & Colin report is obvious: Sooner or later, the value created and the various taxes associated to it will have to be reconciled. For Google, the consequences would be severe: Instead of €138m of official revenue admitted in France, the tax base would grow to €1.5bn revenue and about €500m profit; that could translate €150m in corporate tax alone instead of the mere €5.5m currently paid by Google. (And I’m not counting the 20% VAT that would also apply.)

Of course, this intellectual construction will be extremely difficult to translate into enforceable legislation. But the French authorities intend to rally other countries and furiously lobby the EU Commission to comer around to their view. It might takes years, but it could dramatically impact Google’s economics in many countries.

More immediately, for Google, a parliamentary debate over the Ancillary Copyright will open a Pandora’s box. From the Right to the Left, encouraged by François Hollande‘s administration, lawmakers will outbid each other in trashing the search engine and beyond that, every large internet company.

As for members the press, “They will lose too”, a senior official tells me. First, because of the complications in setting up the machinery the Ancillary Copyright Act would require, they will have to wait about two years before getting any dividends. Two, the governments — the present one as well as the past Sarkozy administration  — have always been displeased with what they see as the the French press “addiction to subsidies”; they intend to drastically reduce the €1.5bn in public aid. If the press gets is way through a law,  according to several administration officials, the Ministry of Finances will feel relieved of its obligations towards media companies that don’t innovate much despite large influxes of public money. Conversely, if the parties are able to strike a decent business deal on their own, the French Press will quickly get some “compensation” from of Google and might still keep most of its taxpayer subsidies.

As for the search giant, it will indeed have to stand a small stab but, for a while, will be spared the chronic pain of a long and costly legislative fight — and the contagion that goes with it: The French bill would be dissected by neighboring governments who will be only too glad to adapt and improve it.

frederic.filloux@mondaynote.com   

Next week: When dealing with Google, better use a long spoon; Why European media should rethink their approach to the search giant.