Could Google and Publishers one day understand each other? Frankly, I doubt it. Two weeks ago I was in Hyderabad for the dual assembly of the World Association of Newspapers and the World Editors Forums (1).
There, Google-bashing was the life of the party. As I told in last week’s Monday Note (see The Misdirected Revolt of the Dinosaurs) the climax was the “debate” between WAN’s president Gavin O’Reilly and Google’s top lawyer Dave Drummond. One comes from Alpha Centauri, the other from, say, Pandora. For those who want to get to the bottom of the argument, the publisher’s statement is here and Google’s top lawyer defense is here.
In a nutshell, publishers keep complaining about Google’s relentless copyright violations. Tireless Google robots crawl the internet, indexing and displaying snippets in Google News, without paying a red cent for the content they post. As a result, said Gavin O’Reilly, “Google makes tons of money on our back”.
Dave Drummond’s reply: “We send online news publishers of all types a billion clicks a month from Google News and more than 3 billion additional visits from Search and other Google services. That’s about 100,000 business opportunities – to serve ads or offer subscriptions – every minute. And we don’t charge anything for that!” He added that Google practices were fully compliant with the Fair Use principle.
Fair Use is “A tired rhetoric”, snapped O’Reilly.
At this point the discussion gets technical. And interesting. At stake is this a crucial evolution of copyright, from a binary form (authorized ≠ forbidden) to a more fuzzy concept (use is allowed but restrictions apply). This evolution of copyright is tied to the Creative Commons (coined by Law professor Lawrence Lessig), which define a sort of shape-adjustable notion of intellectual property.
Here is the first catch: how do you translate an intellectual construct such as flexible copyright into a computer protocol ? In Hyderabad, publishers re-ignited a nerdy quarrel over the best way to protect their news material. That’s the Robots.txt vs. ACAP issue. Non-techies, please stay with me, I’ll do that in plain English here (and I’ll pursue in French next January on Slate.fr).
Robots.text is a 1994 protocol (two years before Google was incorporated), these were the early internet days. It works like this:
Say I’m an online publisher. In the tree structure of my site I decide to open selected branches (directories) to search engines robot crawlers. The result of the crawling can be regurgitated by aggregators like as GoogleNews. But, for reasons such as copyright restrictions on material I don’t own, parts of my site need to be kept out of Google’s sight.
For total prevention of unwanted crawling, I’ll just insert two lines of code at the root of my site:
The first line carries the name of the robot I want to exclude (“*” means all) and the second line specifies the directories I want to protect. For example :
Here, the site of the French paper Le Monde will prevent the Google’s indexing robot from crawling sports directories about football and rugby.
It’s simple as that. To get an idea of the various protection policies implemented by news sites, just type the extension “robots.txt” after the url. Example: http://www.timesonline.co.uk/robots.txt. There you see the list of all the robots the London Times wants to “disallow”. Interestingly enough, even though Rupert Murdoch is at the forefront of an anti-Google crusade, his notorious British media property is not excluding Google at all ; same as The Australian, another historical Murdoch property which is rather robot tolerant (see by yourself). I love such duplicity — sorry, pragmatism. (Actually, the fight is about a MySpace-related advertising contract).
Facing our clunky but straightforward robots.txt protocol, here is a much more modern one: ACAP. It stands for Automated Content Access Protocol, and was created in 2006. But more importantly, it is backed by 150 publishers and by the WAN.
Here we are.
ACAP and Robots.txt look similar: lines of simple code, put at the right place to define the bot(s) and /, that is the directories to be excluded. Except that ACAP is way more sophisticated. Specifically, it can tell:
- how many lines of an article the robot is allowed to suck in
- assign a specific abstract (snippet) to be taken by the bot
- at what time the bot can crawl what part of the site, for instance “0700-1230 GMT”
- at what rate it crawls
- block links to a part of the site
- assign a term limit for the validity of the abstract
- decide which country (IP numbers) should be allowed to see what (here comes the balkanization of the internet : bad idea)
But here is the second catch: Google superbly ignores ACAP; the company’s position is the Robots.txt protocol does enough to protect contents. Hence WAN’s president’s ire.
I asked to François Bourdoncle, CEO of the French search engine Exalead for his view of the discord. In 2007, Exalead became the technical partner for the publishing consortium that wanted a better system than Robot.txt. (Exalead did the prototype pro bono). If we consider the best protocol is the one that is the most widely adopted, ACAP is toast: its version 1.1 has been adopted by 1250 publishers, compared to the 20,000 sources that went on GoogleNews.
François Bourdoncle offers the best analogy to describe the antagonism between the online media and Google: “It is the craftsmen of the information world vs. Industrialists”, he says. On one hand, you have the publishers: they manage thousands of documents on each of their web sites. They signed complicated copyright contracts, with clauses defining the nuances in authors’ protection. On the other hand, you have the likes of Google. For them, the unit of measurement is the billion of documents. There is no room for finesse, here. The problem is one of massive processing, one that can be only be dealt with powerful algorithms. I mean: The Google way. Publishers want to be able to define the number of lines a bot can draw out from a story? Google will say: I want to be the only one who defines how my search or crawl results (in Google News) actually look like ; if site x wants abstracts limited to 3 lines and site y agrees to 9, that’ll be a mess. When the Googleplex geeks decide it’s time, they’ll upgrade the Robots.txt protocol to bring it closer to ACAP — and to keep the widely adopted protocol their own.
Fact is, Google is playing bad politics here. It is stunning to see such deployment of raw brainpower so badly messing up its relationship with an important and significant partner such as the media industry. Here are some measures that Google should consider to lower the tension:
- Robots.txt is an old thing. OK, it does the job someway, but Google should adopt ACAP pronto.
- Alternatively, it should work out something close to it, along with the publishers. Contrary to what the WAN says, it won’t change the deteriorating economics of online news but that will be a welcome symbolic gesture.
- Google should organize ASAP a serious gathering at the Googleplex to listen the publishers’ position on copyright, but also on traffic and revenue sharing and pay walls. In every major news organization in the world, there is plenty of smart people managing big news sites who don’t carry an anti-Google bias. They should be asked to come up with real proposals and be allowed to expect real answers.
The worst mistake Google can make at this stage is to continue to ignore publishers’ claims. Every news organization got it: Googles now rules the online publishing world. But with dominance come obligations. Displaying magnanimity could be a good tactical move. Because a new factor has come up. It’s called Microsoft Bing (the search engine), and it is waiting to capitalize on the ailing publishing world’s anger. The Googleplex engineers should integrate that in their master algorithm.
(1) I received many emails regarding last week’s mention of the Mirror.co.uk ‘s strategy on how to deal with Google. Readers challenge Mirror’s Associate Editor Matt Kelly sometimes in a rather documented way (thanks for the contributions). I’ll address the issue in January.
- Measuring time spent on a web page TweetHow much time is actually spent on websites? New technologies are emerging, starting with time spent on individual pages and drilling down to page segments. Such technologies will lead to improved monetization; they could even spell good news for paid sites. Here is why. First, display ads. Banners and other modules still represent 30% to [...]...
- Bob Woodward: how many page views? TweetThe legendary journalist was in Paris last week, promoting (“flogging”) his last book: “Obama’s Wars“. (Large excerpts in the Washington Post here). It was the standard book tour: TV and radio appearances; a well-timed cover story in Le Monde Magazine; same quotes, same anecdotes everywhere. Still, I was curious. After all, he’s one of my [...]...