Could Google and Publishers one day understand each other? Frankly, I doubt it. Two weeks ago I was in Hyderabad for the dual assembly of the World Association of Newspapers and the World Editors Forums (1).
There, Google-bashing was the life of the party. As I told in last week’s Monday Note (see The Misdirected Revolt of the Dinosaurs) the climax was the “debate” between WAN’s president Gavin O’Reilly and Google’s top lawyer Dave Drummond. One comes from Alpha Centauri, the other from, say, Pandora. For those who want to get to the bottom of the argument, the publisher’s statement is here and Google’s top lawyer defense is here.
In a nutshell, publishers keep complaining about Google’s relentless copyright violations. Tireless Google robots crawl the internet, indexing and displaying snippets in Google News, without paying a red cent for the content they post. As a result, said Gavin O’Reilly, “Google makes tons of money on our back”.
Dave Drummond’s reply: “We send online news publishers of all types a billion clicks a month from Google News and more than 3 billion additional visits from Search and other Google services. That’s about 100,000 business opportunities – to serve ads or offer subscriptions – every minute. And we don’t charge anything for that!” He added that Google practices were fully compliant with the Fair Use principle.
Fair Use is “A tired rhetoric”, snapped O’Reilly.
At this point the discussion gets technical. And interesting. At stake is this a crucial evolution of copyright, from a binary form (authorized ≠ forbidden) to a more fuzzy concept (use is allowed but restrictions apply). This evolution of copyright is tied to the Creative Commons (coined by Law professor Lawrence Lessig), which define a sort of shape-adjustable notion of intellectual property.
Here is the first catch: how do you translate an intellectual construct such as flexible copyright into a computer protocol ? In Hyderabad, publishers re-ignited a nerdy quarrel over the best way to protect their news material. That’s the Robots.txt vs. ACAP issue. Non-techies, please stay with me, I’ll do that in plain English here (and I’ll pursue in French next January on Slate.fr).
Robots.text is a 1994 protocol (two years before Google was incorporated), these were the early internet days. It works like this:
Say I’m an online publisher. In the tree structure of my site I decide to open selected branches (directories) to search engines robot crawlers. The result of the crawling can be regurgitated by aggregators like as GoogleNews. But, for reasons such as copyright restrictions on material I don’t own, parts of my site need to be kept out of Google’s sight.
For total prevention of unwanted crawling, I’ll just insert two lines of code at the root of my site:
User-agent: *
Disallow: /
The first line carries the name of the robot I want to exclude (“*” means all) and the second line specifies the directories I want to protect. For example :
User-agent: Googlebot
Disallow: /sport-foot-ligue1/
Disallow: /sport-football/
Disallow: /sport-rugby-top14/
Disallow: /sport-rugby/
Here, the site of the French paper Le Monde will prevent the Google’s indexing robot from crawling sports directories about football and rugby.
It’s simple as that. To get an idea of the various protection policies implemented by news sites, just type the extension “robots.txt” after the url. Example: http://www.timesonline.co.uk/robots.txt. There you see the list of all the robots the London Times wants to “disallow”. Interestingly enough, even though Rupert Murdoch is at the forefront of an anti-Google crusade, his notorious British media property is not excluding Google at all ; same as The Australian, another historical Murdoch property which is rather robot tolerant (see by yourself). I love such duplicity — sorry, pragmatism. (Actually, the fight is about a MySpace-related advertising contract).
Facing our clunky but straightforward robots.txt protocol, here is a much more modern one: ACAP. It stands for Automated Content Access Protocol, and was created in 2006. But more importantly, it is backed by 150 publishers and by the WAN.
Here we are.
ACAP and Robots.txt look similar: lines of simple code, put at the right place to define the bot(s) and /, that is the directories to be excluded. Except that ACAP is way more sophisticated. Specifically, it can tell:
- how many lines of an article the robot is allowed to suck in
- assign a specific abstract (snippet) to be taken by the bot
- at what time the bot can crawl what part of the site, for instance “0700-1230 GMT”
- at what rate it crawls
- block links to a part of the site
- assign a term limit for the validity of the abstract
- decide which country (IP numbers) should be allowed to see what (here comes the balkanization of the internet : bad idea)
… etc.
Which one is the best ? ACAP in theory. It dramatically increases the granularity of the terms of uses of any given contract. To get a full and, I think, balanced perspective, go to this detailed article of Search Engine Land.
But here is the second catch: Google superbly ignores ACAP; the company’s position is the Robots.txt protocol does enough to protect contents. Hence WAN’s president’s ire.
I asked to François Bourdoncle, CEO of the French search engine Exalead for his view of the discord. In 2007, Exalead became the technical partner for the publishing consortium that wanted a better system than Robot.txt. (Exalead did the prototype pro bono). If we consider the best protocol is the one that is the most widely adopted, ACAP is toast: its version 1.1 has been adopted by 1250 publishers, compared to the 20,000 sources that went on GoogleNews.
François Bourdoncle offers the best analogy to describe the antagonism between the online media and Google: “It is the craftsmen of the information world vs. Industrialists”, he says. On one hand, you have the publishers: they manage thousands of documents on each of their web sites. They signed complicated copyright contracts, with clauses defining the nuances in authors’ protection. On the other hand, you have the likes of Google. For them, the unit of measurement is the billion of documents. There is no room for finesse, here. The problem is one of massive processing, one that can be only be dealt with powerful algorithms. I mean: The Google way. Publishers want to be able to define the number of lines a bot can draw out from a story? Google will say: I want to be the only one who defines how my search or crawl results (in Google News) actually look like ; if site x wants abstracts limited to 3 lines and site y agrees to 9, that’ll be a mess. When the Googleplex geeks decide it’s time, they’ll upgrade the Robots.txt protocol to bring it closer to ACAP — and to keep the widely adopted protocol their own.
Fact is, Google is playing bad politics here. It is stunning to see such deployment of raw brainpower so badly messing up its relationship with an important and significant partner such as the media industry. Here are some measures that Google should consider to lower the tension:
- Robots.txt is an old thing. OK, it does the job someway, but Google should adopt ACAP pronto.
- Alternatively, it should work out something close to it, along with the publishers. Contrary to what the WAN says, it won’t change the deteriorating economics of online news but that will be a welcome symbolic gesture.
- Google should organize ASAP a serious gathering at the Googleplex to listen the publishers’ position on copyright, but also on traffic and revenue sharing and pay walls. In every major news organization in the world, there is plenty of smart people managing big news sites who don’t carry an anti-Google bias. They should be asked to come up with real proposals and be allowed to expect real answers.
The worst mistake Google can make at this stage is to continue to ignore publishers’ claims. Every news organization got it: Googles now rules the online publishing world. But with dominance come obligations. Displaying magnanimity could be a good tactical move. Because a new factor has come up. It’s called Microsoft Bing (the search engine), and it is waiting to capitalize on the ailing publishing world’s anger. The Googleplex engineers should integrate that in their master algorithm.
(1) I received many emails regarding last week’s mention of the Mirror.co.uk ‘s strategy on how to deal with Google. Readers challenge Mirror’s Associate Editor Matt Kelly sometimes in a rather documented way (thanks for the contributions). I’ll address the issue in January.
Related columns:
- Measuring time spent on a web page TweetHow much time is actually spent on websites? New technologies are emerging, starting with time spent on individual pages and drilling down to page segments. Such technologies will lead to improved monetization; they could even spell good news for paid sites. Here is why. First, display ads. Banners and other modules still represent 30% to [...]...
- Bob Woodward: how many page views? TweetThe legendary journalist was in Paris last week, promoting (“flogging”) his last book: “Obama’s Wars“. (Large excerpts in the Washington Post here). It was the standard book tour: TV and radio appearances; a well-timed cover story in Le Monde Magazine; same quotes, same anecdotes everywhere. Still, I was curious. After all, he’s one of my [...]...






8 Comments
There is absolutely nothing preventing publishers from implementing their own custom solutions to the problems posed by Google. They simply build their own robot that automatically partitions their site into a robots.txt crawlable section according to their desired ACAP rules. They can limit crawler activity by appropriate HTTP response status codes. When a user is directed to the automatically crawled section by Google, the server simply detects that they’re not a robot, and redirects them to the full content which has now been protected from pillaging.
If publishers’ IT departments are not competent enough to perform this transition, then my firm is more than happy to provide necessary technical consultation.
As I understand it, and has been explained to me by a web developer, robots.txt is more of a gentlemen’s agreement than a true software protocol. Putting “googlebot disallow” in your robots.txt will not really block the googlebot (or any other bot for that matter) on a software level, it’s just common courtesy that this particular bot does not go ahead and crawl your website. Software engineers can further enlighten us on this issue, but that’s what I’ve been told. So things are not as simple as they sound.
ACAP is all about restrictions and control and is not in the free spirit that both the Internet G thrives on. It will never be adapted.
Yes, Oliver, robots.txt is nothing more than a “gentlemen’s agreement.” It is not binding and could not be binding since it was defined by engineers, not a legislature. Nothing in robots.txt can extend or expand on the rights granted by copyright. The same holds true for ACAP. Both are merely advisory formats, defined by non-legislative bodies. You cannot be sued for ignoring or misinterpreting either a robots.txt or ACAP file.
Now, imagine, for a moment, a world in which ACAP was respected in the same way that many services respect robots.txt. Imagine, for instance, not only news sites but also every spammer, advertiser, blogger and charlatan on the web was free to define the “abstract” that would be shown by search engines. (Note: The science of constructing correct “abstracts” from content is one that has been the subject of thousands of academic studies…) I think you’ll see that it would not be a good thing…
Imagine further that ACAP, by putting limits on the length of time an abstract was valid, the times of day that crawlers could work, etc. was able to impact the design and implementation of *every* search engine on the web (not just the big ones). What you’ll realize is that it would be massively harder for new entrants to the search field to build systems that might one day compete with the existing big ones. And, it would be harder for any search engine to innovate in how it serves the needs of its users. Do you really want to make it that much harder for new information discovery systems to be built and improved? Are you willing to allow the news publishers to bend the entire web in order to serve their purposes?
The system we have today works for most publishers even if the news publishers haven’t yet figured out how to work within it. The web would not be the wonderous tool that it is today without information discovery tools such as search engines. How much are we willing to break in order to compensate for the failure of the old-line news publishers to build compelling products?
Frédéric,
Excellent summary and analysis of the issues.
Your recommendations for Google are in the spirit of the wisdom of Wired’s founding editor Kevin Kelly who has, according to @wiredUK this month, concluded that technology is an unruly child. “There’s no bad children, just bad friends,” he says. “So it’s about training the technology, rearing it as if it were children.”
Google technology (and those who make and run it) needs to grow up. And, it’s clear, that’s not going to happen unless those who’ve been around for a while longer, such as WAN & the other ACAP affiliates, exercise some “tough love”.
François
Google Bashing is also a sport in France. Watch the video of a meeting between Google and french publishers who took place in Paris Dec. 4.
http://www.lavoixdudodo.info/2009/12/08/google-vs-presse-francaise-le-clash-a-bien-eu-lieu/
The point about the internet is that it indeed is one big “gentleman’s agreement”. The only problem is that some gentleman get envious of anothers ability to monetize this fact.
Consider the simple fact that for any of us to read this blog, the content has to be handed off by countless machines that none of us own and none of us directly financially contribute to the operation of. Yes, there are financial arrangements for “transit”, and more gentlemanly “peering”, however ultimately everything “works”.
If you provide content on the internet and you are worried about it being used freely, you should either lock it up behind a pay-wall or via encryption, or simple pester governments enough into creating laws for a new class of property, and new form of theft, and a new idea of punishment. It looks like media companies, with their immense wealth, are taking the latter route and essentially making the world slightly less freer without considering philosophical ramifications of the laws and regulations.
If you content is valuable enough, the price should be reflected in your transit costs. End-of-story! A “media strike” where publishers refuse content to Google is a perfectly valid response, however I think that the media producers are too jealous of each other for this to happen.
This is an interesting problem. A problem that should be solved at the level of the internet, not at the myopic, technically-illiterate level of government.
For an excellent introduction to the true nature of the internet, please read the following article on ArsTechnica http://arstechnica.com/old/content/2008/09/peering-and-transit.ars/ After you have warmed up to that, perhaps you’ll pick up a book on the Border Gateway Protocol and impress you friends with nuances of our modern age at cocktail parties.
Thanks for another interesting article Frédéric.
mantapssss
6 Trackbacks
[...] Not on the same page. Ever. [...]
[...] Not on the same page. Ever. | Monday Note [...]
[...] PEACE: Why publishers and Google will never get along. [Monday [...]
[...] di Frédéric Filloux (MondayNote) [...]
[...] http://www.mondaynote.com/2009/12/13/not-on-the-same-page-ever/ [...]
[...] Not on the same page. Ever. Could Google and Publishers one day understand each other? Frankly, I doubt it. Two weeks ago I was in Hyderabad… [...]