This February 10, Les Echos launches its business news aggrefilter. For the French business media group, this is a way to gain critical working knowledge of the semantic web. Here is how we did it. An why.
The site is called Les Echos 360 and is separate from our flagship site LesEchos.fr, the digital version of the French business daily Les Echos. As the newly coined word aggrefilter indicates, it is an aggregation and filtering system. It is to be the kernel from which many digital products and extensions we have in mind will spring.
My idea to build an aggrefilter goes back to… 2007. That year, in San Francisco, I met Dan Farber, at the time editor-in-chief of CNet (now at CBS Interactive, his blog here) – and actual father of the aggrefilter term. Dan told me: ‘You should have a look at Techmeme. It’s an “aggrefilter” that collects technology news and ranks them based on their importance to the news cycle’. I briefly explored the idea of building such an aggrefilter, but found it too hard to do it from scratch, off-the-shelf aggrefilter software didn’t exist yet. The task required someone like Techmeme founder Gabe Rivera – who holds a PhD in computer science. I shelved the idea for a while.
A year ago, as the head of digital at Les Echos, I reopened the case and pitched the idea to a couple of French computer scientists specialized in text-mining — a field that had vastly improved since I first looked at it. We decided to give a shot to the idea. Why?
I believe a great media brand bearing a large sets of positive attributes (reliability, scope, depth of coverage) needs to generate an editorial footprint that goes far beyond its own production. It’s a matter of critical mass. In the case of Les Echos, we need to be the very core of business information, both for the general public and for corporations. Readers trust the content we produce, therefore they should trust the reading recommendation we make through our aggregation of relevant web sites. This isn’t an obvious move for journalists who, understandably, aren’t necessarily keen to send traffic to third party web sites. (Interestingly enough, someone at the New York Times told me that a heated debate flared up within the newsroom a few years ago: To which extent should NYT.com direct readers to its competitors? Apparently, market studies settled the issue by showing that readers of the NYT online actually tended to also like it for being a reliable prescriber.)
In the business field, unlike Google News that crawls an unlimited trove of sources, my original idea was to extract good business stories from both algorithmically and manually selected sources. More importantly, the idea was to bring to the surface, to effectively curate specialized sources — niche web sites and blogs — usually lost in the noise. Near-real-time information also seemed essential, hence the need for an automated gathering process, Techmeme-like. (Techmeme is now supplemented by Mediagazer, one of my favorite readings.)
Where do we go from here?
Initially, we turned to the newsroom, asking beat reporters for a list of reliable sources they regularly monitored. The idea was to build a qualified corpus based on suggestions from our in-house specialists. Techmeme and Mediagazer call it their “leaderboard” (see theirs for tech and media). Perhaps we didn’t have the right pitch, or we were misunderstood, but all we got was a lukewarm reception. Our partner, the French startup Syllabs, came up with a different solution, based on Twitter analysis.
We used our reporters’ 72 most active Twitter accounts to extract URLs embedded in their tweets. This first pass yielded about 5000 URLs, but most turned out to be useless because, most of the time, reporters linked their tweets to their own or their colleagues’ newsroom stories. Then, Syllabs engineers had another idea, they data-mined tweets from people followed by our staff. This yielded 872,000 URLs. After that, another filtering pass found out the true curators, the people who found original sources around the web. Retweets also were counted as they indicate a vote of relevance/confidence. After further statistical analysis of tweet components, the 872,000 URLs were boiled down to less than 400 original sources that were to become the basis of Les Echos 360′s Leaderboard (we are now down to 160 sources).
Building a corpus of sources is one thing, but ranking articles with respect to their weight in the news cycle is yet another story. Every hour, 1,500 to 2,000 news pieces go through a filtering process that defines their semantic footprint (with its associated taxonomy). Then, they are aggregated in “clusters”. Eventually, clusters are ranked based according to a statistical analysis of their “signal” in the general news-flow. Each “Clustering” (collection + ranking) contains 400-500 clusters, a process that more than occasionally overloads our computers.
Despite continuous revisions to its 19,000 lines of code, the system is far from perfect. As expected. In fact it needs two sets of tunings: One to maintaining a wide enough spectrum of sources to properly reflect the diversity of topics we want to cover. With a caveat: profusion doesn’t necessarily create quality. Crawling the long tail of potentially good sources continues to prove difficult. The second needed adjustment is finding the right balance between all parameters: update frequency, the “quality index” of sources – and many other criteria I won’t disclose here. This I compare to the mixing console inside a recording studio. Finding the right sound is tricky.
It took years for Techmeme to refine its algorithm. It might take a while for Les Echos 360 — that’s why we are launching the site in beta (a notion not widely shared in the media sector.) No surprise, a continuous news-flow is an extremely difficult moving target. As for Techmeme and Mediagazer, despite refinements in Gabe Rivera’s work, their algorithm is “rectified” by more than a dozen editors (who even rewrite headlines to make them more explicit and punchier). A much lighter crew will monitor Les Echos 360 through a back-office that will allow us to change cluster rankings and to eliminate parasitic items.
For Les Echos’ digital division, this aggrefilter is a proof of concept, a way to learn a set of technologies we consider essential for the company’s future. The digital news business will be increasingly driven by semantic processes; these will allow publishers to extract much more value from news items, whether they are produced in-house or aggregated/filtered. That is especially true for a business news provider: the more specialized the corpus, the higher the need for advanced processing. Fortunately, it is much easier to fine-tune an aggrefilter for a specific field (logistics, clean-tech, M&A, legal affairs…) than for wider and muddier streams of general news. This new site is just the tip of the iceberg. We built this engine to address a wide array of vertical, business-to-business, needs. It aims to be a source of tangible revenue.