techmeme

Building a business news aggrefilter

 

This February 10, Les Echos launches its business news aggrefilter. For the French business media group, this is a way to gain critical working knowledge of the semantic web. Here is how we did it. An why. 

The site is called Les Echos 360 and is separate from our flagship site LesEchos.fr, the digital version of the French business daily Les Echos. As the newly coined word aggrefilter indicates, it is an aggregation and filtering system. It is to be the kernel from which many digital products and extensions we have in mind will spring.

My idea to build an aggrefilter goes back to… 2007. That year, in San Francisco, I met Dan Farber, at the time editor-in-chief of CNet (now at CBS Interactive, his blog here) – and actual father of the aggrefilter term. Dan told me: ‘You should have a look at Techmeme. It’s an “aggrefilter” that collects technology news and ranks them based on their importance to the news cycle’. I briefly explored the idea of building such an aggrefilter, but found it too hard to do it from scratch, off-the-shelf aggrefilter software didn’t exist yet. The task required someone like Techmeme founder Gabe Rivera – who holds a PhD in computer science. I shelved the idea for a while.

360 cap

A year ago, as the head of digital at Les Echos, I reopened the case and pitched the idea to a couple of French computer scientists specialized in text-mining — a field that had vastly improved since I first looked at it. We decided to give a shot to the idea. Why?

I believe a great media brand bearing a large sets of positive attributes (reliability, scope, depth of coverage) needs to generate an editorial footprint that goes far beyond its own production. It’s a matter of critical mass. In the case of Les Echos, we need to be the very core of business information, both for the general public and for corporations. Readers trust the content we produce, therefore they should trust the reading recommendation we make through our aggregation of relevant web sites. This isn’t an obvious move for journalists who, understandably, aren’t necessarily keen to send traffic to third party web sites. (Interestingly enough, someone at the New York Times told me that a heated debate flared up  within the newsroom a few years ago: To which extent should NYT.com direct readers to its competitors? Apparently, market studies settled the issue by showing that readers of the NYT online actually tended to also like it for being a reliable prescriber.)

In the business field, unlike Google News that crawls an unlimited trove of sources, my original idea was to extract good business stories from both algorithmically and manually selected sources. More importantly, the idea was to bring to the surface, to effectively curate specialized sources — niche web sites and blogs — usually lost in the noise. Near-real-time information also seemed essential, hence the need for an automated gathering process, Techmeme-like. (Techmeme is now supplemented by Mediagazer, one of my favorite readings.)

Where do we go from here?

Initially, we turned to the newsroom, asking beat reporters for a list of reliable sources they regularly monitored. The idea was to build a qualified corpus based on suggestions from our in-house specialists. Techmeme and Mediagazer call it their “leaderboard” (see theirs for tech and media). Perhaps we didn’t have the right pitch, or we were misunderstood, but all we got was a lukewarm reception. Our partner, the French startup Syllabs, came up with a different solution, based on Twitter analysis.

We used our reporters’ 72 most active Twitter accounts to extract URLs embedded in their tweets. This first pass yielded about 5000 URLs, but most turned out to be useless because, most of the time, reporters linked their tweets to their own or their colleagues’ newsroom stories. Then, Syllabs engineers had another idea, they data-mined tweets from people followed by our staff. This yielded 872,000 URLs. After that, another filtering pass found out the true curators, the people who found original sources around the web. Retweets also were counted as they indicate a vote of relevance/confidence. After further statistical analysis of tweet components, the 872,000 URLs were boiled down to less than 400 original sources that were to become the basis of Les Echos 360′s Leaderboard (we are now down to 160 sources).

Building a corpus of sources is one thing, but ranking articles with respect to their weight in the news cycle is yet another story. Every hour, 1,500 to 2,000 news pieces go through a filtering process that defines their semantic footprint (with its associated taxonomy). Then, they are aggregated in “clusters”. Eventually, clusters are ranked based according to a statistical analysis of their “signal” in the general news-flow. Each “Clustering” (collection + ranking) contains 400-500 clusters, a process that more than occasionally overloads our computers.

Despite continuous revisions to its 19,000 lines of code, the system is far from perfect. As expected. In fact it needs two sets of tunings: One to maintaining a wide enough spectrum of sources to properly reflect the diversity of topics we want to cover. With a caveat: profusion doesn’t necessarily create quality. Crawling the long tail of potentially good sources continues to prove difficult. The second needed adjustment is finding the right balance between all parameters: update frequency, the “quality index” of sources – and many other criteria I won’t disclose here. This I compare to the mixing console inside a recording studio. Finding the right sound is tricky.

It took years for Techmeme to refine its algorithm. It might take a while for Les Echos 360 — that’s why we are launching the site in beta (a notion not widely shared in the media sector.) No surprise, a continuous news-flow is an extremely difficult moving target. As for Techmeme and Mediagazer, despite refinements in Gabe Rivera’s work, their algorithm is “rectified” by more than a dozen editors (who even rewrite headlines to make them more explicit and punchier). A much lighter crew will monitor Les Echos 360 through a back-office that will allow us to change cluster rankings and to eliminate parasitic items.

For Les Echos’ digital division, this aggrefilter is a proof of concept, a way to learn a set of technologies we consider essential for the company’s future. The digital news business will be increasingly driven by semantic processes; these will allow publishers to extract much more value from news items, whether they are produced in-house or aggregated/filtered. That is especially true for a business news provider: the more specialized the corpus, the higher the need for advanced processing. Fortunately, it is much easier to fine-tune an aggrefilter for a specific field (logistics, clean-tech, M&A, legal affairs…) than for wider and muddier streams of general news. This new site is just the tip of the iceberg. We built this engine to address a wide array of vertical, business-to-business, needs. It aims to be a source of tangible revenue.

frederic.filloux@mondaynote.com

@filloux 

 

Aggregators: the good ones vs. the looters

News aggregators have grown into all shapes and forms. Some are truly helping the producers of original content but others simply amount to mere electronic ransack.

My daily media routine starts on Techmeme. It is a pure aggregator — actually an aggrefilter, as coined by Dan Farber, at the time editor-in-chief of Cnet, who recommended it. This little site combines simple concept and sophisticated execution. As shown in its “Leaderboard”, it crawls a hundred sources and applies a clever algorithm using 600 parameters. More importantly, it adds a human editing layer. In this Read Write Web interview, Techmeme’s founder Gabe Riviera recently discussed his views on the importance of human editing, how it allowed him to fine-tune the his site’s content. The result is one of the most useful ways of monitoring the tech sector. And, since Gabe Riviera also launched Mediagazer last year, I use it to watch the media space. (Another iteration of the concept, Memeorandum, aggregates political news; for reasons I don’t quite understand yet, it doesn’t work as well as the two others.)

Techmeme and Mediagazer benefit the news outlets they mention. Story excerpts are short enough to avoid being self-sufficient and the hierarchical structure works. (Self-sufficient excerpts result in the aggregator not sending back traffic to the source — I’ll come to that later.) These twin sites are definitely among the best of their kind, resulting in a sound six persons business, not the next Google News but doing OK financially.

In fact, in their very own fields, Techmeme are Mediagazer are more useful than Google News. By crawling through so many sources, with the sole help of a powerful (but aging) algorithm, Google News ends up lacking finesse, precision and selectiveness. It’s a pure product of the engineering culture the search giant is built on, where obsessive hardcore binary thinking sweeps away words like “nuance”, “refinement”, “gradation”.

At the other end of the aggregator spectrum, we have The Huffington Post, one of the smartest digital news machine ever and, at the same time, the mother of all news internet impostures.

In France, where true journalism is in a state of exhaustion, everybody wants to make “Un Huffington Post à la Française“. The dream hardly comes from the best and the brightest. No, the fantasy agitates click-freaks building “traffic machines” on the generous losses their investors are willing to put up with. So, in spite of the red ink, why do they yearn for their Huffington Post so much? One word: Numbers. As recalled in Newsonomics story, in one year, the HuffPo doubled its audience. And now, the HuffPo is nibbling at the NYTimes.com’s ankle: 13m unique visitors/month (Nielsen) vs. 19m for the Times. The HuffPo is a privately-held company with abundant funding and therefore does not release financial numbers. Revenues are said to be in the $15m range, and profitability is “near”…, this according to fascinated bloggers who kissed the HuffPo CEO Eric Hippeau’s ring. More