big data

Google’s looming hegemony

 

If we factor Google geospatial applications + its unique data processing infrastructure + Android tracking, etc., we’re seeing the potential for absolute power over the economy. 

Large utility companies worry about Google. Why? Unlike those who mock Google for being a “one-trick pony”, with 99% of its revenue coming from Adwords, they connect the dots. Right before our eyes, the search giant is weaving a web of services and applications aimed at collecting more and more data about everyone and every activity. This accumulation of exabytes (and the ability to process such almost unconceivable volumes) is bound to impact sectors ranging from power generation, transportation, and telecommunications.

Consider the following trends. At every level, Western countries are crumbling under their debt load. Nations, states, counties, municipalities become unable to support the investment necessary to modernize — sometimes even to maintain — critical infrastructures. Globally, tax-raising capabilities are diminishing.

In a report about infrastructure in 2030 (500 pages PDF here), the OECD makes the following predictions (emphasis mine):

Through to 2030, annual infrastructure investment requirements for electricity, road and rail transport, telecommunications and water are likely to average around 3.5% of world gross domestic product (GDP).

For OECD countries as a whole, investment requirements in electricity transmission and distribution are expected to more than double through to 2025/30, in road construction almost to double, and to increase by almost 50% in the water supply and treatment sector. (…)

At present, governments are not well placed to meet these growing, increasingly complex challenges. The traditional sources of finance, i.e. government budgets, will come under significant pressure over the coming decades in most OECD countries – due to aging populations, growing demands for social expenditures, security, etc. – and so too will their financing through general and local taxation, as electorates become increasingly reluctant to pay higher taxes.

What’s the solution? The private sector will play a growing role through Public-Private-Partneships (PPPs). In these arrangements, a private company (or, more likely, a consortium of such) builds a bridge, a motorway, a railroad for a city, region or state, at no expense to the taxpayer. It will then reimburse itself from the project’s cash-flow. Examples abound. In France the elegant €320m ($413m) viaduct of Millau was built — and financed — by Eiffage, a €14 billion revenue construction group. In exchange for financing the viaduct, Eiffage was granted a 78-year toll concession with an expected internal rate of return ranging from 9.2% 17.3%. Across the world, a growing number of projects are built using this type of mechanism.

How can a company commit hundreds of millions of euros, dollars, pounds with an acceptable level of risk over several decades? The answer lies in data-analysis and predictive models. Companies engineer credible cash-flow projections using reams of data on operations, usages patterns and components life cycles.

What does all this have to do with Google?

Take a transportation company building and managing networks of buses, subways or commuter trains in large metropolitan areas. Over the years, tickets or passes analysis will yield tons of data on customer flows, timings, train loads, etc. This is of the essence when assessing the market’s potential for a new project.

Now consider how Google aggregates the data it collects today — and what it will collect in the future. It’s a known fact that cellphones send back to Mountain View (or Cupertino) geolocalization data. Bouncing from one cell tower to another, catching the signal of a geolocalized wifi transmitter, even if the GPS function is turned off, Android phone users are likely to be tracked in realtime. Bring this (compounded and anonymized) dataset on information-rich maps, including indoor ones, and you will get very high definition of profiles for who goes or stays where, anytime.

Let’s push it a bit further. Imagine a big city such as London, operating 500,000 security cameras, which represent the bulk of the 1.85 million CCTVs deployed in the UK — one for every 32 citizens. 20,000 of them are in the subway system. The London Tube is the perfect candidate for partial or total privatization as it bleeds money and screams for renovations. In fact, as several people working at the intersection of geo applications and big data project told me, Google would be well placed to provide the most helpful datasets. In addition to the circulation data coming from cellphones, Google would use facial recognition technology. As these algorithms are already able to differentiate a woman from a man, they will soon be able to identify (anonymously) ethnicities, ages, etc. Am I exaggerating ? Probably not. Mercedes-Benz already has a database of 1.5 million visual representations of pedestrians to be fed into the software of its future self-driving cars. This is a type of applications in which, by the way, Google possesses a strong lead with its fleets of driverless Prius crisscrossing Northern California and Nevada.

Coming back to the London Tube and its unhappy travelers, we have traffic data, to some degree broken down into demographics clusters; why not then add shopping data (also geo-tagged) derived from search and ads patterns, Street View-related informations… Why not also supplement all of the above with smart electrical grid analysis that could refine predictive models even further (every fraction of percentage points counts…)

The value of such models is much greater than the sum of their parts. While public transportation operators or utility companies are already good at collecting and analyzing their own data, Google will soon be in the best position to provide powerful predictive models that aggregate and connect many layers of information. In addition, its unparalleled infrastructure and proprietary algorithms provide a unique ability to process these ever-growing datasets. That’s why many large companies over the world are concerned about Google’s ability to soon insert itself into their business.

frederic.filloux@mondaynote.com

 

Facebook’s Gen Y Nightmare

 

GenerationY will — paradoxically — pay a high price for giving up its privacy to Facebook.                  

Taos, New Mexico, Fall 2012. At 18, Tina Porter has been on Facebook for four years. Duly briefed by her parents, a teacher and a therapist, she takes great care not to put contents — remarks on her wall, photos, videos — that could expose her in a unwanted manner.

Still. Spending about 30 hours a month on the social network, she has become as transparent as a looking glass. It will impact the cost of her health insurance, her ability to get a loan and to find a job.

Denver, Colorado, spring 2018. Tina is now 24. She’s finishing her law degree at Colorado State University. She’s gone through a lot: experimenting with substances, been pulled over for speeding a couple of times, relying on pills to regain some sleep after being dumped by her boyfriend.  While Tina had her share of downs, she also has her ups. Living in Denver she never missed an opportunity to go hiking, mountain biking, or skiing — except when she had to spend 48 gruesome hours in the dark, alone with a severe migraine. But she remains fit, and she likes to record her sports performances on health sites — all connected to Facebook — and compare with friends.

Seattle, winter 2020. In a meeting room overlooking the foggy Puget Sound, Alan Parsons, head of human resources at the Wilson, McKenzie & Whitman law firm holds his monthly review of the next important hires. Parsons is with Marcus Chen, a senior associate at Narrative Data Inc., both are poring over a selection of resumés. Narrative Data was created in 2015 by a group of MIT graduates. Still headquartered in Cambridge, Massachusetts, the startup now helps hundreds of corporations pick the right talent.

Narrative Data doesn’t track core competencies. The firm is more into character and personality analysis; it assesses ability to sustain stress, to make the right decision under pressure. To achieve this, Narrative Data is staffed with linguists, mathematicians, statisticians, psychologists, sociologists, neuroscientists. What they basically do is data-mining the social internet: blogs, forums, Twitter, and of course Facebook. Over the years, they’ve drawn a map of behaviors, based on language people use. Thanks to Narrative Data’s algorithm, everyone aged above 20, can have his or her life unfolded like a gigantic electronic papyrus scroll. HR people and recruiters love it. So do insurance companies and banks.

Of course, in 2015 no one will be dumb enough to write on his Facebook wall something like “Gee, bad week ahead, I’m heading to my third chemotherapy session”. But Narrative Data is able to pinpoint anyone’s health problems by weaving together language patterns. For instance, it pores over health forums where people talk, openly but anonymously, about their conditions. By analyzing millions of words, Narrative Data has mapped what it calls Health Clusters, data aggregates that provide remarkable accuracy in revealing health conditions. The Cambridge company is even working on a black program able to “de-anonymize” health forum members thanks to language patterns cross-matching with Facebook pages. But the project raises too many privacy issues do be rolled out — yet.

Tina Porter’s resumé popped up thanks to LinkedIn Expert, the social network’s high-end professional service. LinkedIn, too, developed its own technology to data-mine resumés for specific competences. Tina’s research on trade disputes between Korea and the United States caught everyone’s interest at Wilson, McKenzie. That’s why her “3D Resumé” — a Narrative Data trademark — is on the top of the pile, that is displayed on a large screen in the meeting room.

Narrative’s Marcus Chen does the pitch:
“Tina Porter, 26. She’s what you need for the transpacific trade issues you just mentioned, Alan. Her dissertation speaks for itself, she even learned Korean…”
He pauses.
“But?…” Asks the HR guy.
“She’s afflicted with acute migraine. It occurs at least a couple of times a month. She’s good at concealing it, but our data shows it could be a problem”, Chen said.
“How the hell do you know that?”
“Well, she falls into this particular Health Cluster. In her Facebook babbling, she sometimes refers to a spike in her olfactory sensitivity — a known precursor to a migraine crisis. In addition, each time, for a period of several days, we see a slight drop in the number of words she uses in her posts, her vocabulary shrinks a bit, and her tweets, usually sharp, become less frequent and more nebulous. That’s an obvious pattern for people suffering from serious migraine. In addition, the Zeo Sleeping Manager website and the stress management site HeartMath — both now connected with Facebook –  suggest she suffers from insomnia. In other words, Alan, we think you can’t take Ms Porter in the firm. Our Predictive Workforce Expenditure Model shows that she will cost you at least 15% more in lost productivity. Not to mention the patterns in her Facebook entries suggesting a 75% chance for her to become pregnant in the next 18 months, again according to our models.”
“Not exactly a disease from what I know. But OK, let’s move on”.

I stop here. You might think I’m over the top with this little tale. But the (hopefully) fictitious Narrative Data Inc. could be the offspring of existing large consumer research firms, combined to semantic and data-mining experts such as Recorded Future. This Gothenburg (Sweden)-based company — with a branch in… Cambridge, Mass. –  provides real time analysis of about 150,000 sources (news services, social networks, blogs, government web sites). The firm takes pride in its ability to predict a vast array of events (see this Wired story).

Regarding the “de-anonymizing” the web, two years ago in Paris, I met a mathematician working on pattern detection models. He focused on locating individuals simply through their cell phones habits. Even if the person buys a cell phone with a fake ID and uses it with great care, based on past behavior, his/her real ID will be recovered in a matter of weeks. (As for Facebook, it recently launched a snitching program aimed at getting rid of pseudonyms — cool.)

Expanding such capabilities is only a matter of refining algorithms, setting up the right data hoses and lining up the processing power required to deal with petabytes of unstructured data. Not an issue anymore. Moore’s Law is definitely on the Inquisitors’ side.

frederic.filloux@mondaynote.com

The value is in the reader’s Big Data

 

Why the right use of Big Data can change the economics of digital publishing. 

Digital publishing is vastly undervalued. Advertising has yet to fulfill its promises — it is nosediving on the web and it failed on mobile (read JLG’s previous column Mobile Advertising: The $20 billion Opportunity Mirage). Readers come, and often go, as many digital publications are unable to retain them beyond a few dozen articles and about thirty minutes per month. Most big names in the digital news business are stuck with single digit ARPUs. People do flock to digital, but cash doesn’t follow — at least, not in amounts required to sustain the production of quality information. Hence the crux of the situation: if publishers are unable to extract significantly more money per user than they do now, most of them will simply die. As a result, the bulk of the population — with the notable exception of the educated wealthy — will rely on high audience web sites merely acting as echo chambers for shallow commodity news snippets.

The solution, the largest untaped value resides right before publisher’s eyes: readers profiles and contents, all matched against the “noise” of the internet.

Extracting such value is a Big Data problem. But, before we go any further, what is Big Data? The simplest answer: data sets too large to be ingested and analyzed by conventional data base management tools. At first, I was a suspicious, this sounded like a marketing concept devised by large IT players struggling to rejuvenate their aging brands. I changed my mind when I met people with hands-on experience, from large corporations down to a 20-staff startup. They work on tangible things, collecting data streams from fleets of cars or airplanes, processing them in real time and, in some cases, matching them against other contexts. Patterns emerge and, soon, manufacturers predict what is likely to break in a car, find out ways to refine the maintenance cycle of a jet engine, or realize which software modification is needed to increase the braking performance of a luxury sedan.

Phone carriers, large retail chains have been using such techniques for quite a while and have adjusted their marketing as a result. Just for fun, read this New York Times Magazine piece depicting, among other things, the predictive pregnancy model developed by Target (a large US supermarket chain). Through powerful data mining, the rightfully named Target corporation is able to pinpoint customers reaching their third pregnancy month, a pivotal moment in their consuming habits. Or look at Google Flu Trends providing better tracking of flu outbreaks than any government agency.

Now, let’s narrow the scope back to the subject of today’s column and see how these technologies could be used to extract more value from digital news.

The internet already provides the necessary tools to see who is visiting a web site, what he (she) likes, etc. The idea is to know the user with greater precision and to anticipate its needs.

Let’s project an analogy with Facebook. By analyzing carefully the “content” produced by its users — statements, photos, links, interactions among friends, “likes”, “pokes”, etc. — the social network has been able to develop spectacular predictive models. It is able to detect the change in someone’s status (single, married, engaged, etc.) even if the person never mentioned it explicitly. Similarly, Facebook is able to predict with great accuracy the probability for two people exchanging casually on the network to become romantically involved. The same applies to a change in someone’s financial situation or to health incidents. Without telling anyone, semantic analysis correlated by millions of similar behaviors will detect who is newly out of job, depressed, bipolar, broke, high, elated, pregnant, or engaged. Unbeknownst to them, online behavior makes people completely transparent. For Facebook, it could translate into an unbearable level of intrusiveness such as showing embarrassing ads or making silly recommendations — that are seen by everyone.

Applied to news news contents, the same techniques could help refine what is known about readers. For instance, a website could detect someone’s job changes by matching his reading patterns against millions of other monthly site visits. Based on this, if Mrs. Laura Smith is spotted with a 70% probability to have been: promoted as a marketing manager in a San Diego-based biotech startup (five items), she can be served with targeted advertising especially if she has also appears to be a active hiker (sixth item). More importantly, over time, the website could slightly tailor itself: of course, Mrs Smith will see more biotech stories in the business section than the average reader, but the Art & Leisure section will select more contents likely to fit her taste, the Travel section will look more like an outdoor magazine than a guide for compulsive urbanites. Progressively, the content Mrs. Smith gets will become both more useful and engaging.

The economic consequences are obvious. Advertising — or, better, advertorial contents branded as such (users are sick with banners)– will be sold at a much higher price by the web site and more relevant content will induce Mrs. Smith to read more pages per month. (Ad targeting companies are doing this, but in such a crude and saturating way that it is now backfiring). And since Mrs Smith makes more money, her growing interest for the web site could make her a good candidate to become a premium subscriber, then she’ll be served with a tailor-made offer at the right time.

Unlike Facebook who will openly soak the intimacy of its users under the pretext of they are willing to give up their privacy in exchange for a great service (good deal for now, terrible in the future), news publishers will be more careful. First, readers will be served with ads and contents they will be the only ones to see — not their 435 Facebook “friends”. This is a big difference, one that requires a sophisticated level of customization. Also, when it comes to reading, preserving serendipity is essential. By this I mean no one will enjoy a 100% tailor-made site; inevitably, it will feel a bit creepy and cause the reader to go elsewhere to find refreshing stuff.

Even with this sketchy description, you get my point: by compiling and analyzing millions of behavioral data, it is possible to make a news service way more attractive for the reader — and much more profitable for the publisher.

How far-reaching is this? In the news sector, Big Data is still in infancy. But as Moore’s Law keeps working, making the required large amounts of computing power more affordable, it will become more accessible to publishers. Twenty years ago, only the NSA was able to handle large sets of data with its stadium-size private data centers. Now publishers can work with small companies that outsource CPU time and storage capabilities to Amazon Web Services and use Hadoop, the open source version of Google master distributed applications software to pore over millions of records. That’s why Big Data is booming and provides news companies with new opportunities to improve their business model.

frederic.filloux@mondaynote.com