Par François Bourdoncle, Co-fondateur et CTO Exalead
“I had the honor to be the moderator of a very stimulating round table at WWW2012. The participants were:
- François BANCILHON, Founder & CEO of Data Publica, a company working on assembling data sets built from both public data and open data, and then selling these data sets to companies to help them build innovative applications. Data Publica describes itself as a “Data Vendor” similar, in the domain of Open Data, to what “Software Vendors” are to the domain of Software.
- Sébastien LEFEBVRE, Founder & CEO of Mesagraph, a company specializing in making sense of Twitter posts to develop innovative social TV viewing experiences.
- Denis WEISS, CIO for Industry at Groupe La Poste (the French Postal Service), who talked about examples of cutting edge applications of Big Data to monitor their core-business, namely, the distribution of mail.
The focus of this roundtable was on the Industrial Applications of Open Data and Big Data, and at least four new trends and open questions emerged during our discussion.
Trend #1: The Internet of Things is already there!
When you think about it, it’s not so long ago that the “Internet of Things” (a vast collection of small devices seamlessly connected to the Net) was still just a concept in research papers. And before you know it, it’s there, and like Monsieur Jourdain, people don’t quite fully understand it. Even if you think calling your smart phone a “Thing” is debatable, and yet it is a “Thing” that send lots and lots of information to many servers world-wide, you would be amazed to know the number of anonymous devices that are already fully connected. For example, La Poste has worked with Exalead on connecting to the net the opto-electronic machines that it uses to filter and sort our mail. It then uses all the information gathered to build a full-fledged business intelligence tool, used to operationally monitor the system. Another example: did you know that high-end car manufacturers had turned their vehicles into “Things” that keep sending monitoring information to central servers to assure better service and maintenance? One has to understand that every such “Thing” creates huge logs of, literally, hundreds of billions of records: that’s more pages than the entire Web!
Trend #2: What is (are) the right business model(s) for data?
Data, data, data. Data, is the new frontier these days. Big Data, Open Data, DaaS (Data as a Service), you name it. Data is like Software, it is very scalable: one invests heavily to create data sets, and then sells it by million, with zero or very small marginal cost.
Well, at least that’s how the theory goes. But in fairness, it’s hard to say that anybody has cracked the right business model for data. For instance, one interesting question remains: to be scalable, a data set needs to be reusable by many applications and developers. But then, the value of such a data data set is probably very low, unless it’s absolutely needed to build everybody’s application and you have exclusivity, which is likely to be a very rare case, especially with Open Data.
At the other end of the spectrum, using the Big Data artillery to build a very specific data set can yield a very exclusive “product” that can only be used by one or maybe a handful of non-competing companies. Such a data set can be very expensive (to build and to buy), and can also create a lot of value for the company that uses it. But it’s an entirely different business model that is very different from the intrinsically scalable business model of the Software industry (especially, SaaS). At least until someone cracks it. Any volunteer?
Trend #3: Adding a Social layer to traditional activities
Well, that is also a very interesting trend: using social networks like Twitter to provide real-time “voice of the customer” applications. For instance, Mesagraph is working with broadcasters to build iPad applications connected to TV programs so that you can comment and interact with other viewers in real-time, while you’re watching the show. That is truly revolutionary: finally, a way to connect back to the broadcasters. Consumers can find their interest here, quite obviously, but at the same time, think of the implications in terms of advertising. Real-time advertising, even. Fine-grained audience segmentation. This is an entirely new field with all sorts of promises and challenges.
Another very interesting application that was presented at WWW2012 is the use of tweets to monitor the Netflix media streaming service, by detecting tweets containing phrases like “is out” (come on, guys, you can do better than that . Even with very simple heuristics, about 90% of outages were correctly detected. Truly awesome, IMHO.
Trend #4: The New Frontier of Business Intelligence & Semantics at petabyte scales
The Internet of Things is making petabyte scales a reality today (a petabyte is 1,000 terabytes, or 1,000,000 gigabytes). A copy of the entire Web amounts to several petabytes. So Big Data technologies are needed to handle such a vast amount of data, and one has to perform some form of Business Intelligence to make sense of it.
There are two major breakthroughs to handle this challenge. On one side, RAM-based databases, where data is organized in “columns”, as opposed to “rows”, allow for a very fast processing of large quantities of data (as long as this data fits in RAM, that is). Slicing and dicing couldn’t be any faster or easier. On the other hand, search-engines, which are “columnar” by essence, are evolving to handled many more kind of data (semantic, numeric, etc.), are becoming more and more transactional (“ACID”, in barbarian terms) and can process even larger data sets since they do not require that entire data sets fit in RAM.
You get to choose your favorite. But one thing is clear: semantic treatment of textual data will be a major requirement for next-generation Business Intelligence platforms. That is the next frontier for Big Data. And search engines are uniquely positioned to win this race.