O'Reilly on the Future of Massive Data Analysis
20 Nov 2008There's a post by Joseph Hellerstein worth a read over on O'Reilly Radar: The Commoditization of Massive Data Analysis. It's more enterprise focused then small-normal business focused, but that's just a consequence of the target audience.
His primary point is becoming especially pertinent to web companies and smaller developers: The convergence of dropping hardware prices and machine-readable APIs is making the storage and processing of vast amounts of information practical.
We are at the beginning of what I call The Industrial Revolution of Data. We're not quite there yet, since most of the digital information available today is still individually "handmade": prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation "factories" such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds.
It's already reasonable for a site on a commodity web host to store every user and search interaction, or a database of tens of millions of data points, and in the future it will only get easier. The question is, what tools will we use to make sense of all of this?
His analysis reduces the field to SQL (via Oracle) and MapReduce (via Hadoop), but once we look beyond the enterprise, tools like Erlang (or functional programming in general) and the emerging CouchDB show promise, not to mention some of the cloud computing entries from Amazon and others.
On the visualization side of things, tools like Processing and the Prefuse Toolkit are seeing quick uptake, as well as more focused commercial tools like FusionCharts.
Whatever the toolchain turns out to be, those of us with an interest in understanding information have the opportunity to be on the forefront of the change, and if we don't gain expertise in the available options early, we risk being left behind.