How do I get started? A General Solution to Discovery in Big Data
Source: http://www.flickr.com/photos/41829005@N02/6162370327/
I've used the "spreadsheet" as a metaphor for an epiphany -- in this case combining enabling technologies (cheap PC processing, high-resolution displays and cheap memory) to provide a new metaphor for problem solving. Spreadsheet visual programming is a perfect metaphor for financial analysis because the rows-and-columns of financial ledgers map crisply to rows and columns on a computer screen. The final essential piece of the "PC Data" revolution arrived when a macro language was built into Lotus 1-2-3 that hadn't been build into Visicalc. This single feature guaranteed the hegemony of 1-2-3 and spreadsheets, as the macro language made them capable of solving problems outside of the domains envisioned but the first spreadsheet's developers.
Before spreadsheets, if you had a problem you could either lay it out on paper, or have a programmer write a specific program to perform the analysis you wanted. "Exploration" and "Discovery" were limited to what you could describe to a developer to program. Life before spreadsheets was brutish and short…
Source: http://appraisalnewsonline.typepad.com/photos/uncategorized/2008/01/08/matrix_data.jpg
So here we are today, at the dawn of the Big Data era. The core toolset is emerging (MapReduce via the Hadoop family of products) and word is spreading that remarkable solutions might be found in data that we formerly thought of as "disposable." The old problem is back, though -- if you (as a manager or executive) want solutions, you better go find a programmer. There are steps being taken to bring us spreadsheets for big data -- Datameer particularly is bringing spreadsheets to Big Data. Or, more properly, bringing Big Data to spreadsheets. They may move Big Data forward, but there's an impedance mismatch here -- if Big Data naturally fit in the rows and columns of spreadsheets it would already have made the jump and be found there. If Big Data describes a world beyond rows and columns, then the spreadsheet metaphor will end up fitting Big Data like a bad suit. Sure, we'll have our familiar rows and columns, but like Mozart played on a kazoo something in the essential nature of the data will be lost.
The answer for Big Data is a spreadsheet conceptually, but with a richer representational metaphor than rows and columns. We want fundamental insights from big data, so our building blocks should match the topologies that we're studying. Here's a first take at what "rows and columns" for Big Data might look like:
- Predictive Modeling -- stripped of scale, are there linear relationships in the data that offer explanatory or predictive value?
- Clustering Partition -- is the data uniformly distributed or clustered, and what can we learn from the clusters?
- N-Dimensional Visualization -- US Supreme Court Justice Potter Stewart once said that he couldn't define pornography, but "…He knew it when he saw it." Are there visual representations of Big Data that provide insight?
- Outlier Analysis -- does the data follow a predictable distribution (normal, exponential, poisson, etc.) and if we can fit the data to control charts, and what is meant by outliers to those charts?
- AB Analysis -- The data may be noisy, but can we use it to measure the performance of key variables against each other?
- Markov Chains -- You know the score this far into the game, and your customers' web interactions foreshadow their interests going forward. Where are we heading, and when do we get there?
These are our rows and columns, and in my next post I'll describe the architecture I'm pursuing to explore them, an architecture built around:
- HDFS for general data storage
- HBase for data management
- Hadoop for unstructured data analysis
- Zookeeper for task management
- SOLR for structured "free text" search
- Thrift for access to external development languages and platforms
- Massive_record to provide ORM-access to all that HBase data
- JQuery for unobtrusive JavaScript and core visual presentation
- SIMILE for advanced visual presentation
- Tableau for advanced visual presentation
- Node.js to serve up all that JavaScript
That's a lot to describe and it'll take some posting to do it, but the ultimate objective never changes -- to provide a sandbox that managers can play with and coax Big Data into giving up it's secrets.