Spreadsheets for the New Millennium -- Getting There...
I've written some posts about my hopes for a next generation of computing; about the rise of "Spreadsheets for the New Millennium" here: part 1, part 2, and part 3. Well it's been a couple of years since I wrote about Spreadsheets, and it's a decade now since our current computational generation began, so let's see what we've got:
Google published their breakthrough paper in late 2004, and it was based on work that had been ongoing since at least 2002. Now 2002 was a great year (but with Nickleback topping the charts we might want to reconsider just how great), but that's more than 10 years ago now and computing has changed a lot since then. Here are some things that were unknown in 2002 that are commonplace now:
- SSDs — My current Mac has a 768GB SSD, and computer disks have since gone the way of… CRT displays
- Flat-screen displays — HP used to sell monitors that my friend Julie Funk (correctly) called "2 men and a small boy" monitors — because that's what it took to carry one. Nobody misses them now — gone and forgotten
- Multicore processors — I'm still waiting for faster living from my GPU, but Moore's Law still lives on in multicore
- "10Gig-E" networks — I'm old enough to still remember IBM token-ring networks. Now "E" has been replaced by "Gig-E", which is itself headed for the "10Gig-E" boneyard.
- GigaRAM — I did some work for Oracle back in 2000's that showed the largest Oracle transactional DB running on about 1TB of memory. That was a lot then, but you can buy machines with at TB of RAM now. Memory is the new disk, and disk is the new tape…
There are still more innovations, but even with what I've listed so far I believe that we can safely say that we're not living in the same computational world that Brin & Page found in the early days of Google. "Big Data" has had lots of wins in the technology domain and even some that have reached general public recognition (such as IBM's jeopardy-playing Watson, Amazon's Customers who bought … also bought.) Expectations for results from data have risen, and it's time for some new approaches. Technologies from then just don't meet the needs of now …
Hadoop was a terrific advance (basically a Dennis Machine for the rest of us), but by today’s standards it’s clumsy, slow and inefficient. Hadoop brought us parallel computing and Big Data, but has done it through a disk-y solution model that really doesn't "feed the bulldog" now:
- Everything gets written to disk ( the new tape ), including all the interim steps — and there are lots of interim steps
- Hadoop really doesn't handle intermediate results and you'll need to chain lots of jobs together to perform your analysis, making Problem 1 even worse.
- I've written beautiful MapReduce code with ruby and streaming, but nobody's willing to pay the streaming performance penalty and thus we have the bane of Java MapReduce code — the API is rudimentary, it's difficult to test and it's impossible to confirm results. Hadoop has spawned a pile of add-on tools such as Hive and Pig that make this easier, but the API problems here are fundamental:
- You have to write and test piles of code to perform even modest tasks
- You have to generate volumes of "boilerplate" code
- Hadoop doesn’t do anything out of the box. It can be a herculean writing and configuration effort to tackle even modest problems.
This brings us to the biggest problem of the now-passing MapReduce era — most haystacks DO NOT have any needles in them! The "Big Data" era is still only just beginning, but if you're looking for needles then lighter, more interactive approaches are already a better way to find them.
The great news is that solutions are emerging that increasingly provide my long-dreamed Spreadsheets for the New Millennium. One of my favorite of these new approaches is Apache Spark and the work evolving from the Berkeley Data Analytics Stack.
Spark is a nice framework for general-purpose in-memory distributed analysis. I've sung the praises of in-memory before ( Life Beyond Hadoop ), and in-memory is a silver bullet for real-time computation. Spark is also familiar: You can deploy Spark as a cluster and submit jobs to it - much as you would with Hadoop. Spark also offers Spark SQL (formerly Shark) that brings advances beyond Hive in the Spark environment.
Many of the major Hadoop vendors have embraced Spark and it's a strong Hadoop replacement because it tackles the fundamental issues that have plagued Hadoop in the 2010's:
- Hadoop has a single point of failure (namenode) — fixed using Hadoop v2 or Spark
- Hadoop lacks acceleration features — Spark is in-memory and parallelized and fast
- Hadoop provides neither data integrity nor data provenance — RDDs (resilient distributed datasets) are (re)generated by provenance, and legacy data management can be augmented by Loom in the Hadoop ecosystem
- HDFS stores three copies of all data (basically by brute force) — Spark RDDs are cleaner and faster
- Hive is slow — Spark SQL (with caching) is fast - routinely 10X to 100X faster
Spark supports both batch and ( unlike Hadoop ) streaming analysis, thus you can use a single framework for both real-time exploration as well as batch processing. Spark also introduces a nice Scala-based functional programming model, which offers a simple introduction to the dominant Reduce patterns for Hadoop’s Map/Reduce.
So Spark is:
- An in-memory cluster computing framework
- Built in Scala, so it runs on the JVM and is compatible with existing Java code and libraries
- 10-100 times faster than Hadoop Map/Reduce because it runs in memory and avoids Hadoop-y disk I/O
- A riff on the Scala collections API -- working with large distributed datasets
- Batch and stream processing in single framework with a common API
Spark really looks like the next step forward. It additionally supports Thomas Kuhn's Structure of Scientific Revolutions requirements for The Next Step Forward in that it preserves many of the existing Big Data approaches while simultaneously moving beyond them. Spark has native language bindings for Scala, Python, and Java and offers some interesting advances, including a native graph processing library called GraphX and a machine learning library (like Mahout) called MLlib.
These are all valuable steps beyond the Toy Elephant, and they give us a great way to find needles while controlling "Needle-less Haystack" risks and costs. So here is our core scenario:
- You have a haystack
- You think there might be a needle (or needles!) in it
- You want to staff and fund a project to find needles — even if you don't know where they are or exactly how to find them
So — do you:
Staff a big project with lots of resources, writing lots of boilerplate code that you'll run slowly in batch mode -- all while praying that magical "needle" answers appear?
or
Start experimenting in real time, with most of your filters and reducers pre-written for you — producing new knowledge and results in 1-10th to 1-100th the time!
With Spark and Spark SQL, to paraphrase William Gibson, "the future is already here, and it's about to get a lot more evenly distributed!" More on rolling with Spark and Shark / Spark SQL in future postings…