« Spark 1.1 live - from Kitty Hawk to Infinity (and beyond...) | Main | Spreadsheets for the New Millennium -- Getting There... »
Sunday
Sep072014

I Saw Sparks

I've long been a follower of Joel Spolsky and his writings on software development, and some of them (e.g. Can Your Programming Language Do This?) are practically QED for their topics. I think I can do him one better on one similarly terrific writing of his: Smart and Gets Things Done.

You can't argue with "smart" — as a software engineer, manager or executive you have to expect that your discipline and skill set will turn over 99% (the remaining 1% being vi and UNIX commands from the '80s) every 3-4 years. If you're not really smart and really dedicated you can't possibly keep up past a single product cycle.

"Gets Things Done" is similarly dispositive — even the brainiest developer won't get products out if they

  • Stay locked in their Microsoft-mandated individual offices, never talk to anyone and stay alive only if/because you keep sliding pizza's under their door
  • Are so spectacularly abrasive that they make the rest of your team take to living in their Microsoft-mandated offices — leading to endless shifts in the product schedule (and ever-increasing pizza bills !)

Interpersonal skills tend to be undervalued in Technology, but are essential to getting things done. Even more than pure skill, "Get's things done" is a testimony to human grace: It takes a lot of humility to get products out the door, and you might be a Putnam Fellow, but without some give and take all that brilliant code will never make it off of your machine!

To Spolsky's pair I'd add one more category — one more thing that I look for when I'm hiring or building teams: "Sparks." Sparks are those odd nuggets that pop up on a resume — seemingly unrelated to anything, that indicate the kind of rare gifts that make our world the wonder it is. I once interviewed (and hired!) a fantastic software engineer, former graduate EE whose "spark" was that she'd done research work on (and helped write the book on) chinchillas! She was qualified in all the Spolsky ways — but to me the chinchilla book was the clincher. Few are those who do EE research on chinchillas, but only the rarest write the book following that research.

Smart?
Check!
Get's Things Done?
Check!
Chinchillas? Chinchillas??? Chinchillas!

HIRED!

Sparks have brought me some of the best things in my life: my wife Kate and some great friends and co-workers (among them a founding Menudo member, another of the greatest musicians of all time, several rare inventors and scientists, and artists and more…)

Sparks are also one of the things I look for in software development efforts, and thus (for my own efforts, and for my work) I tend to stay away from development approaches and tools that required teams and time that only Cecil B DeMille could master.

This might be fine for some, but I think it's just too many to expect the apex of human expression and genius to appear. We don't know what we're doing, but that doesn't stop us from trying and genius is sometimes the result. As I've written before, the great breakthrough that was Lotus 1-2-3 came from its macro capability — the magic-decoder-ring that gave spreadsheet users the ability to do things that it's inventors might never have dreamed!

This is what makes Spark such a big win for Big Data — It's light and interactive, and rewards people who might have that spark of insight — even if they can't afford a 10-geek programming team. With the balance of this post we'll get Spark started, and in my next post we'll go deeper into the wonders that Spark can do.

First, we're going to want to update our Java runtime and JDK environments. There are options in this space now, but as a former Oracle employee (and still Oracle-rooter and fan) we'll head directly over to Larry's site for what we need:

And we're set. I'm running on a Macintosh and I've chosen Java 8 (finally! closures!!), but there is a version of Java that comes with the Macintosh, and you're going to want to add a little magic to your ~/.bash_profile to make OSX recognize the latest Java. Once that's in, you can run
$ java -version

java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

from your terminal to confirm that we're all ready with Java.

Next comes an installation (or update) to Hadoop. I know I've spent most of my past four Big Data posts moaning about Hadoop's batch-y, non-interactive style, but for data that really is embarrassingly parallel it's the tool you need. Cloudera has taken a lot of the adventure out of Hadoop installations, but for my Mac I'm grateful to Yi Wang's Terrific Tech Note on Hadoop for Mac OSX. I've installed Hadoop 2.4.1, and the tech note covers the installation and does a nice job on getting started with the core.site.xml, hdfs-site.xml, and yarn-site.xml setup as well. Again, once you can run

$ hadoop jar ./hadoop-mapreduce-examples-2.4.1 wordcount LICENSE.txt out

from the MapReduce Examples folder you're set. Now, for Sparks and the show we've all been waiting for: