Spark 1.1 live - from Kitty Hawk to Infinity (and beyond...)
"The credit belongs to the man who is actually in the arena … who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat.”
~ Theodore Roosevelt
It's not fair to be too hard on technological pioneers; the path to great progress is often marked with fine innovations that are trumpeted as "better than sliced bread", even if later hindsight shows them to be merely VHS ("better than Beta") — a humble step on the road to DVDs and then digital video.
So it has been with Big Data technologies; Big Data was has done great things for my Stanford classmate Omid Kordestani at Google; even if Google doesn't use MapReduce anymore it was still a milestone on our path, not just to the "Internet of Things" but to the hopefully-coming "Internet of Us."
So it's not surprising that Big Data is taking a pounding these days, exemplified by machine learning's Michael Jordan decrying the Delusions of Big Data. This is par for the course; even as advanced analytics becomes too big to simply dismiss the techniques are still subject to the ills that flesh technology is heir to — welcome to the human condition.
Jordan notes:
- some results will be white noise
- some results will be overhyped, but
- some genuinely valuable results may be overlooked
These are all true - this is the imperfect world we inhabit. I still see great possibilities in big data, and my take on Jordan's comments falls somewhere between physicist Niels Bohr:
"The opposite of a great truth is also true."
and an unknown writer (possibly Vonnegut), who opined:
"A pioneer in any field, if he stays long enough in the field, becomes an impediment to progress in that field…"
Progress changes everything. We must try to imagine the mindset of a Henry Ford, advancing manufacturing processes to put automobiles in the hands of all of his employees; even if he lacked the gasoline to power them; gas stations to fill them or even paved roads to drive them on. The first models were technological marvels of their age, but that doesn't mean we can't laugh at them now:
So it is with the advances of big data technologies. I might reasonably agree with both Jordan and Michael Stonebraker that Hadoop, the darling of the first Data Age, is not just a yellow elephant but has some of the characteristics of a white elephant as well.
I've written about the foibles of Hadoop before. Hadoop is (and continues to advance as) a terrific technology for working with embarrassingly parallel data, but in a real-time world these drawbacks are like a manual crank on a car — it may work, but it's not what everybody (anybody?) would choose going forward. Here's what's wrong:
- Limited acceleration options
- Poor data provenance
- Disc (not RAM) based triply-redundant storage — bulky and slow
- Slow (HIVE) support for SQL — the data query language that everybody knows
Fortunately, the next step in technology evolution has reached Version 1.1.0 since I last wrote. Spark can solve all of these problems, so let's go get it. Spark can be downloaded from the Spark download site:
Once we've downloaded the latest Spark tar file, we can un-tar it and set it up:
$ curl -O http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
$ mv ~/Downloads/spark-1.1.0.tar .
$ cd /usr/local
$ sudo tar xvf ~/Downloads/spark-1.1.0.tar
$ cd spark-1.1.0
Got it! Now let's try running Spark 1.1.0…
$ ./bin/run-example SparkPi 10
Failed to find Spark examples assembly in /Users/jkrepko/src/spark-1.1.0/lib or /Users/jkrepko/src/spark-1.1.0/examples/target
Whoops — spoke too soon. Let's build Spark, starting with Hadoop and including Scala and any of the other tools we'll need.
First let's install hadoop 2.4.1 by downloading our choice of Hadoop version from our chosen download mirror.
Once the Hadoop 2.4.1 download is complete, we untar it and symlink it
$ sudo tar xvf $HOME/Downloads/hadoop-2.4.1.tar
$ sudo ln -s hadoop-2.4.1 hadoop
Now we set the ownership of the installed files
$ ls -ld $HOME
which for me gives
drwxr-xr-x+ 127 jkrepko staff 4318 Oct 20 09:43 /Users/jkrepko
Let me set the ownership for our Hadoop install and we can roll on from here
$ sudo chown -R jkrepko:staff hadoop-2.4.1 hadoop
We can then check the changes with
$ ls -ld hadoop* — which for me gives
lrwxr-xr-x 1 jkrepko staff 12 Oct 21 10:04 hadoop -> hadoop-2.4.1
drwxr-xr-x@ 12 jkrepko staff 408 Jun 21 00:38 hadoop-2.4.1
We'll want to update our ~/.bashrc file to make sure our HADOOP_HOME and other key globals set correctly:
export HADOOP_PREFIX="/usr/local/hadoop"
export HADOOP_HOME="${HADOOP_PREFIX}"
export HADOOP_COMMON_HOME="${HADOOP_PREFIX}"
export HADOOP_CONF_DIR="${HADOOP_PREFIX}/etc/hadoop"
export HADOOP_HDFS_HOME="${HADOOP_PREFIX}"
export HADOOP_MAPRED_HOME="${HADOOP_PREFIX}"
export HADOOP_YARN_HOME="${HADOOP_PREFIX}"
export "PATH=${PATH}:${HADOOP_PREFIX}/bin:${HADOOP_PREFIX}/sbin"
export SCALA_HOME=/usr/local/bin/scala
Now that Hadoop is installed, we can walk through the .sh and .xml files to ensure that our Hadoop installation is configured correctly. These are all routine Hadoop configurations. We'll start with hadoop-env.sh — comment out the first HADOOP_OPTS, and add the following line:
vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
## export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
Next up are our updates to Core-site.xml
vi /usr/local/hadoop/etc/hadoop/core-site.xml
Here we'll add the following lines to configuration:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Next up is our mapred-site.xml.
vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
Immediately following installation this file will be blank, but feel free to copy and edit the mapred-site.xml.template file, or simply add the following code to our blank mapred-site.xml file:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
</property>
</configuration>
Our final configuration file is hdfs-site.xml — let's edit it as well:
$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add the following configuration information
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Finally, to start / stop Hadoop let's add the following to our ~/.profile or ~/.bashrc file
$ vi ~/.profile
alias hstart="$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh"
alias hstop="$HADOOP_HOME/sbin/stop-yarn.sh;$HADOOP_HOME/sbin/stop-dfs.sh"
And source the file to make hstart and hstop active
$ source ~/.profile
Before we can run Hadoop we first need to format the HDFS using
$ hadoop namenode -format
This yields a lot of configuration messages ending in
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at jkrepko-2.local/10.0.0.153
************************************************************/
Just as housekeeping , if you haven't done so already you must make your ssh keys available. I already have keys (which can be otherwise generated by keygen), so I just need to add:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
I can then confirm that ssh is working with:
$ ssh localhost
$ exit
We can now start hadoop with
$ hstart
Let's see how our Hadoop system is running by entering
http://localhost:50070
Bravo! Hadoop 2.4.1 is up and running. Port 50070 gives us a basic heartbeat:
We've started Hadoop, and we can stop it with
$ hstop
Now that Hadoop is installed, we can build Spark, but first we have to set up Maven properties
Add this to ~/.bashrc
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
Now we can build Spark with the built-in Scala builder. We'll kick that off with the following command. First let's make sure we have version 2.10 or later of Scala installed. On a Macintosh (my base machine here) this is a simple Homebrew install command:
$ brew install scala
Now we can run the Spark build utility:
$ SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly
NOTE: SPARK_HADOOP_VERSION is deprecated, please use -Dhadoop.version=2.4.1
The build succeeded, but with the deprecation warning we might do better in the future with something more like:
$ sbt/sbt assembly -Dhadoop.version=2.4.1
We're now LIVE on Spark 1.1.0! Before we start, let's turn the logging down a bit. We can do this by editing an update to the conf/log4j.properties.template file, and saving that update as conf/log4j.properties, as such:
log4j.rootCategory=INFO, console
Let's lower the log-level so that we only show WARN message and above. Here we change the rootCategory as such:
log4j.rootCategory=WARN, console
Now we're live and can run some examples! Time for some Pi…
$ ./bin/run-example SparkPi 10
Pi is roughly 3.140184
Not bad, but let's bump up the precision a bit:
$ ./bin/run-example SparkPi 100
Pi is roughly 3.14157
Mmmmmnnnn, Mmmmmnnnn good!
There are lots of other great emerging Spark examples, but we're up and running here and we'll stop for now.
It's a long road from Kitty Hawk to the (sadly missed) Concorde or the 787, and we won't get there in just one step. In my next post I'll lay out the toolkit we have today that should take Big Data from the sandy shores of North Carolina and a team of crazy bike guys (who should never have beaten Samuel Langley to first-flight, but did anyway!) to Lindbergh crossing the Atlantic, and maybe even to the DC3 — the airplane that brought air travel (big data?) to everyone.