« Happy Birthday, Bobby Fischer | Main | Inventing the Future »
Monday
Feb142011

Casi Casi ... Cassandra

I've written a couple of times about the "N+1 Queries" problem and I've suggested that it's a bane to relational databases. But there's a way out of it -- let me tell you about it.

But first let's wallow a bit in it. I'm in Twitter, I've written a tweet and I'm ready for it to be sent out to all of my (countless) followers... Here's what my code for that broadcast might look like:

All fine so far -- that's a Rubyish-take twittery world we all live in. I can send out my breathless message of what I had for breakfast, and then Twitter picks it up and broadcasts the message from me (as well as all the messages from the other tweeters):

So here we're going to do a query for each of the X tweeters, and for them we'll do another query for each of their Y followers.

Code smell! Fail Whale!!!

(particularly when you consider Dare Obasanjo's take on Twitter combinatorics)

The problem here is Relational: we need a SELECT to find me, and then a new SELECT to get the info on each of my followers. This "N+1 SELECTS" problem is a simplified version of a real problem, where relational databases stagger and where column-oriented databases are much more what we're looking for. Column oriented databases are designed to be fast at grabbing all of the attributes (columns) associated with a given entity. To understand why this is vital for a Twitter or any other social application, consider the one-to-manys: Twitter has many tweeters, who have many followers, who themselves have many followers... and so on.

Let's think, though, about the code that gets generated when I tweet. If we're using a relational database we'll follow a SELECT for each of my followers with a SELECT for each of their followers -- so we got a polynomial number of SELECTs grinding away for each tweet, and as I get more popular the the disks whirr and lights dim every time I tweet about anything.

So to save the power grid let's try a little Twitter application, but this time using the column-oriented data store Cassandra to handle our users and tweets.

I'll run this from the same Amazon Cloud instance that I've used for my previous postings:
So, in my terminal connected to Amazon, I enter:

sudo gem install cassandra

I've already put Java on my base instance, so I'm just about good to go! A single-line command, and it really does run...

Now, lets start Twitter and tweeting. We'll use the Ruby interpreter IRB on Amazon to enter our users and their tweets:

root@ip-10-245-133-190:/var/www/apps# irb

We're rolling -- first we'll enter our requirements: rubygems to run our additional toys, cassandra to link to the data store we just installed, and SimpleUUID to identify our tweeters:

Now we'll start Twitter in Cassandra, and put in some users and screen names (I've mostly left the Cassandra responses out for brevity here):

Great so far -- we have user 5, "mudcat," and we've given him a tweet. Let's give him someone to tweet to:

And there we are -- we have a reasonable data model for Twitter, backed by the Cassandra data store. Let's review what we've got here:

Cassandra works as a kind of multidimensional hash, and the data it contains can be referenced as:

  • A keyspace
  • A column family
  • An optional super column A column, and
  • A key
    

Source: http://ksivakarthikeyan.blogspot.com/

Here's what these all mean:

The keyspace is the highest, most abstract level of organization. Our Cassandra conf/storage-conf.xml file contains our keyspace definitions at startup.

The column-family is the chunk of data that corresponds to a particular key. In Cassandra each column family is stored in a separate file on disk, so frequently-accessed data should be placed in a column family for fastest access. Column families are also defined at startup.

A super column is a named list, containing standard columns stored in recency order A column is a tuple, a key-value pair with a key (name) and a value
A key is the permanent name of the record, and keys are defined on the fly
With this structure we're basically defining a schema, and I'd like to claim it's original, but this one was taken from Twissandra by Eric Florenzano.

The great thing about Cassandra is that it evolved to solve real-world problems, and that it may have a free form but it is NOT exactly schema-less. Cassandra may fall in the "NoSQL" class with Hadoop, but the use cases that apply to it could scarcely be more different. Runtime lookups can be handled really well in Cassandra, due to Cassandra's low latency organization and strict definition. Asychronous analytics with the freedom of high latency and greater flexibility demands are a better fit for analytics systems like Hadoop.

Cassandra generally offers terrific performance. There is a tradeoff in eventual consistency, something that perhaps I'll take up in my next blog post.