Cassandra

Eben Hewitt

Mentioned 6

A guide to Apache Cassandra covers such topics as write, update, and read Cassandra data; add or remove nodes from the cluster; use the JMX interface to monitor a cluster's usage; and tune memory settings and data storage for better performance.

More on Amazon.com

Mentioned in questions and answers.

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy lifting. The same machine is running Apache as well.

The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.

Relational features I would need -

  • Order by [SliceRange in Cassandra's API seems to satisy this]
  • Group by
  • Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
  • Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]

My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).

So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,

  • On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.

  • Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]

  • If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.

  • Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)

Any insights from people who've done a shift would be greatly appreciated!

Thanks.

If you're a relational database developer (as I am), I'd suggest/point out:

  • Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
  • It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
  • Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.

These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.

Some good resources I've found include:

Why is using HBase a better choice than using Cassandra with Hadoop?

Can anyone please give a detailed explanation on this?

Thanks

I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.

To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:

  • Cassandra satisfies the Availability and Partition Tolerance properties.
  • HBase satisfied the Consistency and Partition Tolerance properties.

CAP

When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.

But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (BulkOutputFormat for example), we are no longer to the point where Cassandra was just a standalone project.

In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans

To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.

This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.

I would like to learn Cassandra.

Unfortunately, the few tutorial posts I could find either refer to an old Cassandra version (prior 1) and/or require a somewhat complicated setup, like installing twissandra.

So, I wonder if anyone knows a resource to learn Cassandra without having to install anything, except Cassandra itself, of course.

My setup:

  • Windows 7 (should not matter, right?)
  • Cassandra 1.2.0 (installed using the binary installer from DataStax)
  • OpsCenter (courtesy of DataStax)

I am pretty comfortable with MongoDB and have some experience with MySql, though it seems that Cassandra is like none of the above.

Anyone?

I'd recommend you the book "Cassandra: The Definitive Guide" by Eben Hewitt http://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412

It contains the basics for the database and also for no-SQL modeling.

I also found this resource quite useful while understanding Cassandra configuration parameters: http://www.ecyrd.com/cassandracalculator/

Of course, you won't be able to survive without http://www.datastax.com/docs

I'm planning to start project with NoSQL for data storage. I was trying to find informations about Cassandra in google but I've found very basic info. Anyone know where I can find good source of knowledge about Cassandra (planing data structure, working with data (maybe migrating from mysql?) etc)?

The best source of information is the Cassandra wiki at http://wiki.apache.org/cassandra/.

There's also an O'Reilly book, Cassandra: The Definitive Guide, but this is for Cassandra 0.7, so is a bit out of date now.

DataStax has comprehensive Cassandra documentation at http://www.datastax.com/docs/1.0/index.

"Cassandra High Performance Cookbook" is a decent book. O'Reilly's "Definitive Guide" is actually even worse than a 0.7 book; it's a mix of 0.6, 0.7, and stuff that was cut from 0.7 before release.

Let's assume I have a keyspace with a column family that stores user objects and the key of these objects is the username.

How can I use Hector to get a list of users sorted by username?

I tried to use a RangeSlicesQuery, paging works fine with this query, but the results are not sorted in any way.

I'm an absolute Cassandra beginner, can anyone point me to a simple example that shows how to sort a column family by key? Please ask if you need more details on my efforts.

Edit:

The result was not sorted because I used the default RandomPartitioner instead of the OrderPreseveringPartitioner in cassandra.yaml.

Probably it's better not to rely on the sorting by key but to use a secondary index.

Quoting Cassandra - The Definitive Guide

Column names are stored in sorted order according to the value of compare_with. Rows, on the other hand, are stored in an order defined by the partitioner (for example, with RandomPartitioner, they are in random order, etc.)

I guess you are using RandomPartitioner which

... return data in an essentially random order.

You should probably use OrderPreservingPartitioner (OPP) where

Rows are therefore stored by key order, aligning the physical structure of the data with your sort order.

Be aware of inefficiency of OPP.


(edit on Mar 07, 2014)
Important:

This answer is very old now.

It is a system-wide setting. You can set in cassandra.yaml. See this doc. Again, OPP is highly discouraged. This document is for version 1.1, and you can see it is deprecated. It is likely that it is removed from latest version. If you do want to use OPP, you may want to revisit the architecture the architecture.

NoSQL databases & particularly Cassandra have created a lot of buzz with their high scalability promises at cheaper costs.

There is a lot of buzz around regarding Cassandr's adoption by social networking majors like facebook, twitter, digg. But the fact really is, fb is no longer really taking Cassandra into consideration in the recent projects, and facebook never completely relied on cassandra ditching mysql even though it is still struggling hard with mySQL where Cassandra could have been a good fit for their models.

Even twitter stepped back from its plans to move to Cassandra cluster

Also Digg hasn't been very successful with their Cassandra implementation(but not clear who to be blamed for this).

With this no big players are left around who are proud playing with Cassandra..!!

It is still in the alpha stage and with small community so should Cassandra be considered for production environments for big projects?? For a social networking site, which database solution amongst MySQL & Cassandra would be:

  1. easier to build on, maintain and administer
  2. offers good performance
  3. cheaper solution
  4. future proof (in terms of scalability, reliability, etc)
  5. less human administration required.

Amongst all above I majorly doubt its reliability.... Am I risking my data with Cassandra!!???

any other advice you can give ?

Not sure if I can convince you. But, I am working on a project that uses Cassandra. Cassandra is not the complete solution but it is very fast and it is good for grouped information.

We have off loaded all the intensive read-write data to Cassandra, and the data that are lesser in-demand and do need relational integrity are still in MySQL (on top of which there is MemcacheD). And, I guess Facebook must also be having an amalgam of MySQL, Cassandra, MemcacheD. At least that's what I guess.

To answer your questions (on my short experience with Cassandra and MySQL)

  1. MySQL is traditional and you can build on top of it easily. Cassandra (or any new NoSQL approach) need to be assimilated and sometime you find terminologies conflicting. So, MySQL wins here.
  2. Performance wise Cassandra wins. (read-write performance)
  3. If you are talking about hardware, I am unsure. But I guess, hardware wise, one MySQL master + four slave is same as 4 Cassandra node. But I, honestly, do not know.
  4. Scalabity: Cassandra, Reliability: MySQL. If you read Cassandra docs, it says it's eventually consistent. But I have not tested reliability of Cassandra. By the way, by pointing "eventually consistency" under reliability, I do not mean that it's unrealiable. I mean that at any given time, you might not be sure if a node is the latest and has all the updates.
  5. There are a lot of automated tools for DB management, alteration. But for Cassandra there are not so much. So, MySQL wins here. But I guess, tools for Cassandra will be available soon.

We are just starting with Cassandra, I hope someone can point out if anything above they do not find correct. I would be glad to retest and rectify, if necessary.


When I started I did not find much documentation, but now looks like Apache Cassandra page has quite a few articles listed. Refer: