For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy lifting. The same machine is running Apache as well.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
Relational features I would need -
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
If you're a relational database developer (as I am), I'd suggest/point out:
These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
Some good resources I've found include:
Why is using
HBase a better choice than using
Can anyone please give a detailed explanation on this?
I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.
To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:
When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.
But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (
BulkOutputFormat for example), we are no longer to the point where Cassandra was just a standalone project.
In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans
To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.
This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.
I would like to learn Cassandra.
Unfortunately, the few tutorial posts I could find either refer to an old Cassandra version (prior 1) and/or require a somewhat complicated setup, like installing twissandra.
So, I wonder if anyone knows a resource to learn Cassandra without having to install anything, except Cassandra itself, of course.
I am pretty comfortable with MongoDB and have some experience with MySql, though it seems that Cassandra is like none of the above.
I'd recommend you the book "Cassandra: The Definitive Guide" by Eben Hewitt http://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412
It contains the basics for the database and also for no-SQL modeling.
I also found this resource quite useful while understanding Cassandra configuration parameters: http://www.ecyrd.com/cassandracalculator/
Of course, you won't be able to survive without http://www.datastax.com/docs
I'm planning to start project with NoSQL for data storage. I was trying to find informations about Cassandra in google but I've found very basic info. Anyone know where I can find good source of knowledge about Cassandra (planing data structure, working with data (maybe migrating from mysql?) etc)?
The best source of information is the Cassandra wiki at http://wiki.apache.org/cassandra/.
There's also an O'Reilly book, Cassandra: The Definitive Guide, but this is for Cassandra 0.7, so is a bit out of date now.
Let's assume I have a keyspace with a column family that stores user objects and the key of these objects is the username.
How can I use Hector to get a list of users sorted by username?
I tried to use a RangeSlicesQuery, paging works fine with this query, but the results are not sorted in any way.
I'm an absolute Cassandra beginner, can anyone point me to a simple example that shows how to sort a column family by key? Please ask if you need more details on my efforts.
The result was not sorted because I used the default RandomPartitioner instead of the OrderPreseveringPartitioner in cassandra.yaml.
Probably it's better not to rely on the sorting by key but to use a secondary index.
Quoting Cassandra - The Definitive Guide
Column names are stored in sorted order according to the value of compare_with. Rows, on the other hand, are stored in an order defined by the partitioner (for example, with RandomPartitioner, they are in random order, etc.)
I guess you are using
... return data in an essentially random order.
You should probably use
OrderPreservingPartitioner (OPP) where
Rows are therefore stored by key order, aligning the physical structure of the data with your sort order.
Be aware of inefficiency of OPP.
(edit on Mar 07, 2014)
This answer is very old now.
It is a system-wide setting. You can set in
cassandra.yaml. See this doc. Again, OPP is highly discouraged. This document is for version 1.1, and you can see it is deprecated. It is likely that it is removed from latest version. If you do want to use OPP, you may want to revisit the architecture the architecture.
NoSQL databases & particularly Cassandra have created a lot of buzz with their high scalability promises at cheaper costs.
There is a lot of buzz around regarding Cassandr's adoption by social networking majors like facebook, twitter, digg. But the fact really is, fb is no longer really taking Cassandra into consideration in the recent projects, and facebook never completely relied on cassandra ditching mysql even though it is still struggling hard with mySQL where Cassandra could have been a good fit for their models.
Even twitter stepped back from its plans to move to Cassandra cluster
Also Digg hasn't been very successful with their Cassandra implementation(but not clear who to be blamed for this).
With this no big players are left around who are proud playing with Cassandra..!!
It is still in the alpha stage and with small community so should Cassandra be considered for production environments for big projects?? For a social networking site, which database solution amongst MySQL & Cassandra would be:
Amongst all above I majorly doubt its reliability.... Am I risking my data with Cassandra!!???
any other advice you can give ?
Not sure if I can convince you. But, I am working on a project that uses Cassandra. Cassandra is not the complete solution but it is very fast and it is good for grouped information.
We have off loaded all the intensive read-write data to Cassandra, and the data that are lesser in-demand and do need relational integrity are still in MySQL (on top of which there is MemcacheD). And, I guess Facebook must also be having an amalgam of MySQL, Cassandra, MemcacheD. At least that's what I guess.
To answer your questions (on my short experience with Cassandra and MySQL)
We are just starting with Cassandra, I hope someone can point out if anything above they do not find correct. I would be glad to retest and rectify, if necessary.
When I started I did not find much documentation, but now looks like Apache Cassandra page has quite a few articles listed. Refer: