EzDevInfo.com

cassandra interview questions

Top cassandra frequently asked interview questions

Redis, CouchDB or Cassandra? [closed]

What are the strengths and weaknesses of the various NoSQL databases available?

In particular, it seems like Redis is weak when it comes to distributing write load over multiple servers. Is that the case? Is it a big problem? How big does a service have to grow before that could be a significant problem?


Source: (StackOverflow)

How to choose between Cassandra, Membase, Hadoop, MongoDB, RDBMS etc.? [closed]

Is there a paper/blog-post on when to use Cassandra or Membase or Hadoop or plain old relational databases ? Is there a paper discussing the strengths/weaknesses of each, and on what scenarios either of these technologies should be chosen ?

I am thinking of writing a new webservice which will have about a million hits per day and data spanning about a few terabytes.


Source: (StackOverflow)

Advertisements

Difference between partition key, composite key and clustering key in Cassandra?

I have been reading articles around the net to understand the differences between the following key types. But it just seems hard for me to grasp. Examples will definitely help make understanding better.

primary key,
partition key, 
composite key 
clustering key

Source: (StackOverflow)

MongoDB vs. Cassandra [closed]

I am evaluating what might be the best migration option.

Currently, I am on a sharded MySQL (horizontal partition), with most of my data stored in JSON blobs. I do not have any complex SQL queries (already migrated away after since I partitioned my db).

Right now, it seems like both MongoDB and Cassandra would be likely options. My situation:

  • Lots of reads in every query, less regular writes
  • Not worried about "massive" scalability
  • More concerned about simple setup, maintenance and code
  • Minimize hardware/server cost

Source: (StackOverflow)

Large scale data processing Hbase vs Cassandra [closed]

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.

While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.

I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

but I'm still looking for concrete advantages of Hbase.

While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.


Source: (StackOverflow)

What does "Document-oriented" vs. Key-Value mean when talking about MongoDB vs Cassandra?

What does going with a document based NoSQL option buy you over a KV store, and vice-versa?


Source: (StackOverflow)

What should I choose: MongoDB/Cassandra/Redis/CouchDB? [closed]

We're developing a really big project and I was wondering if anyone can give me some advice about what DB backend should we pick.

Our system is compound by 1100 electronic devices that send a signal to a central server and then the server stores the signal info (the signal is about 35 bytes long). How ever these devices will be sending about 3 signals per minute each, so if we do de numbers, that'll be 4.752.000 new records/day on the database, and a total of 142.560.000 new records/month.

We need a DB Backend that is lighting fast and reliable. Of course we need to do some complex data mining on that DB. We're doing some research on the MongoDB/Cassandra/Redis/CouchDB, however the documentation websites are still on early stages.

Any help? Ideas?

Thanks a lot!


Source: (StackOverflow)

When NOT to use Cassandra?

There has been a lot of talk related to Cassandra lately.

Twitter, Digg, Facebook, etc all use it.

When does it make sense to:

  • use Cassandra,
  • not use Cassandra, and
  • use a RDMS instead of Cassandra.

Source: (StackOverflow)

What's The Best Practice In Designing A Cassandra Data Model? [closed]

And what are the pitfalls to avoid? Are there any deal breaks for you? E.g., I've heard that exporting/importing the Cassandra data is very difficult, making me wonder if that's going to hinder syncing production data to development environment.

BTW, it's very hard to find good tutorials on Cassandra, the only one I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still pretty basic.

Thanks.


Source: (StackOverflow)

Cassandra server throws java.lang.AssertionError: DecoratedKey(...) != DecoratedKey

I'm currently experimenting around with Cassandra.

On the client-side (with Hector) I look up a few keys like this:

ColumnFamilyResult<String, String> result = template.queryColumns(Arrays.asList("key1","key2","key3"));

Most of the time it seems to work. But other times I get a timeout exception on the client:

Caused by: me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
    at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:35)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate$1.execute(ThriftColumnFamilyTemplate.java:100)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate$1.execute(ThriftColumnFamilyTemplate.java:88)
    at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
    at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate.sliceInternal(ThriftColumnFamilyTemplate.java:88)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate.doExecuteSlice(ThriftColumnFamilyTemplate.java:46)
    at me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.queryColumns(ColumnFamilyTemplate.java:113)
    at info.gamlor.experiments.Cassandra.readObjectByKey(ComplexCassandra.java:255)

Caused by: TimedOutException()
    at org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7772)
    at org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:570)
    at org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:542)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate$1.execute(ThriftColumnFamilyTemplate.java:95)

And on the server this exception shows up:

ERROR 11:33:55,312 Exception in thread Thread[ReadStage:91,5,main]
java.lang.AssertionError: DecoratedKey(4948402862350542345439897754126541659, 6932) != DecoratedKey(132475956107784875457507977471906551877, 726f6f74) in C:\tem
p\cassandra\lib\cassandra\data\CassandraPolepos\ComplexObjects\CassandraPolepos-ComplexObjects-hd-2-Data.db
        at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:58)
        at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
        at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:78)
        at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:256)
        at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:63)
        at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1331)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1193)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1128)
        at org.apache.cassandra.db.Table.getRow(Table.java:378)
        at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
        at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:816)
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1250)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

Sometimes the key-values in the DecoratedKey(...) part takes up pages.

Anyone a hint what I'm doing wrong. Or how to investigate this issue.

Thanks.


Source: (StackOverflow)

What is the difference between Cassandra and CouchDB?

I'm looking at both projects and I can't really see the difference

from Cassandra Site:

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store...Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.

from CouchDB Site:

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.

That say, I see the specific differences between each project as: access methods, written langauges, etc. but to put AN EXAMPLE, when you talk about SOLR or Sphinx you know both are indexers with big differences but at the end are indexers.

Can I say here that Cassandra and CouchDB are non-relational databases that in some cases one can replace the other?


Source: (StackOverflow)

Explain Merkle Trees for use in Eventual Consistency

Merkle Trees are used as an anti-entropy mechanism in several distributed, replicated key/value stores:

No doubt an anti-entropy mechanism is A Good Thing - transient failures just happen, in production. I'm just not sure I understand why Merkle Trees are the popular approach.

  • Sending a complete Merkle tree to a peer involves sending the local key-space to that peer, along with hashes of each key value, stored in the lowest levels of the tree.

  • Diffing a Merkle tree sent from a peer requires having a Merkle tree of your own.

Since both peers must already have a sorted key / value-hash space on hand, why not do a linear merge to detect discrepancies?

I'm just not convinced that the tree structure provides any kind of savings when you factor in upkeep costs, and the fact that linear passes over the tree leaves are already being done just to serialize the representation over the wire.

To ground this out, a straw-man alternative might be to have nodes exchange arrays of hash digests, which are incrementally updated and bucketed by modulo ring-position.

What am I missing?


Source: (StackOverflow)

Difference between Document-based and Key/Value-based databases?

I know there are three different, popular types of non-sql databases.

  • Key/Value: Redis, Tokyo Cabinet, Memcached
  • ColumnFamily: Cassandra, HBase
  • Document: MongoDB, CouchDB

I have read long blogs about it without understanding so much.

I know relational databases and get the hang around document-based databases like MongoDB/CouchDB.

Could someone tell me what the major differences are between these and the 2 former on the list?


Source: (StackOverflow)

Switching from MySQL to Cassandra - Pros/Cons?

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy lifting. The same machine is running Apache as well.

The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.

Relational features I would need -

  • Order by [SliceRange in Cassandra's API seems to satisy this]
  • Group by
  • Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
  • Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]

My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).

So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,

  • On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.

  • Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]

  • If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.

  • Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)

Any insights from people who've done a shift would be greatly appreciated!

Thanks.


Source: (StackOverflow)

Cassandra Client Java API's [closed]

I have recently started working with Cassandra Database. Now I am in the process of evaluating which Cassandra client we should go forward with.

I have seen various post on stackoverflow about which client to use for Cassandra but none has very definitive answer.

My team has asked me to do some research on this and come up with certain pros and cons for each Cassandra Client API’s in Java.

As I mentioned, I recently got involved with Cassandra so not have that much idea why certain people choose Pelops client and why certain people go with Astyanax and some other clients.

I know brief things about each of the Cassandra clients, by which I mean I am able to make that work and start reading and writing to Cassandra database.

Below is the information I have so far.

CASSANDRA APIS

  • Hector (Production-Ready)
    The most stable of the Java APIs, ready for prime-time.

  • Astyanax (The Up and Comer)
    A clean Java API from Netflix. It isn't as widely used as Hector, but it is solid.

  • Kundera (The NoSQL ORM)
    JPA compliant, this is handy when you want to interact with Cassandra via objects.
    This constrains you somewhat in that you won't be able to have a dynamic number of columns/names, etc. But it does allow you to port over ORMs, or centralize storage onto Cassandra for more traditional uses.

  • Pelops
    I've only used Pelops briefly. It was a straight forward API, but didn't seem to have the momentum behind it.

  • PlayORM (ORM without the constraints?)
    I just heard about this. It looks like it is trying to solve the impedance mismatch between traditional JPA-based ORMs and NoSQL by introducing JQL. It looks promising.

  • Thrift (Avoid Me!)
    This is the "low-level" API.

Below are our priorities in deciding Cassandra Client-

  • First priorities are: low latency overhead, Asynch API, and reliability/stability for production environment.
    (e.g. a more user-friendly APIs that can be had in the DAL that wraps the client).
  • Connection pooling and partition awareness are some other good feature to have.
  • Able to detect any new nodes that got added.
  • Good Support as well (as pointed by dean below)

Can anyone provide some thoughts on this? And also any pros and cons for each Cassandra Client and also which client can fulfill my requirements will be of great help as well.

I believe, mainly I will be revolving around Astyanax client or New Datastax client that uses Binary protocol I guess basis on my research so far. But don't have certain information to back my research and present it to my team.

Any comparison between Astyanax client and New Datastax client(which uses new Binary protocol) will be of great help.

It will be of great help to me in my research and will get lot of knowledge on this from different people who have used different clients in the past.


Source: (StackOverflow)