Tuesday, July 9, 2013

Getting Started with Apache Cassandra

Apache Cassandra is “an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable” (source: “Cassandra: The Definitive Guide,” O’Reilly Media, 2010, p. 14).

Cassandra is built to store lots of data across a variety of machines arranged in a ring, in other words scaling horizontally, rather than vertically.

Data Model

Cassandra is based on a key-value model and it is organized according to the following concepts:

  • Column is a key-value pair.

Cassandra_DataModel_CheatSheet.pdf - Adobe Reader_2013-07-03_15-59-51

  • Column Family is a set of key-value pairs (columns in Cassandra’s terminology). They are sorted by their keys. Families are referenced and sorted by row keys.

Cassandra_DataModel_CheatSheet.pdf - Adobe Reader_2013-07-03_16-03-39

  • Super Column the value of a key-value pair can be a sequence of key-value pairs as well. In this case, the outer column would be called super column.

Cassandra_DataModel_CheatSheet.pdf - Adobe Reader_2013-07-03_16-39-29

  • Columns and Super Columns can equally be used within Column Families

Cassandra_DataModel_CheatSheet.pdf - Adobe Reader_2013-07-03_16-44-41

  • Columns or Super Columns are stored ordered by names within their Column Families

For a better understanding of Cassandra’s data model, refer to this article by Maxim Grinev.

 

Installation

  1. Download the latest Cassandra version from here (I got version 1.2.6).
  2. Extract the archive (I extracted it to C:\apache-cassandra-1.2.6 )
  3. If you don’t have Java installed on your machine, go and get it installed.
  4. Add environment variables
    1. Reight-click the my Computer icon on your desktop or start menu.
    2. Click the Advanced tab (or the Advanced System Settings)
    3. Under System Variables, click New (adjust for your own directories)
      1. Variable Name :  CASSANDRA_HOME
      2. Variable Value : C:\apache-cassandra-1.2.6
      3. click OK
    4. Under System Variables, click New (adjust for your own directories)
      1. Variable Name : JAVA_HOME
      2. Variable Value : C:\Program Files\Java\jre7
      3. click OK
  5. Now, open command window and navigate to your bin directory inside Cassandra directory (C:\apache-cassandra-1.2.6\bin in my case).
  6. Launch Cassandra by executing the comand cassandra –f (the “-f” causes it to run in the foreground). You will see lots of messages coming out. If everything goes fine, it will end up with something like this:

CWindowssystem32cmd.exe - cassandra  -f_2013-07-03_11-18-19

Now we have a running Cassandra server is expecting incoming connections on port 9160.

Once Cassandra is up and running on your machine, we can connect to the running instance using the Cassandra command-line interface, launched by running “cassandra-cli.bat”, from the Cassandra “bin” directory.

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-03_12-16-36

Commads

  • show api version; to show the current api version.
  • describe cluster; to show a description of the current cluster.

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-05_12-59-46

  • create keyspace TestKS; to create a keyspace and it have to be a unique name.
  • use TestKS; to switch to keyspace TestKS.
  • create column family TestCF; to create a column family TestCF within the current keyspace.
    • No other schema definition is required, the column family is a collection of name/value pairs.
  • set TestCF[ascii('TestKey')][ascii('column1')]=ascii('TestValue'); to insert the TestKey/TestValue key/value pair into the column named column1 within the column family TestCF. You can use the with ttl = x setting at the end of the set command to make the column self-delete aft x seconds of the insertion time.
    • by default Cassandra treats data as byte arrays but you can convert to other types such as Long, int, integer,…. Also timeuuid() generates new UUID. For full information about set command: type help set;
  • get TestCF[ascii("TestKey")]; to retrieve the value stored in the key TestKey within the column family TestCF

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-05_11-15-32

  • del TestCF[ascii(‘TestKey’)][ascii(‘column2’)]; rows and columns can be deleted by specifying the row key and/or the column name
    with the del (delete) command.

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-05_15-29-44

  • list TestCF; list the data inside a column family

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-05_16-05-31

  • drop column family TestCF; removes a column family.
  • drop keyspace TestKS; removes a key space.

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-05_16-25-05

==> You can insert to super columns much like inserting to normal columns. They can be read with get, written with set, and deleted with del. The super column version of these commands uses an extra ['xxx'] to represent the extra sub-index level.

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-05_17-07-44

  • assume TestCF comparator as ascii; it decodes and helps display results of get and list requests inside the command-line interface. It can be used in the same way to set the validator and keys. By default, columns with no metadata are displayed in a hex format. This is done because row keys, column names, and column values are byte arrays. After using assume the column and value will be displayed rather than the hex code.

CWindowssystem32cmd.exe - cassandra-cli.bat_2013-07-05_17-29-07

  • Type Enforcement : Cassandra is designed to store and retrieve simple byte arrays but it also have support for built-in types. When creating or updating a column family, the user can supply column metadata that instructs the CLI Cassandra client on how to display data and help the server enforce types during insertion operations.
    create column family User with comparator = UTF8Type;
    update column family User with
            column_metadata =
            [
            {column_name: first, validation_class: AsciiType},
            {column_name: last, validation_class: AsciiType},
            {column_name: age, validation_class: IntegerType, index_type: KEYS}
            ];
  • Querying data :

set User[ascii('jsmith')][ascii('first')] = ascii('John');
set User[ascii('jsmith')][ascii('last')] = ascii('Smith');
set User[ascii('jsmith')][ascii('age')] = '38';

get User where age = '38';

  • Update is the same as set.

set User[ascii('jsmith')][ascii('first')] = ascii('Jack');

  • Quite the client;

quit;

 

In this post we just touched the Cassandra’s iceberg. In later posts we will dig more in it and how to write .net programs against it.