Software Development: junio 2014

Currently i'm intern at Flytecomm working on data engineering and as big data trainee we want to upgrade our system using new technologies as NoSQL with Cassandra for time series.
The most important to start is to know how cassandra stores data and then choose the right partitioner depending on the queries that we need, we will use CQL3 to get token values.

We want to share our test using token function on range queries based in tokens.

A partitioner determines how data is distributed across the nodes in the cluster (including replicas). Basically, a partitioner is a hash function for computing the token (it's hash) of a partition key. Each row of data is uniquely identified by a partition key and distributed across the cluster by the value of the token.

Tested:
Murmur3Partitioner
ByteOrderedPartitioner
More info about partitioners

Cassandra version 2.0.7
- Schema

create keyspace ksptest
with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

use ksptest;

create table data1(
    key1 int,
    key2 int,
    id int,
    column1 timestamp,
    column2 timestamp,
    PRIMARY KEY((key1, key2), id));


INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(1,2014,1,'2014-01-01 16:50:45 -0500','2014-01-01 18:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(2,2014,1,'2014-01-01 17:50:45 -0500','2014-01-01 19:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(3,2014,1,'2014-01-01 18:50:45 -0500','2014-01-01 20:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(4,2014,2,'2014-02-01 16:50:45 -0500','2014-02-01 18:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(5,2014,2,'2014-02-01 17:50:45 -0500','2014-02-01 19:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(6,2014,2,'2014-02-01 18:50:45 -0500','2014-02-01 20:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(7,2014,3,'2014-03-01 16:50:45 -0500','2014-03-01 17:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(8,2014,3,'2014-03-01 17:50:45 -0500','2014-03-01 18:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(9,2014,3,'2014-03-01 18:50:45 -0500','2014-03-01 19:50:45 -0500');
INSERT INTO ksptest.data1( id,key1,key2,column1,column2)
VALUES(10,2014,3,'2014-03-01 19:50:45 -0500','2014-03-01 20:50:45 -0500');

- Using Murmur3Partitioner

SELECT token(key1,key2),key1,key2 FROM ksptest.data1 where token(key1,key2) = token(2014,1);

SELECT token(key1,key2),key1,key2 FROM ksptest.data1 where token(key1,key2) = token(2014,2);

SELECT token(key1,key2),key1,key2 FROM ksptest.data1 where token(key1,key2) = token(2014,3);

TOKENS

token(key1, key2) | key1 | key2
----------------------+------+------+----
-2662374876872028068 | 2014 | 1
(3 rows)

token(key1, key2) | key1 | key2
----------------------+------+------+----
-8469758598453416143 | 2014 | 2
(3 rows)

token(key1, key2) | key1 | key2
---------------------+------+------+----
8957232040621060434 | 2014 | 3
(4 rows)

Querying token ranges

- Okey the first logic query should be get all from 1 to 3 row key and get 10 rows

SELECT token(key1,key2),key1,key2,id FROM ksptest.data1 where
token(key1,key2) >= token(2014,1) and token(key1,key2) <= token(2014,3);

token(key1, key2) | key1 | key2 | id
----------------------+------+------+----
-2662374876872028068 | 2014 | 1 | 1
-2662374876872028068 | 2014 | 1 | 2
-2662374876872028068 | 2014 | 1 | 3
8957232040621060434 | 2014 | 3 | 7
8957232040621060434 | 2014 | 3 | 8
8957232040621060434 | 2014 | 3 | 9
8957232040621060434 | 2014 | 3 | 10
(7 rows)

Why we get 7 rows?

Because we are querying tokens so we are searching a hash(token)

Sorted tokens
-8469758598453416143 2 (3 rows)
-2662374876872028068 1 (3 rows)<--
8957232040621060434 3 (4 rows)<--

So if we query from 2 to 3 we will get all row keys and that is because our tokens are created randomly by the partitioner and the range query go through all nodes requested only

- Using ByteOrderedPartitioner

SELECT token(key1,key2),key1,key2 FROM ksptest.data1 where token(key1,key2) = token(2014,1);

SELECT token(key1,key2),key1,key2 FROM ksptest.data1 where token(key1,key2) = token(2014,2);

SELECT token(key1,key2),key1,key2 FROM ksptest.data1 where token(key1,key2) = token(2014,3);

TOKENS

token(key1, key2) | key1 | key2
--------------------------------+------+------
0x0004000007de0000040000000100 | 2014 | 1
(3 rows)

token(key1, key2) | key1 | key2
--------------------------------+------+------
0x0004000007de0000040000000200 | 2014 | 2
(3 rows)

token(key1, key2) | key1 | key2
--------------------------------+------+------
0x0004000007de0000040000000300 | 2014 | 3
(4 rows)

Querying token ranges

Same query to get all rows from 1 to 3 and should get 10 rows

SELECT token(key1,key2),key1,key2,id FROM ksptest.data1 where
token(key1,key2) >= token(2014,1) and token(key1,key2) <= token(2014,3);

token(key1, key2) | key1 | key2 | id
--------------------------------+------+------+----
0x0004000007de0000040000000100 | 2014 | 1 | 1
0x0004000007de0000040000000100 | 2014 | 1 | 2
0x0004000007de0000040000000100 | 2014 | 1 | 3
0x0004000007de0000040000000200 | 2014 | 2 | 4
0x0004000007de0000040000000200 | 2014 | 2 | 5
0x0004000007de0000040000000200 | 2014 | 2 | 6
0x0004000007de0000040000000300 | 2014 | 3 | 7
0x0004000007de0000040000000300 | 2014 | 3 | 8
0x0004000007de0000040000000300 | 2014 | 3 | 9
0x0004000007de0000040000000300 | 2014 | 3 | 10
(10 rows)

Conclusion

We can now understand how token function works and how return data but choose ByteOrderedPartitioner means a complex work to manage a cluster and there are very few good reasons to use it.

Here you can find an analysis about partitioners:

Apache Cassandra: The Case Against The ByteOrderedPartitioner

Software Development

Datos personales

sábado, 7 de junio de 2014

Cassandra: Partitioners and Tokens