Cassandra

Engine

Features

Consistent hashing
Replication factor, replicas of the data across the cluster
Consistency level controlled for each query
Up to 2 billion key-value pairs in a row

Replication factor = 3
Consistency level = QUORUM
Clients talks to any node, the node hashes the partition key and finds the location of the data
Data is read from all the replicas waiting for responses until we reach a quorum

Acknowledged when we write to both the commit log (append only) and the memtable
When the memtable becomes full it’s flushed into an SSTable
Periodically SSTables are merged

Check if the key is in the in-memory row cache
Query the bloomfilters of the existing SSTables to find the record, if it doesn’t exist then skip the SSTable
If the bloomfilter says that there may be data check the in-memory key cache
On miss get the data from the SSTable and merge it with the data in the memtable, write the key to the in-memory key-cache and merged result to the in-memory row cache

Data modeling

Goals

spread data evenly around the cluster
minimize the number of partitions read
keep partitions manageable

Process

Identify initial entities and relationships
Key attributes (map to PK columns)
Equality search attributes (map to the beginning of the PK)
Inequality search attributes (map to clustering columns)
Other attributes
- Static attributes are shared within a given partition

primary key = partition key + clustering columns

Legend:

K Partition key
C Clustering key and their ordering (ascending or descending)
S Static columns, fixed and shared per partition

Validation

Is data evenly spread?
1 partition per read?
Are writes (overwrites) possible?
How large are the partitions? Let’s assume that each partition should have at most 1M cells, $n_{cells} = n_{rows} * (n_{cols} - n_{K} - n_{S}) + n_{S} < 1M$
How much data duplication?

Examples

Store books by ISBN

Attribute	Special
isbn	K
title
author
genre
publisher

Is data evenly spread? Yes
1 partition per read? Yes
Are writes (overwrites) possible? Yes
How large are the partitions? $1 * (5 - 1 - 0) + 0 < 1M$
How much data duplication? 0

Register a user uniquely identified by an email/password, we also want their fullname. They will be accessed by email and password or by UUID

Attribute	Special
email	K
password	C
fullname
uuid

Q1: find users by login info

Q3: find users by email (to guarantee uniqueness)

Is data evenly spread? Yes
1 partition per read? Yes
Are writes (overwrites) possible? Yes
How large are the partitions? $1 * (4 - 1 - 0) + 0 < 1M$
How much data duplication? 0

Attribute	Special
uuid	K
fullname

Q2: get users by UUID

Is data evenly spread? Yes
1 partition per read? Yes
Are writes (overwrites) possible? Yes
How large are the partitions? $1 * (2 - 1 - 0) + 0 < 1M$
How much data duplication? 0

Find books a logged in user has read sorted by title and author

Attribute	Special
uuid	K
title	C
author	C
fullname	S
ISBN
genre
publisher

Is data evenly spread? Yes
1 partition per read? Yes
Are writes (overwrites) possible? Yes
How large are the partitions? (up to 200k book reads per user)

$$ \begin{align*} n_{books} * (7 - 1 - 1) + 1 & < 1M \\\\ n_{books} & < \frac{1M}{5} - 1 \\\\ n_{books} & < 200k \end{align*} $$

How much data duplication? 0

Interaction of every user in the website

Attribute	Special
uuid	K
time	C (desc)
element
type

Is data evenly spread? Yes
1 partition per read? Yes
Are writes (overwrites) possible? Yes
How large are the partitions? (up to 333k book reads per user, 333k actions may be low number of actions to store therefore we should store actions by bucket)

$$ \begin{align*} n_{actions} * (4 - 1 - 0) + 0 & < 1M \\\\ n_{actions} & < 333K \end{align*} $$

How much data duplication? 0

Attribute	Special
uuid	K
month	K
time	C (desc)
element
type

Is data evenly spread? Yes
1 partition per read? Yes
Are writes (overwrites) possible? Yes
How large are the partitions? (up to 333k book reads per user)

1 year  = 333k / 365 / 24 = 38 actions / h
1 month = 333k / 30 / 24  = 462 actions / h (most realistic case)
1 week  = 333k / 7 / 24   = 1984 actions / h

$$ \begin{align*} n_{actions} * (5 - 2 - 0) + 0 & < 1M \\\\ n_{actions} & < 333K \end{align*} $$

How much data duplication? 0

References

Engine

Features

Data modeling

Goals

Process

Validation

Examples

References

See Also

Partitioning

Preparation for a Software Engineer interview

Back of the envelope calculations