CassandraDB
Facebook/Meta's solution to the storage and search of Inbox messages and my favorite "Java Low Latency Boiler", CassandraDB, was released in 2008. Notes on the 2009 paper from Facebook engineers.
Data Model
-
Single row atomic
-
Structured values with column family groupings like BigTable
-
Sortable super and simple column families
-
Time based sorting
Outline
Writes and reads are routed too a node where replica determination occurs. Reads can be configured so that the request can be routed to the closest replica. Alternatively, there is read based quorum. Writes are handled with write based quorum. Writes can be configured to be synchronous or asynchronous (for heavy write loads).
Partitioning
-
Uses order preserving consistent hashing
-
Same problems as other systems regarding non-uniform data and load distribution
-
Cassandra opts to solve this via analysing load info and having nodes change their position in the hash space they're covering vs having nodes cover multiple disjoint points in the hash space like with Dynamo.
Replication
Replication is via a replication factor, N. The co-ordinator of a key is responsible for the replication to the other N-1 replicas.
Provides replication policies - rack unaware, rack aware and datacenter aware.
Nodes contact a leader to determine the range of the hashspace they're responsible for. The leader tries to maintain the invariant that no node is responsible for more than N-1 ranges of the hashspace/ring.
Every node knows the range in the hashspace that all other nodes are responsible for.
For geographic diversification, a preference list is constructed like with DynamoDB.
Cluster Membership
Uses gossip based Scuttle-butt. Uses Gossip to disseminate system state as well as cluster membership.
Failure Detection
Detects failures to avoid trying to communicate with unreachable nodes. Uses Accrual Failure Detection to provide a suspicion level for monitored nodes.
This suspicion level has it's scale adjusted to reflect network and load conditions at the node being monitored.
This suspicion level is calculated at the node level using a sliding window of inter-arrival times of gossip messages from other nodes in the cluster.
Local Persistence
-
Write to a commit log for durability and then an update into an in memory data structure upon write success. Uses a separate disk for commit log to exploit sequential write pattern.
-
In memory structure written to disk after a threshold. Writes generate an index for lookup based on the key.
-
Merge process runs to merge multiple written files in the background.
-
Reads query in memory before disk. Files are looked at newest to oldest (to exploit temporal locality).
-
Uses on disk and in memory bloom filters to prevent looking up keys in a file that doesn't contain that key.
-
Uses column indices for better random access to the right disk chunk for a given column - output every 256k chunk boundary.
Implementation Details
-
Modularised structure
-
SEDA style architecture
-
NIO with UDP for system control messages and TCP for requests and replication
-
Rolling commit log that uses a bit vector structure to indicate which columns were updated in that file.
-
Writes to the commit log can be buffered under fast sync mode, meaning potential for data loss on machine loss.
-
Low use of locks - files written to disk are only ever read, so no read based locks.
-
Data on disk is block based with each block containing at most 128 keys and is demarcated with a block index which has the relative offset of keys within that block as well as the size of the data for that key.
-
The index of blocks and offsets is also maintained in memory for fast access.
Links and References
Cassandra - A Decentralized Structured Storage System, 2009 - Avinash Lakshman (Facebook), Prashant Malik (Facebook)