DynamoDB, Lessons and Development
What is DynamoDB and what lessons did the Amazon team learn from developing it? Notes from the 2022 paper from Amazon's engineering team.
DynamoDB launched in 2012, as a culmination of the work and learning from the Dynamo system, but with a different architecture to Dynamo. You can read more about Dynamo and the 2007 paper in my previous post.
Fundamental System Properties
-
Fully managed - provisioning, failure, encryption, backups etc.
-
Multi-tenant - the same physical machines are used for multiple clients.
-
Boundless scale with hardware, horizontal scaling, partitioning and re-partitioning.
-
Predictable latencies even with larger data sets.
Summary of Lessons Learnt
-
Physically partitioning tables according to traffic patterns gives better CX.
-
Use continuous verification of data at rest to protect against hardware failures and bugs - better durability.
-
Importance of operational discipline and procedure.
-
Importance of tooling, using formal methods and model checking techniques.
-
Prioritizing predictability vs efficiency to improve stability - you get less outlier cascades across a system.
Architecture
A primary key with a schema is defined when the table is created. The key is a composite, made up of a partition key and a sort key. The sort key must be unique. The partition key is hashed and the hashed output along with the sort key if specified, is used to determine where data lives.
Secondary indexes are also supported for lookup on non-keys.
Conditions can also be specified along with commands which must hold true for the operation to occur.
Supports ACID semantics.
Supports point in time restore - snapshot + event replay
Replication
Data is partitioned horizontally in a disjoint and contiguous manner, with replicas in different DCs. Replicas form a replication group that uses multi-paxos concensus (leadership election with leader lease renewal). Only the leader serves writes.
On write requests, the leader persists the write to a write-ahead log and sends it to the peers. The peers ack the write once they've persisted to their local write ahead log. Upon quorum, the write is acked to the application.
Peers can detect failure and trigger a leader election. Peers can also effectively request the state of the leader from other peers in the event they can't contact the leader - this allows for example a peer to distinguish between the leader being down and it's own connection being down. This reduces some classes of gray failures.
Replication groups store the write ahead log and a B-tree of the key-value data. Some nodes in a replication group can also be setup to only persist recent write ahead log entries - these are just log replicas that don't have the key-value data.
Autoadmin
The autoadmin service monitors system health and handles control plane requests. It also handles things like recovery.
Admission Control
DynamoDB uses an admission control system to prevent nodes being overloaded and minimise multi-tenant interference.
Nodes are allocated throughput in proportion to their capacity.
Each partition is also allocated some throughput. This means we can prevent a node from exceeding it's defined capacity based on the partitions it is responsible for.
Splitting
When a table or partition is accessed a lot, then it gets split - the subset is allocated throughput in a pro-rata fashion.
This causes problems because uniform access is unlikely - by splitting, the hot portion of the data may end up in a single partition. But because we are dividing the allocated throughput evenly across all partitions, that one hot partition ends up with less overall throughput being allocated too it!
Two most commonly faced challenges by the applications
were: hot partitions and throughput dilution. Hot partitions
arose in applications that had traffic going consistently
towards a few items of their tables. The hot items could belong
to a stable set of partitions or could hop around to different
partitions over time. Throughput dilution was common for
tables where partitions were split for size. Splitting a partition
for size would cause the throughput of the partition to be
divided equally among the newly created child partitions, and
hence the per partition throughput would decrease
Bursting
The engineers at Amazon observed low covariance between partitions on a node - not all would use their allocated capacity at the same time. There was also non-uniform access to partitions.
Bursting was introduced to allow for temporary spikes in usage, but only if there was unused node level capacity. Admission control occurs at the node and partition level using a token based system.
Adaptive Capacity
Monitor provisioned and used capacity at the table level. If a table is throttled but the table level throughput was not exceeded then the partition level capacity gets increased. Boosted partitions can be moved to other nodes automatically via AutoAdmin.
Global Admission Control
Adaptive capacity was replaced with Global Admission Control (GAC). There is a GAC service which tracks table level consumption in terms of tokens and handles token replenishment. A GAC service basically tracks configured token bucket consumption.
The consuming services of GAC maintain local token buckets which get replenished at intervals, or when the service has run out of tokens.
The GAC instance
uses the information provided by the client to estimate the
global token consumption and vends tokens available for the
next time unit to the client’s share of overall tokens
Partition level token buckets were also left in place for defense in depth.
Proactive Partition Balancing
Nodes can end up hosting replicas from different users with very different traffic patterns. Because empirically it was noted that all partitions don't utilize their full capacity at the same time, each node is packed with enough replicas that would exceed it's provisioned capacity.
Based on throughput and storage, partitions are proactively balanced.
In case the throughput is beyond a threshold percentage of the
maximum capacity of the node, it reports to the autoadmin service
a list of candidate partition replicas to move from the current node.
The autoadmin finds a new storage node for the partition in the
same or another Availability Zone that doesn’t have a replica
of this partition.
Splitting Based on Empirical Key Distribution
Instead of splitting hot partitions and table in half, the empirical distribution of observed keys is used instead.
Splitting is prevented for single hot items or where the data is being accessed in a sequential manner.
Durability and Correctness
-
Triple replicas with write ahead logs which are integrity checked (checksum and sequence numbers) before being archived to S3
-
Checksums on all log entries, messages etc with integrity validation on data transfer between any two nodes
-
Detection of silent errors and bit rot using a scrub process - comparison of live replicas with one built from the history of the write ahead logs
Gray Failures
Gray network failures can happen because
of communication issues between a leader and follower, issues with
outbound or inbound communication of a node, or
front-end routers facing communication issues with the leader
even though the leader and followers can communicate with
each other. Gray failures can disrupt availability because there
might be a false positive in failure detection or no failure detection.
Operations and Deployments
-
Component level upgrade and downgrade tests
-
Rollback tested as part of a deployment
-
Read-write deployments on message/protocol changes - upgrade to read first, then upgrade to be able to write
-
Controlled rollout with auto rollback based on alarm thresholds
-
Canary applications to proactively detect user issues
Links and References
Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service, 2022 - Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam ,Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, Akshat Vig