Distributed Systems Series — Part 3.5: Replication, Consistency & Consensus
Progress Without Perfect Agreement
The CAP Theorem established an uncomfortable truth: during a network partition, a distributed system cannot simultaneously provide both strong consistency and availability. During normal operation, things are better — but even then, Replication means data exists on multiple nodes that can disagree temporarily.
If a system required every replica to agree before accepting a write or serving a read, any single replica failure would make the entire system unavailable. That is too fragile for production systems that must tolerate constant partial failures.
Quorums are the engineering answer to this problem. Instead of requiring unanimous agreement, a quorum-based system requires only enough agreement — a minimum number of replicas participating in an operation for it to be considered successful. This is the practical mechanism that allows leaderless distributed systems to make progress despite ongoing failures.
Quorums are the compromise between the impossible (perfect agreement) and the unacceptable (no agreement at all).
The Quorum Formula
In a system with N replicas, a write requires W replicas to acknowledge it before it is considered successful, and a read queries R replicas and returns the most recent value among the responses.
The fundamental rule that makes quorums work:
If W + R > N, the write set and the read set must overlap.
At least one replica that acknowledged the write will always respond to the read. That replica has the most recent value. The read therefore always sees at least one current copy of the data.
This is the mathematical guarantee that prevents lost updates in quorum-based systems. Let’s make it concrete.
A Worked Example: Three Nodes, One Failure
Consider a cluster of N=3 replicas: Node A, Node B, and Node C. Configure W=2 and R=2. Check: W + R = 4 > N = 3, so the quorum rule holds.
Normal operation — a write then a read:
- A client writes value X=42 to the cluster
- The write goes to all three nodes. Node A acknowledges. Node B acknowledges. That is W=2 — the write is confirmed to the client. Node C is slow but the write does not wait for it
- Node C eventually receives the write (or will catch up via read repair)
- A client reads X. The read queries all three nodes. Node A returns 42. Node B returns 42. Node C returns the old value (it hasn’t caught up). That is R=2 responses — the read returns 42, the most recent value, because at least two of the three responses agree
Failure scenario — Node C is down:
- A client writes X=99. Node A acknowledges. Node B acknowledges. W=2 achieved — write succeeds. Node C is unreachable and misses the write entirely
- A client reads X. Queries Node A (returns 99), Node B (returns 99). R=2 achieved — read returns 99 correctly. Node C is not consulted because it is down
- Node C recovers. It still has the old value. The next read that touches Node C will detect the discrepancy and trigger read repair, bringing Node C up to date
Why W + R > N matters: if W=1 and R=1 (W + R = 2, not > N=3), a write could go to Node A only, and the subsequent read could go to Node B only — which never received the write. The read returns stale data with no indication that anything is wrong. The quorum overlap prevents this: when W + R > N, there is always at least one node in the intersection of the write set and the read set.
Quorum Configurations and Their Trade-offs
Different W and R configurations produce different trade-offs between read performance, write performance, and consistency strength. With N=3:
W=2, R=2 — the balanced configuration. Tolerates one node failure for both reads and writes. Write latency waits for two acknowledgements. Read latency waits for two responses. The most common production default for systems that need reasonable consistency without sacrificing too much availability. Used as Cassandra’s QUORUM consistency level in single-datacenter deployments.
W=3, R=1 — writes are maximally durable, reads are fast. Every node has every write before it is confirmed. Any single node can answer a read correctly. The cost: write latency waits for all three nodes — if any node is slow or unreachable, the write fails. This configuration sacrifices write availability for read simplicity. Appropriate when reads are far more frequent than writes and write availability is not critical.
W=1, R=3 — writes are fast, reads are expensive. Write confirms as soon as one node acknowledges. But reads must query all three nodes to guarantee seeing the most recent write. Appropriate when write throughput is the primary concern and read latency is acceptable. Vulnerable to data loss if the single node that acknowledged a write fails before the write propagates.
W=1, R=1 — maximum availability, minimum consistency. No overlap guarantee. The system accepts writes and serves reads as long as any single node is reachable. Eventual consistency in its weakest form. Appropriate for use cases where staleness is acceptable and availability is paramount — caches, counters, analytics pipelines.
Strict Quorums vs Sloppy Quorums
The quorum configurations above assume a strict quorum: a write must be acknowledged by W replicas from the designated set of N replicas for that data. If fewer than W replicas are reachable, the write fails.
During a network partition, strict quorums can make a system unavailable even when many nodes are still running — if the W required replicas happen to be on the unreachable side of the partition, the write fails regardless of how many other nodes are available.
Sloppy quorums address this by relaxing the requirement: if the designated W replicas are unavailable, the write is accepted by any W available nodes in the cluster, even if those nodes would not normally be responsible for that data. The write is stored with a hint indicating which node it was originally destined for. When the original node recovers, the hint is used to deliver the write — this is hinted handoff.
Sloppy quorums significantly increase write availability during partitions. A write that would fail under a strict quorum — because only one of the three designated replicas is reachable — succeeds under a sloppy quorum by being stored on two available non-designated nodes temporarily.
The cost: sloppy quorums do not satisfy the W + R > N overlap guarantee during the period when writes are on non-designated nodes. A read targeting the designated replicas may not find the write that was accepted by non-designated nodes. Sloppy quorums trade the consistency guarantee of strict quorums for higher availability during partitions, accepting that convergence will happen after the partition heals via hinted handoff delivery.
Amazon Dynamo uses sloppy quorums by default. Cassandra supports both — LOCAL_QUORUM uses strict quorums within a datacenter, while the combination of LOCAL_QUORUM writes and cross-datacenter replication involves elements of sloppy quorum behaviour during datacenter-level partitions.
Read Repair: How Quorum Systems Heal
In a quorum-based system, replicas frequently fall out of sync temporarily — a write reaches W replicas but not all N, a replica was down during a write and missed it, or replication is lagging under high load. Read repair is the primary mechanism by which these inconsistencies are detected and corrected without a background reconciliation process.
When a client performs a read that queries R replicas, the system compares the responses. If some replicas return a value with a lower version than others, the system identifies the stale replicas and sends them the up-to-date value as part of the read response. The stale replicas update their copies. The client receives the most recent value.
Read repair is elegant because it is zero-cost in terms of additional operations — it happens naturally as part of reads that would occur anyway. The more reads a system receives, the more frequently replicas are checked and healed. Heavily read data stays consistent. Rarely read data may fall behind, which is why some systems implement background anti-entropy processes that periodically compare and synchronise replicas regardless of read traffic.
Cassandra’s read repair happens in two modes. Synchronous read repair occurs during the read request itself — if the coordinator detects inconsistency, it repairs before returning to the client, at the cost of additional read latency. Background read repair occurs asynchronously after the response is returned — the client gets a fast response and the repair happens in the background. Which mode is used depends on the read consistency level and Cassandra’s configuration.
Why Quorums Do Not Automatically Provide Strong Consistency
This is the most important and most frequently misunderstood point about quorum-based systems. Meeting the W + R > N condition does not guarantee linearisability — the strong consistency model described in Post 3.3.
There are two specific scenarios where quorum-based reads can return stale data even when W + R > N:
Concurrent writes: two clients write different values to the same key at the same time. Both writes achieve quorum — W=2 on different sets of replicas. Now the cluster has Node A with value X, Node B with value Y, and Node C with either X or Y depending on timing. The write quorums overlapped in terms of count, but the two writes targeted overlapping but different sets. The system has accepted two conflicting writes. A subsequent read achieving R=2 may see X from Node A and Y from Node B — both are quorum members. Without careful version tracking and conflict resolution, this produces an incorrect result.
Failures during write propagation: a write achieves quorum — W=2 of 3 nodes acknowledge. But before the third node receives the write, a read hits all three nodes. Two of the three have the write (returning the new value), one does not (returning the old value). If R=2, the read correctly returns the new value. But if the two nodes that received the write happen to be unavailable for the read — and the one node that missed the write is available along with one that did have it — the result is ambiguous. With careful version tracking the read still returns the correct value; without it, it might not.
Providing linearisability in a quorum-based system requires additional mechanisms: typically a leader that serialises all writes, or a consensus protocol that ensures writes are applied in a total order across all replicas. This is exactly why Cassandra’s QUORUM consistency level does not provide linearisability — it provides a strong form of eventual consistency, but concurrent writes to the same key can still produce conflicts that require resolution. Cassandra’s SERIAL and LOCAL_SERIAL consistency levels, which use Paxos for lightweight transactions, provide linearisable operations at higher cost.
Quorums in Production Systems
Cassandra’s consistency levels expose the quorum configuration directly to applications. ONE means W=1 or R=1 — any single replica suffices. QUORUM means a majority of all replicas in the cluster. LOCAL_QUORUM means a majority of replicas in the local datacenter only — critical for multi-datacenter deployments where cross-datacenter quorums would add hundreds of milliseconds of latency. ALL means every replica must respond — maximum consistency, minimum availability. Cassandra allows different consistency levels per operation, enabling applications to use strong consistency for critical writes and weaker consistency for high-throughput reads.
Amazon DynamoDB abstracts quorum configuration from users but uses quorum-based replication internally. DynamoDB’s “eventually consistent reads” use a quorum of one replica — fast but potentially stale. “Strongly consistent reads” use a quorum that guarantees the most recent write is returned, at the cost of higher latency. The internal quorum parameters are not user-configurable, but the consistency level per read operation is.
Riak exposes n_val (the replication factor N), w (write quorum W), and r (read quorum R) as configurable parameters at the bucket level. Riak defaults to n_val=3, w=2, r=2 — the balanced configuration. Applications can override per-operation for specific consistency requirements.
etcd and ZooKeeper, despite using quorums for their internal consensus protocol (Raft and ZAB respectively), are not quorum-based in the leaderless sense. Their quorums are used by the consensus algorithm to elect leaders and commit log entries — a fundamentally different use of the quorum concept than what Cassandra and Dynamo implement. All reads and writes in etcd still go through the leader; the quorum is used by the leader to commit entries to the replicated log.
Choosing Your Quorum Configuration
Quorum configuration is not a set-once architectural decision — it can and should vary per operation based on the consistency and availability requirements of each specific use case.
For writes where data loss is unacceptable — financial transactions, inventory updates, user account changes — use W=majority (QUORUM in Cassandra). The write does not complete until most replicas have it, minimising the window for data loss on failure.
For reads that require the most recent value — reading account balances before debiting, checking inventory before committing a purchase — use R=majority (QUORUM). Combined with W=majority writes, this guarantees the read sees the most recent write.
For high-throughput reads where some staleness is acceptable — product catalogue reads, content feeds, analytics queries — use R=ONE or R=LOCAL_ONE. The performance difference is significant; the consistency difference is acceptable for these use cases.
In multi-datacenter deployments, prefer LOCAL_QUORUM over QUORUM for both reads and writes unless cross-datacenter consistency is genuinely required. Cross-datacenter quorums add hundreds of milliseconds of latency and fail when the cross-datacenter link is partitioned. LOCAL_QUORUM provides strong consistency within each datacenter while allowing each datacenter to operate independently during inter-datacenter partitions.
Key Takeaways
- Quorums allow distributed systems to make progress without requiring full agreement — they define a minimum number of replicas that must participate for an operation to succeed
- The rule W + R > N guarantees that the write set and read set overlap — at least one replica that acknowledged the write will always respond to the read, preventing lost updates
- Different W and R configurations trade read performance, write performance, and consistency strength — W=2, R=2 with N=3 is the standard balanced default
- Sloppy quorums increase write availability during partitions by accepting writes on non-designated nodes with hinted handoff for later delivery — at the cost of temporarily weakening the overlap guarantee
- Read repair is the primary healing mechanism — when a read detects stale replicas, it sends them the up-to-date value, keeping replicas converged without a dedicated reconciliation process
- Quorums do not automatically provide strong consistency — concurrent writes and failure scenarios can still produce stale reads even when W + R > N; linearisability requires additional coordination such as a leader or consensus protocol
- Quorum configuration should vary per operation — use majority quorums for critical writes and reads, relax to single-replica reads for high-throughput eventually consistent use cases
Frequently Asked Questions (FAQ)
What is a quorum in distributed systems?
A quorum is the minimum number of replicas that must participate in an operation — a write or a read — for it to be considered successful. In a cluster of N replicas with write quorum W and read quorum R, requiring W + R > N guarantees that the set of replicas that acknowledged a write always overlaps with the set that responds to a read. This overlap ensures that a read always sees at least one up-to-date copy of the data, preventing lost updates without requiring every replica to participate in every operation.
What does W + R > N mean?
It is the mathematical condition that guarantees quorum overlap. W is the number of replicas that must acknowledge a write. R is the number of replicas that must respond to a read. N is the total number of replicas. When W + R > N, the write set and the read set must share at least one member — that shared member has the most recent write and will return it on the read. When W + R ≤ N, a write and a read could go to completely separate sets of replicas, allowing the read to miss the write entirely.
Do quorums guarantee strong consistency?
No. Satisfying W + R > N provides a strong form of eventual consistency but not linearisability. Concurrent writes to the same key can each achieve quorum on overlapping but different replica sets, resulting in conflicts. Failures during write propagation can create windows where reads see different values depending on which replicas respond. Linearisability requires that all writes be ordered — typically through a leader or consensus protocol — which quorum-based leaderless systems do not provide by default.
What is the difference between a strict quorum and a sloppy quorum?
A strict quorum requires that the W (or R) replicas participating in an operation are specifically the designated replicas for that data. If fewer than W designated replicas are reachable, the operation fails. A sloppy quorum allows any W available nodes in the cluster to accept a write when the designated replicas are unavailable, storing the write with a hint for later delivery to the correct replica via hinted handoff. Sloppy quorums increase availability during partitions at the cost of temporarily weakening the consistency guarantee.
What is read repair?
Read repair is the mechanism by which quorum-based systems detect and correct replica inconsistencies during reads. When a client reads from R replicas and the system detects that some replicas returned a value with a lower version than others, it sends the up-to-date value to the stale replicas as part of the read response. Read repair is zero-cost in terms of additional operations — it happens naturally during reads — and is the primary mechanism by which frequently read data stays consistent across replicas.
What is LOCAL_QUORUM in Cassandra and when should I use it?
LOCAL_QUORUM is a Cassandra consistency level that requires a majority of replicas in the local datacenter to acknowledge a write or respond to a read, without waiting for replicas in remote datacenters. In multi-datacenter deployments, using QUORUM (which requires a majority across all datacenters) adds cross-datacenter network latency to every operation and fails when the cross-datacenter link is partitioned. LOCAL_QUORUM provides strong consistency within each datacenter while allowing each datacenter to operate independently during inter-datacenter network problems. Use LOCAL_QUORUM for most operations in multi-datacenter deployments; reserve cross-datacenter QUORUM for operations where global consistency is genuinely required.
How do I choose between W=2,R=2 and W=3,R=1 for N=3?
W=2, R=2 (the balanced configuration) tolerates one replica failure for both reads and writes — the system stays available as long as two of three replicas are reachable. It is the right default for most workloads. W=3, R=1 maximises durability — every replica has every write before it is confirmed — and allows reads to be served by any single replica. But it sacrifices write availability: if any of the three replicas is unreachable, writes fail. Choose W=3, R=1 only when reads are far more frequent than writes and write availability is not critical — for example, reference data that changes rarely but is read constantly.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 3 — Replication, Consistency & Consensus
- 3.1 — Why Replication Is Necessary
- 3.2 — Replication Models: Leader-Based, Multi-Leader and Leaderless
- 3.3 — Consistency Models: Strong, Eventual, Causal and Session
- 3.4 — The CAP Theorem Correctly Understood
- 3.5 — Quorums and Voting in Distributed Systems
- 3.6 — Why Consensus Is Hard in Distributed Systems
- 3.7 — Paxos vs Raft: Consensus Algorithms Compared
- 3.8 — Performance Trade-offs in Replicated Systems
- 3.9 — Engineering Guidelines for Replication, Consistency and Consensus
Previous: ← 3.4 — The CAP Theorem Correctly Understood
Next: 3.6 — Why Consensus Is Hard in Distributed Systems →
Not read Parts 1 or 2 yet?