Distributed Systems Series — Part 3.6: Replication, Consistency & Consensus
The Problem That Took Decades to Formalise
Everything in Part 3 so far has been building toward this question. Replication creates multiple copies of data. Consistency Models define what clients observe. CAP establishes the limits partitions impose. Quorums allow progress without unanimous agreement.
But quorums are not sufficient for some of the most critical operations in distributed systems. When a cluster must elect exactly one leader, when a database must commit a transaction that spans multiple nodes, when a configuration change must take effect simultaneously across an entire fleet — these require something stronger than quorum overlap. They require that multiple nodes agree on a single definitive value, permanently, even in the presence of failures.
This is the consensus problem. And it is one of the hardest problems in computer science — not because clever solutions have not been found, but because the environment in which distributed systems operate makes guaranteeing agreement fundamentally difficult in ways that are not obvious until you try to build it correctly.
Understanding why consensus is hard is a prerequisite for understanding how Paxos and Raft work — which is what Post 3.7 covers. This post builds that foundation.
What Consensus Actually Requires
A consensus algorithm must satisfy three properties simultaneously. They sound straightforward. In a distributed system, achieving all three together is the challenge.
Agreement: all correct nodes decide on the same value. If Node A decides the value is X, then every other node that reaches a decision must also decide X. No two correct nodes can decide different values. This is the safety property — it prevents split-brain, where two parts of a system believe they have authoritative and different answers.
Validity: the decided value must be a value that was actually proposed by some node. A consensus algorithm cannot invent a value from nowhere. If all nodes propose X and the algorithm decides Y, that is a correctness violation. This seems obvious but is important for formal proofs.
Termination: every correct node eventually decides on some value. The algorithm must make progress. A system that is permanently unable to reach a decision — stuck forever waiting for agreement — has failed even if it never decides on the wrong value. This is the liveness property.
The tension is between agreement and termination. To guarantee agreement under all conditions, a system must sometimes refuse to make a decision — it is safer to pause than to risk deciding incorrectly. But refusing to decide violates termination. This fundamental tension is what the FLP impossibility theorem formalises.
The Two Generals Problem: Why Agreement Over Unreliable Channels Is Impossible
Before the formal theory, there is an intuition that captures the core difficulty perfectly. The Two Generals Problem, described in the 1970s, asks: can two generals coordinate an attack when their only communication channel is an unreliable messenger who may be captured?
General A sends a message to General B: “Attack at dawn.” The messenger may or may not arrive. General B receives the message and wants to confirm. He sends an acknowledgement back. But that acknowledgement may also not arrive. So General A sends a confirmation of the acknowledgement. Which may also not arrive. And so on, infinitely.
The generals can never be certain that both sides have confirmed the plan. Whatever message is sent last — whether it arrives or not — leaves the sender uncertain. If the last message does not arrive, the two sides are not synchronised. There is no finite sequence of messages that can guarantee agreement when the channel is unreliable.
This is not a solvable engineering problem. It is a mathematical impossibility. No protocol, no matter how clever, can guarantee that two parties agree when messages can be lost. The generals must either attack without certainty — risking uncoordinated action — or never attack at all.
Distributed consensus faces the same problem. Nodes communicate over a network where messages can be lost, delayed, or duplicated. Any message that could be the final confirming message in a consensus protocol might not arrive. The receiving node cannot distinguish “the message was lost” from “there was no message to send.” This uncertainty is irreducible.
The FLP Impossibility: What It Actually Says
In 1985, Fischer, Lynch, and Paterson published the paper that bears their initials — the FLP impossibility result. The theorem states:
In an asynchronous distributed system, there is no deterministic consensus algorithm that is guaranteed to terminate if even one process may fail.
This requires careful reading. “Asynchronous” here means specifically that there is no bound on message delay — a message might take arbitrarily long to arrive, and there is no way to distinguish a slow message from a lost one or a slow node from a crashed one. “Guaranteed to terminate” means the algorithm must always eventually reach a decision. “Even one process may fail” means the impossibility holds even in the best case where only a single node is faulty.
What FLP does not say is equally important. It does not say consensus is impossible — systems like etcd, ZooKeeper, and Raft reach consensus in practice constantly. It does not say consensus algorithms are useless. It says that no algorithm can guarantee both safety (agreement) and liveness (termination) in a purely asynchronous system where failures can occur.
The practical implication: every real consensus algorithm must make additional assumptions beyond pure asynchrony. Typically this means assuming that the network will eventually stabilise — messages will eventually be delivered, nodes will eventually recover, and the system will eventually have a period of stability during which the algorithm can make progress. This is the partially synchronous model introduced in Post 1.3 — the network is not reliably synchronous, but it is not purely asynchronous either. It is unreliable most of the time but eventually stabilises.
Paxos and Raft both operate under this assumption. They guarantee safety (agreement, validity) always — they will never decide on conflicting values. They guarantee liveness (termination) only when the network is sufficiently stable. During periods of instability — high message loss, frequent leader failures — they may pause and wait rather than risk deciding incorrectly. This is the right trade-off: brief unavailability is recoverable, split-brain is not.
The Three Root Causes of Difficulty
No global clock. As established in Post 1.5, distributed systems have no shared authoritative clock. Nodes cannot determine with certainty which of two events happened first, whether a timeout means failure or slowness, or whether a message arriving “late” invalidates a decision already made. Ordering events — which is fundamental to consensus — requires coordination rather than observation.
Unreliable networks. Messages can be lost, delayed, duplicated, or delivered out of order. Crucially, silence is ambiguous: a node that stops receiving messages from another node cannot determine whether the other node has crashed, whether the network between them has partitioned, or whether messages are simply delayed. This ambiguity means failure detection is always probabilistic, never certain. A timeout is a guess, not a fact.
Partial failures. Nodes fail independently and in ways that are difficult to observe from the outside. A node may crash partway through writing a value — having broadcast a proposal to some nodes but not others. A node may recover with stale state — not knowing what decisions were made while it was down. A node may be slow enough to appear failed to timeouts while still processing messages and potentially acting on them. These partial failures create complex states that consensus algorithms must handle correctly without knowing which failure scenario has actually occurred.
Why Timeouts Are Necessary but Dangerous
Since silence is ambiguous — a node that stops responding might be crashed, slow, or partitioned — consensus algorithms must use timeouts to make progress. Without timeouts, a single slow node could prevent the entire system from ever reaching a decision. FLP proves that in a purely asynchronous system this happens even with just one potentially faulty node.
Timeouts break this deadlock: if a node does not respond within a configured time, treat it as failed and proceed without it. Leader election in Raft works this way — a follower that does not receive a heartbeat from the leader within the election timeout assumes the leader has failed and starts an election.
But timeouts introduce their own failure mode: false suspicion. A node that is merely slow — paused by garbage collection, stalled on a disk read, experiencing network congestion — will be suspected as failed and trigger an election. If the old leader is still alive and still believes itself to be the leader, there is a window during which two nodes believe they are the current leader. This is the split-brain risk that consensus algorithms work carefully to prevent.
Raft’s approach: a new leader is only valid if it has received votes from a majority of nodes. The old leader, now isolated, cannot reach a majority and must step down before it can commit any writes. But the window between the election starting and the old leader recognising it has lost its majority is real, and during that window the system must be designed to handle the ambiguity safely.
The 2013 etcd split-brain incident illustrated this precisely. A network partition separated the etcd cluster, and the timing of timeouts and elections caused a window where two nodes believed they were the leader. The Raft implementation at the time had a subtle bug in how it handled this transition, and some writes were accepted by both leaders before the conflict was detected. The fix required a more careful implementation of leader checks before write acceptance — illustrating that even well-understood consensus algorithms require extremely careful implementation to handle the timeout-based failure detection correctly.
Crash-Stop vs Byzantine Failures
Consensus algorithms like Paxos and Raft — which Post 3.7 will cover in depth — operate under the crash-stop failure model: a node either behaves correctly or it crashes and stops responding entirely. A node that has crashed does not send incorrect messages, does not participate in elections it has lost, and does not pretend to be a leader when it is not.
This assumption significantly simplifies the consensus problem. With crash-stop failures, the only challenge is figuring out which nodes have crashed and which are slow. A crashed node, once detected, can simply be ignored.
Byzantine failures are a different and harder class. A Byzantine node can behave arbitrarily — sending contradictory messages to different peers, lying about its state, colluding with other Byzantine nodes to subvert the protocol. Byzantine failures model adversarial conditions: hardware bit-flips that produce incorrect data, software bugs that corrupt messages, and — in blockchain systems — malicious participants trying to double-spend.
Byzantine Fault Tolerant (BFT) consensus algorithms — PBFT, Tendermint, HotStuff — can tolerate Byzantine failures but require that fewer than one-third of nodes be Byzantine. They also require more communication rounds and are significantly more expensive than crash-stop algorithms. Paxos and Raft require only a majority of nodes to be correct, which is a weaker and cheaper assumption.
Most enterprise distributed systems — databases, coordination services, stream processors — use crash-stop assumptions because Byzantine failures in controlled datacenter environments are rare and the cost of BFT consensus is prohibitive. Blockchain systems and Byzantine-resilient payment networks use BFT consensus precisely because they operate in adversarial environments where participants cannot be trusted.
Consensus vs Quorums: The Critical Distinction
Post 3.5 explained how quorums work. Consensus goes further. Understanding the distinction is essential for understanding why consensus is needed in addition to quorums.
A quorum-based system answers: “do I have enough agreement right now to proceed?” A write quorum ensures that W replicas acknowledged a write. A read quorum ensures that R replicas responded. The W + R > N overlap guarantees that the read and write sets share at least one member. But quorums do not order operations globally. Two concurrent writes can each achieve quorum on overlapping replica sets and still conflict. A quorum-based system tolerates this and resolves conflicts afterward.
Consensus answers: “can we make a decision that is permanent, ordered, and agreed upon by all?” A consensus algorithm does not just require that enough nodes agree now — it requires that the decided value cannot be changed, that all correct nodes eventually agree on the same value, and that the order of decisions is consistent across all nodes.
This distinction has a practical consequence: consensus is needed when decisions must be irrevocable and globally ordered. Leader election requires consensus — there must be exactly one leader, and all nodes must agree on who it is. Transaction commit in a distributed database requires consensus — a committed transaction cannot be rolled back, and all participants must agree that it committed. Log replication in Raft requires consensus — the order of log entries must be the same on all replicas.
Quorums are sufficient for storage — accepting and serving data with tunable consistency. Consensus is required for coordination — making decisions that the entire system must agree on.
Why Consensus Must Stay Off the Data Path
Consensus is expensive. A single consensus round requires multiple message exchanges between a majority of nodes. In a five-node cluster, a write requires the leader to send the write to all four followers, wait for three of them to acknowledge, and then commit — a minimum of two network round trips. In a geographically distributed cluster, two round trips across datacenters adds hundreds of milliseconds to every write.
This cost makes consensus entirely unsuitable as the mechanism for every database write in a high-throughput system. Google Spanner commits approximately 1 million transactions per second across its global deployment. If every write required a full cross-region Paxos round, the latency would be seconds and the throughput would be orders of magnitude lower.
The solution that every large-scale system uses: consensus on the control plane, not the data plane.
The control plane handles the infrequent, critical decisions: which node is the leader, what is the current cluster membership, what is the current configuration. These change rarely — a leader election happens once after a failure, not once per transaction. Running consensus for these decisions is acceptable because they happen infrequently.
The data plane handles the frequent operations: reading and writing data, processing transactions, serving queries. These use replication — the leader replicates writes to followers — but the replication is within the structure established by consensus on the control plane. Kafka’s partition leader, elected via ZooKeeper (or KRaft), handles millions of writes per second through simple log replication without running a consensus round per message. Raft in etcd handles thousands of writes per second using log replication once a leader is established — the consensus is in the log commit, but the log commit itself is a sequential append, not a fresh round of full consensus.
The engineering principle: use consensus to establish structure, use replication to operate within it.
Key Takeaways
- Consensus requires three properties simultaneously — agreement (all correct nodes decide the same value), validity (the decided value was proposed by some node), and termination (every correct node eventually decides) — and the tension between agreement and termination is what makes it hard
- The Two Generals Problem shows that agreement over an unreliable channel is impossible — no finite protocol can guarantee both sides have confirmed a plan when messages can be lost
- The FLP impossibility theorem proves that in a purely asynchronous system, no deterministic algorithm can guarantee consensus termination if even one node may fail — real systems escape this by assuming partial synchrony, not pure asynchrony
- Timeouts are necessary to make progress but introduce false suspicion — a slow node is indistinguishable from a crashed node, which can trigger leader elections that create split-brain windows that consensus algorithms must handle carefully
- Paxos and Raft assume crash-stop failures — nodes either behave correctly or crash and stop — which is simpler and cheaper than Byzantine fault tolerance, which handles nodes that behave arbitrarily or maliciously
- Consensus is required when decisions must be permanent and globally ordered — leader election, transaction commit, log replication — while quorums are sufficient for storage operations with tunable consistency
- Consensus must stay off the data path — put it on the control plane for infrequent structural decisions, use replication for frequent data operations, or throughput collapses
Frequently Asked Questions (FAQ)
What is consensus in distributed systems?
Consensus is the process by which multiple nodes in a distributed system agree on a single definitive value or decision, even in the presence of node failures and unreliable networks. A consensus algorithm must guarantee that all correct nodes decide on the same value (agreement), that the decided value was actually proposed (validity), and that all correct nodes eventually reach a decision (termination). Consensus is required for operations that must be irrevocable and globally ordered — leader election, distributed transaction commit, and replicated log entries.
What is the FLP impossibility and what does it mean in practice?
The FLP impossibility theorem, proven by Fischer, Lynch, and Paterson in 1985, states that in a purely asynchronous distributed system — one where message delays are unbounded — no deterministic consensus algorithm can guarantee both agreement and termination if even one node may fail. In practice, this means that consensus algorithms must make additional assumptions beyond pure asynchrony. Paxos and Raft assume partial synchrony — the network is unreliable but eventually stabilises — and guarantee safety (agreement) always while guaranteeing liveness (termination) only during periods of sufficient stability.
What is the Two Generals Problem?
The Two Generals Problem is an illustration of why agreement over an unreliable communication channel is mathematically impossible. Two generals must coordinate an attack, but their only communication is through messengers who may be captured. No finite sequence of messages can guarantee both generals are synchronised, because the last confirming message might not arrive. This intuition directly maps to distributed consensus: no protocol can guarantee that two nodes have confirmed agreement when the network can lose messages.
What is the difference between crash-stop and Byzantine failures?
Crash-stop failures assume that a faulty node simply crashes and stops responding — it does not send incorrect messages or behave maliciously. Paxos and Raft are designed for crash-stop failures, which simplifies the consensus problem significantly. Byzantine failures assume that a faulty node can behave arbitrarily — sending contradictory messages, lying about its state, or actively trying to subvert the protocol. Byzantine Fault Tolerant algorithms like PBFT handle Byzantine failures but require more communication rounds, tolerate fewer faults (less than one-third), and are more expensive to run.
Why can’t systems just use quorums instead of consensus?
Quorums allow a system to make progress when enough replicas agree, but they do not provide global ordering or permanent decisions. Two concurrent writes can each achieve quorum on overlapping replica sets and produce conflicts. Quorum-based systems tolerate this and resolve conflicts afterward — acceptable for storage with tunable consistency. Consensus is required when a decision must be permanent and all nodes must agree on the same answer — leader election cannot have two winners, a committed transaction cannot be rolled back, and the log entry order must be identical on all replicas.
Why should consensus stay off the data path?
A single consensus round requires multiple message exchanges across a majority of nodes, adding at least two network round trips to every operation. In geographically distributed systems, this adds hundreds of milliseconds per write. Running consensus on every data write would make high-throughput systems impractical. The correct architecture puts consensus on the control plane — electing leaders, managing membership, coordinating configuration changes — where decisions are infrequent. Data operations use replication within the structure that consensus has established, without running a new consensus round per operation.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 3 — Replication, Consistency & Consensus
- 3.1 — Why Replication Is Necessary
- 3.2 — Replication Models: Leader-Based, Multi-Leader and Leaderless
- 3.3 — Consistency Models: Strong, Eventual, Causal and Session
- 3.4 — The CAP Theorem Correctly Understood
- 3.5 — Quorums and Voting in Distributed Systems
- 3.6 — Why Consensus Is Hard in Distributed Systems
- 3.7 — Paxos vs Raft: Consensus Algorithms Compared
- 3.8 — Performance Trade-offs in Replicated Systems
- 3.9 — Engineering Guidelines for Replication, Consistency and Consensus
Previous: ← 3.5 — Quorums and Voting in Distributed Systems
Next: 3.7 — Paxos vs Raft: Consensus Algorithms Compared →
Not read Parts 1 or 2 yet?