Distributed Systems Series — Part 3.7: Replication, Consistency & Consensus
The Same Problem, Two Very Different Solutions
Post 3.6 established Why consensus is hard: no global clock, unreliable networks, partial failures and the FLP impossibility all make guaranteed agreement difficult. Despite these obstacles, distributed systems achieve consensus in practice constantly — electing leaders, committing transactions, replicating logs.
Two algorithms dominate the landscape: Paxos and Raft. They solve the same problem — getting multiple nodes to agree on a sequence of values despite failures — and they are equally correct when implemented properly. But they optimise for entirely different things.
Paxos optimises for theoretical minimality. It makes the fewest assumptions necessary to guarantee correctness, which makes it extremely powerful and extremely hard to understand and implement correctly. Raft optimises for understandability. It makes Paxos-equivalent guarantees while decomposing the problem into pieces that engineers can reason about individually, debug under pressure, and implement without subtle correctness bugs.
This is not a competition with a winner. Both algorithms power production systems at global scale. The choice between them — or their variants — depends on what your team needs to understand and operate. This post explains both well enough to make that choice deliberately.
What Both Algorithms Are Solving: Replicated State Machines
Single-decree consensus — agreeing on one value once — is a theoretical building block. Real systems need something stronger: they need to agree on a sequence of values, in the same order, across all replicas. This is the replicated state machine problem.
A replicated state machine applies a sequence of commands — log entries — to produce a state. If every replica applies the same commands in the same order, every replica ends up in the same state. This is how etcd stores Kubernetes cluster state, how CockroachDB replicates database transactions, and how Kafka (in KRaft mode) replicates partition metadata.
The consensus algorithm’s job is to ensure that all replicas agree on the content and order of log entries. A log entry that is committed — agreed upon by a quorum — will be present in the log of every correct replica, at the same position, forever. This is the safety guarantee both Paxos and Raft provide.
Paxos: How It Actually Works
Leslie Lamport introduced Paxos in a 1989 paper that he circulated informally (it was finally published in 1998 as “The Part-Time Parliament”). The algorithm was described as a fable about a Greek parliament, which contributed to the confusion about what it actually does. Lamport later wrote “Paxos Made Simple” in 2001 to clarify. Engineers still find it hard.
Single-decree Paxos — agreeing on one value — has two phases.
Phase 1: Prepare and Promise. A proposer selects a proposal number N — a unique, monotonically increasing identifier. It sends a Prepare(N) message to a quorum of acceptors. Each acceptor that receives Prepare(N) responds with a promise: it will not accept any proposal with a number less than N. If the acceptor has already accepted a value in a previous round, it includes that value in its promise response. The proposer collects promises from a quorum of acceptors. If any acceptor returned a previously accepted value, the proposer must use that value — it cannot propose a new one. This is the mechanism that prevents conflicting decisions: if a value was already accepted in a previous round, every future round will use the same value.
Phase 2: Accept and Accepted. The proposer sends an Accept(N, V) message to a quorum of acceptors, where V is either the value returned by acceptors in Phase 1 (if any) or a new value the proposer wants to commit. Each acceptor that has not promised to ignore proposals with number N accepts the value and replies with Accepted(N, V). Once the proposer receives Accepted from a quorum, the value V is committed — it is the decided value for this instance of Paxos.
Why this prevents conflicting decisions: the promise mechanism in Phase 1 ensures that any value accepted in a previous round will be discovered by any new proposer before it proposes a value. A new proposer cannot ignore a previously accepted value — it must use it. This means once a quorum has accepted a value, every future round will either accept the same value or fail to form a quorum, but it can never commit a different value.
What makes Paxos hard: single-decree Paxos decides one value. Real systems need Multi-Paxos — running Paxos repeatedly to commit a sequence of log entries. The original Paxos paper does not describe Multi-Paxos in detail. Engineers building real systems have had to figure out how to handle leader election (to avoid running Phase 1 on every log entry), log gaps (when some entries are decided out of order), membership changes (adding or removing nodes from the quorum), and snapshots (compressing the log when it grows too large). Each of these requires careful design not covered by the original algorithm.
The Chubby paper from Google, which described their Paxos-based lock service, noted explicitly that the published Paxos algorithm must be significantly extended before it can be implemented, and that the gap between the described algorithm and a working implementation is large. The Google Spanner team built their own Paxos variant. The Apache Zab protocol (used in ZooKeeper) is a Paxos-inspired protocol with different crash recovery semantics. Every major Paxos deployment has diverged from the original in ways that are difficult to reason about from the paper alone.
Raft: Consensus Designed for Humans
Diego Ongaro and John Ousterhout introduced Raft in 2014 with an explicit design goal: understandability. The paper’s title is “In Search of an Understandable Consensus Algorithm.” They measured understandability by running a study — students who learned Raft performed significantly better on Paxos questions after learning Raft than students who learned Paxos directly. This was not a marketing claim. It was an empirical result.
Raft decomposes consensus into three relatively independent sub-problems, each with clear rules and clear failure handling.
Leader Election
Raft uses a strong leader model — all log entries flow through the leader. At any given time, exactly one node is the leader. Followers receive log entries from the leader and replicate them. Clients send requests to the leader.
Nodes operate in terms — monotonically increasing numbers that act as logical clocks. Each term begins with an election. If a follower does not receive a heartbeat from the leader within a randomised election timeout — each node independently chooses a timeout between 150ms and 300ms — it assumes the leader has failed, increments its term, and starts an election by sending RequestVote messages to all other nodes.
The randomised timeout is the key insight that prevents split votes. If all followers had the same timeout, they would all start elections simultaneously and split votes among themselves, potentially making no progress. Random timeouts ensure that one follower usually times out first, starts an election, and wins before others start competing.
A candidate wins an election if it receives votes from a majority of nodes. Each node votes for at most one candidate per term, and only if the candidate’s log is at least as up-to-date as the voter’s log (the log completeness check). The log completeness check ensures that a new leader always has all committed entries — a node with a stale log cannot become leader, even if it times out first.
Log Replication
Once a leader is elected, it begins accepting client requests. Each request is appended to the leader’s log as a new entry with the current term number. The leader sends AppendEntries RPCs to all followers, containing the new log entries. Followers append the entries to their logs and respond. When the leader receives acknowledgements from a majority of nodes, the entry is committed — it will never be removed from any node’s log. The leader then applies the entry to its state machine and responds to the client.
The log matching property guarantees two things: if two log entries in different nodes have the same index and term, they store the same command; and if two logs agree on an entry at some index, they agree on all entries before that index. This property, enforced by the AppendEntries consistency check, means that once entries are committed across a quorum, all logs are identical up to the committed point.
When a new leader is elected after a failure, its log may be ahead of some followers (it may have entries that were not yet replicated to a majority) and some followers may have diverged (they may have entries from a previous leader that were not committed). Raft resolves this deterministically: the new leader finds the last point where its log and each follower’s log agree, then overwrites the follower’s log from that point forward with the leader’s entries. Uncommitted entries on followers are safely overwritten because they were never committed — no client was told they succeeded.
Safety Rules
Raft’s safety property — “if a log entry has been committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms” — is enforced by two mechanisms working together. The election restriction (a candidate cannot win unless its log is at least as up-to-date as a majority of nodes) ensures new leaders have all committed entries. The leader completeness property (a leader never overwrites its own log, only followers’) ensures committed entries are never lost.
Paxos vs Raft: A Direct Comparison
| Dimension | Paxos | Raft |
|---|---|---|
| Primary design goal | Theoretical minimality and correctness | Understandability and operational clarity |
| Leader model | Implicit — any node can propose | Explicit — one leader per term, all writes through leader |
| Log gaps | Possible — entries can be committed out of order | Not allowed — entries must be committed in order |
| Implementation complexity | Very high — paper underspecifies many cases | Lower — paper specifies all cases explicitly |
| Membership changes | Not specified in original paper | Specified — joint consensus or single-server changes |
| Election mechanism | Not specified — must be designed separately | Specified — randomised timeouts, term-based voting |
| Operational reasoning | Difficult — role boundaries unclear | Clear — leader, follower, candidate roles are explicit |
| Production adoption | Indirect via variants (Chubby, Spanner, ZAB) | Direct — etcd, Consul, CockroachDB, TiKV |
Multi-Paxos: What Real Systems Actually Use
The Paxos described above — single-decree Paxos — decides one value. Real systems need to decide a sequence of values (log entries). Multi-Paxos is the extension that handles this, but it is not described precisely in Lamport’s papers.
The standard Multi-Paxos optimisation: elect one node as the distinguished proposer (the leader) for a period of time. Once a leader is established, it can skip Phase 1 for subsequent log entries — it already has promises from a quorum, so it can go directly to Phase 2 for each new entry. This makes the common case efficient: one network round trip to commit each log entry once a leader is stable.
Leader election, log gaps, snapshots, and membership changes in Multi-Paxos are all left to implementors. Google’s Chubby paper and the Zookeeper Atomic Broadcast (ZAB) paper each describe their approaches, which differ significantly. This is why Paxos-based systems tend to have bespoke implementations that are hard to compare or verify.
Raft can be seen as a well-specified version of Multi-Paxos with opinionated choices for all the underspecified parts: randomised timeouts for leader election, sequential log entry commitment (no gaps), explicit joint consensus for membership changes, and log compaction via snapshots. The choices Raft makes are not the only correct choices — but they are written down, which makes implementations comparable and bugs findable.
Raft in Production Systems
etcd is the most widely deployed Raft implementation. Every Kubernetes cluster stores all cluster state — pods, services, deployments, secrets — in etcd. Kubernetes’ correctness depends entirely on etcd’s Raft implementation maintaining consistent state across the etcd cluster. etcd’s Raft implementation follows the original paper closely, with extensions for snapshots and membership changes.
CockroachDB uses a Raft variant for its range replication. CockroachDB shards data into ranges, each replicated across multiple nodes using Raft. The system runs thousands of simultaneous Raft groups — one per range — coordinated across the cluster. CockroachDB uses joint consensus (a two-phase approach to membership changes) to safely add or remove nodes from Raft groups without risking availability gaps during the transition.
TiKV (the storage engine underlying TiDB) uses Multi-Raft — running many Raft groups simultaneously, one per data region, with a placement driver that manages region distribution across nodes. This is the same pattern as CockroachDB, independently implemented. The Multi-Raft pattern is now the standard approach for building globally distributed databases on top of Raft.
Consul uses Raft for its server cluster consensus — storing service registry state, health check results, and key-value data. Consul’s Raft implementation is provided by the HashiCorp Raft library, which is also used by Vault and Nomad. This shared library approach means that multiple HashiCorp products benefit from the same Raft correctness fixes and improvements.
Why Correctness That Humans Cannot Reason About Is Fragile
The engineering case for Raft over Paxos is not about theoretical correctness — both are correct. It is about operational resilience in practice.
When a consensus cluster is misbehaving at 3am — leader elections are happening too frequently, a node is not catching up after recovery, writes are being rejected — the engineer on call needs to understand what the algorithm should be doing well enough to diagnose what it is actually doing. With Raft, the leader is always known, the term is always visible, the log index is always inspectable. The algorithm’s state is observable because the algorithm’s design makes its state explicit.
With a Paxos variant that has diverged from the original paper through implementation necessity, the on-call engineer may be looking at a system where the relationship between the algorithm’s abstract state and the observable system state is unclear. This is not a hypothetical concern — it is the reason Google’s engineering teams have documented repeatedly that building and operating Paxos-based systems required deep expertise that was hard to transfer.
Raft’s explicit leader, clear term progression, and sequential log make the algorithm’s state directly observable. A Raft cluster’s health is diagnosable from its metrics: current leader, current term, log index on each node, commit index. This observability is not accidental — it is a direct consequence of designing the algorithm for understandability.
Correctness that humans cannot reason about is fragile. In production, reasoning under pressure matters as much as theoretical guarantees.
Key Takeaways
- Paxos and Raft solve the same problem — replicated state machine consensus — and provide equivalent safety guarantees when implemented correctly; they differ in understandability, operational clarity, and implementation completeness
- Paxos works through a two-phase prepare/promise/accept/accepted protocol that prevents conflicting decisions, but leaves leader election, log gaps, membership changes, and snapshots unspecified — every production Paxos deployment has had to design these separately
- Raft decomposes consensus into three explicitly specified sub-problems — leader election with randomised timeouts, log replication with the log matching property, and safety rules enforced by the election restriction — making all cases deterministic and comparable across implementations
- Multi-Paxos is what real systems actually use — single-decree Paxos is a theoretical building block; Multi-Paxos adds leader optimisation for sequential log entry commitment but is underspecified in the original papers
- Raft dominates new production systems — etcd, CockroachDB, TiKV, and Consul all use Raft directly; Paxos lives on through variants in Google Chubby, Spanner, and ZooKeeper’s ZAB protocol
- The operational case for Raft is observability — Raft’s explicit leader, term, and log index make cluster health directly diagnosable; Paxos variants often require deep implementation-specific knowledge to debug under pressure
- Correctness that humans cannot reason about is fragile — in production distributed systems, the ability to diagnose and recover from failures quickly matters as much as the theoretical guarantees of the algorithm
Frequently Asked Questions (FAQ)
What is the difference between Paxos and Raft?
Both Paxos and Raft are consensus algorithms that allow a cluster of nodes to agree on a sequence of values despite node failures and unreliable networks. They provide equivalent safety guarantees. The key difference is design philosophy: Paxos optimises for theoretical minimality — making the fewest assumptions needed for correctness — which makes it powerful but hard to understand and implement. Raft optimises for understandability — explicitly specifying all cases including leader election, log replication, and membership changes — which makes it easier to implement correctly and debug in production.
Is Raft safer than Paxos?
No. Both provide the same safety property — committed values are never lost and all correct nodes eventually agree. Raft is not more correct than Paxos; it is more completely specified. The practical benefit is that a complete specification reduces the chance of implementation bugs, because implementors do not have to design the underspecified parts themselves. A correct Paxos implementation and a correct Raft implementation provide identical safety guarantees.
Why is Paxos considered hard to understand?
Several reasons. Paxos was originally described as a fable about a Greek parliament, which obscured its algorithmic structure. It separates the roles of proposers, acceptors, and learners in a way that does not map cleanly to how engineers think about servers. It specifies only single-decree Paxos — agreeing on one value — leaving the more practically relevant Multi-Paxos (agreeing on a sequence) underspecified. And it does not specify leader election, log management, snapshots, or membership changes at all, which are the hardest parts of building a real consensus system.
What is Multi-Paxos?
Multi-Paxos extends single-decree Paxos to decide a sequence of values — a replicated log. The key optimisation: elect one node as the distinguished leader for a period. Once a leader is established with promises from a quorum, it can skip Phase 1 for subsequent log entries and go directly to Phase 2, reducing the common-case write to one network round trip. Multi-Paxos is what Google Chubby, Google Spanner, and Apache ZooKeeper (via ZAB) implement, each with their own designs for the parts Paxos leaves unspecified.
Why do most new systems use Raft instead of Paxos?
Because Raft specifies everything needed to build a complete implementation — leader election, log replication, safety rules, membership changes, and log compaction — in a single coherent document. Engineers building a new consensus system on Raft can follow the specification closely and produce a correct implementation. Engineers building on Paxos must design the underspecified parts themselves, which is where most consensus implementation bugs originate. The operational benefit is equally important: Raft’s explicit leader and sequential log make cluster health directly observable and diagnosable under pressure.
Do large-scale systems still use Paxos?
Yes, through variants. Google Chubby uses a Multi-Paxos variant. Google Spanner uses Paxos for its TrueTime-based replication. Apache ZooKeeper uses ZAB, a Paxos-inspired protocol with different crash recovery semantics. These systems were built before Raft existed or for specific requirements that their Paxos variants address. They are maintained by teams with deep expertise in their specific implementations. New systems choosing a consensus algorithm today typically choose Raft unless they have a specific reason to use a Paxos variant.
What is the write path in Raft?
A client sends a write request to the Raft leader. The leader appends the entry to its log with the current term number. The leader sends AppendEntries RPCs to all followers containing the new entry. Followers append the entry to their logs and respond. When the leader receives acknowledgements from a majority of nodes (including itself), the entry is committed. The leader applies the entry to its state machine, responds to the client, and sends the updated commit index to followers on the next AppendEntries. The minimum latency is one network round trip from leader to followers and back — in a single datacenter, typically one to five milliseconds.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 3 — Replication, Consistency & Consensus
- 3.1 — Why Replication Is Necessary
- 3.2 — Replication Models: Leader-Based, Multi-Leader and Leaderless
- 3.3 — Consistency Models: Strong, Eventual, Causal and Session
- 3.4 — The CAP Theorem Correctly Understood
- 3.5 — Quorums and Voting in Distributed Systems
- 3.6 — Why Consensus Is Hard in Distributed Systems
- 3.7 — Paxos vs Raft: Consensus Algorithms Compared
- 3.8 — Performance Trade-offs in Replicated Systems
- 3.9 — Engineering Guidelines for Replication, Consistency and Consensus
Previous: ← 3.6 — Why Consensus Is Hard in Distributed Systems
Next: 3.8 — Performance Trade-offs in Replicated Systems →
Not read Parts 1 or 2 yet?