Replication, Consistency and Consensus in Distributed Systems

Home » Distributed Systems: Complete Engineering Guide » Replication, Consistency and Consensus in Distributed Systems

Distributed Systems Series Home

The Hardest Problems in Distributed Systems

Part 1 established that nodes fail, networks are unreliable, and clocks cannot be trusted. Part 2 showed how distributed systems cope with those realities through communication patterns, retry strategies, coordination primitives, and logical time. Part 3 takes the hardest question that remains: how do distributed systems store and replicate data correctly when the environment they operate in guarantees that replicas will disagree?

This is the most technically deep part of the series. The problems here — keeping copies of data consistent, defining what consistency actually guarantees, navigating the limits that network partitions impose, and achieving agreement across nodes that can fail at any moment — are the problems that took decades of research to formalise and that continue to drive incorrect implementations in production systems today.

Part 3 covers four interconnected topics that must be understood together.

Replication is where it starts. Storing data on a single node is a liability in a distributed system — one failure means total unavailability. Replication distributes copies across nodes to improve availability, fault tolerance, and durability. But replication immediately raises questions that have no simple answers: which node accepts writes, how do copies stay synchronised, what happens when replicas disagree, and what does a reader see when different replicas hold different values? The three replication models — leader-based, multi-leader, and leaderless — each answer these questions differently and make fundamentally different trade-offs between consistency, availability, and write throughput.

Consistency models define the contract between a replicated system and its clients. They answer the question a client actually cares about: when I read data, what value will I see? The spectrum runs from linearisability — every read returns the most recent write, the system appears to have a single copy — through causal consistency, read-your-writes, and monotonic reads, down to eventual consistency, where replicas will converge eventually with no timing guarantee. Choosing the right model is one of the most consequential architectural decisions in a distributed system, because it determines the replication design, the failure behaviour, and the class of bugs the system will produce under stress.

The CAP theorem formalises the fundamental limit that network partitions impose on replicated systems. During a partition, a system must choose between returning consistent data and remaining available. This is widely cited and widely misunderstood — the CP-vs-AP database taxonomy oversimplifies real system behaviour, and the PACELC model provides a more practically useful framework by also addressing the latency-vs-consistency trade-off during normal operation. Understanding CAP correctly means using it as a per-operation design prompt rather than a global system classification.

Consensus is the strongest form of the coordination problem. Where quorums allow progress without full agreement, consensus requires a permanent, globally ordered decision — exactly what leader election, distributed transaction commit, and replicated log entries demand. The FLP impossibility theorem proves this cannot be guaranteed in a purely asynchronous system, which is why every real consensus algorithm — including Paxos and Raft — assumes partial synchrony. Raft dominates production systems because it specifies every case that Paxos leaves underspecified, making implementations comparable, debuggable, and operationally observable in ways that Paxos variants are not.

Part 3 closes with a performance analysis of what replication and consensus actually cost in production — in milliseconds, in operations per second, in the speed-of-light constraints of geographic distribution — and engineering guidelines that translate everything above into decisions engineers can make immediately.


Part 3 Posts — Full Reading List

3.1 — Why Replication Is Necessary

Replication is not an optimisation — it is a mandatory response to the fundamental unreliability of distributed systems. Covers the five motivations (availability, fault tolerance, durability, performance, geographic distribution), the replication vs backups distinction and why conflating them is dangerous, the new failure modes replication introduces (split-brain, stale reads, write amplification, replication lag), and why replication converts hardware failures into coordination problems. Closes with the insight that replication is the door — consistency and consensus are what you find behind it.

Key concepts: availability, fault tolerance, durability, replication lag, split-brain, replication vs backups, write amplification, geographic distribution

Read this if: you want to understand why replication exists and what problems it introduces before examining how it is structured

3.2 — Replication Models: Leader-Based, Multi-Leader and Leaderless

The structural organisation of replicas determines failure behaviour, consistency guarantees, and performance limits before any algorithm is chosen. Covers synchronous vs asynchronous vs semi-synchronous replication and their durability trade-offs, leader-based replication (write ordering, replication lag consequences, leader failover, the Kafka partition leader model), multi-leader replication (geographic write availability, inevitable conflicts, last-write-wins dangers, CRDTs, application-level merging), and leaderless replication (quorum configuration, read repair, hinted handoff, sloppy quorums, vector clock conflict detection). Includes how Google Spanner, CockroachDB, and YugabyteDB use hybrid approaches in production.

Key concepts: synchronous replication, asynchronous replication, leader-based replication, multi-leader replication, leaderless replication, quorum configuration, read repair, hinted handoff, last-write-wins, CRDTs, replication lag

Read this if: you want to understand how the three major replication models differ in their failure behaviour and consistency trade-offs

3.3 — Consistency Models: Strong, Eventual, Causal and Session

Consistency models are contracts between a replicated system and its clients that define what clients can expect to observe when reading data. Covers the full spectrum — linearisability (the precise technical definition of strong consistency, used by etcd, ZooKeeper, and Google Spanner), causal consistency (causally related writes in order, used by MongoDB causally consistent sessions), read-your-writes (the minimum viable guarantee for user-facing applications, documented in Facebook’s TAO system), monotonic reads, and eventual consistency (the Amazon Dynamo design decision explained). Includes a comparison table of all models with production examples and a decision framework for choosing the right model per operation.

Key concepts: linearisability, strong consistency, eventual consistency, causal consistency, read-your-writes, monotonic reads, session consistency, consistency spectrum, DynamoDB, Cassandra, MongoDB, Google Spanner

Read this if: you want to understand what each consistency model actually guarantees, what it costs, and how to choose the right one for each operation in your system

3.4 — The CAP Theorem Correctly Understood

The CAP theorem is widely cited and widely misunderstood. Covers the precise definitions of consistency (linearisability, not eventual consistency), availability (every request receives a non-error response), and partition tolerance (why it is not optional in any real distributed system), what the theorem actually states and does not state, the two-generals scenario showing exactly what a system must choose during a partition, why the CP-vs-AP database taxonomy is harmful (most systems make different trade-offs per operation), PACELC as the more practically useful extension that covers the latency-vs-consistency trade-off during normal operation, and how to apply CAP as a per-operation design prompt rather than a system classification. Includes the 2012 AWS US-East-1 outage as a production-scale CAP demonstration.

Key concepts: CAP theorem, linearisability, partition tolerance, consistency vs availability, PACELC, CP systems, AP systems, network partition, Brewer’s conjecture, Gilbert and Lynch proof

Read this if: you want to understand what CAP actually says — not the slogan — and how to use it correctly in system design decisions

3.5 — Quorums and Voting in Distributed Systems

Quorums allow distributed systems to make progress without requiring unanimous agreement — the fundamental mechanism behind leaderless replication and many coordination protocols. Covers the W+R>N overlap guarantee with a step-by-step three-node worked example (normal operation, one node failure, recovery via read repair), common quorum configurations and their trade-offs (W=2 R=2 balanced, W=3 R=1 write-heavy, W=1 R=3 read-heavy), strict vs sloppy quorums and why sloppy quorums increase availability during partitions at the cost of weakening the overlap guarantee, hinted handoff as the delivery mechanism for sloppy quorum writes, read repair as the healing mechanism, and why quorums do not automatically provide linearisability. Includes Cassandra’s LOCAL_QUORUM vs QUORUM in multi-datacenter deployments with concrete configuration guidance.

Key concepts: quorum, W+R>N rule, write quorum, read quorum, sloppy quorum, read repair, hinted handoff, Cassandra consistency levels, LOCAL_QUORUM, linearisability limits of quorums

Read this if: you want to understand how leaderless systems maintain consistency during failures and how to configure quorums for your specific availability and consistency requirements

3.6 — Why Consensus Is Hard in Distributed Systems

Consensus — getting multiple nodes to agree on a single permanent value — is required for leader election, transaction commit, and replicated log entries. It is also fundamentally hard in ways that are not obvious until you try to build it correctly. Covers the three formal properties (agreement, validity, termination) and the tension between them, the Two Generals Problem as the intuitive demonstration that agreement over unreliable channels is impossible, the FLP impossibility theorem and what it actually means (not that consensus is impossible but that it cannot be guaranteed to terminate in a purely asynchronous system), why real systems escape FLP through partial synchrony assumptions, crash-stop vs Byzantine failures and why they require fundamentally different algorithms, why timeouts are necessary but produce false suspicion, and why consensus must stay off the data path to preserve throughput. Includes the 2013 etcd split-brain incident as a documented production example.

Key concepts: consensus, agreement, validity, termination, Two Generals Problem, FLP impossibility, partial synchrony, crash-stop failures, Byzantine failures, Byzantine fault tolerance, PBFT, timeouts and false suspicion, control plane vs data plane

Read this if: you want to understand why consensus is genuinely hard — not just complex to implement but mathematically constrained — before studying how Paxos and Raft achieve it

3.7 — Paxos vs Raft: Consensus Algorithms Compared

Paxos and Raft solve the same problem and provide equivalent safety guarantees. They optimise for entirely different things: Paxos for theoretical minimality, Raft for human understandability. Covers the replicated state machine problem (what both algorithms actually solve), Paxos’s two-phase prepare/promise/accept/accepted protocol, why Multi-Paxos is underspecified and every production implementation has diverged from the paper, Raft’s three sub-problems (leader election with randomised timeouts, log replication with the log matching property, safety rules), the write path in Raft step by step, batching and pipelining optimisations, and production deployments — etcd in Kubernetes, CockroachDB’s Multi-Raft, TiKV, Consul. Includes a direct eight-dimension comparison table.

Key concepts: Paxos, Raft, Multi-Paxos, replicated state machine, leader election, log replication, log matching property, randomised timeouts, election restriction, etcd, CockroachDB, TiKV, consensus algorithm comparison

Read this if: you want to understand how both consensus algorithms work, why Raft dominates production systems, and what the write path looks like in a Raft cluster

3.8 — Performance Trade-offs in Replicated Systems

A system can be correct, consistent, and fault-tolerant — and still fail users due to poor performance. Every correctness guarantee has a latency cost, every durability guarantee has a throughput cost. Covers write latency as max(replica response times) not the average, tail latency amplification through fan-out and why Google’s “The Tail at Scale” paper changed how engineers think about p99, write throughput bounded by the leader and how Multi-Raft addresses this, reads scaling linearly with replicas while writes do not, consensus requiring multiple network round trips and how batching and pipelining reduce this, the speed of light problem in geographic distribution with real latency numbers, backpressure preventing retry storms under overload, and hedged requests as the standard tail latency mitigation. Includes Google Spanner’s 100-200ms cross-region write latency as a concrete physics-bounded example.

Key concepts: write latency, tail latency, p99, hedged requests, write scalability, read scalability, leader bottleneck, Multi-Raft, batching, pipelining, geographic latency, backpressure, speed of light constraint

Read this if: you want to understand what replication and consensus actually cost in production — in milliseconds and operations per second — and how production systems manage those costs

3.9 — Engineering Guidelines: Replication, Consistency and Consensus

Nine practical engineering principles that translate Part 3 theory into decisions engineers can make immediately — in design reviews, architecture discussions, and incident analysis. Covers: design for failure first (failure scenario analysis as a design input), use the weakest consistency that preserves invariants (the consistency decision ladder from linearisability to eventual), separate control plane from data plane (the most important architectural principle for consensus-backed systems), tail latency dominates (design around p99 not p50), plan for leader bottlenecks before they occur (sharding strategy timing), prefer understandability over theoretical optimality (the operational case for Raft), make trade-offs explicit and documented (undocumented consistency is a correctness bug), measure reality not assumptions (load testing, fault injection, chaos engineering), and design for evolution not permanence. Closes with a complete design review checklist covering failure design, consistency model, replication, consensus placement, performance, and observability.

Key concepts: engineering guidelines, consistency decision framework, control plane vs data plane, design review checklist, trade-off documentation, fault injection, chaos engineering, leader bottleneck planning, consistency model selection

Read this if: you want a practical synthesis of Part 3 that you can apply immediately — a reference to return to during design reviews and architecture decisions


What Part 3 Prepares You For

Part 3 answers the question of how distributed systems store and replicate data correctly. By the end of Part 3, the design decisions that previously seemed like arbitrary technology choices — why Cassandra uses eventual consistency, why etcd uses Raft, why Spanner sacrifices write latency for global consistency — become comprehensible as deliberate trade-offs within the framework Part 3 establishes.

Part 4 builds directly on this foundation. Fault tolerance and high availability require understanding what replication provides (redundancy, the ability to survive individual node failures) and what it does not provide (protection against correlated failures, the ability to detect which replica is authoritative after a partition heals). Failure detection — knowing when a node has failed vs when it is merely slow — is the same problem that consensus algorithms address through timeouts and elections, at the operational layer. Observability in a replicated system requires understanding replication lag, consistency model behaviour under failure, and what Raft’s term and log index metrics actually tell you about cluster health.

Part 3 establishes how systems agree on data. Part 4 establishes how systems survive losing the nodes that hold it.


Continue the Series

Series home: Distributed Systems — Concepts, Design & Real-World Engineering

Previous: Part 2 — Communication & Coordination

Next: Part 4 — Fault Tolerance & High Availability

Part 4 examines what happens when components fail — how systems detect failures, recover from them, and maintain availability despite ongoing partial failures. The operational layer that makes everything in Parts 1 through 3 work reliably in production.