Replication Models in Distributed Systems: Leader-Based vs Leaderless Explained

Home » Distributed Systems » Replication, Consistency & Consensus » Replication Models in Distributed Systems: Leader-Based vs Leaderless Explained

Distributed Systems Series — Part 3.2: Replication, Consistency & Consensus

How Replicas Are Organised Determines Everything

Post 3.1 established why replication is unavoidable. Post 3.3 established what clients observe depending on consistency guarantees. Post 3.4 established the fundamental limit that partitions impose.

This post answers the structural question that sits beneath all of those: how should replicas actually be organised?

There is no neutral answer. The replication model you choose quietly determines who is allowed to accept writes, how conflicts are handled, what happens during failures and network partitions, and whether strong consistency is even theoretically achievable. These decisions are made before any consensus algorithm, before any quorum calculation, before any consistency level configuration. They are made at the architectural level.

Replication models are not implementation details. They are architectural commitments — and changing them later requires rearchitecting the system.

Synchronous vs Asynchronous Replication

Before examining the three major structural models, there is a more fundamental question that cuts across all of them: when a write arrives, does the system wait for replicas to acknowledge it before confirming success to the client?

Synchronous replication means the leader (or primary node) does not acknowledge a write to the client until at least one follower has confirmed it has durably stored the write. The client waits. The write latency is higher — it includes a network round trip to at least one replica. But the durability guarantee is stronger: if the leader fails immediately after acknowledging the write, at least one replica has the data and can take over without data loss.

PostgreSQL’s synchronous standby mode operates this way. A write to the primary is not acknowledged until the synchronous standby confirms receipt. This prevents data loss on primary failure at the cost of write latency that depends on network speed to the standby — which can be significant in cross-datacenter deployments.

Asynchronous replication means the leader acknowledges the write to the client immediately, then propagates it to followers in the background. Write latency is lower — the client does not wait for replicas. But there is a window between the acknowledgement and the replica receiving the write during which a leader failure could lose the write entirely. The client was told the write succeeded. It did not.

MySQL’s default replication is asynchronous. A write acknowledged by the primary may not yet have reached any replica. If the primary fails in that window, the write is lost — and the new primary elected from the replicas will not have it. Applications that cannot tolerate any data loss must use semi-synchronous or synchronous replication and accept the latency cost.

Semi-synchronous replication is a compromise: the leader waits for at least one replica to acknowledge receipt before confirming to the client, but does not wait for all replicas. This provides a meaningful durability guarantee — at least one copy survives a leader failure — without the full latency cost of waiting for all replicas. MySQL semi-synchronous replication and PostgreSQL with a single synchronous standby both implement this approach.

The synchronous vs asynchronous choice interacts with every replication model below. A leader-based system with asynchronous replication can lose writes on failover. A leaderless system with synchronous quorum writes cannot. The structural model and the replication timing mode together determine the system’s actual durability and consistency behaviour.

Leader-Based Replication

Leader-based replication — also called primary-replica or primary-standby replication — is the oldest and most widely deployed replication model. One replica is designated as the leader. All writes go to the leader. Followers replicate data from the leader. Reads may go to the leader for strong consistency, or to followers for higher read throughput at the cost of potentially stale data.

The mental model is intuitive: there is one authoritative source of truth, and everything else is a copy. This intuition maps cleanly onto how engineers typically think about databases, which is a large part of why this model is so widespread. PostgreSQL, MySQL, MongoDB (in replica set mode), Kafka (per partition), and etcd all use leader-based replication as their primary replication mechanism.

Why leader-based replication produces clean consistency: because all writes go through a single node, there is a natural total ordering of writes. The leader applies write one, then write two, then write three — and followers apply them in the same order. There are no concurrent writes to the same key on different nodes, so there are no conflicts to resolve. Strong consistency is achievable: route all reads to the leader and every client sees the most recent write.

Replication lag and its consequences: when replication is asynchronous, followers are always slightly behind the leader. The delay — replication lag — is usually milliseconds but can grow to seconds or minutes during periods of high write load or follower degradation. A client that writes to the leader and then reads from a follower may not see their own write. This is the most common replication-related bug in production systems using leader-based replication with read replicas.

The scenario is familiar to any engineer who has debugged it: a user submits a form, the write goes to the primary, the page reloads and the read goes to a replica that has not yet received the write, and the user sees the old value. The user assumes their submission failed and submits again. The fix — ensuring that reads immediately following writes go to the primary, or using read-your-writes consistency — requires understanding the replication model well enough to know the problem exists.

Leader failure and failover: when the leader fails, writes must stop until a new leader is elected or the leader recovers. This is the most operationally complex aspect of leader-based replication. Automatic failover — where the system elects a new leader without human intervention — requires a leader election protocol, which introduces the split-brain risk covered in Post 2.4. Manual failover is safer but slower. Neither is free.

Kafka’s partition leader model illustrates leader-based replication at large scale. Each Kafka topic partition has one leader broker and zero or more follower brokers. All reads and writes for a partition go through the leader. Followers replicate from the leader and are tracked in an in-sync replica set. When a leader broker fails, Kafka elects a new leader from the in-sync replicas — a process that takes seconds and during which that partition is unavailable. At scale, with thousands of partitions, leader elections happen continuously as brokers fail and recover, and Kafka’s design minimises their impact through careful partition distribution and fast election.

The leader as bottleneck: all writes flow through one node. Write throughput is bounded by what a single node can handle. This is the fundamental scalability limit of leader-based replication. Systems that need to scale writes beyond a single node must either shard data (multiple leaders, each owning a partition of the data) or move to a different replication model.

Multi-Leader Replication

Multi-leader replication — also called multi-master replication — allows multiple replicas to accept writes simultaneously. Each leader handles writes from clients in its region or zone, and writes are propagated asynchronously to other leaders in the background. When the propagated writes arrive at a replica that has already accepted a conflicting write, a conflict must be detected and resolved.

The motivation is clear: in a globally distributed system, forcing all writes through a single leader in one region means users in other regions pay a cross-region round-trip latency penalty on every write. Multi-leader replication eliminates that penalty — writes go to the nearest leader and are confirmed immediately, with cross-region propagation happening asynchronously afterward.

The conflict problem: multi-leader replication makes conflicts inevitable, not merely possible. If two clients in different regions concurrently update the same record, both writes succeed locally on their respective leaders. When the writes propagate to each other, both leaders now have two versions of the same record. The system must decide which version wins.

Last-write-wins (LWW) is the simplest resolution strategy: the write with the higher timestamp wins, and the other is discarded. It is also the most dangerous. As established in Post 1.5, clocks in distributed systems cannot be trusted for ordering. A write with a higher timestamp may not actually have happened later — it may simply have come from a node with a clock running slightly ahead. LWW silently discards data that clients believed was successfully written. For any system where data loss is unacceptable, LWW is the wrong strategy.

Conflict-free replicated data types (CRDTs) are a more principled approach: data structures designed so that concurrent updates can always be merged automatically without conflicts. Counters that only increment (every increment can be merged by summing), sets where elements can be added but never removed (every add can be merged by union), and grow-only registers all have this property. CRDTs work well for specific use cases — collaborative text editors, distributed counters, shopping carts — but cannot handle arbitrary data mutations.

Application-level merging — exposing both conflicting versions to the application and letting it decide — is the most flexible approach but also the most complex to implement correctly. Amazon’s original Dynamo design used this approach for shopping carts, presenting both cart versions to the client and merging by taking the union of items.

Where multi-leader replication is used: CouchDB’s replication model is explicitly multi-leader with application-level conflict resolution. Active-active database deployments across multiple datacenters — where each datacenter accepts writes locally — use multi-leader replication. Google Docs and similar collaborative editing tools use CRDT-based approaches to handle concurrent edits. Calendar applications that sync across devices typically use multi-leader replication with last-write-wins or merge strategies per field.

The operational reality: most engineers who have worked with multi-leader replication in production describe it as significantly harder to operate than leader-based replication. Conflicts that seemed unlikely in testing appear regularly under production load. Conflict resolution logic that seemed simple contains edge cases. Debugging a data inconsistency across multiple leaders requires correlating logs from multiple systems across multiple regions. The performance benefits are real; so is the operational complexity.

Leaderless Replication

Leaderless replication eliminates the concept of a leader entirely. Clients send writes to multiple replicas directly, and reads query multiple replicas directly. Consistency is achieved not through a single authority but through overlapping quorums — requiring that enough replicas acknowledge a write and enough replicas respond to a read that at least one must have the most recent value.

Truth is determined by numbers, not authority.

The Amazon Dynamo paper introduced the most influential leaderless design in 2007. Cassandra, Riak, and Voldemort are all inspired by Dynamo’s architecture. The design choices in Dynamo were explicit and deliberate: Amazon prioritised availability and write performance over strong consistency for specific use cases — shopping carts, session data — where the cost of unavailability exceeded the cost of occasional inconsistency.

How quorums work: in a cluster of N replicas, a write requires W replicas to acknowledge it, and a read queries R replicas and takes the most recent value. When W + R > N, the write set and the read set must overlap — at least one replica that received the write will respond to the read. This guarantees that a read always sees at least one up-to-date copy of the data.

Common configurations: with N=3, W=2, R=2 provides a good balance — any two replicas must agree on a write, and any read queries two replicas, so there is always at least one overlap. With W=3, R=1, writes are maximally durable but reads are fast and cheap — every replica has every write. With W=1, R=3, writes are fast but reads are expensive — every replica must respond to confirm consistency.

Handling stale replicas — read repair and hinted handoff: when a replica is temporarily down and misses some writes, it falls behind. Two mechanisms bring it back up to date. Read repair: when a client reads from multiple replicas and detects that one replica returned a stale value, it sends the up-to-date value back to the stale replica. This happens automatically on reads at no additional cost. Hinted handoff: when a write cannot reach its intended replica because the replica is down, another replica stores the write temporarily with a hint about where it should go. When the original replica recovers, the hint is delivered and the write is applied.

The sloppy quorum: what happens when a network partition means the client cannot reach enough replicas to form a quorum? In a strict quorum system, the write fails. In a sloppy quorum — used by Cassandra and Dynamo — the write is accepted by any available replica, even one that would not normally be responsible for that data, and delivered to the correct replica later via hinted handoff. Sloppy quorums increase availability at the cost of making the quorum guarantee weaker during the partition.

Version vectors for conflict detection: when concurrent writes happen on different replicas, leaderless systems need a way to detect that two values are in conflict rather than that one simply supersedes the other. Version vectors — one counter per replica — track which replica has seen which writes and in what order. When a read returns two values with version vectors that cannot be ordered (neither is strictly greater than the other), the values are concurrent conflicts that must be resolved. This is the same mechanism described in Post 2.5 on Vector Clocks.

Where leaderless replication is used: Cassandra is the most widely deployed leaderless database. Its tunable consistency levels — ONE, LOCAL_QUORUM, QUORUM, ALL — allow applications to choose per-request how many replicas must respond, trading consistency for latency. Amazon DynamoDB uses a leaderless model internally (though this is abstracted from users). Riak is a leaderless key-value store designed explicitly around the Dynamo model.

Real Systems Are Hybrids

Production distributed systems rarely implement a single replication model in pure form. The models described above are conceptual anchors — real systems adapt, combine, and layer them based on specific requirements.

Google Spanner uses leader-based replication within each Paxos group, but uses leaderless coordination across Paxos groups. Each shard of data has a leader that handles writes for that shard, elected through Paxos consensus. But no single node is the leader for the entire database. This gives Spanner the write ordering simplicity of leader-based replication at the shard level with the global availability and scalability of distributed coordination at the system level.

CockroachDB and YugabyteDB take a similar approach: Raft consensus groups manage individual data ranges, with a leader per range elected through Raft. The overall system has no single leader — ranges are distributed across nodes and each node leads some ranges and follows others. This is effectively sharded leader-based replication with automatic leader distribution.

MongoDB’s replica sets use leader-based replication, but sharded MongoDB clusters use multiple replica sets — each shard is a replica set with its own leader. The overall cluster is a hybrid: leader-based at the shard level, distributed coordination at the cluster level.

Understanding these hybrid architectures requires understanding all three base models. The hybrid is only comprehensible if you understand what each component is doing and why.

Choosing a Replication Model

The choice of replication model should follow from the system’s failure story — what the system must do when components fail — not from technology familiarity or trend.

Choose leader-based replication when: write ordering and strong consistency are important, write volume fits within a single node’s capacity, and the operational simplicity of a single write authority is valuable. Leader-based replication fails mysteriously less often than the alternatives and is easier to reason about under failure. PostgreSQL streaming replication, MySQL primary-replica, and Kafka partition leaders are all good defaults for these requirements.

Choose multi-leader replication when: users in multiple geographic regions must write with low latency and cross-region write latency is unacceptable. Accept that conflicts will happen and invest in conflict detection and resolution before choosing this model — not after discovering conflicts in production. Multi-leader replication is complex to operate correctly. Choose it only when the performance requirement genuinely demands it.

Choose leaderless replication when: availability during node failures is the primary requirement, write throughput must scale beyond a single node, and the application can tolerate tunable consistency rather than requiring strong consistency globally. Cassandra’s leaderless design makes it exceptionally resilient to node failures at the cost of requiring careful consistency level configuration to avoid stale reads.

Simpler models fail less mysteriously. A leader-based system that loses its leader produces a clear, diagnosable failure: writes stop. A leaderless system with misconfigured quorums produces subtle data inconsistencies that surface as application bugs days or weeks after the underlying cause. Operational clarity has real value.

Key Takeaways

  1. Replication models are architectural commitments that determine who accepts writes, how conflicts are handled, and what happens during failures — they must be chosen deliberately, not by default
  2. Synchronous replication provides stronger durability guarantees at higher write latency — asynchronous replication provides lower latency but creates a window where acknowledged writes can be lost on leader failure
  3. Leader-based replication provides natural write ordering and strong consistency at the cost of write scalability and leader failover complexity — it is the right default for most systems
  4. Multi-leader replication enables low-latency writes across geographic regions at the cost of inevitable conflicts that must be detected and resolved — last-write-wins is dangerous, CRDTs and application-level merging are safer
  5. Leaderless replication maximises availability and write throughput by distributing writes across replicas with quorum coordination — it moves complexity from the server to the client and from hardware failure to logical correctness
  6. Real production systems are hybrids — Spanner, CockroachDB, and YugabyteDB combine leader-based replication at the shard level with distributed coordination at the system level
  7. Simpler replication models fail less mysteriously — operational clarity is a real engineering value, not just a preference

Frequently Asked Questions (FAQ)

What are replication models in distributed systems?

Replication models define how data is copied and coordinated across multiple nodes. They specify who can accept writes, how writes propagate to replicas, how replicas handle failures, and how conflicts are detected and resolved. The three primary models are leader-based replication (one node accepts all writes), multi-leader replication (multiple nodes accept writes concurrently), and leaderless replication (writes go to multiple nodes with quorum coordination). The choice of model is an architectural commitment that determines the system’s consistency, availability, and failure behaviour.

What is the difference between synchronous and asynchronous replication?

In synchronous replication, the primary node waits for at least one replica to acknowledge a write before confirming success to the client — higher latency, stronger durability guarantee, no data loss on primary failure. In asynchronous replication, the primary confirms the write immediately and replicates in the background — lower latency, but a failure window exists during which acknowledged writes can be lost if the primary fails before replication completes. Semi-synchronous replication waits for one replica while propagating to others asynchronously, providing a practical middle ground.

What is replication lag and why does it matter?

Replication lag is the delay between a write being applied on the primary and that write appearing on replica nodes. In asynchronous replication, this delay can range from milliseconds to seconds or more under high load. The practical consequence is that a client who writes to the primary and then reads from a replica may not see their own write — the replica has not caught up yet. This produces the common production bug where a user submits a change and the page reload appears to show it was not saved. Read-your-writes consistency, implemented by routing post-write reads to the primary or using write tokens, addresses this.

What is a quorum in leaderless replication?

A quorum is the minimum number of replicas that must agree before a read or write is considered successful. In a cluster of N replicas, requiring W replicas to acknowledge a write and R replicas to respond to a read — where W + R > N — guarantees that the write set and read set overlap. At least one replica that received the write will always respond to the read, ensuring the read sees the most recent value. Common configurations: N=3, W=2, R=2 provides a balanced trade-off between availability and consistency.

Why is last-write-wins dangerous in multi-leader replication?

Last-write-wins resolves conflicts by keeping the write with the higher timestamp and discarding the other. It is dangerous because clocks in distributed systems cannot be trusted for ordering — a write with a higher timestamp may not actually have happened later, it may have come from a node with a clock running slightly ahead. LWW silently discards writes that clients were told had succeeded. For any system where data loss is unacceptable, LWW produces correctness violations that are difficult to detect and impossible to recover from.

When should I use leaderless replication?

Use leaderless replication when availability during node failures is the primary requirement and write throughput must scale beyond what a single node can handle. Cassandra’s leaderless design makes it exceptionally resilient — the system continues accepting writes even when multiple nodes are down, as long as a quorum remains reachable. The cost is that consistency requires careful configuration (choosing the right consistency level per operation) and that subtle consistency bugs are harder to diagnose than the clear failure modes of leader-based systems.


Continue the Series

Series home: Distributed Systems — Concepts, Design & Real-World Engineering

Part 3 — Replication, Consistency & Consensus

Previous: ←3.1 — Why Replication Is Necessary

Next: 3.3 — Consistency Models: Strong, Eventual, Causal and Session →

Not read Parts 1 or 2 yet?

Discover more from Rahul Suryawanshi

Subscribe now to keep reading and get access to the full archive.

Continue reading