Performance Trade-offs in Distributed Systems: Replication vs Consensus

Distributed Systems Series — Part 3.8: Replication, Consistency & Consensus

Correct Systems Can Still Be Slow

Part 3 has covered why replication is necessary, how consistency models define client guarantees, what CAP means in practice, how quorums work, why consensus is hard, and how Paxos and Raft achieve it. Every post has focused on correctness — making sure the system behaves as specified despite failures.

Now we face a harder truth: a system can be correct, consistent, and fault-tolerant — and still fail users due to poor performance.

Performance is where distributed systems expose their trade-offs most brutally. Every correctness guarantee has a latency cost. Every durability guarantee has a throughput cost. Every availability guarantee has a coordination cost. Understanding these costs — not abstractly but concretely, in milliseconds and operations per second — is what allows engineers to make design decisions that are both correct and fast enough to be useful.

This post examines the performance characteristics of replication and consensus, where the costs come from, and how production systems manage them.

Replication Adds Latency — Always

A write to a single node is bounded only by the speed of that node’s storage. A write to a replicated system must propagate to multiple nodes before it is considered durable. That propagation takes time.

The fundamental law of replicated write latency:

Write latency ≈ max(replica response times)

Not the average. Not the median. The maximum — because the write is not complete until the slowest required replica acknowledges it. In synchronous replication requiring W replicas to acknowledge before confirming to the client, the write latency is determined by the slowest of those W replicas, not the fastest.

In a single datacenter with fast, homogeneous hardware, this maximum is small — typically 1 to 5 milliseconds for a synchronous write to a three-node cluster. Replicas are nearby, network latency is sub-millisecond, and hardware is similar. The performance cost of replication is modest.

Across datacenters in the same region — US-East-1a to US-East-1b — network latency is typically 1 to 3 milliseconds. Still manageable for synchronous replication.

Across geographic regions — US-East to EU-West — network round-trip time is typically 80 to 100 milliseconds. A synchronous write requiring acknowledgement from a replica in a different region adds this latency to every write. A system that commits writes synchronously across US-East and EU-West cannot acknowledge a write in less than approximately 80 milliseconds, regardless of how fast the local storage is. This is not an engineering failure — it is the speed of light.

Google Spanner operates globally with synchronous replication across datacenters. Its write latency for cross-region transactions is approximately 100 to 200 milliseconds. This is acceptable for Spanner’s use cases — financial transactions, inventory management — where correctness matters more than write throughput. It would be completely unacceptable for a low-latency API serving user requests.

Tail Latency Is the Real Enemy

Engineers optimising distributed systems instinctively focus on average latency — the p50. The p50 hides the performance problem that actually matters in replicated systems.

In 2013, Google engineers published “The Tail at Scale,” which documented a pattern that every engineer working with distributed systems should understand. When a request fans out to multiple components — as every replicated read or write does — the end-to-end latency is determined by the slowest component in the fan-out. If each of ten components has a 1% chance of taking more than 1 second, the probability that at least one of them takes more than 1 second is approximately 10%. A 1% slow tail becomes a 10% slow aggregate.

In a replicated system waiting for W replicas to acknowledge a write, the write latency is the maximum of W response times. With W=2 from a cluster of three, the probability of hitting at least one slow replica doubles compared to a single-node write. With W=3, it triples. Replication multiplies tail risk.

The sources of tail latency in replicated systems are well-understood: garbage collection pauses (a JVM replica pausing for GC adds hundreds of milliseconds to every write that requires its acknowledgement), noisy neighbours on shared hardware, network congestion causing packet retransmits, and disk flushes that take longer than expected under high write load.

The practical implication: design for p99 and p99.9, not p50. A system that has 5ms p50 write latency and 500ms p99 write latency will produce a terrible user experience — 1 in 100 writes takes half a second. Users experience the tail, not the median.

Google’s “The Tail at Scale” introduced two techniques that reduce tail latency in replicated systems. Hedged requests: send the same request to two replicas simultaneously and use whichever responds first. The second response is discarded. This roughly halves p99 latency at the cost of doubling read load — an acceptable trade-off when tail latency matters more than resource efficiency. Tied requests: send the request to one replica, and if it has not responded within a short threshold (say, the p95 response time), send the same request to a second replica. The second replica is cancelled if the first one responds. This provides most of the tail latency benefit of hedging at lower additional load.

Write Scalability Is Bounded by the Leader

Leader-based replication — used by Raft, Multi-Paxos, PostgreSQL streaming replication, and Kafka partition leaders — has a fundamental throughput ceiling: all writes flow through one node. The leader must receive the write, append it to the log, replicate it to followers, wait for quorum acknowledgement, and commit. The write throughput of the entire cluster is bounded by how fast one node can do this.

In practice, a well-tuned Raft cluster on modern hardware can sustain approximately 10,000 to 100,000 writes per second, depending on payload size, disk speed, and whether writes are batched. etcd’s documented benchmark results show approximately 10,000 writes per second for small key-value operations on commodity hardware. This is more than sufficient for control-plane workloads — cluster metadata, configuration, leader election — which is what etcd is designed for.

It is not sufficient for data-plane workloads at scale. A database receiving 1 million writes per second cannot route all writes through a single Raft leader. The solution is horizontal partitioning: shard the data across many independent Raft groups, each with its own leader. CockroachDB and TiKV both use this Multi-Raft approach — thousands of Raft groups running simultaneously, each handling writes for a portion of the key space. The write throughput of the overall system scales with the number of shards, not with the capabilities of a single node.

Read scalability is a different story. With leader-based replication, reads can be served by any replica — not just the leader — if the application tolerates stale reads. Adding replicas increases read throughput proportionally: ten replicas can serve ten times the read load of one replica. This asymmetry — reads scale with replicas, writes are bounded by the leader — is one of the most important practical insights in distributed systems performance design.

Systems with heavily read-skewed workloads exploit this aggressively. MySQL read replicas, PostgreSQL hot standbys, and Redis replicas all serve read traffic from followers while routing writes to the primary. The primary’s write throughput is not improved, but the read throughput scales essentially without limit — limited only by the cost of adding and maintaining additional replicas.

Consensus Adds Multiple Network Round Trips

A single Raft write requires at minimum two network round trips from the client’s perspective. The client sends the write to the leader. The leader replicates to followers. The followers acknowledge. The leader commits. The client receives the response. In a single datacenter with 1ms network round-trip times, this adds 2 to 4ms to every write beyond the storage latency.

The number of phases in consensus is not arbitrary — it reflects the minimum number of message exchanges required to guarantee safety under the failure model. Paxos requires two phases (prepare/promise and accept/accepted) for the first value in a term. Multi-Paxos and Raft optimise the common case to a single replication round per log entry once a leader is established. But the leader establishment itself costs an election round, and network round trips cannot be eliminated — only minimised.

Two optimisations make Raft systems fast in practice despite this overhead.

Batching: instead of committing each write individually through a full replication round, the leader accumulates multiple writes and commits them as a single log entry. If 100 client writes arrive within a 1ms window, the leader can replicate all 100 in a single AppendEntries RPC and commit them together. The per-write overhead of the consensus round is amortised across all writes in the batch. etcd, CockroachDB, and TiKV all use batching extensively. The trade-off is latency — a write that arrives alone may wait up to the batch timeout before being committed. For high-throughput workloads, this is acceptable. For low-latency workloads expecting sub-millisecond commit times, batching adds perceptible delay.

Pipelining: instead of waiting for round N to complete before sending round N+1, the leader sends multiple AppendEntries RPCs without waiting for acknowledgements, maintaining a pipeline of in-flight replication rounds. This keeps the network saturated and hides the per-round latency. The leader tracks which entries are committed by monitoring acknowledgements, committing entries as their quorum is reached. Pipelining significantly increases throughput at the cost of slightly increased implementation complexity — the leader must handle out-of-order acknowledgements and ensure commits happen in log order.

The Throughput-Consistency Trade-off

Stronger consistency directly reduces throughput. The relationship is not incidental — it is structural.

Linearisable writes must be totally ordered. A single leader serialises all writes — they are applied one at a time, in order. No two writes can be processed concurrently on the leader, because concurrent processing would break the total ordering that linearisability requires. This serialisation is the ceiling on write throughput.

Eventual consistency removes this ceiling. In a leaderless system where every node accepts writes, all nodes can accept writes simultaneously. Writes are not serialised — they are accepted independently and reconciled later. The aggregate write throughput scales with the number of nodes. Cassandra’s write throughput scales nearly linearly with cluster size because writes are distributed across nodes without central coordination.

The cost is that application logic must handle conflicts, stale reads, and the complexity of eventual convergence. The throughput gain is real; so is the operational complexity it transfers to the application.

Most production systems use a spectrum between these extremes. They use strong consistency for operations where correctness is non-negotiable — configuration changes, leader elections, financial transactions — and eventual consistency for operations where throughput matters more than immediate consistency — analytics ingestion, cache updates, social feed writes. The art is knowing which operations belong in which category.

Backpressure and Overload in Replicated Systems

As established in Post 2.2, retries without backpressure produce retry storms. The same dynamic applies to replication under overload.

When a leader receives more writes than it can replicate and commit, its replication queue grows. Followers fall behind — replication lag increases. Clients waiting for acknowledgements begin timing out. If clients retry on timeout, the leader’s write queue grows faster. The replication lag grows faster. Eventually the system is processing retries rather than new writes, throughput collapses, and the cluster may lose quorum if followers fall far enough behind.

Backpressure in replicated systems means the leader must signal to clients — or to load balancers upstream — when it is approaching its capacity. This can take several forms: explicit rate limiting, increasing write latency (which clients interpret as a slow-down signal), or returning specific error codes that clients are designed to back off on. etcd implements this through a flow control mechanism that limits the rate at which entries are proposed to the Raft log. CockroachDB’s admission control system limits the rate of incoming writes based on the health of the Raft replication pipeline.

Without backpressure, overload is catastrophic in consensus systems. With it, the system degrades gracefully — some writes are delayed or rejected, but the cluster remains healthy and recovers when load reduces.

Geographic Distribution: The Speed of Light Problem

Every performance trade-off described so far is manageable within a single datacenter or region. Geographic distribution makes all of them dramatically harder.

The fundamental constraint is physics. The speed of light in fibre optic cable is approximately 200,000 kilometres per second. A round trip from New York to London — approximately 5,500 kilometres — takes a minimum of 55 milliseconds at the speed of light. Real-world round-trip times are 70 to 100 milliseconds due to routing, switching, and protocol overhead. No engineering optimisation can reduce this below the physical limit.

A consensus system requiring synchronous cross-region acknowledgement adds this latency to every write. A three-region active-active deployment — US-East, EU-West, AP-Southeast — requiring a write to be acknowledged by a majority of regions before committing will have a minimum write latency of approximately 100ms for the US-East to EU-West round trip. This is the cost of global durability.

Production systems that need both global durability and low write latency use one of three approaches. Regional leaders with async cross-region replication: writes are synchronously committed within a region (1-5ms) and asynchronously replicated to other regions. Reads from remote regions may be stale. This is the approach used by most globally distributed relational databases in their default configuration. Sharded global state: data is partitioned by geography, and each shard’s leader is co-located with the users who primarily write that shard. A European user’s data is owned by the EU-West region and committed there with local latency. Cross-region reads are possible but cross-region writes are rare. This is how Shopify and Stripe structure their global database deployments. Local region consensus with global metadata: each region runs its own consensus cluster for data operations. A global consensus cluster handles metadata — which regions exist, which shards are assigned where — but is not in the critical path of data writes. This is the architecture underlying systems like Cosmos DB and CockroachDB’s multi-region deployments.

Practical Performance Guidelines

Measure tail latency, not averages. Your users experience p99, not p50. A 5ms average with a 500ms p99 is a poorly performing system. Instrument your replication and consensus paths for p99 and p99.9 latency, not just mean latency.

Use consensus sparingly — keep it off the data path. As established in Post 3.6, consensus is expensive. Every write that requires a full consensus round pays the multi-round-trip cost. Keep consensus on the control plane. Use replication for data — it is cheaper, faster, and scales better.

Design for read scalability separately from write scalability. Reads and writes have fundamentally different scaling properties in leader-based systems. Exploit this: add replicas to scale reads, shard to scale writes. Do not conflate the two problems.

Use batching and pipelining in consensus-backed systems. These optimisations reduce the per-operation cost of consensus significantly under high load. Most production consensus libraries implement them — ensure they are enabled and tuned for your workload.

Account for the speed of light in geographic deployments. Cross-region latency is bounded by physics. Design your consistency requirements around what geographic latency you are willing to pay, not around an idealised assumption that global consensus is fast.

Implement backpressure before you need it. A replicated system without backpressure will experience catastrophic throughput collapse under overload. Build the signal for overload into the system before the first production incident reveals its absence.

Key Takeaways

  1. Write latency in replicated systems is determined by the slowest required replica — the maximum of W response times, not the average — which is why tail latency matters more than averages in distributed systems
  2. Tail latency multiplies with fan-out — waiting for W replicas means W chances of hitting a slow replica; hedged requests and tied requests are the standard techniques for reducing tail latency at the cost of additional load
  3. Write throughput in leader-based systems is bounded by the leader’s capacity — reads scale linearly with replicas, writes scale only through sharding; this asymmetry is one of the most important practical insights in distributed systems performance
  4. Consensus adds multiple network round trips per write — batching and pipelining amortise this cost under high throughput, but low-latency workloads still pay the baseline per-round cost
  5. Stronger consistency directly reduces throughput — linearisable writes must be serialised through a leader, capping concurrency; eventual consistency removes this ceiling at the cost of conflict resolution complexity
  6. Geographic distribution introduces physics-bounded latency — cross-region round trips take 70 to 100ms regardless of engineering, which is why global synchronous consensus is expensive and most systems use regional leaders with asynchronous cross-region replication
  7. Backpressure is essential in replicated systems — without it, overload produces retry storms that collapse the replication pipeline; with it, the system degrades gracefully and recovers when load reduces

Frequently Asked Questions (FAQ)

Does replication always improve performance?

No. Replication improves availability and fault tolerance but always adds write latency — every write must propagate to W replicas before being confirmed. Replication improves read performance when reads are served from replicas rather than the primary, because read load is distributed. But it never improves write performance — it always makes writes slower than a single-node write. The trade-off is durability and availability in exchange for write latency.

What is tail latency and why does it matter in distributed systems?

Tail latency is the latency at the high percentiles — p99, p99.9 — rather than the median. In distributed systems, tail latency matters more than average latency because replication fan-out means every write waits for the slowest required replica. If the p99 response time of a single replica is 100ms, the p99 write latency waiting for two replicas is approximately the probability that at least one of them takes more than 100ms — which is roughly double the single-replica probability. Google’s “The Tail at Scale” paper documented this effect and introduced hedged requests as a mitigation.

Why do consensus systems have write throughput limits?

Because consensus requires total ordering of writes — all writes must be applied in the same order on all replicas — and total ordering requires serialisation. In leader-based consensus, all writes flow through the leader, which applies them sequentially to the replicated log. The leader’s throughput is the ceiling for the entire cluster. Batching and pipelining improve this ceiling significantly by amortising the per-write overhead of consensus rounds, but they cannot eliminate the fundamental serialisation constraint.

What is the difference between write scalability and read scalability in replicated systems?

Read scalability and write scalability are fundamentally different in leader-based replication. Reads can be served by any replica — adding replicas proportionally increases read capacity. A cluster of ten replicas can serve roughly ten times the read load of one replica. Writes must go through the leader — the leader’s capacity is the ceiling regardless of how many replicas exist. To scale writes, you must shard data across multiple independent leaders, each handling writes for a subset of the key space. This is why CockroachDB and TiKV use many Raft groups rather than one.

Why is global synchronous consensus expensive?

Because of physics. Light travels at a finite speed, and real-world network round-trip times between geographic regions are bounded below by the speed of light in fibre. US-East to EU-West is approximately 80 to 100ms round-trip time. A consensus protocol requiring acknowledgement from a replica in a remote region adds this latency to every write, regardless of how fast the local hardware is. This is why most globally distributed systems use regional consensus with asynchronous cross-region replication rather than synchronous global consensus.

What is backpressure and why is it important in replicated systems?

Backpressure is the mechanism by which a system signals to producers that it is approaching capacity and they should slow down. In replicated systems under overload, the leader’s replication queue grows, followers fall behind, clients begin timing out and retrying, and retries increase load further — a positive feedback loop that collapses throughput. Backpressure breaks this loop by limiting the rate of incoming writes before the queue grows out of control, allowing the system to degrade gracefully under overload rather than collapsing catastrophically.


Continue the Series

Series home: Distributed Systems — Concepts, Design & Real-World Engineering

Part 3 — Replication, Consistency & Consensus

Previous: ← 3.7 — Paxos vs Raft: Consensus Algorithms Compared

Next: 3.9 — Engineering Guidelines for Replication and Consistency →

Not read Parts 1 or 2 yet?

Discover more from Rahul Suryawanshi

Subscribe now to keep reading and get access to the full archive.

Continue reading