Distributed Systems Series — Part 5.2: Scalability & Performance
Why Latency at Scale Is a Different Problem
Post 5.1 established what scalability means and identified Amdahl’s Law as the mathematical ceiling on parallelism. This post addresses the latency dimension of scalability — specifically why latency behaviour at scale is fundamentally different from latency at low traffic, and why the metric that matters most is not the one that appears most prominently in dashboards.
At low traffic, the average latency of a service is a reasonable proxy for user experience. Most requests complete in roughly the same time. The variance is small. Optimising average latency improves the experience for most users most of the time.
At scale, average latency is actively misleading. A service with 5ms average latency and 500ms p99 latency is a service that is failing 1 in 100 of its users with a half-second response. That 1% does not appear in the average. It does not appear in the median. It is visible only at the high percentiles — p95, p99, p99.9 — and it is at the high percentiles where user experience is determined and SLO budgets are consumed.
The reason high percentiles matter more at scale is fan-out amplification: the phenomenon where a small tail probability at one component becomes a large probability of slow end-to-end response when requests touch many components. This is the central insight of Google’s “The Tail at Scale” paper, and it is the reason every production system at significant scale instruments and optimises p99 rather than average latency.
Fan-Out Amplification: The Mathematics of Tail Latency
The mathematics are straightforward but the implications are non-obvious until you work through them.
Suppose a single service component has a p99 latency of 100ms — meaning 1% of requests to that component take 100ms or more. For a request that touches only this one component, there is a 1% probability of a slow response. This is the baseline.
Now suppose a request touches 10 independent components in parallel — a fan-out of 10. The end-to-end latency is determined by the slowest component to respond, because the request cannot complete until all parallel calls return. The probability that at least one of the 10 components takes 100ms or more is:
P(at least one slow) = 1 – P(all fast) = 1 – (0.99)^10 ≈ 9.6%
A 1% tail probability at each component produces approximately a 10% probability of a slow end-to-end response. The tail has amplified by a factor of 10.
As the diagram above shows, this amplification grows rapidly. At a fan-out of 50 components, the probability rises to approximately 40%. At a fan-out of 100 components — not unusual in a large microservices system where a user-facing request touches dozens of services, each making their own downstream calls — the probability approaches 63%. A 1% slow tail at each component produces a 63% probability of a slow end-to-end response.
This is not a failure of any individual component. Each component is meeting its 99th percentile SLO. The system-level latency behaviour is a structural consequence of fan-out architecture combined with individual component tail latency. Fixing it requires either reducing fan-out depth, reducing individual component tail latency, or applying specific tail latency mitigation techniques.
The implication for SLO design: a system that fans out to N components cannot meet a p99 end-to-end latency SLO without each component meeting a significantly better percentile. If the end-to-end p99 target is 200ms across a fan-out of 10 components, each component needs p99.9 (not p99) latency better than 200ms — because the p99 end-to-end is approximately the p99.9 per-component when N=10.
The Sources of Tail Latency
Understanding why tail latency exists — the root causes that produce the slow 1% — is necessary for addressing it. The sources are well-documented in production systems and fall into several categories.
Garbage collection pauses. JVM-based services (Java, Scala, Kotlin) experience stop-the-world GC pauses during which all threads halt and the GC collects unreachable objects. A GC pause of 200ms adds 200ms to every request that is in flight during the pause. These pauses are not predictable in timing — they occur when the GC determines that collection is necessary, which may be during a peak traffic window. Go’s GC is concurrent and produces much shorter pause times, but is not entirely pause-free. GC tuning — heap sizing, GC algorithm selection, pause time targets — directly affects tail latency.
Noisy neighbours. In cloud environments where multiple tenants share physical hardware, a workload running on the same physical host can saturate shared resources — CPU caches, memory bandwidth, network interface, disk I/O — and degrade the performance of co-located workloads. A virtual machine that experiences noisy neighbour interference may see latency spikes for the duration of the interference. This is non-deterministic and difficult to predict, which is why tail latency in cloud environments is inherently higher and more variable than in dedicated hardware environments.
Disk flush stalls. Services that write to local disk — logs, write-ahead logs, checkpoints — experience latency stalls when the operating system’s page cache flushes dirty pages to disk. This can block the writing thread for tens to hundreds of milliseconds during the flush. The flush timing is determined by OS kernel parameters and disk I/O queue depth, not by the application. Services that are sensitive to tail latency must either avoid synchronous disk writes on the critical path or use asynchronous write-ahead logging with explicit flush control.
Network jitter. Network packet loss triggers TCP retransmission, which adds one retransmission timeout (RTO) to the affected connection — typically 200ms to 1 second in default TCP configurations. A single dropped packet on a connection adds hundreds of milliseconds of latency to that connection. In high-throughput systems where many connections are active simultaneously, the probability of at least one connection experiencing retransmission at any given moment is non-trivial. Kernel parameter tuning — specifically the TCP initial RTO and the minimum RTO — reduces retransmission latency for intra-datacenter connections where packet loss is rare.
Lock contention. Services that use mutexes or other synchronisation primitives experience tail latency when high-priority requests arrive while the mutex is held by a low-priority operation. The high-priority request waits for the mutex holder to release, adding the holder’s remaining execution time to the high-priority request’s latency. This is the distributed analogue of the priority inversion problem. Reducing lock contention — through lock-free data structures, shorter critical sections, or per-shard locking instead of global locking — directly reduces tail latency caused by contention.
Background tasks and maintenance operations. Many systems run background tasks — cache eviction, index compaction, log rotation, metrics aggregation — that compete with foreground request processing for CPU and I/O. When a background task saturates the disk I/O queue, foreground requests that require disk access experience increased latency. Scheduling background tasks during low-traffic periods and applying I/O priority to ensure foreground tasks preempt background tasks reduces tail latency caused by resource competition.
Hedged Requests: The Standard Tail Latency Mitigation
Google’s “The Tail at Scale” paper introduced two techniques that have become the standard approach to tail latency mitigation in high-scale systems: hedged requests and tied requests.
Hedged requests send the same request to two (or more) replicas simultaneously and use whichever response arrives first, cancelling the outstanding request to the slower replica. This cuts the tail latency approximately in half — instead of waiting for the single replica to respond, the system uses the faster of two replicas. The cost is doubled read load on the downstream service, which is acceptable when tail latency improvement is worth the additional load.
The implementation detail that makes hedged requests practical: do not send the second request immediately. Send the first request, and only issue the second request if the first has not responded within a threshold — typically the p95 response time. This captures most of the tail latency benefit while limiting the additional load to the requests that are actually slow (approximately 5% of requests), rather than doubling load on all requests.
Tied requests extend hedged requests with cancellation coordination. When the second request is sent to the backup replica, a cancellation token is piggybacked on both requests. When either replica begins processing the request, it sends a cancellation to the other. This prevents the slower replica from wasting resources completing work that will be discarded. In systems where request processing is expensive, tied requests significantly reduce the wasted work compared to simple hedging.
Google applies hedged requests broadly across their internal infrastructure — BigTable reads, GFS chunk reads, Colossus file reads. The pattern is so effective that it has become standard in high-scale read-heavy systems. It is less applicable to writes (where both replicas completing the write would require deduplication) but hedged reads are implementable in almost any read-heavy system with replicated data.
Latency Percentiles: What to Instrument and Why
The four latency percentiles that matter in production are p50, p95, p99, and p99.9. Each answers a different question and serves a different purpose.
p50 (median) is the latency experienced by the middle of the distribution — half of requests are faster, half are slower. It is useful for understanding typical performance and for detecting broad regressions that shift the entire distribution. It is not useful for understanding tail behaviour and should not be the primary latency alert metric.
p95 is the latency experienced by 95% of requests — 5% of requests are slower. It is the first meaningful tail indicator and is useful for detecting performance regressions before they reach the extreme tail. Circuit breaker thresholds and hedged request thresholds are often calibrated against p95.
p99 is the latency experienced by 99% of requests — 1% of requests are slower. This is the standard production SLO metric for most user-facing services. It captures the tail behaviour that most users will never experience directly but that significantly affects user experience in fan-out architectures. Most production alerting on latency should be based on p99.
p99.9 is the latency experienced by 99.9% of requests — 0.1% of requests are slower. For high-traffic services handling millions of requests per minute, 0.1% represents thousands of slow requests per minute. p99.9 is the right metric for high-scale systems where even rare tail events affect large numbers of users, and for fan-out architectures where p99.9 per-component is required to meet p99 end-to-end SLOs.
A critical implementation note: latency percentiles must be calculated from histograms, not from averages of averages. Averaging p99 values across instances or time windows produces a number that is neither mathematically correct nor meaningful as a latency percentile. The correct approach is to aggregate histogram buckets across instances and time windows and calculate percentiles from the aggregated histogram. Prometheus histograms and the `histogram_quantile` function implement this correctly. Statsd-style timers that emit pre-calculated percentiles cannot be correctly aggregated.
Latency SLO Design for Scaled Systems
An SLO for latency is not just a threshold — it is a contract that must be achievable under the fan-out conditions the system operates in. Designing a latency SLO without accounting for fan-out produces an SLO that cannot be met without addressing the underlying tail latency amplification.
The correct SLO design process: define the end-to-end latency target that represents acceptable user experience. Decompose the request path into its constituent fan-out components — which services are called in parallel, what the expected fan-out depth is. Calculate the per-component latency target required to meet the end-to-end SLO with the observed fan-out. Use the formula P(slow end-to-end) = 1 – (1-p)^N where p is the per-component slow probability and N is the fan-out depth to determine what per-component percentile is required.
As an example: an end-to-end p99 latency target of 200ms across a fan-out of 10 components requires that each component has at most a 0.1% probability of exceeding 200ms — which means each component must meet a p99.9 latency of 200ms, not merely a p99. Specifying a per-component p99 SLO of 200ms and expecting the end-to-end p99 to also be 200ms is a mathematical impossibility with a fan-out of 10.
The SRE error budget framework from Post 4.2 applies directly to latency SLOs. A latency SLO violation — more than 1% of requests exceeding the p99 threshold — consumes error budget. SLO burn rate alerts fire when latency violations are consuming the budget at an unsustainable rate. This connects latency management directly to availability management: latency failures are availability failures in disguise, because a request that takes 10 seconds is functionally unavailable to the user waiting for it.
Latency Budgets: Allocating Time Across the Request Path
A latency budget is the allocation of end-to-end latency across the components in a request path — how much time each component is allowed to consume. Without an explicit latency budget, individual teams optimise their component’s latency in isolation while the end-to-end latency grows because no one owns the sum.
The latency budget is set at the end-to-end target and divided among components based on their contribution to the request. For an end-to-end target of 200ms, a typical allocation might be: 10ms for DNS resolution and connection establishment, 20ms for the API gateway, 50ms for the primary service business logic, 80ms for database queries, 20ms for cache lookups, 20ms for serialisation and network overhead. Each component team is then responsible for meeting their allocation.
Distributed tracing from Post 4.8 is the instrumentation that makes latency budget tracking operational. Each span in a trace records the time consumed by one component. The trace timeline shows, for a specific request, which component consumed how much of the latency budget. When the end-to-end SLO is violated, the trace immediately shows which component exceeded its allocation — focusing investigation on the right place rather than requiring guesswork.
Latency vs Throughput: The Fundamental Trade-off
Latency and throughput are inversely related under load — this is one of the most important and most frequently misunderstood relationships in distributed systems performance.
At low utilisation, a system can respond quickly to each request because resources are available immediately. As utilisation increases, requests must wait for resources that are occupied serving previous requests. Queuing theory (specifically Little’s Law and the M/M/1 queue model) formalises this: as utilisation approaches 100%, average latency approaches infinity. The relationship is not linear — latency increases slowly at moderate utilisation and explosively at high utilisation.
The practical implication: a system cannot simultaneously maximise throughput and minimise latency. Operating at 90% utilisation maximises resource efficiency but produces high and variable latency. Operating at 50% utilisation wastes half the provisioned capacity but provides consistent low latency. The operating point — the utilisation level the system is designed to sustain — is a product decision that engineering must make explicit: how much latency headroom is required, and what is the acceptable utilisation ceiling?
Most production systems targeting consistent low latency operate at 60-70% utilisation. Systems that can tolerate higher latency variance (batch processing, background jobs) can operate at 80-90%. Systems requiring ultra-consistent low latency (financial trading, real-time gaming) may target 40-50% utilisation. The unused capacity is not waste — it is the latency headroom budget.
Load shedding from Post 4.6 is the mechanism that enforces the utilisation ceiling: when the system approaches its target utilisation threshold, excess requests are rejected rather than queued, preventing the latency explosion that occurs when queues grow without bound. Without load shedding, a burst of traffic above the design capacity produces latency degradation for all requests in the system. With load shedding, the excess is rejected at the edge and the requests that are admitted continue to receive acceptable latency.
Key Takeaways
- Average latency is misleading at scale — tail latency (p99, p99.9) determines user experience and SLO compliance; a 5ms average with a 500ms p99 means 1 in 100 users waits 500ms, which is unacceptable at any scale
- Fan-out amplification converts small component tail probabilities into large end-to-end slow probabilities — at a fan-out of 100 components each with 1% slow probability, approximately 63% of end-to-end requests experience at least one slow component
- The sources of tail latency are specific and addressable — GC pauses, noisy neighbours, disk flush stalls, network jitter, lock contention, and background task interference each require different mitigation strategies
- Hedged requests are the standard tail latency mitigation — send the request to two replicas after a p95 delay threshold and use whichever responds first; this halves tail latency at the cost of ~5% additional read load on the downstream service
- Latency percentiles must be calculated from histograms, not from averages of averages — aggregating pre-calculated percentiles across instances or time windows produces mathematically incorrect results that cannot be used as SLO compliance metrics
- SLO design must account for fan-out — an end-to-end p99 SLO requires per-component p99.9 or better when fan-out depth is 10; specifying p99 per component and expecting p99 end-to-end is mathematically impossible with non-trivial fan-out
- Latency and throughput trade off under load — consistent low latency requires operating below the utilisation ceiling where queuing latency becomes significant; the unused capacity is latency headroom, not waste
Frequently Asked Questions (FAQ)
What is tail latency in distributed systems?
Tail latency is the latency at the high percentiles of the request latency distribution — p95, p99, p99.9 — as opposed to the median or average. In distributed systems, tail latency matters more than average latency because of fan-out amplification: when a request touches N components in parallel, the end-to-end latency is determined by the slowest component to respond. A 1% slow tail at each of 10 components produces approximately a 10% probability of a slow end-to-end response. At 100 components, the probability approaches 63%. The tail dominates user experience in any system with significant fan-out.
What are hedged requests and when should I use them?
Hedged requests send the same read request to two replicas simultaneously and use whichever responds first, cancelling the outstanding request to the slower replica. They reduce tail latency by approximately half at the cost of doubled read load on the downstream service. The practical implementation sends the second request only after a threshold delay — typically p95 response time — so that only the slowest ~5% of requests trigger the hedge. Hedged requests are appropriate for read-heavy paths in replicated systems where tail latency significantly impacts user experience. They are not directly applicable to writes without additional deduplication logic.
Why should I use p99 instead of average latency for alerting?
Average latency hides tail behaviour. A service with 5ms average latency and 500ms p99 latency has 1 in 100 requests taking 500ms — but the average of 4ms fast requests and one 500ms request is still approximately 9ms, which appears healthy in average-based monitoring. p99 latency directly measures the experience of the worst 1% of users, which in a fan-out architecture becomes the experience of a much larger fraction of end-to-end requests. SLO-based alerting on p99 fires when the tail is actually degrading, not when the average is slightly elevated by a few extreme outliers.
How do I calculate latency percentiles correctly across multiple instances?
Latency percentiles must be calculated from histogram bucket counts, not from averaging pre-calculated percentiles. If Instance A reports p99=100ms and Instance B reports p99=200ms, the aggregate p99 is not 150ms — it depends on the relative request volumes and the full distribution shape. The correct approach is to collect histogram bucket counts from all instances (how many requests fell in each latency bucket), sum the bucket counts across instances, and calculate percentiles from the aggregated histogram. Prometheus histograms with `histogram_quantile` implement this correctly. Tools that report percentiles as pre-calculated numbers cannot be correctly aggregated without the underlying histogram data.
What is a latency budget and how do I use one?
A latency budget allocates the end-to-end latency target across the components in a request path — defining how much time each component is permitted to consume. For a 200ms end-to-end target, the budget might allocate 50ms to the primary service, 80ms to the database, 20ms to the cache, and so on. Each component team owns their allocation. Distributed tracing makes the budget operational — each trace span shows which component consumed how much time, immediately identifying which component exceeded its allocation when the end-to-end SLO is violated. Without explicit latency budgets, teams optimise their components in isolation while no one owns the end-to-end sum.
What is the relationship between latency and throughput?
Latency and throughput trade off under load through queuing effects. At low utilisation, resources are available immediately and requests complete quickly. As utilisation increases toward 100%, requests must wait for occupied resources and average latency rises — slowly at moderate utilisation and explosively as utilisation approaches the capacity ceiling. This relationship means a system cannot simultaneously maximise throughput (high utilisation) and minimise latency (low queue depth). Most latency-sensitive production systems target 60-70% utilisation — accepting resource “inefficiency” to maintain the headroom that prevents latency spikes under load. Load shedding enforces this ceiling by rejecting excess traffic before it causes queuing-induced latency degradation.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 5 — Scalability & Performance
- 5.1 — What Scalability Really Means in Distributed Systems
- 5.2 — Latency and Tail Latency at Scale
- 5.3 — Partitioning and Sharding in Distributed Systems
- 5.4 — Load Balancing Strategies in Distributed Systems
- 5.5 — Caching Trade-offs in Distributed Systems
Previous: ← 5.1 — What Scalability Really Means in Distributed Systems
Next: 5.3 — Partitioning and Sharding in Distributed Systems →
Related posts from earlier in the series:
- 3.8 — Performance Trade-offs in Replicated Systems — Tail latency in the replication context, hedged requests introduction
- 4.2 — Fault Tolerance vs High Availability — SLO error budget framework that latency SLOs feed into
- 4.6 — Designing for High Availability — Load shedding that enforces the utilisation ceiling
- 4.8 — Observability — Distributed tracing that makes latency budgets operational
- 4.7 — Fault Isolation and Bulkheads — Isolating slow components to prevent tail latency propagation