Distributed Systems Series — Part 4.6: Fault Tolerance & High Availability
High Availability Is a System-Level Property
The availability nines defined in Post 4.2 — 99.9%, 99.99%, 99.999% — are measurements of an outcome. This post is about the architecture that produces that outcome.
High availability cannot be achieved by adding redundancy to one layer of a system while leaving other layers as single points of failure. A system with three compute replicas behind a single load balancer is as available as that load balancer — when it fails, all three replicas become unreachable. A system with redundant compute and load balancers but a single database primary that takes ten minutes to fail over cannot achieve 99.99% availability no matter how many application servers it runs. A system that handles compute failures automatically but has a single-node coordination service is one etcd failure away from the entire control plane becoming unavailable.
High availability is a system-level property that emerges from redundancy and coordination at every layer simultaneously. The diagram above shows the correct mental model: five layers, each independently redundant, each failing independently. A single non-redundant component at any layer collapses the availability guarantee of the entire stack above it.
This post covers the architectural patterns that produce high availability in production — load shedding, graceful degradation, health-check-driven routing, multi-region deployment, and control plane HA — and closes with a design review checklist engineers can apply immediately.
Failure Domains: The Foundation of HA Architecture
A failure domain is a set of components that can fail together due to a shared dependency — a shared power circuit, a shared network switch, a shared availability zone, a shared software version. High availability requires spreading components across failure domains that are independent — components that cannot fail for the same reason at the same time.
As established in Post 4.3, AWS availability zones are physically separate datacenters within a region, connected by high-bandwidth low-latency links. Two instances in different AZs share the same regional network infrastructure and regional control plane — they are not fully independent failure domains for regional-level events. Two instances in different regions share only the global internet — they are genuinely independent failure domains for most failure scenarios.
The practical implication: for 99.9% availability, multi-AZ deployment within a single region is typically sufficient. For 99.99% availability, multi-AZ is necessary but may not be sufficient for all failure scenarios — regional events can affect all AZs simultaneously. For 99.999% availability, multi-region deployment with automatic failover is required, combined with the elimination of all single points of failure at every layer.
Load Shedding: Surviving Overload Without Collapsing
Load shedding is the most important high availability pattern that engineers consistently overlook. It is the mechanism that prevents an overloaded system from collapsing entirely — by deliberately dropping excess traffic at the edges of the system to protect core functionality for the traffic it can actually serve.
Without load shedding, an overloaded system behaves catastrophically. Incoming requests queue faster than they are processed. Queue depth grows. Memory pressure increases. Response latency climbs. Retries amplify the load. The system spends more time managing queue overhead than processing requests. Throughput collapses. The system becomes unavailable not because of hardware failure but because it attempted to serve more traffic than it could handle, and the attempt degraded it below the threshold where any requests could succeed.
With load shedding, the system measures its current capacity — through CPU utilisation, active request count, queue depth, or response latency — and begins rejecting new requests when capacity is approached. Rejected requests receive a 429 (Too Many Requests) or 503 (Service Unavailable) response immediately, without consuming processing resources. The existing requests in flight continue to be served. The system degrades to a reduced throughput level rather than collapsing entirely.
Google’s production systems implement load shedding through a mechanism called adaptive throttling, described in the Google SRE book. Each client tracks its own request success rate and voluntarily reduces its request rate when the success rate drops — a form of client-side load shedding that distributes the throttling decision. Server-side load shedding is implemented in gRPC through its built-in flow control mechanism — servers that are overloaded signal backpressure to clients through connection-level flow control, causing clients to slow their request rate automatically.
The implementation decision: load shedding should be based on a meaningful capacity signal, not arbitrary rate limits. CPU utilisation above 80% is a reasonable trigger for stateless compute. Queue depth exceeding N×processing-time-target is a reasonable trigger for asynchronous processors. Response latency p99 exceeding SLO is a reasonable trigger for latency-sensitive services. The shedding threshold should be set conservatively — it is better to start shedding at 75% capacity and maintain good performance for the traffic that gets through than to start at 95% and collapse before shedding begins.
Graceful Degradation: Serving Less Rather Than Nothing
Graceful degradation is the architectural pattern where a system continues serving a reduced but correct subset of its functionality when some dependencies are unavailable — rather than failing completely because a non-critical dependency has failed.
Netflix’s design philosophy is the canonical production example. Netflix’s homepage is assembled from dozens of microservices — recommendations, trending content, personalised rows, account information, recently watched. When one of these services is unavailable, a naive implementation would fail the entire homepage request. Netflix instead degrades gracefully: the homepage renders with whatever services are available, showing cached or default content where a failed service would have contributed. Users see a slightly less personalised homepage rather than an error page.
Graceful degradation requires explicit design work — it does not happen automatically. Each feature must be categorised as critical (the system cannot function without it) or non-critical (the system can function with reduced capability without it). For non-critical dependencies, the system must have a defined degraded behaviour — a cached response, a default value, an empty state, or simply the omission of that feature from the response.
The implementation pattern: wrap every non-critical dependency call in a circuit breaker with a defined fallback. The fallback is the degraded behaviour — a cached response from the last successful call, a static default, or an empty result. When the circuit opens, the fallback activates automatically. Users experience reduced functionality rather than a failure. The system’s core functionality — the critical path — continues without interruption.
Stripe’s payment infrastructure implements a specific form of graceful degradation called read-your-writes degradation. When Stripe’s primary database is unavailable during a regional failover, Stripe continues accepting payment requests against a replica — accepting that some very recent writes may not be visible in reads, but ensuring payment acceptance continues without interruption. For a payment processor, accepting a payment against a slightly stale replica is safer than refusing payment acceptance entirely.
Health-Check-Driven Routing: Removing Degraded Instances Before Users Notice
Health-check-driven routing removes degraded instances from the load balancer pool before clients experience failures — not after. This is the distinction between reactive failure handling (detecting failures from client errors) and proactive failure handling (detecting degraded instances from health signals and routing around them before they fail client requests).
As established in Post 1.4 and Post 4.1, a node can be alive but failing — passing a basic TCP health check while serving incorrect results for application requests. This is the gray failure scenario. Routing traffic to a gray-failing node because it passes liveness checks while failing application-level correctness is one of the most common HA anti-patterns in production.
Two levels of health checking are required for correct routing decisions. Liveness checks determine whether the process is running — a TCP connection can be established, or an HTTP endpoint returns any response. A node that fails liveness is dead and should be immediately removed from the pool and restarted. Readiness checks determine whether the process can serve traffic correctly — the database connection pool is healthy, the cache is warm, the downstream dependencies are reachable. A node that fails readiness is alive but not ready to serve traffic — it should be removed from the load balancer pool without being restarted.
Kubernetes implements this distinction through liveness and readiness probes. A pod failing liveness is restarted. A pod failing readiness is removed from the Service endpoints (and therefore from all load balancer routing) but is not restarted — it remains running, and when it recovers and passes readiness again, it is re-added to the pool automatically. This distinction prevents the common failure mode where a pod is repeatedly killed and restarted because it is misconfigured or its dependency is unhealthy — when the correct response is to stop routing to it and let it recover.
Application-level readiness checks should exercise the actual dependencies that are required to serve traffic. A database-backed service’s readiness check should verify that the database connection pool is healthy and that a test query succeeds. An external-API-backed service should verify that the external API is reachable. A cache-dependent service should verify that the cache is connected. Readiness checks that only verify process liveness provide no protection against gray failures.
Multi-Region High Availability
Multi-region HA is the architecture required for the highest availability tiers — 99.99% and above — and for systems where a regional cloud provider outage would be an unacceptable business risk.
Three deployment models exist for multi-region HA, each with different trade-offs. Active-passive multi-region runs the full stack in one region (active) and a warm standby in a second region (passive). Traffic is routed to the active region under normal conditions and fails over to the passive region on regional failure. The passive region’s database is kept synchronised through asynchronous or semi-synchronous replication. Failover is typically triggered by a combination of health checks and manual decision — the potential for data loss during failover (if replication was asynchronous) means automatic failover may not be appropriate for all systems.
Active-active multi-region runs the full stack in multiple regions simultaneously, with traffic distributed across all active regions by GeoDNS or a global load balancer. Each region serves a portion of normal traffic. When one region fails, its traffic is absorbed by the remaining regions. Write conflicts must be handled — as covered in Post 4.3, this requires either last-write-wins (dangerous), CRDTs (limited applicability), or application-level merge logic. This is the pattern used by Netflix, Cloudflare, and most large-scale consumer applications.
Regional isolation with global coordination assigns each user’s data to a specific region — a European user’s data is owned by the EU region, an American user’s data by the US region. Each region operates independently for its data. A global coordination layer (a small consensus cluster) handles cross-region metadata — which region owns which user’s data, which regions are currently healthy. This is Shopify’s architecture for merchant data — each merchant is assigned to a primary region and all their data operations are handled within that region with no cross-region write coordination required during normal operation.
Control Plane High Availability: The Most Overlooked Layer
The control plane — the coordination service, service discovery system, and configuration management layer — is the most commonly overlooked HA requirement. Engineers invest heavily in making the data plane highly available (redundant compute, replicated storage, multi-region deployment) while leaving the control plane as a single point of failure.
A control plane failure does not immediately take down a running system — services that are already running and have already discovered each other continue to function. But a control plane failure prevents the system from adapting to failures. New service instances cannot register. Failed instances cannot be removed from routing. Configuration changes cannot be applied. Leader elections cannot complete. The system is frozen in its last known state and becomes progressively more degraded as failures occur that it cannot respond to.
etcd — Kubernetes’ control plane datastore — must be deployed as a minimum three-node cluster across three availability zones. A single-node etcd is not an acceptable production configuration at any availability level. A three-node etcd cluster across three AZs tolerates the complete loss of one AZ without losing control plane availability. A five-node etcd cluster tolerates two simultaneous AZ failures — appropriate for systems requiring 99.99% or above.
ZooKeeper, Consul, and all coordination services have the same requirement: the control plane must be deployed with the same failure domain discipline as the data plane — minimum three nodes, spread across independent failure domains, with quorum-based leader election that does not require human intervention to complete.
HA Design Review Checklist
Use this checklist when reviewing the availability design of any distributed system component.
Failure domain analysis
- Are all components deployed across at least two independent failure domains (AZs or regions)?
- Does any single AZ or region failure take down the system entirely?
- Are software deployments staggered across failure domains to prevent correlated software failures?
- Is there a correlated failure scenario that defeats the redundancy design?
Load balancing and routing
- Is the load balancer itself redundant — does it have its own HA configuration?
- Are health checks configured to reflect actual readiness, not just process liveness?
- Is there a mechanism to remove degraded instances from routing before clients experience failures?
- Is GeoDNS or global load balancing configured for multi-region routing and automatic failover?
Compute layer
- Is the minimum replica count sufficient to absorb the loss of one failure domain while remaining within latency SLOs?
- Are pod anti-affinity rules configured to prevent all replicas from being scheduled in the same AZ?
- Is load shedding implemented to prevent overload collapse during traffic spikes?
- Are non-critical dependencies wrapped in circuit breakers with defined fallback behaviours?
Storage layer
- Is the database HA configuration synchronous or asynchronous — and is the RPO implication of asynchronous replication explicitly accepted?
- Is automated database failover tested in production, not just in staging?
- Is the failover time within the RTO target?
- Are read replicas available to absorb read traffic during primary failover?
Control plane
- Is the coordination service (etcd, ZooKeeper, Consul) deployed with a minimum of three nodes across three failure domains?
- Does a control plane failure prevent normal data plane operations from continuing?
- Is control plane health monitored and alerted on separately from data plane health?
- Is control plane failover automatic or does it require human intervention?
Operational validation
- Has each failure scenario been tested in production or a production-equivalent environment?
- Is the failover procedure documented and practiced — not just designed?
- Are availability metrics (uptime, error rate, latency p99) measured and alerted on continuously?
- Is there a defined error budget and a process for halting feature deployments when it is exhausted?
Key Takeaways
- High availability is a system-level property — redundancy at one layer is negated by a single point of failure at any other layer; the availability guarantee of the entire stack is bounded by its least redundant component
- Load shedding prevents overload collapse — by deliberately dropping excess traffic at the system edge when capacity is approached, the system degrades to reduced throughput rather than collapsing entirely under load
- Graceful degradation serves less rather than nothing — non-critical dependencies should have defined fallback behaviours so that their failure produces reduced functionality rather than a complete outage
- Health-check-driven routing requires readiness checks, not just liveness checks — routing to a node that passes liveness but fails readiness produces gray failures that are invisible to infrastructure monitoring but visible to users
- Multi-region HA requires choosing between active-passive (simpler, potential RPO during failover), active-active (zero failover gap, write conflict complexity), and regional isolation (independent regions, global metadata coordination)
- Control plane HA is the most overlooked layer — a single-node coordination service is a single point of failure that freezes the system’s ability to adapt to failures even when the data plane remains redundant
- HA design is not complete until it has been tested under production failure conditions — a design that has never been failed over has unknown actual availability, not the availability its architecture implies
Frequently Asked Questions (FAQ)
What is load shedding in distributed systems?
Load shedding is the practice of deliberately rejecting excess incoming requests when a system approaches its capacity limit, in order to protect existing in-flight requests and prevent total system collapse. Without load shedding, an overloaded system queues requests faster than it processes them, queue depth grows, latency increases, retries amplify load, and the system becomes completely unavailable. With load shedding, excess requests receive immediate rejection responses (429 or 503) while the system continues processing the traffic it can actually serve. Load shedding is triggered by capacity signals — CPU utilisation, queue depth, active request count, or p99 latency — rather than arbitrary rate limits.
What is graceful degradation in distributed systems?
Graceful degradation is the architectural pattern where a system continues serving reduced but correct functionality when some dependencies are unavailable, rather than failing completely. It requires categorising dependencies as critical (required for any functionality) or non-critical (required only for specific features), and defining a fallback behaviour for each non-critical dependency — a cached response, a default value, an empty state, or feature omission. Netflix’s homepage rendering, which continues with cached or default content when individual recommendation services are unavailable, is the canonical production example.
What is the difference between liveness and readiness checks?
A liveness check determines whether a process is running — whether a TCP connection can be established or an HTTP endpoint returns any response. A node failing liveness is dead and should be restarted. A readiness check determines whether a process is ready to serve traffic correctly — whether its database connection pool is healthy, its cache is connected, and its dependencies are reachable. A node failing readiness is alive but not ready to serve traffic — it should be removed from the load balancer pool without being restarted, and re-added automatically when it recovers. In Kubernetes, liveness probe failure triggers restart while readiness probe failure triggers pool removal. Routing to a node that passes liveness but fails readiness produces gray failures — the node is reachable but cannot serve requests correctly.
What is the difference between active-passive and active-active multi-region HA?
Active-passive multi-region runs the full stack in one region (active) with a warm standby in a second region (passive). Traffic normally goes to the active region and fails over to the passive on regional failure — introducing an RTO equal to the failover time and potentially a non-zero RPO if replication was asynchronous. Active-active multi-region runs the full stack in multiple regions simultaneously, with traffic distributed across all. When one region fails, surviving regions absorb its traffic immediately — RTO is effectively zero. The cost is write conflict complexity: all active regions accept writes, making concurrent writes to the same data possible, requiring explicit conflict resolution.
Why is control plane HA critical and often overlooked?
The control plane — etcd, ZooKeeper, Consul, service discovery, configuration management — manages the metadata that allows the data plane to adapt to failures. A control plane failure does not immediately take down running services, which is why it is often overlooked. But it prevents the system from responding to new failures: service instances cannot register, failed instances cannot be removed from routing, leader elections cannot complete, configuration changes cannot be applied. The system is frozen and becomes progressively more degraded as data plane failures occur that it cannot respond to. etcd must be deployed as a minimum three-node cluster across three AZs — a single-node etcd is not a production-acceptable configuration at any availability target.
How do you validate that a high availability design actually works?
HA design validation requires testing under actual failure conditions — not in design documents or staging environments, but in production or production-equivalent environments. Simulated instance failures verify that load balancers correctly remove failed instances and distribute traffic to survivors. AZ-level outage drills verify that cross-AZ redundancy works correctly under realistic conditions. Database failover tests verify that automated promotion completes within the RTO target and that no data is lost beyond the RPO target. Chaos engineering — continuous automated failure injection — ensures that failure handling remains correct as the system evolves. A design that has never been tested under failure has unknown actual availability regardless of its architecture.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 4 — Fault Tolerance & High Availability Overview
- 4.1 — Failure Taxonomy: How Distributed Systems Fail
- 4.2 — Fault Tolerance vs High Availability: Understanding the Difference
- 4.3 — Redundancy Patterns in Distributed Systems
- 4.4 — Failure Detection: Heartbeats, Timeouts and the Phi Accrual Detector
- 4.5 — Recovery and Self-Healing Systems
- 4.6 — Designing for High Availability: Patterns and Trade-offs
- 4.7 — Fault Isolation and Bulkheads
- 4.8 — Observability and Diagnosing Distributed Failures
- 4.9 — Chaos Engineering and Resilience Culture
Previous: ← 4.5 — Recovery and Self-Healing Systems
Next: 4.7 — Fault Isolation and Bulkheads →
Related posts from earlier in the series:
- 4.2 — Fault Tolerance vs High Availability — availability nines, MTTR, MTBF, and error budgets
- 4.3 — Redundancy Patterns — active-passive, active-active, and quorum-based redundancy in depth
- 2.2 — Reliability and Retries — circuit breakers and the fallback patterns that enable graceful degradation
- 3.4 — The CAP Theorem Correctly Understood — the consistency vs availability trade-off that multi-region HA must navigate
- 1.4 — Node & Failure Model — slow nodes vs crashed nodes and why readiness checks matter
9 thoughts on “Designing for High Availability: Patterns and Trade-offs in Distributed Systems”
Comments are closed.