Distributed Systems Series — Part 4.7: Fault Tolerance & High Availability
Failures Are Inevitable — Outages Are Not
Every large distributed system experiences component failures continuously. Nodes crash, networks degrade, downstream services slow, disks fill, processes run out of memory. The engineering discipline is not preventing these failures — that is impossible at scale — but containing them. A failure in one component should stay in that component. It should not propagate to unrelated services, exhaust shared resources, or bring down functionality that has nothing to do with the original fault.
The 2019 Cloudflare outage demonstrated what happens without this containment. A single CPU-intensive regular expression bug in a new WAF rule caused CPU utilisation to spike on Cloudflare’s servers globally. The CPU spike starved every other process running on those servers — the proxy, the DNS resolver, the control plane — of the cycles they needed to function. A bug in one component with a blast radius that encompassed the entire system. The result was a global traffic drop affecting all Cloudflare customers for approximately 27 minutes.
The 2021 Fastly outage was structurally similar. A single customer triggered a latent software bug by changing their configuration. The bug caused 85% of Fastly’s network to return errors simultaneously — because a single configuration change could affect a globally shared system without isolation between customers or between configuration changes and serving traffic.
Both incidents share the same root cause: insufficient isolation between failure domains. A failure in one part of the system had an unrestricted blast radius that encompassed the entire system. Fault isolation — the architectural practice of containing failures within defined boundaries — is what limits blast radius. Bulkheads are the primary mechanism.
Blast Radius: The Metric That Drives Isolation Design
Blast radius is the scope of impact when a component fails — how many users are affected, how many services are degraded, how much functionality is lost. Minimising blast radius is the primary goal of fault isolation design.
Blast radius has two dimensions. Depth: how far down the dependency chain does a failure propagate? A failure that stays within one service has shallow depth. A failure that cascades through five layers of dependencies has deep blast radius. Width: how many independent users, tenants, or feature areas are affected? A failure that affects one tenant in a multi-tenant system has narrow width. A failure that affects all tenants because they share infrastructure has wide blast radius.
Every isolation boundary in the system design is a blast radius boundary — a line that a failure cannot cross automatically. Thread pool isolation creates blast radius boundaries between service calls. Process isolation creates boundaries between services. AZ isolation creates boundaries between infrastructure failure domains. Tenant isolation creates boundaries between customers. The goal is to design enough boundaries that no single failure can produce an unacceptable blast radius.
The practical question for every shared resource in your system: if this resource is exhausted or fails, what is the blast radius? If the answer is “the entire system” or “all customers,” that resource needs a bulkhead.
The Bulkhead Pattern: Origin and Core Mechanism
The term bulkhead comes from naval architecture. Ships are divided into watertight compartments separated by bulkheads — structural walls that prevent flooding from spreading between compartments. If one compartment floods, the bulkheads contain the flooding to that compartment and the ship remains afloat. Without bulkheads, a single hull breach sinks the ship.
In distributed systems, bulkheads isolate resource pools so that the exhaustion of resources serving one component cannot starve resources needed by other components. The diagram above shows the mechanism concretely: without bulkheads, a slow recommendation service exhausts a shared thread pool and starves authentication and payment services that have nothing to do with the recommendation failure. With bulkheads, each service has its own thread pool — the recommendation pool exhausts, recommendation requests fail fast, but authentication and payment pools are completely unaffected.
Bulkheads work in combination with the graceful degradation pattern established in Post 4.6. Bulkheads contain the failure to a specific resource pool. Graceful degradation defines what the system does when that pool is exhausted — serve a cached response, return a default, or fail fast with an informative error. Together they ensure that a failure in one component produces a defined, limited degradation rather than an unbounded cascade.
Thread Pool Isolation: The Primary Bulkhead Mechanism
Thread pool isolation is the most widely deployed bulkhead implementation. Each downstream dependency gets its own dedicated thread pool. Requests to that dependency use threads from its pool. If the dependency slows or fails and its pool exhausts, requests to that dependency fail fast — but the thread pools serving other dependencies are completely unaffected.
Netflix’s Hystrix library, released in 2012, popularised thread pool isolation in the Java microservices ecosystem. Hystrix was created specifically because Netflix experienced the cascade failure pattern — one slow dependency exhausting a shared thread pool and taking down unrelated functionality. Hystrix wrapped each downstream call in a command object that executed in a dedicated thread pool with configurable size, timeout, and fallback. When the pool was exhausted, new requests immediately invoked the fallback rather than blocking. Netflix ran Hystrix on all their API calls for years and credited it with significantly improving their service resilience.
Hystrix entered maintenance mode in 2018 and is no longer actively developed. Its successor in the Java ecosystem is Resilience4j, which implements the same bulkhead concept in two distinct variants that solve different problems.
Thread pool bulkhead (Resilience4j) assigns a fixed number of threads to each downstream call, matching the Hystrix model. Requests that arrive when the pool is full immediately invoke the fallback. The thread pool size is the isolation guarantee — at most N threads can be simultaneously blocked waiting for the downstream service. The cost is thread overhead: each thread pool consumes memory and scheduling overhead, and idle threads in one pool cannot be used by another pool even if that pool is at capacity.
Semaphore bulkhead (Resilience4j) limits the number of concurrent calls to a downstream service without allocating dedicated threads. Instead, a semaphore counter tracks concurrent calls. When the count reaches the configured limit, new requests immediately invoke the fallback. Semaphore bulkheads have lower overhead than thread pool bulkheads — they do not allocate threads, which makes them suitable for high-throughput scenarios where thread allocation overhead is measurable. The trade-off: semaphore bulkheads do not provide timeout isolation. A request that is waiting for a semaphore cannot be interrupted by a timeout — the timeout applies to the downstream call itself, not to waiting for the semaphore. For CPU-bound or memory-intensive calls where thread isolation overhead matters, semaphore bulkheads are the right choice. For I/O-bound calls where timeout isolation is critical, thread pool bulkheads are the right choice.
Connection Pool Isolation
Connection pool isolation applies the bulkhead principle to database and external service connections. Without isolation, all requests to all downstream services share a single connection pool. A downstream service that is slow to respond holds connections open, depleting the pool. When the pool is exhausted, all requests to all downstream services fail — not just requests to the slow service.
The correct design maintains separate connection pools per downstream dependency. Each database, each external API, each third-party service gets its own pool with its own size limit. Pool exhaustion for one downstream service produces failures for that service only — other pools are unaffected.
PostgreSQL’s connection handling makes this concrete. PostgreSQL has a maximum connection limit (default 100 in many configurations). If one application service opens 80 connections to one database and another application service needs connections to the same database, the second service is starved — it cannot get connections because the pool is nearly exhausted. The correct architecture uses connection pooling middleware (PgBouncer, pgpool-II) that maintains separate pools per application service and enforces per-service connection limits, preventing one service from monopolising the database’s connection capacity.
In microservices architectures, connection pool isolation at the HTTP client level prevents a similar failure mode. If all outbound HTTP calls share a single HTTP client with a single connection pool, a downstream service that accepts connections but responds slowly will hold all connections open until the pool is exhausted. Separate HTTP clients per downstream service — each with its own connection pool, timeout, and circuit breaker — contain this failure mode to the specific downstream service that is misbehaving.
Service Mesh Bulkheads: Istio and Envoy
Service mesh infrastructure — Istio, Linkerd, Envoy — implements bulkheads at the network layer without requiring application code changes. This is significant because it means bulkhead protection applies uniformly to all services in the mesh, not just services that have explicitly integrated a bulkhead library.
Istio’s DestinationRule resource configures connection pool limits per upstream service. The connectionPool settings define the maximum number of TCP connections and the maximum number of pending HTTP requests per connection. When these limits are reached, Istio’s Envoy sidecar rejects new connections immediately rather than queuing them — applying the bulkhead pattern at the proxy level before the application receives the request.
Istio’s outlier detection implements circuit breaking at the infrastructure level — automatically ejecting upstream hosts that are returning errors above a configured threshold from the load balancing pool. When an upstream host is ejected, no traffic is routed to it for a configurable ejection interval. This is bulkheading at the instance level within a service: a specific upstream instance that is misbehaving is isolated from the pool without affecting other instances of the same service.
Locality-weighted load balancing in Istio implements geographic bulkheading — traffic from a given region is preferentially routed to upstream instances in the same region, with fallback to other regions only when local instances are unhealthy. This contains the blast radius of a regional upstream degradation to regional traffic rather than spreading it globally through the load balancer.
Process and Service Isolation
Thread pool and connection pool isolation contain failures within a service. Process isolation and service isolation contain failures between services.
Process isolation runs different services in separate processes, preventing a crash in one service from crashing others. Container-based deployment (Docker, Kubernetes) implements process isolation by default — each container runs an independent process namespace. A service that crashes due to an unhandled exception, an OOM kill, or a segmentation fault is contained to its container. Other containers on the same node continue running unaffected. The container runtime (containerd, CRI-O) provides the isolation boundary.
Service isolation separates services into independent deployable units with independent resource allocation and independent failure domains. In a monolithic architecture, one code path that consumes excessive CPU or memory degrades all other code paths. In a microservices architecture, one service that consumes its CPU limit does not affect other services’ CPU allocations — Kubernetes resource limits enforce the isolation boundary.
Kubernetes resource limits and requests implement service-level bulkheads. Resource requests reserve guaranteed capacity. Resource limits cap maximum consumption. A container that exceeds its memory limit is OOM-killed — contained to that container. A container that exceeds its CPU limit is throttled — contained to that container. Without limits, one container can consume all available node resources and starve every other container on the node. Resource limits are mandatory bulkheads, not optional optimisations.
Tenant Isolation in Multi-Tenant Systems
Multi-tenant SaaS systems face a specific blast radius problem: one tenant’s workload consuming shared resources and degrading performance for all other tenants. This is the “noisy neighbour” problem, and it requires tenant-level bulkheads to contain.
Tenant isolation strategies exist at multiple layers. Request-level rate limiting caps the number of requests per tenant per time window — preventing one tenant from consuming an unfair share of compute capacity. Per-tenant connection pool limits prevent one tenant’s database-heavy workload from exhausting connection capacity for other tenants. Per-tenant resource quotas in Kubernetes (using ResourceQuota objects per namespace) limit the compute and storage resources available to one tenant’s workloads. Per-tenant queues prevent one tenant’s large batch job from blocking other tenants’ real-time requests in a shared processing system.
Stripe’s rate limiting architecture is the canonical production example. Every Stripe API call is rate-limited per API key, per endpoint, and per time window. Rate limit responses (HTTP 429) are returned immediately — they do not consume backend processing resources. This bulkheads one merchant’s API usage from affecting other merchants’ API availability, even when all merchants share the same backend infrastructure.
Data Plane and Control Plane Isolation
As established in Post 4.6, the control plane and data plane must be isolated from each other. A control plane failure should not affect data plane serving, and a data plane overload should not affect control plane operations.
This isolation requires separate resource allocation for control plane components. etcd, the Kubernetes API server, and the scheduler should run on dedicated nodes with dedicated CPU and memory allocations — not on the same nodes as workload containers. A workload container that exhausts node memory and triggers an OOM kill should not be able to OOM-kill the kubelet or the etcd process on the same node.
Network isolation between control plane and data plane traffic prevents a data plane traffic spike from saturating the network paths used by control plane heartbeats and leader election messages. Kubernetes recommends dedicated network interfaces for control plane communication in high-availability production deployments — a separate network that data plane traffic cannot reach, preventing data plane congestion from causing control plane timeouts and spurious leader elections.
Testing Fault Isolation
Bulkheads that are not tested under failure conditions may not work as designed. Thread pool sizes that were set without load testing may be too small (causing excessive fallback activation under normal load) or too large (providing insufficient isolation under peak load). Connection pool limits that were configured without measuring actual connection usage may be wrong. Tenant rate limits that were set without measuring tenant traffic patterns may be too aggressive or too lenient.
Fault isolation testing follows the same principles as chaos engineering — covered in Post 4.9 — but with a specific focus on isolation boundary validation. Inject latency into one downstream service and verify that other downstream services continue to serve requests at normal latency. Exhaust one tenant’s rate limit and verify that other tenants are unaffected. Kill one container and verify that other containers on the same node continue running without performance degradation.
The key metrics to monitor during isolation testing: request latency and success rate per service, thread pool utilisation and rejection rate per pool, connection pool utilisation and wait time per pool, tenant-level request rate and error rate. If injecting a failure into Service A causes degradation in Service B’s metrics, the isolation boundary between them is insufficient.
Key Takeaways
- Blast radius is the primary metric for fault isolation design — every shared resource in the system should have an explicit blast radius assessment, and any shared resource whose exhaustion affects the entire system needs a bulkhead
- Thread pool bulkheads (Hystrix, Resilience4j) isolate downstream service calls into dedicated thread pools — pool exhaustion for one downstream service produces fast failures for that service only, leaving other service pools unaffected
- Semaphore bulkheads limit concurrent calls with lower overhead than thread pool bulkheads — appropriate for high-throughput CPU-bound calls where thread allocation overhead is measurable, but they do not provide timeout isolation
- Service mesh infrastructure (Istio, Envoy) implements connection pool limits and outlier detection at the network layer — providing uniform bulkhead protection across all services without application code changes
- Kubernetes resource limits are mandatory service-level bulkheads — a container without memory and CPU limits can exhaust node resources and starve every other container on the node
- Multi-tenant systems require tenant-level bulkheads — per-tenant rate limits, connection pool limits, and resource quotas prevent the noisy neighbour problem from allowing one tenant’s workload to degrade all others
- Control plane and data plane must be isolated from each other — a data plane overload that exhausts control plane resources prevents the system from responding to failures, turning a degradation into a complete outage
Frequently Asked Questions (FAQ)
What is the bulkhead pattern in distributed systems?
The bulkhead pattern isolates resources — thread pools, connection pools, CPU, memory — into separate compartments per dependency or service, preventing the resource exhaustion caused by one failing component from affecting other components. The name comes from naval architecture, where watertight bulkheads prevent a single hull breach from flooding the entire ship. In distributed systems, bulkheads prevent a single slow or failing downstream service from exhausting all available threads or connections and cascading to unrelated services. Implemented through Resilience4j, Hystrix (deprecated), or service mesh infrastructure like Istio.
What is blast radius in distributed systems?
Blast radius is the scope of impact when a component fails — how many users are affected, how many services degrade, how much functionality is lost. A failure with a small blast radius affects only the failed component and its direct dependents. A failure with a large blast radius cascades through shared resources to affect unrelated components and all users. Fault isolation design is fundamentally blast radius minimisation — adding isolation boundaries that prevent failures from propagating beyond their origin. Every shared resource without a bulkhead is a potential blast radius amplifier.
What is the difference between thread pool and semaphore bulkheads?
Thread pool bulkheads allocate a dedicated set of threads for each downstream service. Requests that arrive when the pool is full invoke the fallback immediately. Thread pool bulkheads provide both concurrency isolation and timeout isolation — a thread waiting for a slow downstream call can be interrupted by a timeout. Semaphore bulkheads use a counter to limit concurrent calls without allocating dedicated threads. They have lower overhead (no thread allocation) but do not provide timeout isolation — a request waiting for a semaphore cannot be interrupted. Thread pool bulkheads are appropriate for I/O-bound calls where timeout isolation is critical. Semaphore bulkheads are appropriate for high-throughput CPU-bound calls where thread overhead is measurable.
How does Istio implement bulkheads?
Istio implements bulkheads at the network proxy layer through its Envoy sidecar, without requiring application code changes. The DestinationRule resource configures connection pool limits per upstream service — maximum TCP connections and maximum pending HTTP requests. When limits are reached, Envoy rejects new connections immediately rather than queuing them. Istio’s outlier detection implements circuit breaking per upstream instance — automatically ejecting instances that exceed an error threshold from the load balancing pool for a configurable interval. This provides instance-level blast radius containment within a service without application-level circuit breaker code.
Why are Kubernetes resource limits required for fault isolation?
Without resource limits, a container can consume all available CPU and memory on a node, starving every other container on that node of the resources they need to function. A memory leak in one container can trigger node-level OOM kills that affect unrelated containers. A CPU-intensive workload can throttle every other process on the node. Kubernetes resource limits enforce isolation boundaries at the container level — a container exceeding its memory limit is OOM-killed (contained to that container), a container exceeding its CPU limit is throttled (contained to that container). Resource limits are mandatory bulkheads, not optional performance hints.
How do bulkheads relate to circuit breakers?
Bulkheads and circuit breakers are complementary mechanisms that address different aspects of failure containment. Bulkheads limit resource consumption — they ensure that a failing dependency cannot exhaust more than its allocated thread pool or connection pool. Circuit breakers detect failure rates — they stop sending requests to a failing dependency when the error rate exceeds a threshold, giving it time to recover. Bulkheads contain the resource impact of a slow service. Circuit breakers contain the call volume to a failed service. Used together, they prevent both resource exhaustion (bulkhead) and retry amplification (circuit breaker) during downstream service failures.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 4 — Fault Tolerance & High Availability Overview
- 4.1 — Failure Taxonomy: How Distributed Systems Fail
- 4.2 — Fault Tolerance vs High Availability: Understanding the Difference
- 4.3 — Redundancy Patterns in Distributed Systems
- 4.4 — Failure Detection: Heartbeats, Timeouts and the Phi Accrual Detector
- 4.5 — Recovery and Self-Healing Systems
- 4.6 — Designing for High Availability: Patterns and Trade-offs
- 4.7 — Fault Isolation and Bulkheads
- 4.8 — Observability and Diagnosing Distributed Failures
- 4.9 — Chaos Engineering and Resilience Culture
Previous: ← 4.6 — Designing for High Availability: Patterns and Trade-offs
Next: 4.8 — Observability and Diagnosing Distributed Failures →
Related posts from earlier in the series:
- 2.2 — Reliability and Retries — Circuit breakers that complement bulkheads for failure containment
- 4.6 — Designing for High Availability — Graceful degradation and the fallback patterns that bulkheads enable
- 4.1 — Failure Taxonomy — The cascade failure patterns that bulkheads are designed to prevent
- 1.4 — Node & Failure Model — Slow nodes as the failure mode that bulkheads contain