Distributed Systems Series Home
When Correct Systems Still Fail
Part 1 established that nodes fail, networks partition, and clocks drift. Part 2 showed how systems communicate and coordinate despite those realities. Part 3 showed how systems store and replicate data correctly — with consistency guarantees, quorum-based availability, and consensus algorithms that survive individual node failures.
But correct, consistent, well-replicated systems still fail in production. Entire availability zones go dark. Hardware fails in correlated waves. A configuration change brings down a fleet. A traffic spike overwhelms a service that was handling normal load perfectly. A software bug triggers a cascading failure across dependent systems. A partition heals but the system cannot determine which side was authoritative.
These failures are not gaps in the algorithms. They are the operational reality of running distributed systems at scale. Part 4 addresses them directly — not through theory but through the engineering disciplines that production systems actually use to detect failures, survive them, and maintain availability despite ongoing partial failures.
Part 4 covers two closely related but distinct problems that engineers frequently conflate.
Fault tolerance is the ability of a system to continue operating correctly when individual components fail. A fault-tolerant system absorbs failures — it does not prevent them. It uses redundancy to mask the impact of component failures, isolation to prevent failures from propagating, and recovery mechanisms to restore correct operation after a failure has occurred. A fault-tolerant system that loses one of three replicas continues serving correct results from the remaining two. It does not stop. It does not degrade. It absorbs the fault.
High availability is a measurement — the percentage of time a system is available to serve requests. A system with 99.99% availability is unavailable for no more than 52 minutes per year. High availability is achieved through fault tolerance, but fault tolerance does not automatically produce high availability. A system that tolerates individual node failures but takes ten minutes to detect the failure and elect a new leader is fault-tolerant but not highly available — users experience ten minutes of unavailability for each failure. High availability requires not just the ability to survive failures but the ability to detect them quickly, respond automatically, and restore service within a time budget that users do not notice.
Understanding this distinction is the starting point for Part 4. Every design decision — how aggressively to tune failure detection timeouts, how many redundant components to provision, whether to use active-active or active-passive redundancy, when to trigger automatic failover vs wait for human confirmation — depends on which property you are optimising for and what the acceptable trade-off between them is.
Part 4 covers the complete fault tolerance and high availability engineering stack, from the vocabulary of failure modes through to the cultural practices that keep resilience from degrading as systems evolve.
Failure taxonomy establishes the vocabulary. Not all failures are equal — a crash-stop failure, a crash-recovery failure, a Byzantine failure, a correlated failure, and a gray failure each require different detection strategies and different recovery mechanisms. Engineers who cannot distinguish between failure types cannot design appropriate responses to them.
Failure detection is the operational form of the problem consensus algorithms address theoretically. Detecting that a node has failed — rather than merely slowed — requires heartbeats, timeouts, and probabilistic detectors like the phi accrual failure detector used by Cassandra and Akka. Every threshold is a trade-off: aggressive detection reduces unavailability but increases false positives and unnecessary failovers; conservative detection avoids false positives but extends the unavailability window when real failures occur.
Redundancy patterns are the structural mechanisms that make fault tolerance possible — active-active, active-passive, N+1, and geographic redundancy each offer different trade-offs between cost, complexity, and the failure scenarios they protect against.
Recovery and self-healing mechanisms return systems to correct operation after failures without human intervention — because at scale, failures happen faster than humans can respond to them manually.
High availability architecture patterns — load shedding, graceful degradation, bulkheads, and circuit breakers at the infrastructure level — are the design patterns that translate fault tolerance mechanisms into measurable availability.
Observability and diagnosis are the operational foundation that makes everything else maintainable. A system that fails in ways engineers cannot observe, measure, and understand will eventually fail in ways that take too long to recover from.
Chaos engineering is the practice of continuously validating resilience by deliberately injecting failures in production — ensuring that failure handling remains correct as systems evolve and that the team’s understanding of failure modes stays current.
Fault Tolerance vs High Availability: The Core Distinction
Because this distinction is the foundation for all of Part 4, it deserves more precise treatment before the posts themselves.
Fault tolerance is a binary property in theory: a system either continues operating correctly in the presence of a specified class of failures, or it does not. In practice, fault tolerance is parameterised — a system tolerates up to F failures out of N components. A three-node Raft cluster tolerates one node failure. A five-node cluster tolerates two. Beyond the tolerance threshold, correctness is no longer guaranteed.
High availability is a continuous metric: uptime as a percentage, typically expressed in “nines.” The difference between 99.9% and 99.99% availability is the difference between 8.7 hours of unavailability per year and 52 minutes. For a payment system processing millions of transactions, this difference is measured in revenue and user trust, not just engineering pride.
The two properties interact but are not the same. A system can be fault-tolerant but not highly available — if failure detection is slow, failover is manual, or recovery requires human intervention. A system can appear highly available but not be fault-tolerant — if it achieves uptime by hiding failures rather than surviving them, eventually a failure that cannot be hidden will cause a complete outage.
The goal is both: systems that genuinely survive component failures and do so quickly enough that users do not notice. Part 4 shows how production systems achieve this combination through careful design at every layer of the stack.
Part 4 Posts — Full Reading List
4.1 — Failure Taxonomy: How Distributed Systems Fail
Not all failures are equal and not all failures require the same response. Covers the complete taxonomy of distributed system failures — crash-stop, crash-recovery, omission failures, timing failures, Byzantine failures, correlated failures, and gray failures (the hardest class: systems that are partially functioning, passing health checks, but producing incorrect results). Establishes the vocabulary that every subsequent Part 4 post depends on and grounds each failure type in documented production incidents.
Key concepts: crash-stop failure, crash-recovery, omission failure, timing failure, Byzantine failure, correlated failure, gray failure, failure taxonomy, failure modes
4.2 — Fault Tolerance vs High Availability: Understanding the Difference
Fault tolerance and high availability are frequently used interchangeably. They are not the same property. Covers the precise definitions of both, how they interact, why a system can be fault-tolerant but not highly available and vice versa, the availability nines table and what each level actually means in operational terms, and how to design for both simultaneously. Includes real examples of systems that achieved one without the other and the production consequences.
Key concepts: fault tolerance, high availability, availability nines, SLA, SLO, SLI, MTTR, MTBF, active-active vs active-passive, failure vs unavailability
4.3 — Redundancy Patterns in Distributed Systems
Redundancy is the structural foundation of both fault tolerance and high availability — the system continues operating because there are multiple copies of each component. Covers active-active redundancy (all components serving traffic simultaneously, highest availability, highest complexity), active-passive redundancy (standby components activated on failure, simpler but slower failover), N+1 redundancy (one spare per group), geographic redundancy (multi-region and multi-datacenter patterns), and the cost and complexity trade-offs of each. Includes how AWS, Google Cloud, and Azure structure their redundancy at the infrastructure level.
Key concepts: active-active redundancy, active-passive redundancy, N+1 redundancy, geographic redundancy, hot standby, warm standby, cold standby, redundancy cost, correlated failure risk
4.4 — Failure Detection: Heartbeats, Timeouts and the Phi Accrual Detector
Detecting that a node has failed — rather than merely slowed — is the operational form of the ambiguity problem Part 1 identified. Covers heartbeat-based failure detection and its limitations, timeout tuning and the fundamental trade-off between sensitivity (fast detection) and specificity (avoiding false positives), the phi accrual failure detector used by Cassandra and Akka (adaptive threshold based on historical heartbeat intervals), gossip protocols for failure propagation at scale, and how Kubernetes liveness and readiness probes implement failure detection at the container level. Includes documented cases where aggressive timeout tuning caused cascading unnecessary failovers.
Key concepts: heartbeat, timeout, failure detector, phi accrual failure detector, gossip protocol, false positive, false negative, Cassandra failure detection, Kubernetes probes, adaptive thresholds
4.5 — Recovery and Self-Healing Systems
At scale, failures happen faster than humans can respond to them manually. Self-healing systems detect failures and restore correct operation automatically — without paging an engineer at 3am for every node failure. Covers automatic restart and rescheduling (Kubernetes pod recovery), automatic leader re-election (Raft and ZooKeeper election on leader failure), data re-replication after node loss (Cassandra and HDFS automatic re-replication), circuit breaker recovery (the half-open state and recovery probing), and the limits of self-healing — the failure scenarios that require human intervention and how to design the boundary between automatic and manual recovery correctly.
Key concepts: self-healing, automatic recovery, leader re-election, data re-replication, Kubernetes pod recovery, circuit breaker recovery, recovery time objective (RTO), recovery point objective (RPO), human-in-the-loop recovery
4.6 — Designing for High Availability: Patterns and Trade-offs
High availability is not a feature that can be added after a system is designed — it is an architectural property that must be designed in from the start. Covers the core HA design patterns: load shedding (dropping excess traffic rather than collapsing under it), graceful degradation (serving reduced functionality when dependencies are unavailable rather than failing completely), health-check-driven routing (removing degraded instances from load balancer pools before clients experience failures), multi-region active-active deployment patterns, database HA with synchronous replication trade-offs, and the HA design review checklist. Includes documented HA architectures from Netflix, Stripe, and AWS.
Key concepts: high availability architecture, load shedding, graceful degradation, health-check routing, multi-region HA, active-active deployment, HA database patterns, availability budget, dependency HA requirements
4.7 — Fault Isolation and Bulkheads
The goal of fault isolation is to prevent a failure in one component from propagating to others. The bulkhead pattern — borrowed from naval architecture where watertight compartments prevent a single hull breach from sinking the entire ship — applies this principle to distributed systems. Covers thread pool isolation (separate pools per downstream dependency so one slow dependency cannot exhaust shared resources), process isolation (separate deployments for independent failure domains), data isolation (separate databases for separate services), geographic isolation (availability zone and region isolation), and how Hystrix, Resilience4j, and Istio implement bulkheads at different layers of the stack.
Key concepts: fault isolation, bulkhead pattern, thread pool isolation, process isolation, data isolation, geographic isolation, blast radius, failure domain, Hystrix, Resilience4j, Istio
4.8 — Observability and Diagnosing Distributed Failures
A system that fails in ways engineers cannot observe, measure, and understand will eventually fail in ways that take too long to recover from. Observability — the three pillars of metrics, logs, and distributed traces — is the operational foundation that makes fault tolerance and high availability maintainable over time. Covers the distinction between monitoring (known failure modes) and observability (unknown failure modes), structured logging and correlation IDs for tracing failures across service boundaries, distributed tracing with OpenTelemetry, SLI/SLO/SLA and error budget frameworks, alerting on symptoms rather than causes, and how Honeycomb, Datadog, and Grafana implement observability at scale. Includes the three questions every observability system must answer during an incident.
Key concepts: observability, metrics, logs, distributed traces, structured logging, correlation IDs, OpenTelemetry, SLI, SLO, SLA, error budget, alerting philosophy, Honeycomb, Datadog, Grafana
4.9 — Chaos Engineering and Resilience Culture
Systems that are designed for resilience but never tested under failure conditions degrade silently — failure handling code rots, recovery mechanisms stop working, and the team loses confidence in the system’s behaviour under stress. Chaos engineering — deliberately injecting failures in production — is the practice that keeps resilience real. Covers the principles of chaos engineering (steady-state hypothesis, minimal blast radius, running in production), Netflix’s Chaos Monkey and the Simian Army, the progression from controlled game days to continuous automated chaos, chaos engineering tooling (Chaos Monkey, Gremlin, Chaos Mesh), the organisational dimension of resilience culture, and how to start a chaos engineering practice without causing incidents. Closes with the engineering guidelines that synthesise all of Part 4 into decisions engineers can apply immediately.
Key concepts: chaos engineering, Chaos Monkey, Simian Army, steady-state hypothesis, game day, fault injection, resilience culture, blameless postmortem, Gremlin, Chaos Mesh, continuous chaos
What Part 4 Prepares You For
Parts 1 through 3 answered the question of how distributed systems work correctly — how they model their environment, communicate under uncertainty, store data consistently, and achieve agreement despite failures. Part 4 answers the equally important question of how they continue working correctly when components fail in production — not in the controlled conditions of a well-behaved failure, but in the messy, partial, ambiguous failures that actually occur at scale.
Part 5 builds on the operational foundation Part 4 establishes. Scalability — the ability of a system to handle growing load — depends on fault tolerance and high availability as prerequisites. A system that cannot survive individual node failures cannot be safely scaled horizontally. A system that does not have observability into its failure modes cannot safely autoscale. The patterns Part 4 establishes — redundancy, isolation, graceful degradation, observability — are the operational foundation that makes the scalability patterns of Part 5 safe to apply.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Previous: Part 3 — Replication, Consistency & Consensus
Next: Part 5 — Scalability & Performance
Part 5 examines how systems scale beyond a single node without collapsing under load — partitioning strategies, caching trade-offs, load balancing, autoscaling, geo-distribution and cost and capacity planning.