Distributed Systems Series — Part 4.1: Fault Tolerance & High Availability
Why Failure Vocabulary Matters Before Failure Mechanisms
Part 4 covers how distributed systems survive failures. But before designing survival mechanisms — redundancy, failure detection, circuit breakers, chaos engineering — engineers must be precise about what kinds of failures they are designing for. A retry strategy that handles crash failures correctly may be completely ineffective against omission failures. A consensus algorithm that tolerates crash-recovery failures may produce split-brain under Byzantine failures. A health check that detects a crashed node may completely miss a gray failure.
Designing the wrong resilience mechanism for the actual failure mode is one of the most common and expensive mistakes in distributed systems engineering. The vocabulary introduced in this post is not academic taxonomy — it is the prerequisite for every design decision in Part 4.
This post builds directly on the node model established in Part 1.4 — crash-stop, crash-recovery, slow nodes, partial failure — and extends it into the complete failure taxonomy used by distributed systems literature, production platform design, and formal algorithm analysis. The foundation comes from four canonical references: Tanenbaum and van Steen’s Distributed Systems, Coulouris et al.’s Distributed Systems: Concepts and Design, Kleppmann’s Designing Data-Intensive Applications, and Roberto Vitillo’s Understanding Distributed Systems.
Three Terms Engineers Conflate — Fault, Error, and Failure
Before classifying failures, the distinction between three related terms must be precise. Distributed systems literature — specifically Laprie’s dependability framework adopted by Tanenbaum, Coulouris, and most formal treatments — defines them as follows:
A fault is the underlying defect: a hardware component failing, a software bug, a configuration error. A fault is a cause. It may exist in a system for a long time before producing any observable effect.
An error is the incorrect internal state caused by an activated fault. The system’s state diverges from what is specified. An error is not yet visible to external observers — it is an internal condition that may or may not propagate further.
A failure is the externally observable incorrect behaviour that results when an error propagates to the system’s output. A failure is what users, clients, and monitoring systems see.
The propagation chain: a disk sector silently corrupts (fault) → a replica stores incorrect data in memory (error) → a client receives the wrong value from a read (failure).
This distinction matters for fault-tolerant system design because fault tolerance operates at different layers. Fault prevention stops faults from occurring — quality hardware, code review, formal verification. Fault removal eliminates faults before activation — testing, debugging, patching. Fault tolerance contains the propagation from fault to error to failure — redundancy, error detection, recovery mechanisms. A system may contain latent faults without ever producing a failure if the error propagation is contained.
This is why Kleppmann’s treatment in Designing Data-Intensive Applications emphasises that systems must be designed to tolerate faults, not just prevent them. At scale, faults are inevitable. The engineering question is whether they propagate into failures that users observe.
The Four Classical Failure Models
Distributed systems literature classifies failures into four primary models, ordered from simplest to most complex. Each model is a superset of the previous — Byzantine failures include all crash, omission, and timing failure behaviours, plus arbitrary behaviour beyond them.
Crash-Stop Failures
A crash-stop failure occurs when a node stops executing and does not recover or send any further messages. From other nodes’ perspective, the failed node simply disappears — it stops responding to all requests, stops sending heartbeats, and takes no further action. It does not send incorrect messages. It does not recover with stale state. It stops.
Crash-stop is the simplest failure model and the one that produces the cleanest algorithm design. If a node has crash-stopped, other nodes can safely ignore it and proceed with the remaining nodes. Raft and Paxos are both designed under this assumption — they guarantee correctness when nodes crash and stop, requiring only a quorum of surviving nodes to make progress.
In production, crash-stop is approximated but rarely exact. A Kubernetes container killed by the Linux OOM killer behaves as a crash-stop failure from the cluster’s perspective — the container process terminates, and the kubelet reports it as failed. A virtual machine that loses power and does not restart also approximates crash-stop. The key property is permanent cessation — the node does not return.
Crash-Recovery Failures
A crash-recovery failure occurs when a node crashes, restarts, and rejoins the system — potentially with stale or incomplete state. This is the model that describes production systems most accurately. Processes restart after OOM kills. Kubernetes pods reschedule after node failures. Services restart after deployment errors. Disks recover after brief outages.
Crash-recovery is significantly harder to handle than crash-stop because the recovered node may not know what happened while it was down. It may have been elected to a role it no longer holds. It may hold state that was superseded while it was unavailable. It may attempt to resume operations based on a view of the world that is now incorrect.
As established in Post 1.4, restarting a node is frequently harder than losing it. A crashed node is absent. A recovered node is present but potentially acting on outdated assumptions. ZooKeeper, etcd, and most production coordination services handle crash-recovery explicitly — a recovered node must catch up on all writes that occurred during its absence before it can participate in decisions.
Omission Failures
An omission failure occurs when a node fails to send or receive messages it should have. The node is running — it is not crashed — but messages are silently dropped either by the node itself or by the network between nodes.
Omission failures come in three forms. A send omission occurs when a node fails to transmit a message it should send — perhaps due to a full send buffer, a software bug that skips the send path, or a network interface dropping packets silently. A receive omission occurs when a node fails to process an incoming message — perhaps due to an overloaded receive queue. A channel omission occurs when the network drops packets in transit.
Channel omissions are by far the most common in practice. As established in Post 1.3, networks drop packets due to congestion, router failures, misconfigurations, and transient hardware issues. TCP handles many channel omissions transparently through retransmission, but at the distributed systems layer — where message loss means a service call that never arrives — omission failures are the baseline assumption. Most distributed protocols are designed assuming omission failures as the standard network behaviour, not as exceptional events.
The danger of omission failures is that they are indistinguishable from slow responses and from crash-stop failures. A node that stops receiving messages cannot determine whether the sender crashed, whether the network is dropping packets, or whether the messages are merely delayed. This is the fundamental ambiguity that makes failure detection probabilistic rather than deterministic.
Timing Failures
A timing failure occurs when a node responds outside its expected time bounds — either too late (the response arrives after a timeout has fired) or too early (the response arrives before the requesting system is ready to process it, which is rare in practice). Timing failures are particularly relevant in systems with strict latency SLAs and in asynchronous distributed systems where timeouts are the primary mechanism for detecting failures.
The distributed systems model used by Raft, Paxos, and most production consensus algorithms is partially synchronous — the network is not reliably synchronous (bounded message delay) but it is not purely asynchronous (unbounded delay) either. It eventually stabilises. Timing failures — responses that arrive after timeouts — drive the false suspicion problem that Post 3.6 identified as one of the root causes of unnecessary leader elections and cascading failovers.
A garbage collection pause is a timing failure at the node level. The node is running, its processes are alive, but it stops responding for hundreds of milliseconds while the JVM reclaims memory. From the perspective of other nodes waiting for heartbeats, the paused node looks identical to a crashed node. When the GC pause ends and the node resumes, it may discover that it has been declared failed, its leadership role has been revoked, and another node has taken over — all while it was merely paused.
Gray Failures: The Hardest Failure Class in Production
Classical distributed systems literature defines four failure models. Production systems have added a fifth that does not fit cleanly into any of the classical categories: the gray failure.
A gray failure occurs when a component is partially functioning — it passes health checks, responds to pings, appears alive to monitoring systems — but is producing incorrect or degraded results for a subset of requests. It is not crashed (crash failure). It is not completely silent (omission failure). It is not behaving maliciously (Byzantine failure). It is responding, but incorrectly, inconsistently, or too slowly for some requests while handling others normally.
Gray failures are the hardest failure class in production because standard failure detection mechanisms are designed to detect absence — a node that stops responding. They are not designed to detect partial incorrectness in a node that continues responding. A health check endpoint that returns 200 OK tells you the process is alive. It tells you nothing about whether the database connection pool is exhausted, whether the node is serving stale data from a corrupted cache, or whether 10% of requests are silently timing out internally while the remaining 90% succeed.
Microsoft Research documented gray failures extensively in a 2017 paper studying Azure production incidents. They found that a significant fraction of production failures that caused user-visible impact were gray failures — partially functioning components that standard monitoring missed entirely until the impact became severe enough to surface through user complaints or business metric degradation.
Gray failures require different detection strategies than crash failures: application-level health checks that exercise the actual functionality rather than just the process liveness, synthetic transaction monitoring that measures end-to-end correctness, and anomaly detection on request success rates and latency distributions rather than binary up/down status. We will cover these in depth in Post 4.4 and Post 4.8.
Byzantine Failures: The Most General Failure Model
A Byzantine failure occurs when a node behaves arbitrarily — sending conflicting messages to different nodes, producing incorrect outputs, lying about its state, or acting in ways that actively subvert the system’s correctness. The term comes from the Byzantine Generals Problem introduced by Lamport, Shostak, and Pease in 1982, which formalised the impossibility of reaching agreement when some nodes may be actively deceptive.
Byzantine failures are the superset that contains all other failure models. A node that crashes has stopped behaving — that is Byzantine behaviour (the degenerate case of sending no messages). A node that drops messages is exhibiting omission behaviour — also a special case of Byzantine. The defining characteristic of full Byzantine failure is that the node may send syntactically correct but semantically wrong or contradictory messages to different observers, making it impossible for honest nodes to distinguish the Byzantine node from a correct one through direct observation.
In practice, Byzantine failures arise from three sources: hardware faults that produce silent data corruption (bit flips in memory, incorrect computation in CPUs), software bugs that produce inconsistent outputs under specific conditions, and malicious actors in adversarial environments (blockchain validators, untrusted participants in multi-party computation).
Byzantine fault-tolerant algorithms — PBFT, Tendermint, HotStuff — can tolerate up to f Byzantine nodes in a cluster of 3f+1 nodes. They require a third more nodes than crash-stop algorithms (which require 2f+1 nodes to tolerate f failures) and significantly more communication rounds, making them expensive. Production distributed systems — databases, coordination services, stream processors — almost universally use crash-stop assumptions because Byzantine failures in controlled datacenter environments are rare and the cost of BFT consensus is prohibitive. Blockchain and payment systems use BFT specifically because they operate in adversarial environments where participants cannot be trusted.
Two Additional Failure Categories That Matter in Production
Correlated Failures
Correlated failures occur when multiple independent components fail simultaneously due to a shared underlying cause. This is the failure class that breaks the assumptions of most fault-tolerance designs.
A three-node Raft cluster tolerates one node failure. This guarantee assumes failures are independent — the probability of two nodes failing simultaneously is negligible. When failures are correlated — all three nodes are in the same physical rack that loses power, all three run the same software version that contains a bug triggered by a specific input, all three are in the same availability zone that experiences a network partition — the independence assumption breaks and the cluster’s fault tolerance guarantee does not hold.
Correlated failures explain why production systems use availability zones and regions rather than just replicating within a single datacenter, why deployment practices stagger rollouts across nodes rather than updating all at once, and why Kubernetes anti-affinity rules spread pods across failure domains. The goal is not just redundancy but independent redundancy — copies that cannot fail for the same reason at the same time.
Network Partitions
A network partition occurs when nodes are alive and functioning correctly but cannot communicate with each other, splitting the cluster into isolated groups. Partitions are not a failure of the nodes themselves — they are a failure of the communication channel between groups of nodes.
Network partitions are the failure class that the CAP theorem addresses directly, as covered in Post 3.4. During a partition, each isolated group must decide independently whether to continue operating (sacrificing consistency) or to stop accepting writes (sacrificing availability). The 2012 AWS US-East-1 outage demonstrated partitions at cloud scale — the network split isolated groups of nodes that continued operating independently, producing state divergence that required careful reconciliation when the partition healed.
Partitions are more common than most engineers expect. They occur due to router failures, misconfigured firewalls, BGP route flapping, cloud networking issues, and traffic spikes that exhaust network buffer capacity. In any system that communicates over a real network and runs for years in production, partitions will occur. The question is not whether to design for them but how to define the system’s behaviour when they do.
How Failure Models Determine Algorithm Choice
The failure model assumed by an algorithm is not a footnote — it is a correctness precondition. Using an algorithm in a failure environment more severe than it assumes produces correctness violations that only appear under production conditions, not in testing.
| Algorithm | Failure model assumed | Breaks under |
|---|---|---|
| Raft | Crash-stop / crash-recovery | Byzantine failures, correlated failures exceeding quorum |
| Paxos | Crash-stop / crash-recovery | Byzantine failures, correlated failures exceeding quorum |
| PBFT / Tendermint | Byzantine (up to f out of 3f+1) | More than f Byzantine nodes, network partition without quorum |
| Gossip protocols | Omission and timing failures | Persistent partitions, Byzantine failures |
| Two-phase commit | Crash-recovery with stable storage | Coordinator crash during phase 2 (blocking), Byzantine failures |
| Phi accrual detector | Timing failures with statistical heartbeat model | Byzantine failures, systematic clock skew |
The practical implication: before choosing a replication protocol, consensus algorithm, or failure detector, identify the failure classes your system must tolerate. Most production systems in controlled datacenter environments can safely assume crash-recovery failures as the worst case. Systems operating across untrusted networks or with adversarial participants require Byzantine fault tolerance. Systems with correlated failure risk — shared hardware, shared software, shared network path — require redundancy across independent failure domains, not just redundant replicas.
The 2021 Facebook Outage: Multiple Failure Classes in One Incident
Real production outages rarely involve a single clean failure class. The 2021 Facebook BGP withdrawal incident illustrates how multiple failure types interact to produce systemic outages.
The initial trigger was a configuration fault — a misconfigured BGP route update that caused Facebook’s backbone routers to withdraw their route announcements, making the entire Facebook network unreachable from the public internet.
This immediately produced network partition failures — Facebook’s internal services could not reach each other because the DNS servers that served internal routing were also unreachable. Each service group became an isolated island, unable to communicate with its dependencies.
The partition produced omission failures throughout the internal service mesh — services that called other services received no response. These omission failures were indistinguishable from crash failures to the calling services, which triggered retry storms and circuit breaker trips across the entire platform.
The configuration fault then became a crash-recovery problem — when Facebook engineers attempted to restore access to the routers to fix the BGP configuration, they could not reach them remotely because the network they needed to reach them through was the same network that was down. Physical access was required, introducing hours of recovery time.
Gray failures appeared at the edges — some services partially recovered but served degraded or incorrect responses because their dependencies were partially available. These were harder to detect and resolve than the clean outage of the fully unreachable services.
The incident lasted approximately six hours and removed Facebook, Instagram, and WhatsApp from the internet simultaneously. No single failure class caused it. The initial configuration fault triggered a cascade through partition, omission, crash-recovery, and gray failure modes simultaneously. Understanding failure taxonomy is precisely what allows engineers to decompose such incidents clearly — and to design systems where failure classes are isolated rather than compounded.
What This Means for System Design
Every resilience mechanism in Part 4 is designed for a specific failure class. Understanding which failure classes your system faces determines which mechanisms are necessary and which are insufficient.
If your system faces crash-stop failures only — a controlled environment where nodes fail cleanly and do not restart — simple leader election with a quorum is sufficient. If your system faces crash-recovery failures — the production norm — you need epoch-based leadership, fencing tokens, and state reconciliation on recovery. If your system faces omission failures — unavoidable in any real network — retries with idempotency, circuit breakers, and timeout tuning are necessary. If your system faces gray failures — partial functioning that passes health checks — application-level readiness probes, synthetic monitoring, and anomaly detection are required. If your system faces correlated failures — shared infrastructure, shared software versions — redundancy across independent failure domains is the only mitigation.
Byzantine fault tolerance is required only in adversarial environments. For most production distributed systems in controlled datacenter environments, crash-recovery with correlated failure mitigation is the appropriate target.
Key Takeaways
- Fault, error, and failure are distinct — a fault is a defect, an error is the incorrect internal state it causes, a failure is the externally observable incorrect behaviour; fault-tolerant design focuses on containing error propagation, not just preventing faults
- The four classical failure models — crash-stop, crash-recovery, omission, and timing — are ordered by severity; each is a superset of the previous, and Byzantine failures are the most general model that includes all others
- Gray failures — partial functioning that passes health checks — are the hardest failure class in production and require application-level monitoring that measures correctness rather than just process liveness
- Correlated failures break the independence assumption of most fault-tolerance designs; redundancy across independent failure domains (racks, availability zones, regions) is required, not just redundant replicas on shared infrastructure
- Network partitions are not node failures — isolated nodes continue operating, producing state divergence that must be reconciled when the partition heals; they are the failure class CAP addresses directly
- Failure model assumptions are correctness preconditions for algorithms — using Raft in an environment with Byzantine failures produces correctness violations; identifying your system’s failure classes before choosing algorithms is a prerequisite for correct design
- Real production outages are typically composites of multiple failure classes — the Facebook 2021 BGP incident combined configuration faults, network partitions, omission failures, crash-recovery problems, and gray failures simultaneously
Frequently Asked Questions (FAQ)
What is a failure taxonomy in distributed systems?
A failure taxonomy is a classification of the different ways components in a distributed system can fail. Classical distributed systems literature — Tanenbaum and van Steen, Coulouris et al., Kleppmann — categorises failures into crash-stop, crash-recovery, omission, timing, and Byzantine failure models. Modern production systems add gray failures and correlated failures. The taxonomy matters because different failure classes require different detection strategies and different resilience mechanisms — designing for the wrong failure class produces systems that fail in the ways they were designed to tolerate while being completely unprepared for the failures they actually experience.
What is the difference between a fault, an error, and a failure?
A fault is the underlying defect — a hardware component failing, a software bug, a configuration error. An error is the incorrect internal state the fault causes — a replica holding stale data, a process with corrupted memory. A failure is the externally observable incorrect behaviour that results when an error propagates — a client receiving wrong data, a service returning an error. A system may contain faults without failing if the error propagation is contained by fault tolerance mechanisms. This distinction, from Laprie’s dependability framework adopted by Tanenbaum and Coulouris, is foundational for fault-tolerant system design.
What is a gray failure in distributed systems?
A gray failure is a partial functioning failure — a component that passes health checks and appears alive to monitoring systems but is producing incorrect, inconsistent, or degraded results for some requests. Unlike crash failures (the node stops) or Byzantine failures (the node behaves arbitrarily), gray failures are subtle: the process is running, the port is responding, but application-level correctness is compromised. Gray failures are common in production — a database connection pool that is exhausted for some threads while others succeed, a replica serving stale data for some keys, a service that handles 90% of requests correctly and silently fails 10%. They require application-level health checking and anomaly detection to detect.
What is a Byzantine failure and when does it matter?
A Byzantine failure occurs when a node behaves arbitrarily — sending conflicting messages to different peers, producing incorrect outputs, or actively subverting the protocol. It is the most general failure model: a Byzantine node can exhibit any behaviour, including mimicking crash failures or omission failures, while also sending malicious or contradictory messages. Byzantine fault tolerance is required in adversarial environments — blockchain validators, payment networks with untrusted participants, military systems. Most enterprise distributed systems use crash-recovery assumptions because Byzantine failures in controlled datacenter environments are rare and BFT algorithms (PBFT, Tendermint) are significantly more expensive to run than crash-tolerant algorithms (Raft, Paxos).
What are correlated failures and why do they break fault tolerance?
Correlated failures occur when multiple independent components fail simultaneously due to a shared underlying cause — a shared power supply, a shared software version with a triggered bug, a shared network path that fails. Most fault-tolerance designs assume failures are independent: the probability of two out of three replicas failing simultaneously is low. When failures are correlated, this independence assumption breaks and the fault tolerance guarantee does not hold. A three-node cluster in the same availability zone offers no protection against an AZ-level failure. Correlated failure mitigation requires redundancy across genuinely independent failure domains — separate racks, separate availability zones, separate regions — not just redundant replicas on shared infrastructure.
How do I know which failure model to design for?
Start by identifying the failure classes your deployment environment actually produces. Controlled datacenter environments with trusted hardware and software produce crash-recovery and omission failures as the baseline — design for these. Any production system running for years at scale will experience network partitions — design for these explicitly. Shared infrastructure (same rack, same AZ, same software version) introduces correlated failure risk — design redundancy across independent failure domains. If your system operates in adversarial environments with untrusted participants, add Byzantine fault tolerance. Gray failure detection should be universal — add application-level health checking that measures correctness rather than just process liveness regardless of your other failure model assumptions.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 4 — Fault Tolerance & High Availability Overview
- 4.1 — Failure Taxonomy: How Distributed Systems Fail
- 4.2 — Fault Tolerance vs High Availability: Understanding the Difference
- 4.3 — Redundancy Patterns in Distributed Systems
- 4.4 — Failure Detection: Heartbeats, Timeouts and the Phi Accrual Detector
- 4.5 — Recovery and Self-Healing Systems
- 4.6 — Designing for High Availability: Patterns and Trade-offs
- 4.7 — Fault Isolation and Bulkheads
- 4.8 — Observability and Diagnosing Distributed Failures
- 4.9 — Chaos Engineering and Resilience Culture
Previous: ← 3.9 — Engineering Guidelines: Replication, Consistency and Consensus
Next: 4.2 — Fault Tolerance vs High Availability: Understanding the Difference →
Foundation posts this builds on: