Distributed Systems Series — Part 4.2: Fault Tolerance & High Availability
Two Goals That Sound the Same and Are Not
Fault tolerance and high availability are the two most frequently conflated concepts in distributed systems engineering. Engineers use them interchangeably in architecture discussions, design documents, and system reviews. This conflation is not just imprecise — it produces concrete architectural mistakes that only surface under production failure conditions.
A system designed for high availability but not fault tolerance will remain responsive during failures while silently returning incorrect data. A system designed for fault tolerance but not high availability will preserve correctness during failures while becoming temporarily unavailable for minutes at a time. Neither is acceptable for most production systems — but the two properties require fundamentally different design approaches and the first step to achieving both is understanding exactly how they differ.
Post 4.1 established the failure taxonomy — the different classes of failures a distributed system can experience. This post establishes how fault tolerance and high availability respond to those failures differently, what each property actually measures, and how production systems are designed to achieve both simultaneously.
Precise Definitions
High availability is a measurement — the proportion of time a system is operational and able to serve user requests. It is expressed as a percentage and defines the user-visible reliability contract. A system with 99.9% availability is unavailable for no more than 8.7 hours per year. A system with 99.99% availability is unavailable for no more than 52 minutes per year. High availability answers the question: can users reach the system and get a response?
Fault tolerance is a correctness property — the ability of a system to continue operating correctly when components fail. As Coulouris et al. define it in Distributed Systems: Concepts and Design: “the ability of a system to continue operating properly in the event of the failure of some of its components.” Fault tolerance answers a different question: does the system produce correct results under failure?
The critical distinction is that high availability is about responding, and fault tolerance is about responding correctly. A system can achieve one without the other. Understanding this asymmetry is the foundation of Part 4.
The Availability Nines: What Each Level Means in Practice
Availability targets are expressed in nines — the number of nines in the availability percentage. Each additional nine reduces acceptable downtime by roughly an order of magnitude and increases the engineering investment required to achieve it.
| Availability | Downtime per year | Downtime per month | Downtime per week | Typical use case |
|---|---|---|---|---|
| 99% (two nines) | 87.6 hours | 7.3 hours | 1.7 hours | Internal tools, dev environments |
| 99.9% (three nines) | 8.7 hours | 43.8 minutes | 10.1 minutes | Standard SaaS, non-critical services |
| 99.95% | 4.4 hours | 21.9 minutes | 5 minutes | AWS S3, many managed services |
| 99.99% (four nines) | 52.6 minutes | 4.4 minutes | 1 minute | Payment processing, financial APIs |
| 99.999% (five nines) | 5.3 minutes | 26.3 seconds | 6 seconds | Emergency services, aviation, trading |
Each step up in nines requires dramatically more investment. Moving from 99.9% to 99.99% typically requires active-active multi-region deployment, automated failover tested and validated in production, and a monitoring and incident response capability that can detect and remediate issues within two to three minutes. Moving to 99.999% requires eliminating all planned maintenance windows, investing in hardware redundancy at every layer, and achieving mean time to recovery (MTTR) measured in seconds rather than minutes.
Two operational metrics determine where a system sits in the availability nines. Mean time between failures (MTBF) measures how long a system runs between failures — improving MTBF means making the system more reliable, reducing fault frequency. Mean time to recovery (MTTR) measures how long the system takes to recover from a failure — improving MTTR means making recovery faster, regardless of how often failures occur. Availability is fundamentally a function of both: a system that fails rarely but takes an hour to recover may achieve the same availability as a system that fails frequently but recovers in seconds.
This is where fault tolerance and high availability connect most directly. Fault tolerance — the ability to continue operating correctly during failure — reduces MTTR by allowing the system to self-heal without human intervention. Automated leader election, automatic replica promotion, and self-healing orchestration all improve availability by reducing recovery time, even without reducing failure frequency.
The SRE Error Budget Framework
Google’s Site Reliability Engineering practice introduced the error budget as the operational framework for managing availability targets. An error budget is the amount of unavailability a service is allowed to have in a given period — the complement of its availability target.
A service with a 99.9% availability target has an error budget of 0.1% — 8.7 hours per year, or 43.8 minutes per month. Every minute of downtime consumes that budget. When the budget is exhausted, the SRE framework halts feature deployments until the next budget period — because new feature deployments are statistically the largest source of reliability incidents, and a service that has already consumed its error budget cannot afford additional deployment risk.
The error budget framework creates a direct, quantified connection between reliability engineering investment and business outcomes. It makes the cost of each availability nines level explicit: moving from 99.9% to 99.99% reduces the error budget from 43.8 minutes per month to 4.4 minutes per month, which requires investment in automated failover, chaos engineering validation, and incident response speed that might not be justified for an internal tool but is essential for a payment API.
Fault tolerance contributes to error budget preservation by reducing the recovery time for each failure. A system that takes ten minutes to elect a new leader after a primary failure consumes ten minutes of error budget per failure event. A system with automated leader election that recovers in fifteen seconds consumes fifteen seconds per failure event. For a system that experiences leader failures once per month, this difference is the margin between 99.9% and 99.99% availability.
Fault Tolerance Without High Availability
A system can be fault tolerant — preserving correctness under failure — while providing limited availability during the recovery period.
The clearest example is a Raft-based replicated database during leader election. When the leader fails, Raft requires that the remaining followers hold an election. The election process — timeout detection, vote solicitation, majority acknowledgement — takes several seconds in a well-tuned cluster and can take minutes in a cluster with aggressive timeout settings or network instability. During this window, the cluster does not accept writes. It is not unavailable due to a bug or a misconfiguration — it is unavailable by design, because accepting writes without a confirmed leader would risk split-brain and data loss.
The system is fault tolerant: it will elect a new leader, preserve all committed writes, and resume correct operation. But it is not continuously available: users attempting writes during the election window receive errors or timeouts. ZooKeeper, etcd, and most Paxos-based systems exhibit the same behaviour — they choose correctness over availability during partition and failure events, exactly as the CAP theorem predicts.
Two-phase commit demonstrates this property more starkly. In 2PC, a coordinator sends prepare messages to all participants. If the coordinator crashes after some participants have voted yes but before sending commit, the system enters a blocking state — participants have locked resources and cannot proceed without knowing the coordinator’s decision. The system is fault tolerant in the sense that it will not commit an incorrect result, but it is completely unavailable until the coordinator recovers or a new coordinator is elected through a separate recovery protocol. This is the fundamental limitation of 2PC that three-phase commit and Paxos-based transaction protocols were designed to address.
High Availability Without Fault Tolerance
A system can remain highly available — responding to every request — while providing incorrect or inconsistent results. This is the more dangerous of the two failure modes because it is less visible. Users do not see errors. Monitoring systems may report green. The system appears healthy while serving wrong data.
Amazon’s DynamoDB in its default eventually consistent mode demonstrates this trade-off explicitly. DynamoDB continues accepting reads and writes during network partitions and node failures. It remains highly available — every request receives a response. But a read immediately after a write may return the old value if it is served by a replica that has not yet received the write. The system is prioritising availability over consistency, exactly as its AP classification in the CAP theorem predicts. For many DynamoDB use cases — shopping carts, session data, preference storage — this is the correct trade-off. For operations where stale reads produce incorrect business outcomes — inventory checks before purchase, balance reads before debit — the application must use strongly consistent reads and accept the higher latency cost.
A more dangerous example of HA without FT is silent data corruption. If a database replica develops a subtle bug that produces incorrect query results for specific data patterns, and the load balancer continues routing queries to that replica because its health check returns 200 OK, the system is highly available while serving wrong data to some users. This is a gray failure — one of the failure classes introduced in Post 4.1. The system appears healthy to infrastructure monitoring while being incorrectly available at the application layer.
This failure mode appears in distributed caches when a cache node serves stale data after an invalidation that did not propagate correctly, in CDN edge nodes that serve cached content after the origin has been updated, and in multi-leader replication scenarios where conflict resolution produces an incorrect winner that propagates to all replicas.
The Relationship: Fault Tolerance Enables High Availability
The correct mental model is that fault tolerance is the mechanism and high availability is the outcome. A system achieves high availability through fault tolerance — by detecting failures quickly, recovering automatically, and preserving correct state through the recovery process. The two properties are not alternatives. They are layers: fault tolerance is the engineering foundation, high availability is the user-visible result.
This relationship is captured in the design principle that Kleppmann articulates in Designing Data-Intensive Applications: a reliable system is one that continues to work correctly even when things go wrong. Reliability — which encompasses both correctness and availability — is achieved through fault tolerance mechanisms that allow the system to tolerate faults before they become failures.
The design order follows from this relationship: correctness first, then durability, then availability. A system that achieves high availability by sacrificing correctness is not a reliable system — it is a system that hides failures rather than surviving them. When the hidden failures accumulate or are discovered, the recovery cost is higher than it would have been had correctness been maintained throughout.
Techniques for Each Property
Because fault tolerance and high availability are different properties, they require different techniques.
High availability is primarily achieved through infrastructure and traffic management. Load balancing distributes traffic across multiple instances so no single instance is a single point of failure. Health checks and automatic failover detect unhealthy instances and remove them from the serving pool before clients experience failures. Active-active deployment runs redundant instances in all regions simultaneously so that traffic can be immediately rerouted when one region fails. Multi-region routing with latency-based or health-based policies directs users to the nearest healthy region automatically.
These mechanisms reduce downtime by reducing MTTR — failures are detected faster and traffic is rerouted faster. But they do not inherently protect against split-brain, inconsistent replicas, or data corruption. A load balancer that routes traffic away from a crashed primary and toward a replica that has not yet received the latest writes has preserved availability while sacrificing consistency.
Fault tolerance requires coordination mechanisms that preserve correctness. Consensus protocols — Raft, Paxos — ensure that only one node acts as leader at any time and that all committed writes are durably replicated before being acknowledged. Write-ahead logging ensures that committed writes survive node crashes by persisting them to durable storage before acknowledging to clients. Quorum-based replication ensures that writes are acknowledged only when a majority of replicas have recorded them, so any subsequent read quorum is guaranteed to include at least one node with the latest write. Deterministic state machines ensure that all replicas apply the same sequence of operations and arrive at the same state, so any replica can serve as a consistent source of truth.
These mechanisms increase correctness guarantees but typically reduce availability during failure events — they require coordination that takes time and may pause operations while a quorum is established or a new leader is elected.
Common Architecture Mistakes
Treating multi-AZ deployment as fault tolerance. Spreading instances across availability zones improves resilience against infrastructure failures but does not ensure consistent state across replicas, safe leader election, or protection against split-brain. A multi-AZ deployment with asynchronous replication is a high availability architecture. It becomes a fault-tolerant architecture only when combined with synchronous replication, consensus-based leader election, and fencing to prevent stale leaders from accepting writes.
Treating retries as fault tolerance. As established in Post 2.2, retries improve availability by increasing the probability that a transient failure is survived. But retries can harm correctness by causing duplicate writes, inconsistent state transitions, and amplified load that turns a partial degradation into a complete outage. Retries must be combined with idempotency to provide fault tolerance — not substituted for it.
Assuming HA implies FT. This is the most common and most dangerous mistake. A highly available system that returns responses to every request is not fault tolerant if those responses are sometimes incorrect. The 2017 AWS S3 availability incident in US-EAST-1 demonstrated this: S3’s availability SLA was compromised during the incident, but the more significant concern for many customers was data integrity — whether the eventual recovery preserved all committed writes correctly. High availability and data durability are separate guarantees that must be designed for separately.
Designing for Both: The Correct Approach
Systems that achieve both fault tolerance and high availability — Google Spanner, CockroachDB multi-region, Amazon Aurora Global — combine the techniques from both domains deliberately.
The design sequence that production systems follow: first, ensure correctness under failure through consensus, write-ahead logging, and quorum-based replication. Second, ensure durability — committed writes survive any single node failure. Third, minimise recovery time through automated leader election, rapid failure detection, and self-healing orchestration. Fourth, deploy redundancy across independent failure domains for availability during infrastructure failures. The sequence matters: availability built on top of an incorrect or non-durable foundation produces systems that appear reliable under normal conditions and fail catastrophically under stress.
The key insight from Vitillo’s Understanding Distributed Systems is that reliability emerges from the combination of fault tolerance and high availability — neither alone is sufficient. A correct system that is frequently unavailable fails its users. An always-available system that sometimes returns wrong data fails its users. The goal is a system that is both correct and available, achieved by designing for correctness first and then minimising the recovery time that correctness mechanisms require.
Key Takeaways
- High availability and fault tolerance are different properties — availability measures whether the system responds, fault tolerance measures whether it responds correctly; a system can achieve one without the other
- Availability is quantified in nines — each additional nine reduces acceptable downtime by an order of magnitude and requires dramatically more engineering investment in automated recovery and reduced MTTR
- MTBF and MTTR are the operational levers — availability improves by reducing failure frequency (MTBF) or recovery time (MTTR); fault tolerance primarily improves MTTR by enabling automated self-healing
- The SRE error budget framework quantifies the connection between reliability investment and business outcomes — fault tolerance contributes by reducing the recovery time consumed per failure event
- Fault tolerance without high availability produces systems that are correct but pause during recovery — Raft during leader election, ZooKeeper during partition
- High availability without fault tolerance produces systems that respond but may respond incorrectly — eventually consistent databases during partition, caches serving stale data, gray failures that pass health checks
- The correct design order is correctness first, then durability, then availability — availability built on an incorrect foundation produces systems that fail catastrophically rather than gracefully
Frequently Asked Questions (FAQ)
What is the difference between fault tolerance and high availability?
High availability measures whether a system responds to user requests — it is a uptime percentage. Fault tolerance is a correctness property — the ability to continue operating correctly when components fail. A highly available system responds to every request but may return incorrect data during failures. A fault-tolerant system preserves correctness but may be temporarily unavailable during failure recovery. Both properties are required for a genuinely reliable distributed system, and they require different design techniques to achieve.
What does 99.99% availability mean in practice?
99.99% availability allows a maximum of 52.6 minutes of downtime per year, or approximately 4.4 minutes per month. Achieving this requires automated failure detection and failover within minutes, active-active deployment across multiple failure domains, no planned maintenance windows that cause visible downtime, and a mean time to recovery (MTTR) measured in seconds rather than minutes. Each additional nine of availability reduces the acceptable downtime budget by roughly an order of magnitude and requires significantly more engineering investment in automated recovery and monitoring.
What is MTTR and MTBF and how do they relate to availability?
MTBF (mean time between failures) measures how long a system runs between failure events — improving it means reducing failure frequency through more reliable components and better software quality. MTTR (mean time to recovery) measures how long recovery takes after a failure — improving it means making recovery faster through automation, better tooling, and fault-tolerant design. Availability is fundamentally determined by both: Availability = MTBF / (MTBF + MTTR). Fault tolerance primarily improves MTTR by enabling automated self-healing, which improves availability even without reducing failure frequency.
Can a system be fault tolerant but not highly available?
Yes. A Raft-based replicated database during leader election is fault tolerant — it preserves all committed writes and will resume correct operation after election — but not continuously available, as it does not accept writes during the election window. Two-phase commit with a crashed coordinator is fault tolerant — it will not commit an incorrect result — but completely unavailable until the coordinator recovers. Systems that prioritise correctness over availability during partition events are deliberately fault tolerant at the cost of reduced availability, which is the correct trade-off for systems where data loss or inconsistency is unacceptable.
Can a system be highly available but not fault tolerant?
Yes. An eventually consistent database that continues accepting reads and writes during network partitions is highly available — every request receives a response — but not fault tolerant in the strict sense, because reads may return stale data and writes may conflict. A CDN edge node serving cached content after the origin has been updated is highly available but serving incorrect data. A replica with a gray failure — passing health checks but returning incorrect results for some queries — is available but not correct. These systems prioritise availability over correctness during failure events, which is the right trade-off for some use cases and the wrong trade-off for others.
What is an SRE error budget?
An error budget is the amount of unavailability a service is permitted within a given period — the complement of its availability target. A service with a 99.9% availability SLA has an error budget of 0.1%, or 43.8 minutes per month. Every minute of downtime consumes the budget. When the budget is exhausted, Google’s SRE framework halts feature deployments until the budget resets — because deployments are statistically the largest source of reliability incidents. The error budget creates a quantified, business-aligned framework for reliability investment decisions: adding another nine of availability means reducing the monthly error budget from 43.8 minutes to 4.4 minutes, which has a specific, calculable cost in engineering investment and operational capability.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 4 — Fault Tolerance & High Availability Overview
- 4.1 — Failure Taxonomy: How Distributed Systems Fail
- 4.2 — Fault Tolerance vs High Availability: Understanding the Difference
- 4.3 — Redundancy Patterns in Distributed Systems
- 4.4 — Failure Detection: Heartbeats, Timeouts and the Phi Accrual Detector
- 4.5 — Recovery and Self-Healing Systems
- 4.6 — Designing for High Availability: Patterns and Trade-offs
- 4.7 — Fault Isolation and Bulkheads
- 4.8 — Observability and Diagnosing Distributed Failures
- 4.9 — Chaos Engineering and Resilience Culture
Previous: ← 4.1 — Failure Taxonomy: How Distributed Systems Fail
Next: 4.3 — Redundancy Patterns in Distributed Systems →
Related posts from earlier in the series:
- 2.2 — Reliability and Retries in Distributed Systems — why retries improve availability but require idempotency for fault tolerance
- 3.4 — The CAP Theorem Correctly Understood — the formal framework for the consistency vs availability trade-off during partitions
- 3.3 — Consistency Models — how different consistency guarantees map to the FT vs HA spectrum
- 3.7 — Paxos vs Raft — the consensus algorithms that provide fault tolerance at the cost of brief availability gaps during leader election