Distributed Systems Series — Part 1.4: Foundations
“Partial failures are the norm, not the exception, in distributed systems.”
A Node Can Be Running and Still Be Failing
When distributed systems fail, engineers look at the network first. Post 1.3 established why that instinct is correct — networks are unreliable, messages are lost, and silence is ambiguous. But equally often, the real problem lies inside the nodes themselves.
A node in a distributed system can be a server, a container, a virtual machine, or a process. The uncomfortable truth that makes distributed systems hard is this: a node can be running, responding to health checks, and holding its leadership position — while producing incorrect results, processing no useful work, or degrading the entire system around it.
Understanding the node failure model — the assumptions a distributed system makes about how its components fail — is as foundational as understanding the network model. Every algorithm’s correctness guarantee rests on specific node failure assumptions. When reality violates those assumptions, the guarantee collapses. This post establishes the node failure vocabulary that Parts 3 and 4 build on directly.
Why Node Failures Are Hard to Reason About
In a single-machine system, failure is usually obvious. The process crashes, logs stop, alerts fire, and an on-call engineer gets paged. In a distributed system, failure is frequently partial, slow, uneven, and ambiguous.
Some nodes work normally while others do not. Some requests succeed while others fail. Some replicas are current while others are stale. From the outside — from the perspective of other nodes and monitoring systems — the cause is not always clear. A node that appears healthy to a TCP health check may be serving incorrect data. A node that appears unreachable may be processing requests correctly. A node that appears to be the leader may have been replaced by a new leader while it was paused.
This is why distributed systems require an explicit node failure model — a set of assumptions about how processes fail that algorithms can be designed against and that engineers can reason about systematically.
Crash-Stop vs Crash-Recovery: The Real-World Gap
The simplest node failure model is crash-stop. A node fails and permanently stops responding — it never sends another message, never rejoins the cluster, never makes another decision. From other nodes’ perspective, a crash-stopped node simply disappears. Algorithms designed for crash-stop can safely ignore a node once it stops responding — there is no risk that it will wake up and take action based on stale state.
Crash-stop is theoretically clean. It is rarely what production systems exhibit.
The realistic model is crash-recovery. A node crashes, restarts, and rejoins the cluster — potentially with stale state from before its failure. A Kubernetes pod killed by the OOM killer and rescheduled on a new node exhibits crash-recovery behaviour. A database process that crashes and is restarted by systemd exhibits crash-recovery. A VM that loses power and reboots exhibits crash-recovery.
The critical difference: a recovered node may not know what happened while it was down. It may attempt to reassert a leadership role it no longer holds. It may replay messages that were already processed. It may attempt to commit a write that another node has already overridden. Crash-recovery requires algorithms to handle a node that believes it is entitled to a role it no longer legitimately occupies — which is significantly more complex than simply ignoring a permanently absent node.
Most production distributed system bugs occur when engineers design for crash-stop behaviour while operating in a crash-recovery environment. The gap between the assumed model and production reality is where subtle correctness violations live.
Restarting a node is frequently harder than losing it.
The Most Dangerous Failure: The Slow Node
A crashed node is obvious — it stops responding entirely. A slow node is not obvious, and that is precisely what makes it dangerous.
A slow node responds, but late. It processes requests, but slowly. It holds locks past their intended expiry. It sends heartbeats, but with increasing delays. From the perspective of other nodes in the cluster, a slow node is indistinguishable from a crashed node during the window when its responses are delayed beyond the timeout threshold — but it is still alive and still capable of acting on stale assumptions.
This creates the worst of both worlds. The system cannot safely ignore the slow node (it might recover and resume activity on stale state) and cannot safely continue without it (it holds locks or leadership roles that others need). The ambiguity spreads through the system, triggering retries that amplify load, false failover that creates split-brain windows, and cascading timeouts that convert one slow node into a cluster-wide degradation.
GC Pauses and the GitHub Incident of 2012
In 2012, GitHub experienced a significant outage caused not by a crashed node but by a slow one. A MySQL primary database instance underwent an extended garbage collection pause — a period during which the JVM stops all application threads to reclaim memory. During this pause, the node was technically running. It responded to basic health checks. It held its leadership position in the replication topology. From the outside, it appeared alive.
The problem was that while the primary was paused, it could not process writes or respond to replication requests. Replicas, detecting the absence of updates, began to suspect the primary had failed — but the failure detectors were not calibrated aggressively enough to trigger a failover quickly. The system entered the worst possible state: a primary that appeared healthy but was doing no useful work, and replicas that suspected failure but had not yet acted on it.
When the GC pause ended and the primary resumed, it briefly tried to reassert its leadership role — creating a split-brain window where two nodes simultaneously believed they were authoritative for the same data.
This is the slow node failure mode in production. The node did not crash. It did not send an error. It simply stopped being useful while the system waited, uncertain what to do. The solution is not better health checks — it is designing leadership to be time-bounded through lease-based mechanisms and fencing tokens, so that a resumed node cannot act on a role it held before its pause. Fencing tokens and lease-based leadership are covered in Post 2.4. Failure detection mechanisms — including the phi accrual detector that Cassandra and Akka use to adaptively detect slow vs failed nodes — are covered in Post 4.4.
Partial Failure: The Defining Characteristic
Partial failure is what separates distributed systems from single-node systems as a design challenge. A distributed system can be simultaneously correct in some parts and broken in others — and this is its normal operating state at scale, not an exceptional condition.
One replica is stale while another is current. One service instance is healthy while another is overloaded. One region is reachable while another is unreachable. One shard is processing writes correctly while another is experiencing high latency. The system as a whole is neither fully operational nor fully down — it occupies a continuous spectrum of partial health.
This breaks the binary thinking that single-machine experience instils. Distributed systems require probabilistic reasoning: what fraction of nodes are healthy, what fraction of requests are succeeding, what is the probability that a given operation will encounter a failed component. Engineers who treat distributed system health as a binary property — up or down — consistently design systems that degrade more severely than necessary when partial failures occur.
Gray Failures: The Hardest Failure Class
Classical distributed systems literature distinguishes crash-stop, crash-recovery, and Byzantine failures. Production systems have added a fourth class that is arguably the hardest to detect and handle: the gray failure.
A gray failure occurs when a component is partially functioning — it passes health checks, responds to pings, appears alive to monitoring systems — but is producing incorrect or degraded results for a subset of requests. A database replica that returns wrong query results for specific data patterns passes all TCP health checks while silently corrupting reads. A service instance with an exhausted connection pool accepts new HTTP connections while failing all requests that require database access. A cache node that has hit a memory limit accepts writes while silently evicting reads that subsequent requests depend on.
Gray failures are invisible to heartbeat-based failure detection. They require application-level observability — error rates per endpoint, p99 latency per dependency, synthetic transactions that verify actual correctness. Microsoft Research documented gray failures as a significant fraction of Azure production incidents in a 2017 study, finding that standard monitoring missed them entirely until the impact became severe enough to surface through business metric degradation. The full treatment of gray failure detection is in Post 4.1 and observability in Post 4.8.
Liveness Checks vs Readiness Checks
Health checks are the primary mechanism by which distributed systems detect node failures. But the type of health check determines what is actually being measured — and the distinction between liveness and readiness is one of the most important and most commonly conflated in production systems.
A liveness check answers: is this process running? A TCP health check that opens a connection and receives a response confirms liveness. A simple HTTP endpoint that returns 200 OK regardless of application state confirms liveness. Liveness checks detect crash-stop failures — a node that has crashed entirely will not respond. They do not detect gray failures, slow nodes, or nodes that are alive but unable to serve useful traffic.
A readiness check answers: is this process ready to serve traffic correctly? A readiness check exercises the actual dependencies that the node needs to serve requests — it verifies that the database connection pool is healthy, that the cache is connected, that the downstream services are reachable and responding within acceptable latency. A node that is alive but has an exhausted database connection pool passes liveness but fails readiness. Routing traffic to it produces errors for every request.
Kubernetes implements this distinction through liveness probes (which trigger container restart on failure) and readiness probes (which remove the container from Service endpoints on failure without restarting it). The correct pattern: route traffic only to nodes that pass readiness, not just liveness. Load balancers that use only liveness checks will route traffic to gray-failing nodes and deliver errors to users while reporting the cluster as healthy. Health-check-driven routing is covered in Post 4.6.
Failure Detection Is Probabilistic, Not Certain
Because silence is ambiguous — a node that stops responding might be crashed, might be slow, or might be partitioned — failure detection in distributed systems is inherently probabilistic. Failure detectors do not detect failure. They detect absence of response within a configured time window, and they output a suspicion rather than a certainty.
Fixed-timeout failure detection is the simplest approach: if a heartbeat has not arrived within T seconds, suspect failure. Simple to implement, easy to understand, and wrong in a specific way — fixed timeouts do not adapt to network conditions. A timeout calibrated for average network latency produces false positives during high-load periods when latency spikes. A timeout calibrated for worst-case latency is too slow to detect genuine failures quickly.
The phi accrual failure detector, used by Cassandra and Akka, replaces binary timeout detection with a continuous suspicion level φ. Rather than asking “has the heartbeat timed out?”, it asks “how statistically surprising is the absence of a heartbeat given the historical heartbeat arrival pattern?” As the observed absence becomes more statistically improbable, φ rises — allowing the system to become more lenient during high-latency periods (when delays are expected) and more sensitive during stable periods. This adaptive approach produces fewer false positives without sacrificing detection speed for genuine failures. The full treatment is in Post 4.4.
What the Node Failure Model Requires Engineers to Design For
A distributed system built on an accurate node failure model — one that treats crash-recovery as the baseline, slow nodes as more dangerous than crashed ones, and failure detection as probabilistic — requires four non-negotiable design properties.
Work must be replayable or idempotent — because a recovered node may reprocess messages it handled before its failure, duplicate execution must produce correct results. Leadership must be time-bounded through leases — because a node that pauses and resumes must not be able to act on a role it held before the pause without re-establishing its authority. State must be recoverable from durable storage — because crash-recovery requires nodes to reconstruct their state after restart without relying on in-memory state that was lost. Correctness must not depend on any single node behaving continuously — because partial failure means some nodes will always be unavailable, degraded, or slow at any given moment in a large cluster.
These properties are implemented by the mechanisms covered in Part 2 (idempotency, coordination, distributed locks) and Part 4 (failure detection, recovery, HA architecture). They are not optimisations. They are what makes a distributed system correct under the node failure conditions it will inevitably encounter.
Key Takeaways
- A node can be running and still be failing — passing health checks while processing no useful work, holding stale leadership, or returning incorrect results is the normal failure mode in production distributed systems
- Crash-recovery is the real-world node failure model — nodes crash and rejoin with potentially stale state, requiring algorithms to handle resumed nodes that believe they hold roles they no longer legitimately occupy
- Slow nodes are more dangerous than crashed nodes — a crashed node is absent and safe to ignore, a slow node is present and capable of taking action on stale assumptions, creating split-brain windows and cascading failures
- Gray failures are the hardest failure class — partially functioning nodes that pass liveness checks while producing incorrect results require application-level observability, not infrastructure monitoring, to detect
- Liveness checks and readiness checks measure different things — liveness confirms the process is running, readiness confirms it can serve traffic correctly; routing to liveness-passing but readiness-failing nodes produces silent application errors
- Failure detection is probabilistic, not certain — failure detectors detect absence of response, not failure itself, and must be designed to produce calibrated suspicions rather than binary verdicts
- Correctness must not depend on any single node behaving continuously — work must be idempotent, leadership must be time-bounded, and state must be recoverable because partial failure is the permanent operating condition of a distributed system at scale
Frequently Asked Questions (FAQ)
What is the node failure model in distributed systems?
The node failure model is the set of assumptions a distributed system makes about how its component processes fail. Three standard models exist. Crash-stop: a node fails permanently and sends no further messages — the cleanest model for algorithm design but rare in production. Crash-recovery: a node fails and later restarts with potentially stale state — the model that describes production systems accurately. Byzantine: a node behaves arbitrarily or maliciously — the most general model, required for adversarial environments but rarely assumed in enterprise distributed systems. The correctness guarantee of every distributed system algorithm depends on which failure model it assumes. Using an algorithm designed for crash-stop in a crash-recovery environment produces subtle bugs during node restarts.
Why are slow nodes more dangerous than crashed nodes in distributed systems?
A crashed node is absent — it stops sending messages, stops holding locks, and other nodes can safely proceed without it once failure detection triggers. A slow node is present — it continues sending heartbeats (with delays), continues holding locks past their intended expiry, and may resume activity based on stale assumptions after its slowness resolves. This creates the split-brain risk: the slow node believes it is the leader while a new leader has been elected to replace it. The 2012 GitHub MySQL GC pause incident is the canonical production example — a paused primary resumed and briefly asserted authority over data that a new primary had already taken ownership of.
What is the difference between a liveness check and a readiness check?
A liveness check answers whether a process is running — a TCP connection that succeeds or an HTTP endpoint that returns 200 confirms liveness. Liveness checks detect crash-stop failures but miss gray failures and nodes that are alive but unable to serve traffic. A readiness check answers whether a process can serve traffic correctly — it verifies database connections, cache connectivity, and downstream service health. A node with an exhausted database connection pool passes liveness but fails readiness. Routing traffic to it produces errors for every request. Kubernetes implements both: liveness probe failure triggers container restart, readiness probe failure removes the container from Service endpoints without restarting.
What is a gray failure and why is it hard to detect?
A gray failure is partial functioning — a node passes infrastructure health checks (TCP liveness, process running) while producing incorrect or degraded results for application requests. Examples: a database replica returning wrong results for specific query patterns, a service with an exhausted thread pool accepting connections but failing all requests, a cache node silently evicting data it should retain. Gray failures are invisible to heartbeat-based failure detection because the process is alive and responding. Detection requires application-level observability — error rates per endpoint, p99 latency per dependency, and synthetic transactions that verify correctness rather than just process liveness.
What is the phi accrual failure detector?
The phi accrual failure detector replaces binary timeout-based failure detection with a continuous suspicion level φ. Rather than asking “has the heartbeat timed out?” against a fixed threshold, it asks “how statistically surprising is the absence of a heartbeat given the historical arrival pattern?” It maintains a sliding window of recent heartbeat intervals, models their distribution, and calculates φ as a function of how long the current silence has lasted relative to that distribution. When φ reaches a configured threshold (typically 8 in Cassandra, 10 in Akka), the node is suspected failed. The adaptive threshold automatically becomes more lenient during high-latency periods and more sensitive during stable periods, reducing false positives without slowing detection of genuine failures.
Why does crash-recovery create more bugs than crash-stop?
Because crash-recovery produces a node that believes it is entitled to a role it may no longer hold. A node that crashes and restarts does not automatically know what happened while it was down — it may not know that a new leader was elected, that its lock lease expired, or that its uncommitted writes were overridden. If it resumes without first re-establishing its authority (re-acquiring its lease, re-syncing from the current leader, replaying missed log entries), it can take actions that conflict with the decisions made during its absence. Algorithms designed for crash-stop can safely ignore a node once it stops responding. Algorithms for crash-recovery must explicitly handle resumed nodes that attempt to rejoin with stale assumptions.
The failure mode that humbled me most in production
Of everything in this post, the slow node is the concept I return to most. Not because it is the most common failure — crashed services are more common. But because it is the failure mode that looks most like normal operation and causes the most confusion when it occurs.
At IDFC First Bank, we run financial transaction systems where a split-brain window — even a brief one — has consequences that go beyond data inconsistency. A payment that is processed twice because two nodes simultaneously believed they were authoritative for that account is not a data quality problem. It is a regulatory event. The GitHub GC pause incident in this post is not something I read about and filed away as an edge case. It is the class of failure I think about when designing any system that involves leadership or locks.
The thing that changed my thinking most was understanding that the solution is not better failure detection — it is designing leadership so that a resumed node cannot act without first re-establishing its authority. Fencing tokens. Lease expiry at the resource, not just at the lock service. The node does not know its lease has expired. The resource must enforce it. That distinction — between trusting the lock holder to know its lease is gone and making the resource reject stale writes regardless — is the difference between a system that is safe and one that is safe until a 40-second GC pause happens at the worst possible moment.
The readiness vs liveness distinction is the other one I wish I had internalised earlier. I have reviewed systems where the health check returned 200 unconditionally — a liveness check with a readiness label. Traffic was routed to instances with exhausted database connection pools, degraded downstream dependencies, and full request queues. The monitoring reported the cluster as healthy. The users were experiencing errors. The gap between what the health check measured and what the service could actually do was producing a gray failure that was invisible until someone looked at per-endpoint error rates rather than infrastructure health dashboards.
Next is Post 1.5 on the time model — the third and final foundational constraint. After networks and nodes, time is the constraint that surprises engineers most, because it violates assumptions that feel obviously true until they are not. Read it with the slow node scenario from this post in mind. The GC pause that creates a split-brain window is also a time problem — the paused node’s local clock kept advancing while it was doing no useful work, and when it resumed it had no reliable way to know how much time had passed or what had changed while it was paused.
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 1 — Foundations
- 1.1 — What Is a Distributed System (Really)?
- 1.2 — System Models: How Distributed Systems See the World
- 1.3 — Network Model: Latency, Loss and Partitions
- 1.4 — Node & Failure Model: Crashes, Slow Nodes and Partial Failure
- 1.5 — Time Model: Why Ordering Is Harder Than It Looks
Previous: ← 1.3 — Network Model: Latency, Loss and Partitions
Next: 1.5 — Time Model: Why Ordering Is Harder Than It Looks →
Posts that build directly on this foundation:
- 2.4 — Coordination and Distributed Locks — Fencing tokens and lease-based leadership that prevent the GC pause split-brain scenario
- 4.1 — Failure Taxonomy — The complete classification including gray failures, Byzantine failures, and correlated failures
- 4.4 — Failure Detection — Phi accrual detector, heartbeats, SWIM protocol, and production tuning values
- 4.6 — Designing for High Availability — Liveness vs readiness probes and health-check-driven routing in production
- 4.8 — Observability — Application-level monitoring required to detect gray failures that infrastructure monitoring misses
Once you have completed all five Foundation posts, Part 2 shows how distributed systems cope with these node failure realities in practice →