Crash Recovery Archives - Rahul Suryawanshi

Recovery and Self-Healing Systems in Distributed Systems

April 7, 2026April 1, 2026 by Rahul Suryawanshi

Distributed Systems Series — Part 4.5: Fault Tolerance & High Availability Detection Is Not Recovery Post 4.4 established how distributed systems detect failures — through heartbeats, timeouts, phi accrual detectors, and gossip protocols. Detection is the prerequisite. But detecting that a node has failed solves nothing by itself. The system must then do something about … Read more

Failure Taxonomy: How Distributed Systems Fail

April 7, 2026April 1, 2026 by Rahul Suryawanshi

Distributed Systems Series — Part 4.1: Fault Tolerance & High Availability Why Failure Vocabulary Matters Before Failure Mechanisms Part 4 covers how distributed systems survive failures. But before designing survival mechanisms — redundancy, failure detection, circuit breakers, chaos engineering — engineers must be precise about what kinds of failures they are designing for. A retry … Read more

Node & Failure Model – Crashes, Slow Nodes and Partial Failure

April 9, 2026January 8, 2026 by Rahul Suryawanshi

In distributed systems, nodes don’t just crash. They slow down, restart, and fail partially — often while appearing healthy.