The Eight Fallacies of Distributed Computing: What Every Engineer Gets Wrong

Home » Distributed Systems: Complete Engineering Guide » The Eight Fallacies of Distributed Computing: What Every Engineer Gets Wrong

Thirty Years of Getting Distributed Systems Wrong in the Same Ways

In 1994, L. Peter Deutsch — one of the principal architects of the Java Virtual Machine — circulated a list of seven false assumptions that programmers new to distributed systems invariably make. Around 1997, James Gosling, the inventor of Java, added an eighth. The list has been called the Eight Fallacies of Distributed Computing ever since.

Thirty years later, every assumption on that list is still being made. In production. By experienced engineers. At companies with sophisticated engineering cultures.

The reason is not ignorance. Engineers who have read the fallacies still violate them, because the assumptions feel correct in the context where code is written — local development, fast networks, controlled environments — and only reveal themselves as false under production conditions, at scale, under failure. By the time the assumption breaks, it has been encoded into the architecture. Fixing it requires rework, not just a patch.

This post covers all eight original fallacies plus the three modern additions that Mark Richards and Neal Ford identified in 2020 as essential for contemporary distributed systems. For each fallacy, you will find: Why it feels true, the specific production incident where it broke at scale and the engineering mechanism that addresses it — with links to the series posts that cover each mechanism in depth.

The goal is not a list you read and forget. It is a framework you use in design reviews to catch assumptions before they become incidents.

The Origin: Sun Microsystems and the Networked Computing Problem

The fallacies did not emerge from academic research. They emerged from operational experience. Deutsch, Bill Joy, Dave Lyon, James Gosling, and their colleagues at Sun Microsystems were building the Java platform and the network infrastructure underneath it during the early 1990s — a period when distributed computing was transitioning from research labs to production systems at scale.

They observed a consistent pattern: engineers who understood single-machine programming would approach distributed systems by extending their existing mental models. They would write remote procedure calls the way they wrote local function calls. They would assume messages arrived the way in-process method calls arrived. They would assume the network behaved the way shared memory behaved. And their systems would fail in confusing, hard-to-diagnose ways that reflected exactly these assumptions.

Deutsch’s list was an attempt to name the assumptions explicitly — to give engineers vocabulary for the ways their intuition was wrong before the production system made it obvious.

In 2020, Mark Richards and Neal Ford extended the list in Fundamentals of Software Architecture, adding three fallacies that modern microservices and cloud-native architectures introduce beyond what Deutsch’s original list covered. Those three are covered at the end of this post.

Fallacy 1 — The Network Is Reliable

Why it feels true: In local development, function calls succeed. TCP makes network communication look like a reliable pipe. A connection either exists or it does not. When it exists, data arrives.

Why it is false: TCP provides a reliable abstraction within a session, but sessions break. Routers fail. Switches restart. Cables are cut. Network interfaces become saturated. Cloud virtual network fabrics have their own failure modes — packet loss, asymmetric routing, intermittent connectivity between specific hosts that appears and disappears. More subtly, a connection can be open while packets are silently dropped — TCP’s retransmission mechanism handles this below the application layer, but the retransmission adds latency that the application observes as a slow response rather than a clean failure.

On April 21, 2012, an AWS US-East-1 outage began with a network configuration change that caused abnormal traffic patterns across the EBS (Elastic Block Store) control plane. The configuration change was an attempt to upgrade network capacity. The traffic patterns overwhelmed the control plane’s ability to process requests, triggering cascading failures across the entire EBS fleet. Systems that had assumed network calls to their storage layer were reliable discovered that silence — calls that never returned — was indistinguishable from failure.

Applications that had no timeout on storage calls blocked indefinitely. Applications that had timeouts but no retry logic failed permanently on transient failures. Applications that had retries but no idempotency keys created duplicate records when their timed-out calls eventually completed after the retry had already succeeded.

The engineering response: Treat every network call as potentially failing, silently, at any time. Every outbound call needs a timeout. Every timed-out call needs a retry strategy — exponential backoff with full jitter as covered in Post 2.2. Every retried call needs an idempotency key to prevent duplicate side effects. Circuit breakers stop sending requests to a consistently failing downstream service, protecting it from retry storms. The network model in Post 1.3 establishes the complete picture of what reliable means in a distributed network.

Fallacy 2 — Latency Is Zero

Why it feels true: In-process function calls complete in nanoseconds. When code is refactored to call a service instead of a local function, the call still looks like a function call. The developer writes user = getUserById(id) whether the user data lives in the same process or in a service three hops away.

Why it is false: A network round trip within a single AWS availability zone takes approximately 0.5 to 1 millisecond. A round trip between availability zones in the same region takes 1 to 3 milliseconds. A round trip between US-East and EU-West takes 80 to 100 milliseconds. These numbers are not implementation details — they are physics. You cannot make a network call faster than the speed of light allows.

The compounding effect is where the fallacy causes real damage. A microservices architecture where a single user-facing request triggers 10 sequential downstream service calls — each adding 5ms of network latency — produces 50ms of irreducible overhead before any computation occurs. This is the N+1 query problem at the service level: a design that works acceptably with one call becomes unusable when that one call becomes a loop.

Consider a product listing page that fetches each product’s inventory status with a separate service call in a loop. With 20 products on the page, that is 20 sequential 5ms calls — 100ms of network latency before any rendering begins. Profiling a production system and discovering that 80% of response time is network latency rather than computation is a common finding in systems built without accounting for this fallacy.

The engineering response: Network calls are expensive and must be treated as such in architecture decisions. Batch related calls into single requests rather than looping. Use asynchronous communication where response latency is not blocking — Post 2.1 covers the communication pattern choices in depth. Cache data that is read frequently and changes infrequently to eliminate the network round trip entirely. Co-locate services that communicate frequently to minimise the latency of their communication. Design API contracts that return all the data a client needs in a single call rather than requiring sequential round trips.

Fallacy 3 — Bandwidth Is Infinite

Why it feels true: Modern networks have gigabit or multi-gigabit capacity. Sending a JSON payload that is a few kilobytes feels like sending nothing.

Why it is false: Bandwidth limits exist at every layer. Individual network interfaces have finite capacity. Shared network segments become congested under load. Cloud providers charge for outbound data transfer — AWS charges $0.09 per GB for data transferred out of US regions, scaling to petabyte volumes in large deployments. At 100 TB per month of outbound traffic, bandwidth cost alone exceeds $9,000 monthly. This is not hypothetical: companies running data-intensive services on AWS have discovered that network egress is a significant fraction of their cloud infrastructure cost.

Within a distributed system, replication traffic competes with application traffic for bandwidth. A database cluster replicating synchronously to three replicas uses three times the write bandwidth of a single-node database. A Kafka cluster with a replication factor of three uses three times the write bandwidth for its log. Under high write load, replication traffic can saturate the network interface before the application’s own traffic becomes a concern.

Chattiness in serialisation formats compounds this. A system that serialises every inter-service message as verbose JSON when it could use Protocol Buffers or Avro uses three to ten times more bandwidth for the same data. At moderate scale, the difference is invisible. At high scale, it becomes a capacity ceiling.

The engineering response: Measure network bandwidth utilisation as a first-class capacity metric — saturation is one of the four golden signals covered in Post 4.8. Use efficient serialisation formats — Protocol Buffers and Avro reduce payload size significantly compared to JSON for structured data. Apply compression on high-volume paths where CPU is cheaper than bandwidth. Design data placement to minimise cross-region data transfer, which carries both latency cost and financial cost. The geographic distribution costs covered in Post 3.8 make this concrete with real numbers.

Fallacy 4 — The Network Is Secure

Why it feels true: The internal network is private. Services communicate behind a firewall. External attackers cannot reach the internal network. There is no reason to encrypt traffic between internal services — it only adds overhead.

Why it is false: The assumption that the internal network is a trusted environment was always optimistic. It is now demonstrably false. The 2020 SolarWinds supply chain attack compromised the internal networks of thousands of organisations, including US government agencies, by injecting malicious code into a trusted software update. Once inside, the attackers moved laterally across internal networks that had assumed internal traffic was safe. Unencrypted service-to-service communication within the “trusted” perimeter was readable to any compromised host on the same network segment.

Insider threats, compromised credentials, container escape vulnerabilities, and misconfigured network policies all create scenarios where a malicious actor can observe or manipulate internal network traffic. In multi-tenant environments — any cloud environment where multiple customers’ workloads run on shared infrastructure — the “internal network” may not be exclusive to your workloads.

The modern security model acknowledges this explicitly: zero-trust networking assumes that every network path is potentially hostile, regardless of whether it is internal or external. Every service call must be authenticated and authorised. Every payload must be encrypted in transit.

The engineering response: Implement mutual TLS (mTLS) for service-to-service communication — both sides present certificates, ensuring both that the caller is who it claims to be and that the service is authentic. Service meshes (Istio, Linkerd) implement mTLS transparently at the infrastructure level without requiring application code changes. Apply the principle of least privilege to network policies — services should be able to communicate only with the services they legitimately need to reach, not with the entire internal network. Rotate credentials automatically. Treat the internal network as hostile by default, not as trusted by assumption. The bulkhead and isolation patterns in Post 4.7 apply equally to security isolation as to fault isolation.

Fallacy 5 — Topology Doesn’t Change

Why it feels true: In a manually managed environment, server addresses are stable. A configuration file lists the IP addresses of the databases and services. Those addresses do not change unless someone changes the configuration deliberately.

Why it is false: Modern infrastructure is in continuous flux. Kubernetes pods are ephemeral — they are created and destroyed constantly, and each new pod receives a new IP address. Cloud instances are replaced automatically during maintenance events, scaling events, and spot instance interruptions. Services scale horizontally, creating and destroying instances dynamically based on load. Canary deployments, blue-green deployments, and rolling updates change which instances are serving traffic on a continuous basis.

In a Kubernetes cluster with a hundred services running hundreds of pods, the set of IP addresses that constitute “the payment service” changes multiple times per hour — pods are rescheduled after node failures, scaled up under load, and replaced during deployments. A system that hardcodes service addresses in configuration and does not use service discovery will route requests to pods that no longer exist and fail to reach pods that were just created.

Network topology changes more broadly — BGP route updates, cloud provider network maintenance, VPC peering modifications, DNS TTL expiry — all change the path that packets take between services. A route that was low-latency yesterday may be routed through a higher-latency path today due to a peering change. Firewall rule changes can silently block traffic that was flowing the day before.

The engineering response: Use service discovery rather than static configuration — Post 2.3 covers DNS-based and registry-based service discovery, health-aware routing, and how Consul and Kubernetes Services implement dynamic address resolution. Design retry logic to handle transient routing failures that occur during topology changes. Use health-check-driven routing covered in Post 4.6 to automatically route away from instances that become unreachable during topology changes.

Fallacy 6 — There Is One Administrator

Why it feels true: In a small team or a single-organisation environment, one person or team controls all the infrastructure. They know the complete state of the system. A change to any component is coordinated through them.

Why it is false: Production distributed systems at scale involve multiple teams with independent ownership of different components, third-party services with their own operational policies, cloud provider infrastructure that the engineering team does not control, and regulatory requirements that impose operational constraints on specific components. When a dependency’s administrator makes a change — a cloud provider upgrades their managed database version, a third-party API deprecates an endpoint, a security team changes network policies — your system must handle the change correctly even though you were not the one who made it.

This fallacy is also visible within organisations. A microservices architecture where each service is owned by a different team means that the “payment service” team and the “fraud detection service” team each make independent operational decisions — deployment schedules, configuration changes, capacity planning — that affect each other. When the fraud detection team deploys a configuration change that increases response time by 200ms, the payment service team experiences a latency spike they did not cause and may not immediately understand.

Multiple administrators also means multiple policies. Different cloud regions may have different network security policies. Different teams may have different deployment windows. Different regulatory regimes may require different data handling. A system that assumes a single coherent administrative policy will break in systems where multiple independent administrators each enforce their own policies.

The engineering response: Design for administrative independence rather than administrative coordination. Infrastructure as code (IaC) — Terraform, Pulumi, CloudFormation — makes infrastructure state explicit, versionable, and reviewable rather than dependent on manual coordination. Circuit breakers and graceful degradation patterns covered in Post 2.2 and Post 4.6 protect your service from the operational decisions of dependencies you do not control. Chaos engineering as covered in Post 4.9 validates that your system handles dependency changes correctly without requiring coordination with the teams that manage those dependencies.

Fallacy 7 — Transport Cost Is Zero

Why it feels true: Sending a message between services in development costs nothing. The abstraction of a function call hides the actual overhead of serialisation, network transmission, deserialisation, and infrastructure utilisation.

Why it is false: Transport cost is real at multiple levels. At the financial level, cloud providers charge for data transfer — particularly outbound transfer from a cloud region to the internet, and cross-region transfer within a cloud provider’s network. AWS’s data transfer pricing means that a system moving 1 petabyte of data per month across regions pays tens of thousands of dollars in data transfer costs alone, before compute or storage. At the compute level, TLS encryption and decryption adds CPU overhead — at high throughput, the CPU cost of TLS handshakes and per-packet encryption is measurable and must be accounted for in capacity planning. At the latency level, serialisation and deserialisation of large payloads adds time to every request.

The hidden costs of network infrastructure — hardware, bandwidth provisioning, redundancy, monitoring — are real operational costs that appear in infrastructure budgets even when individual message transmission appears free. Designing a system that produces excessive unnecessary traffic because transport feels free eventually produces a system that is expensive to operate.

The engineering response: Account for transport cost explicitly in system design. Choose serialisation formats that are compact and efficient for your payload shapes — Protocol Buffers for structured data, Avro for schema-evolved data streams. Design data architectures that minimise cross-region data movement — the data gravity concept is that computation is cheaper to move to data than data to computation. The performance trade-offs and geographic distribution costs covered in Post 3.8 quantify these costs with real production numbers. Include network egress in infrastructure cost models from the beginning of system design rather than discovering it as a surprise in the monthly cloud bill.

Fallacy 8 — The Network Is Homogeneous

Why it feels true: In a controlled datacenter or a single cloud region, all services communicate over the same type of network. The characteristics of the network feel uniform.

Why it is false: A packet from a service in AWS us-east-1 to a mobile user in rural India travels through AWS’s data centre network, through multiple internet exchange points, through submarine cables, through local ISP infrastructure, through cellular tower radio links, and finally over a 4G or 3G connection to the device. Each segment has different latency characteristics, different packet loss rates, different MTU (maximum transmission unit) sizes, and different reliability profiles. Treating this path as homogeneous — configuring a single timeout value that works for the datacenter segment and applying it to the entire path — produces timeouts that are either too aggressive for mobile connections (causing false failures) or too lenient for datacenter connections (causing slow failure detection).

Within cloud infrastructure, network heterogeneity exists at the hardware level. Different instance types have different network interface card capabilities. Different availability zones may be connected by different physical infrastructure. Spot instances may be on different network segments from on-demand instances. A system that assumes uniform network characteristics across all its compute nodes will produce non-deterministic performance and failure behaviour as requests are routed to instances on different network segments.

The emergence of edge computing makes this fallacy more acute. A system that serves both edge nodes (tens of milliseconds from users) and centralised cloud compute (hundreds of milliseconds from users) cannot use a single latency model or a single timeout configuration. The network characteristics of edge-to-cloud communication differ fundamentally from cloud-to-cloud communication.

The engineering response: Design for network heterogeneity explicitly. Use adaptive timeout strategies that reflect the actual characteristics of each network path rather than a single global value. Implement different retry policies for mobile clients (which expect intermittent connectivity) and internal service calls (which expect consistently low latency). Use the observability tooling from Post 4.8 to measure actual latency distributions per communication path and use those measurements to calibrate timeouts and circuit breaker thresholds. The failure detection tuning covered in Post 4.4 addresses exactly this — the phi accrual detector adapts to the statistical distribution of the specific network path it monitors.

Three Modern Additions: What 30 Years of Production Taught Us

In 2020, Mark Richards and Neal Ford identified three additional fallacies in Fundamentals of Software Architecture that Deutsch’s original list did not cover — because the architectural patterns they address (microservices, distributed transactions, complex multi-service systems) were not prevalent in 1994.

Fallacy 9 — Versioning Is Simple

When a service changes its API, all callers must be updated simultaneously — or the service must maintain backward compatibility. In a monolith, “all callers” means all the code in one repository. In a microservices architecture, “all callers” means independent services owned by different teams on different deployment schedules.

Engineers frequently assume that deploying a new API version is a simple matter of updating the endpoint and documenting the change. In practice, coordinating API version upgrades across 50 microservices owned by 15 teams, each with their own deployment schedule, takes months. During that window, the service must simultaneously support the old API for callers that have not yet upgraded and the new API for callers that have. Version negotiation, backward-compatible schema evolution, and deprecation strategies are engineering disciplines in their own right — not simple operational procedures.

The engineering response: Design for API evolution from the start. Use additive changes (new optional fields, new endpoints) rather than breaking changes wherever possible. Implement API versioning in the URL or headers so old and new clients can coexist. Apply consumer-driven contract testing to detect compatibility breaks before deployment. Plan for months-long deprecation windows rather than immediate cutover.

Fallacy 10 — Compensating Updates Always Work

Distributed transactions — operations that must succeed or fail atomically across multiple services — cannot use the two-phase commit protocol at scale without prohibitive performance cost. The standard alternative is the Saga pattern: break the distributed transaction into a sequence of local transactions, each with a compensating transaction that undoes its effect if a later step fails.

The fallacy is assuming that compensating transactions always succeed. In practice, compensating a payment reversal requires the payment processor to accept the reversal request — which may fail if the processor is unavailable, if the reversal window has passed, or if the original transaction is in a state that cannot be reversed. Compensating an inventory reservation requires the inventory service to release the reservation — which may fail if the inventory service has itself changed state since the reservation was made.

When a compensating transaction fails, the system is in a partially compensated state that requires manual intervention to resolve. Systems designed around the assumption that Saga compensation always works discover, in production, that they have no process for handling the cases where it does not.

The engineering response: Design Sagas with explicit handling for compensation failure. Log every step of a Saga and its compensation status to a durable store. Implement a monitoring process that detects partially compensated Sagas and either retries the compensation or escalates to human review. Use idempotent compensating transactions so that retrying a failed compensation does not produce additional side effects.

Fallacy 11 — Observability Is Optional

In a monolith, a production incident can be diagnosed by reading logs from one process and attaching a debugger. In a distributed system, a user-visible failure may have originated several hops upstream from where the error is reported — in a service that does not know a user request was involved, producing a log entry that looks like a routine timeout rather than a production incident.

Engineers building distributed systems frequently treat observability as an operational concern to be added after the system is built — logs, metrics, and traces added later when something goes wrong. The fallacy is that a distributed system without observability built in from the start is not diagnosable when it fails. By the time an incident occurs, the information needed to diagnose it was never recorded. Retrofitting observability into a distributed system after the fact is significantly more expensive than building it in from the start.

This fallacy intersects directly with the gray failure taxonomy from Post 4.1 — partial functioning that passes infrastructure health checks while producing incorrect application results. Gray failures are invisible without application-level observability. Infrastructure monitoring does not detect them. Only logs that record application-level outcomes and traces that follow requests across service boundaries reveal them.

The engineering response: Treat observability as a first-class engineering requirement, not an operational afterthought. Instrument with OpenTelemetry from day one. Propagate trace IDs through every service call. Emit structured logs with trace ID on every log event. Instrument the four golden signals — latency, traffic, errors, saturation — for every service before it goes to production. The complete treatment is in Post 4.8.

Using the Fallacies as a Design Review Framework

The highest practical value of the eleven fallacies is as a design review checklist. Before any distributed system component or architecture goes to production, run it against each fallacy with this question: if this assumption is violated, what is the blast radius and what is the system’s response?

For each of the eight network fallacies, the question takes the form: what happens to this component when the network is unreliable, slow, congested, insecure, topologically different from development, administratively outside your control, costly to use, or physically heterogeneous? If the answer to any of these is “the component fails in an unhandled way,” that is the design gap to address before production.

For the three modern fallacies, the question takes the form: when this API needs to change, what is the migration path? When this distributed operation needs to be compensated, what happens if the compensation fails? If this component has a gray failure, how will you know?

A design that can answer all eleven questions with specific mechanisms — not just “we’ll handle it” but the actual retry strategy, the actual service discovery mechanism, the actual mTLS implementation, the actual Saga compensation failure process — is a design that has internalised these fallacies rather than merely read them.

Key Takeaways

  1. The eight fallacies were first articulated by L. Peter Deutsch at Sun Microsystems in 1994 (seven fallacies), with James Gosling adding the eighth around 1997 — they describe assumptions that engineers invariably make when transitioning from single-machine to distributed systems
  2. The fallacies feel true because they are true in local development environments — they only reveal themselves as false under production conditions, at scale, under failure, which is exactly when they are most expensive to discover
  3. Every fallacy has a specific engineering mechanism that addresses it — retries with idempotency for network unreliability, batching and caching for latency, efficient serialisation for bandwidth, mTLS for security, service discovery for topology changes, IaC for multiple administrators, data locality for transport cost, adaptive timeouts for heterogeneity
  4. Mark Richards and Neal Ford identified three modern additions in 2020 — versioning is simple (it is not), compensating updates always work (they do not), and observability is optional (it is not) — each reflecting the specific failure modes of microservices architectures that Deutsch’s list did not anticipate
  5. The fallacies are most valuable as a design review framework — for each fallacy, ask what the blast radius is if the assumption is violated and whether the system has a specific mechanism to handle it
  6. Thirty years after the original list, every fallacy on it is still being made in production — not by inexperienced engineers but by experienced ones who have read the list and still encode the assumptions into architectures because they feel correct during design and only reveal themselves at runtime

Frequently Asked Questions (FAQ)

What are the eight fallacies of distributed computing?

The eight fallacies of distributed computing are false assumptions that engineers routinely make when building distributed systems: (1) the network is reliable, (2) latency is zero, (3) bandwidth is infinite, (4) the network is secure, (5) topology doesn’t change, (6) there is one administrator, (7) transport cost is zero, (8) the network is homogeneous. Originally identified by L. Peter Deutsch and colleagues at Sun Microsystems in 1994 (seven fallacies), with James Gosling adding the eighth around 1997. They describe assumptions that feel true in local development environments but fail under production conditions at scale.

Who created the eight fallacies of distributed computing?

L. Peter Deutsch, a Sun Fellow at Sun Microsystems, created the original list of seven fallacies in 1994, incorporating earlier work by Bill Joy and Dave Lyon on “The Fallacies of Networked Computing.” James Gosling, the inventor of Java and another Sun Fellow, added the eighth fallacy (the network is homogeneous) around 1997. In 2020, Mark Richards and Neal Ford added three more modern fallacies in their book Fundamentals of Software Architecture: versioning is simple, compensating updates always work, and observability is optional.

Why are the eight fallacies still relevant in cloud-native systems?

Cloud-native and microservices architectures make the eight fallacies more relevant, not less. Kubernetes pods are ephemeral — topology changes constantly (Fallacy 5). Microservices multiply the number of network calls, amplifying the cost of latency fallacies (Fallacy 2) and bandwidth fallacies (Fallacy 3). Multi-team service ownership makes the single administrator assumption (Fallacy 6) more obviously false. Cloud provider charging models make transport cost real and measurable (Fallacy 7). The fallacies were identified before cloud computing existed — every cloud-native pattern that addresses them (service discovery, circuit breakers, retries, mTLS) exists specifically because the network behaves as the fallacies describe.

What is the most dangerous fallacy in production?

The most dangerous fallacy operationally is Fallacy 1 — the network is reliable — because it produces the widest variety of failure modes: hangs when calls never return, data corruption when retries create duplicates, cascading failures when retry storms amplify load, and split-brain scenarios when network partitions are not handled correctly. The most dangerous fallacy strategically is Fallacy 11 (observability is optional) — because without observability built in from the start, a distributed system cannot be diagnosed when it fails, and retrofitting observability after the fact is significantly more expensive than building it in.

How do the eight fallacies relate to the CAP theorem?

The CAP theorem and the eight fallacies address different aspects of distributed systems. The CAP theorem formalises the fundamental limit that network partitions impose: during a partition, a system must choose between consistency (correct results) and availability (responding to requests). The eight fallacies describe the assumptions engineers make that prevent them from designing correctly for CAP’s implications — particularly Fallacy 1 (networks are reliable, so partitions don’t happen) and Fallacy 4 (the network is secure, so partition events are not adversarial). Understanding the fallacies is the prerequisite for understanding why the CAP constraint is real and inescapable.

What engineering mechanisms address the network reliability fallacy?

Three mechanisms work together to address network unreliability. Timeouts prevent calls from hanging indefinitely when the network fails silently — every outbound call must have a configured timeout. Retries with exponential backoff and jitter allow transient failures to be survived — but retries without idempotency keys create duplicate side effects when the original call succeeded and the response was lost. Circuit breakers stop sending requests to a consistently failing downstream service, preventing retry storms from amplifying a partial failure into a complete outage. These three mechanisms form a coordinated defence: timeout detects the failure, retry recovers from transient failures, circuit breaker prevents cascade.


Continue Learning

Series home: Distributed Systems — Concepts, Design & Real-World Engineering

The eight fallacies explain why every mechanism in this series exists. Each fallacy corresponds directly to the content of specific posts:

Start with Part 1 — Foundations to build the complete mental model, or jump to any part using the series home page.


About the Author

Rahul Suryawanshi is a Senior Engineering Manager with experience building and operating large-scale distributed systems across cloud-native platforms. He has led engineering teams through the challenges of consistency trade-offs, operational reliability and platform scalability that this series explores — not as academic exercises but as production engineering decisions with real consequences.

This series reflects what he wished existed when navigating these problems in production: a comprehensive, progressive resource written from an engineering leadership perspective rather than a textbook or paper collection.

Browse all Distributed systems articles