Distributed Systems Series — Part 1.3: Foundations
The Network Does Not Behave the Way You Think It Does
If Post 1.2 established that system models are the assumptions distributed systems are built on, this post examines the most consequential of those assumptions: what the network actually does to messages in transit.
Most software bugs are logical mistakes — the code does not do what the programmer intended. Most distributed system outages begin somewhere else entirely. They begin with the network.
The network model defines what engineers are allowed to assume about communication between nodes. Getting this model wrong — or leaving it implicit — is the root cause of an enormous fraction of production incidents. The eight fallacies of distributed computing, documented by L. Peter Deutsch at Sun Microsystems in 1994, begin with two network assumptions that engineers still violate in 2025: the network is reliable, and latency is zero. Both are false. This post explains precisely why, and what the correct model requires engineers to design for instead.
The Most Dangerous Assumption: The Network Is Reliable
Many systems are built — often unintentionally — on a single assumption: that messages sent over a network arrive correctly, once, in order, within a predictable time. In reality, a message sent from one node to another may experience any of four distinct failure modes:
It may be delayed — arriving later than expected, potentially after a timeout has already fired on the sender. It may be lost — never arriving at the receiver, with no notification to the sender. It may be duplicated — arriving more than once, typically because a retry was issued before the original delivery was confirmed. It may arrive out of order — reaching the receiver after a later message that was sent after it.
The hardest property of these failure modes: from the sender’s perspective, a delayed message and a lost message are indistinguishable. Silence is always ambiguous. A node that stops receiving responses cannot determine whether the other node has crashed, whether the network between them has partitioned, or whether messages are simply delayed. This ambiguity is irreducible — it is not a fixable engineering problem but a fundamental property of communication over unreliable channels.
Latency: The Hidden Enemy Is Variability, Not Speed
Latency is not just how long something takes. It is how unpredictable that time is.
A service call may complete in 5ms on 99 out of 100 requests and take 2 seconds on the hundredth. On the thousandth it may never return at all. Engineers designing for average latency build systems that work correctly most of the time and fail in confusing ways the rest of the time. Systems must be designed for tail latency — the slowest 1%, 0.1%, or 0.01% of requests — because at scale, rare events become frequent ones.
Consider a cluster of 1,000 nodes where each node experiences a slow response 0.1% of the time. Any given second, approximately one node is slow. A request that fans out to all 1,000 nodes — as distributed database queries, MapReduce jobs, and search index queries frequently do — will hit at least one slow node on essentially every request. The tail is not the rare case. The tail is the median experience for distributed fan-out workloads. This is the tail latency amplification problem covered in depth in Post 3.8 and Part 5.
The sources of tail latency in production are well-documented: garbage collection pauses that halt a JVM process for hundreds of milliseconds, noisy neighbours on shared cloud hardware that saturate CPU or network interfaces, disk flush stalls when the OS writes dirty pages to storage, and network congestion causing TCP retransmission timeouts. None of these are exceptional. All of them are operational norms at scale.
Why Timeouts Exist — and What They Actually Solve
Timeouts are necessary because waiting indefinitely is not an option. A thread that blocks forever on a network call eventually exhausts the thread pool. A connection that never closes eventually exhausts the connection pool. Timeouts are the mechanism that converts unbounded waiting into bounded failure.
But timeouts do not solve the problem they appear to solve. When a timeout fires, the calling node does not know what happened on the other side. Five distinct scenarios produce identical silence from the caller’s perspective:
- The request was lost in transit — the receiver never processed it
- The receiver processed the request and crashed before sending a response
- The receiver is processing the request and will respond, eventually
- The response was sent but lost in transit
- A network partition separated the caller from the receiver
Retrying after a timeout is safe only if the operation is idempotent — applying it multiple times produces the same result as applying it once. Without idempotency, a retry that follows a succeeded-but-unreported request produces duplicate side effects: double charges, duplicate orders, double inventory decrements. Retrying without idempotency is one of the most common sources of data corruption in distributed systems. The full treatment of safe retry design is in Post 2.2.
Timeouts also introduce their own failure mode at the system level: retry storms. When many callers simultaneously hit a slow downstream service, all experience timeouts simultaneously, all retry simultaneously, and the downstream service — already struggling — receives a second wave of traffic equal to the first. The retry storm converts a degradation into a collapse. Exponential backoff with full jitter is the standard mitigation, covered in Post 2.2.
Message Loss and Duplication: Normal Behaviours, Not Edge Cases
Networks do not guarantee delivery. A message may never arrive, arrive twice, or arrive after a retry already succeeded and the system has moved on. This is not exceptional behaviour — it is the baseline that production distributed systems are designed around.
Three mechanisms address this. At-least-once delivery acknowledges that preventing message loss requires accepting the possibility of duplication — the queue or broker retransmits until it receives an acknowledgement. Idempotent operations are those that produce the same result whether applied once or multiple times, making at-least-once delivery safe. Deduplication mechanisms (idempotency keys, sequence numbers, content hashes) detect and discard duplicate messages at the receiver, implementing effectively-once processing on top of at-least-once delivery.
Systems that assume exactly-once delivery without implementing these mechanisms are relying on luck — specifically, the luck that the network never drops a packet, a node never crashes mid-operation, and no retry ever runs. At scale, none of these hold.
Network Partitions: When the System Splits
A network partition occurs when nodes cannot communicate with each other despite being individually healthy and running. From inside the system, each isolated group believes the other is slow or down. Both sides may continue making decisions independently, accepting writes, electing leaders, or updating state — diverging from each other without knowing it.
Partitions are not rare edge cases to be handled eventually. They happen because of router failures, misconfigured firewall rules, cloud networking disruptions, traffic spikes that exhaust network buffer capacity, and BGP route changes. In any distributed system running for years in production, partitions will occur. The engineering question is not whether to design for them but what the system should do when they occur.
This is precisely the question the CAP theorem answers — and it answers it by forcing a choice. During a partition, a system must decide: continue serving requests with potentially inconsistent data (choosing availability), or stop serving requests until the partition heals and consistency can be restored (choosing consistency). There is no option that provides both. The full treatment of the CAP theorem and the design decisions it forces is in Post 3.4.
The 2012 AWS US-East-1 Outage: Partitions at Production Scale
On June 29, 2012, a storm caused widespread power failures and network-level disruptions in AWS’s US-East-1 region, splitting the region into isolated groups of nodes. What made this incident a textbook partition scenario was the behaviour of Elastic Load Balancers: many ELB instances lost contact with the control plane — the system responsible for health-checking and routing decisions — while continuing to serve traffic. From the outside, some services appeared functional. From the inside, the two sides of the partition were making independent decisions with no shared state.
Netflix, running heavily on US-East-1 at the time, had anticipated this class of failure through their chaos engineering practice — deliberately injecting failures in production to train systems to degrade gracefully. Services that assumed network reliability failed completely. Services designed around partition tolerance degraded partially and recovered automatically without human intervention.
The incident became one of the clearest real-world demonstrations of CAP in production. Systems that prioritised consistency stopped serving traffic entirely while waiting for the partition to heal. Systems that prioritised availability continued operating with potentially stale data. Neither choice was wrong — but only the engineers who had consciously made the choice, and designed for its consequences, were prepared for what happened. The chaos engineering discipline that prepared Netflix for this failure is covered in Post 4.9.
Synchronous, Asynchronous, and Partially Synchronous Networks
As established in Post 1.2, network behaviour is modelled along a spectrum of timing assumptions.
A synchronous network has a known upper bound on message delay — every message arrives within T milliseconds or not at all. No real production network behaves this way reliably. Algorithms designed for synchronous networks fail or produce incorrect results when the network does not meet the timing bound.
An asynchronous network has no timing guarantees whatsoever. Messages may be arbitrarily delayed or lost. In this model, the FLP impossibility theorem applies: no deterministic algorithm can guarantee that consensus terminates if even one node may fail. The asynchronous model is the most pessimistic and the closest to production reality during adverse conditions.
A partially synchronous network is unreliable most of the time but eventually stabilises — timing bounds hold eventually, even if not continuously. This is the model that describes real production networks most accurately: normally well-behaved, occasionally slow, rarely partitioned, always recovering. Raft and Paxos both operate under this assumption — they guarantee safety always (no incorrect decisions are made) and liveness only during periods of sufficient network stability (elections eventually complete, writes eventually commit). This is the correct trade-off: a brief pause in progress is recoverable, an incorrect decision produces split-brain or data corruption that may not be.
Why Network Behaviour Breaks Single-Machine Intuition
Engineers trained on single-machine systems carry intuitions that actively mislead them in distributed environments. On a single machine: slow usually means broken, fast usually means healthy, and an unresponsive process has crashed. In a distributed system: slow might be normal variance, silence might mean delay rather than loss, and a node that appears unresponsive to one node may be fully operational for another.
The most consequential manifestation of this: Service A calls Service B and receives no response. From Service A’s perspective, five distinct scenarios are indistinguishable:
- Service B crashed before processing the request
- Service B is slow and will respond eventually
- The request was lost in transit — Service B never received it
- Service B processed the request but the response was lost
- A network partition separated A from B — both are healthy but cannot communicate
Designing distributed systems means designing for this ambiguity rather than assuming it away. Every API, every message handler, every retry strategy, and every timeout configuration is a decision about how the system behaves when it cannot know the true state of its dependencies.
What the Network Model Requires Engineers to Design For
A distributed system built on an accurate network model — one that treats network unreliability as the baseline rather than the exception — has four non-negotiable properties.
Retries must be safe: every operation that may be retried must either be idempotent by nature or made idempotent through deduplication. Operations must tolerate duplication: at the receiver, duplicate messages must be detected and handled without producing duplicate side effects. Timeouts must be calibrated to actual latency distributions: a timeout set to the average response time will fire constantly during normal operation; a timeout set to the p99.9 response time will fire only when something is genuinely wrong. Designs must not assume perfect communication: correctness must not depend on every message arriving exactly once in order within a bounded time.
The mechanisms that implement these properties — idempotency keys, circuit breakers, exponential backoff, service discovery, health-check-driven routing — form the subject of Part 2. They are not optimisations added after a system works. They are the mechanisms that make a distributed system work at all.
Key Takeaways
- Networks are unreliable by nature — messages can be delayed, lost, duplicated, or delivered out of order, and from the sender’s perspective all four failure modes produce identical silence
- Latency variability is more dangerous than latency itself — the tail (p99, p99.9) determines user experience and system stability at scale, not the average, because rare events become frequent at high request volumes
- Timeouts surface ambiguity rather than resolving it — when a timeout fires, the caller cannot determine whether the request failed, succeeded, or is still in progress, making retries safe only when operations are idempotent
- Message loss and duplication are normal operational conditions, not edge cases — systems must be designed around at-least-once delivery with idempotent operations rather than assuming exactly-once delivery
- Network partitions are inevitable in production — routers fail, misconfigurations occur, cloud networks have disruptions, and any distributed system running for years will experience partitions that require explicit design decisions
- The partially synchronous network model describes production systems most accurately — timing is unreliable but eventually stabilises, which is why algorithms like Raft guarantee safety always and liveness only during stable periods
- Distributed system correctness must not depend on perfect communication — every design decision about retries, timeouts, and consistency must account for messages that may be lost, delayed, duplicated, or delivered out of order
Frequently Asked Questions (FAQ)
What is a network model in distributed systems?
A network model is the set of assumptions a distributed system makes about how messages behave in transit — whether they can be delayed, lost, duplicated, or delivered out of order, and whether the network can partition. Three standard models exist: synchronous (bounded message delay), asynchronous (no timing guarantees), and partially synchronous (unreliable but eventually stabilising). The choice of network model determines which algorithms are correct and which guarantees are achievable. Most production distributed systems assume partial synchrony — which is why algorithms like Raft guarantee safety always but liveness only when the network is sufficiently stable.
What is the difference between message loss, delay, and duplication?
These are three distinct failure modes with different implications for system design. A lost message never arrives at the receiver — the sender receives no acknowledgement and cannot determine whether to retry. A delayed message arrives eventually but later than expected — potentially after a timeout has fired on the sender, causing a retry that produces a duplicate delivery. A duplicated message arrives more than once, typically because a retry was issued before the original delivery was confirmed. From the sender’s perspective all three produce identical silence, which is why distributed system design requires idempotent operations and deduplication mechanisms rather than relying on exactly-once delivery.
Are network partitions the same as node failures?
No, and the distinction matters significantly for system design. A network partition means nodes are alive and individually healthy but cannot communicate with each other due to a network-level failure — a router failure, misconfigured firewall, or cloud networking disruption. A node failure means the node itself has crashed or stopped. The critical difference: a partitioned node may continue processing requests and accepting writes independently, leading to diverged state on both sides of the partition. A crashed node cannot process anything. This is why partition tolerance requires different design decisions than fault tolerance — the system must handle two groups that each believe they are the authoritative partition.
Why can the sender not know what happened to a message?
Because silence is ambiguous in an asynchronous network. A lack of response from the receiver could mean the message was lost in transit and never received, the receiver is slow and still processing, the receiver crashed after processing but before sending a response, the response was sent but lost on the way back, or a network partition separated sender from receiver. The sender has no way to distinguish between these scenarios without additional coordination — which itself requires network communication that may also fail. This irreducible ambiguity is why timeout-based failure detection always produces probabilistic suspicions rather than certainty, and why the partially synchronous model is the most honest description of real networks.
Why is tail latency more important than average latency at scale?
Because tail latency determines system behaviour in the conditions that actually matter — high load, fan-out requests, and degraded dependencies. Average latency hides the distribution. A service with 5ms average latency and 500ms p99 latency fails 1 in 100 requests with a half-second response — invisible in the average but highly visible to users and to systems that fan out to many instances. At N=100 components each with 1% slow tail, approximately 63% of fan-out requests hit at least one slow component. At scale, the tail is not the exception — it becomes the median experience for distributed workloads.
What does the 2012 AWS US-East-1 outage teach us about network partitions?
Three specific lessons. First, partitions occur at any scale — even AWS’s mature, redundant infrastructure experienced a significant partition during the June 2012 storm. Second, the consequences of a partition depend entirely on the design decision made in advance: systems that prioritised consistency stopped serving traffic; systems that prioritised availability continued with potentially stale data. Neither outcome was accidental — the outcome was determined by the architectural decisions made before the incident. Third, chaos engineering — Netflix’s practice of deliberately injecting failures — is what prepared Netflix’s systems to degrade gracefully rather than collapse completely. Resilience is validated under realistic failure conditions, not assumed from design intent.
The outage that changed how I think about network assumptions
The 2012 AWS US-East-1 incident was not the first major cloud outage I was aware of, but it was the one that made the network model tangible in a way that reading about it had not. The specific detail that stayed with me was not the scale — it was that the ELB instances kept serving traffic while disconnected from the control plane. From the outside, they looked healthy. From the inside, they were operating on a view of the world that had stopped being accurate the moment the partition formed.
That failure mode — a component that appears healthy while operating on stale assumptions — is the failure mode that appears most often in production distributed systems. Not crashed services. Not obvious errors. Components that are alive, that are responding, that are doing work — but doing it based on information that is no longer true. Network partitions make this visible at infrastructure scale. At the application level, it shows up as stale cache reads, expired lock holders that do not know their lease is gone, and replicas that have fallen behind and are serving outdated data without any indication that they are doing so.
At IDFC First Bank, the network model question that comes up most often is not “will the network drop a packet” — it is “what does the system do when a downstream service stops responding?” The five scenarios that produce identical silence from the caller’s perspective are not theoretical. They are the daily operational reality of a microservices platform. A payment service calling a fraud check service that stops responding — did the fraud check fail, succeed, crash, or is it just slow? The timeout fires and the answer is still ambiguous. What the system does next — whether it retries, whether the retry is idempotent, whether it has a fallback — is what determines whether the user sees an error or a seamless experience.
That is the practical value of understanding the network model. Not that it solves the problem, but that it reframes it correctly. The question is not “how do I prevent the network from failing?” The question is “what does my system do when the network behaves the way networks behave?” The mechanisms in Part 2 are the answers to that question. They make more sense after this post than they would have before it.
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 1 — Foundations
- 1.1 — What Is a Distributed System (Really)?
- 1.2 — System Models: How Distributed Systems See the World
- 1.3 — Network Model: Latency, Loss and Partitions ← you are here
- 1.4 — Node & Failure Model: Crashes, Slow Nodes and Partial Failure →
- 1.5 — Time Model: Why Ordering Is Harder Than It Looks
Previous: ← 1.2 — System Models: How Distributed Systems See the World
Next: 1.4 — Node & Failure Model: Crashes, Slow Nodes and Partial Failure →
Posts that build directly on this foundation:
- 2.2 — Reliability and Retries — Idempotency, circuit breakers and safe retry design for unreliable networks
- 3.4 — The CAP Theorem — The consistency vs availability choice that network partitions force
- 4.9 — Chaos Engineering — The practice that prepared Netflix for the 2012 AWS partition
- The Eight Fallacies of Distributed Computing — Fallacy 1 (network is reliable) and Fallacy 2 (latency is zero) are the direct subject of this post
Once you have completed all five Foundation posts, Part 2 shows how distributed systems cope with these network realities in practice →