Distributed Systems Series — Part 1.1: Foundations
A Distributed System is a set of independent components that coordinate over unreliable communication to appear as a single coherent system. This definition — coordination under uncertainty, not just multiple machines — is what makes distributed systems fundamentally harder than single-node systems and why most of the failures that happen in production do not look like the failures engineers design for.
A Distributed System is not defined by how many machines it uses, but by the fact that independent components must coordinate over unreliable communication while still making progress. The moment coordination depends on a network, uncertainty becomes part of the system’s behavior.
In practice, Distributed Systems power nearly everything we build today — cloud platforms, payment systems, e-commerce, ride-sharing, streaming services and enterprise SaaS platforms. Yet despite their ubiquity, they remain one of the most misunderstood areas of software engineering.
Why Distributed Systems Are Fundamentally Hard
Distributed systems are hard because failure isn’t obvious. The system often looks healthy right up until it isn’t. Let’s break down why.
1. Partial Failure Is the Default
In a single-node system, failure is usually obvious – The system is either up or down. In contrast, Distributed Systems fail partially.
- One service responds, another times out
- One replica is up to date, another is stale
- One request succeeds, another silently disappears
As a result, engineers must design systems that continue operating even when some components are unhealthy, unreachable or slow.
2. Communication Is Inherently Unreliable
Once components communicate over a network, guarantees disappear. Messages can be:
- Delayed
- Lost
- Duplicated
- Delivered out of order
Consequently, remote calls are not local calls. Treating them as such is one of the most common — and costly — design mistakes.
Retries, timeouts, and idempotency are not optimisations — they are survival mechanisms. The mechanisms that make them safe are covered in Post 2.2.
3. Time Is Not What You Think It Is
In Distributed Systems, there is no single, authoritative clock.
- Physical clocks drift
- Synchronization is imperfect
- Network delays obscure event ordering
When two events appear to happen “at the same time,” what that really means is we don’t know which happened first — and we might never know.
This is why Distributed Systems rely on logical clocks and causal ordering rather than trusting timestamps alone — covered in Post 1.5 and Post 2.5.
4. Consistency and Availability Are Trade-offs
When networks partition and they eventually will, systems are forced to make a trade-off:
They can return correct results, even if that means becoming temporarily unavailable,
or they can remain available while serving data that may be inconsistent.
This tension is formalised by the CAP theorem. More importantly, it appears in almost every real-world system design decision
There is no universally correct choice — only trade-offs aligned with business and user expectations.
5. Complexity Grows Non-Linearly as Systems Scale
As systems scale:
- Nodes increase
- Interactions multiply
- Failure scenarios explode
A bug or delay that is rare on a single machine becomes inevitable at scale. A failure that happens “once a month” on one node may occur every few minutes across a large fleet.
Distributed Systems don’t just scale capacity — they scale uncertainty.
A Real-World Example: A Banking Transaction
Consider a simple money transfer.
In a single-node system:
- Debit account A
- Credit account B
- Done
In a distributed system:
- Multiple services are involved (accounts, ledger, fraud checks, notifications)
- Some services may be slow or temporarily unavailable
- Messages may arrive late or be processed more than once
The hardest problem isn’t handling obvious failures — it’s handling ambiguity.
- Did the debit succeed?
- Was the confirmation delayed or lost?
- Is retrying safe or dangerous?
Distributed Systems must make progress without always knowing the true state of the world.
How Amazon Learned This the Hard Way
Amazon’s early e-commerce platform was a single monolithic system. As it scaled, the team discovered exactly what the banking example above describes – partial failures became the norm, not the exception. A slow database call could hold up an entire page render. One team’s deployment could break another team’s service. Recovery was unpredictable.
Werner Vogels, Amazon’s CTO, later described the core insight that reshaped their architecture: “Failures are a given and everything will eventually fail over time.“ This led Amazon to decompose their platform into independent services — what we now call microservices — each owning its own data and communicating over the network.
The result was not simplicity. It was a deliberately Distributed System, designed around the assumption that the network is unreliable, that partial failure is normal and that no single component should be able to bring down the whole. Every trade-off we will cover in this series — Consistency vs Availability, Retries vs Idempotency, Coordination vs Independence — was a direct consequence of that architectural decision.
Amazon did not build a Distributed System because it sounded elegant. They built one because a single node could no longer absorb the uncertainty of their scale.
What This Means for Engineers
Distributed Systems force engineers to abandon comforting assumptions:
- The network is not reliable
- Time is not consistent
- Failure is not exceptional
- Recovery matters more than prevention
Reliable systems are not those that never fail, but those that fail in predictable ways and recover quickly.
Key Takeaways
- A Distributed System is defined by coordination under uncertainty, not by the number of machines — The moment coordination depends on a network, ambiguity becomes a permanent operating condition
- Partial failures are the default, not the exception — One service responding while another times out, one replica current while another is stale, is the normal state of a large distributed system at scale
- Network communication is fundamentally unreliable — Messages can be delayed, lost, duplicated or delivered out of order, making remote calls categorically different from local function calls
- There is no global clock in a distributed system — Physical clocks drift, NTP adjustments can move time backward and timestamps cannot reliably establish which of two distributed events happened first
- Consistency and Availability are genuine trade-offs during network partitions — No design can provide both simultaneously, only deliberate choices about which to prioritise in each scenario
- Complexity grows non-linearly with scale — A failure mode that occurs once a month on one node occurs every few minutes across a large fleet, turning rare events into operational norms
- Recovery and Observability are first-class design concerns, not afterthoughts — Reliable Distributed Systems are not systems that never fail but systems that fail predictably and recover quickly
Frequently Asked Questions (FAQ)
No. It’s a system where components coordinate to behave like a single logical system, despite failures and network unpredictability.
Yes. Distribution is about independence and failure boundaries, not geographic location.
A distributed system is a group of independent computers that work together and appear as a single system to users.
Yes, because services run independently and communicate over a network.
No. Distribution is about failure boundaries, not physical distance.
Microservices is an architectural style for decomposing a system into independently deployable services. A distributed system is any system where components coordinate over a network — which means every microservices architecture is a distributed system. The reverse is not true: a distributed system does not require microservices. A distributed database, a Hadoop cluster, and a CDN are all distributed systems without using a microservices architecture.
What I wish engineers understood before their first distributed systems incident
The definition at the top of this post — coordination under uncertainty, not just multiple machines — took me longer to internalise than it should have. In the early years of my career, I thought distributed systems were hard because they were technically complex. The algorithms, the protocols, the trade-offs. Those are hard. But that is not the real reason.
The real reason is simpler and more uncomfortable: distributed systems force you to design for a world where you cannot know the true state of things. Did the payment succeed before the timeout? Was the message received or just delayed? Is that node crashed or just slow? In a single-machine system, these questions have answers. In a distributed system, they frequently do not — and your design must be correct either way.
At IDFC First Bank, we run financial transaction systems where that uncertainty has real consequences. A payment that is ambiguously failed is not just a user experience problem. It is a potential duplicate charge if you retry incorrectly, a missed settlement if you do not retry at all and a compliance question if you cannot explain what happened. The banking transaction example in this post is not hypothetical. It is the class of problem we solve every day.
What building those systems taught me is exactly what this post describes: recovery matters more than prevention. You cannot prevent the network from dropping packets. You cannot prevent nodes from pausing at the worst moment. You cannot prevent clocks from drifting. What you can do is design systems that handle those realities predictably — that fail in ways you anticipated, recover automatically when they can, and surface clear signals when they cannot.
That is what this series is about. Start here. Read sequentially through Part 1. The five posts establish the mental model that everything else depends on. Engineers who skip the foundations and jump straight to replication or consensus consistently find those topics harder than they need to be — not because the concepts are complex, but because the problem those concepts are solving is not yet clear.
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 1 — Foundations
- 1.1 — What Is a Distributed System (Really)
- 1.2 — System Models: How Distributed Systems See the World
- 1.3 — Network Model: Latency, Loss and Partitions
- 1.4 — Node & Failure Model: Crashes, Slow Nodes and Partial Failure
- 1.5 — Time Model: Why Ordering Is Harder Than It Looks
Next: 1.2 — System Models: How Distributed Systems See the World →
Once you have worked through all five Foundation posts, Part 2 begins with how distributed systems cope with these realities in practice →