What is a distributed system? - Rahul Suryawanshi

Home » Distributed Systems » Foundations » What Is a Distributed System (Really)?

Distributed Systems Series — Part 1.1: Foundations

A Distributed System is a set of independent components that coordinate over unreliable communication to appear as a single coherent system. This definition — coordination under uncertainty, not just multiple machines — is what makes distributed systems fundamentally harder than single-node systems and why most of the failures that happen in production do not look like the failures engineers design for.

A Distributed System is not defined by how many machines it uses, but by the fact that independent components must coordinate over unreliable communication while still making progress. The moment coordination depends on a network, uncertainty becomes part of the system’s behavior.

In practice, Distributed Systems power nearly everything we build today — cloud platforms, payment systems, e-commerce, ride-sharing, streaming services and enterprise SaaS platforms. Yet despite their ubiquity, they remain one of the most misunderstood areas of software engineering.

Why Distributed Systems Are Fundamentally Hard

Distributed systems are hard because failure isn’t obvious. The system often looks healthy right up until it isn’t. Let’s break down why.

1. Partial Failure Is the Default

In a single-node system, failure is usually obvious – The system is either up or down. In contrast, Distributed Systems fail partially.

One service responds, another times out
One replica is up to date, another is stale
One request succeeds, another silently disappears

As a result, engineers must design systems that continue operating even when some components are unhealthy, unreachable or slow.

2. Communication Is Inherently Unreliable

Once components communicate over a network, guarantees disappear. Messages can be:

Delayed
Lost
Duplicated
Delivered out of order

Consequently, remote calls are not local calls. Treating them as such is one of the most common — and costly — design mistakes.

Retries, timeouts, and idempotency are not optimisations — they are survival mechanisms. The mechanisms that make them safe are covered in Post 2.2.

3. Time Is Not What You Think It Is

In Distributed Systems, there is no single, authoritative clock.

Physical clocks drift
Synchronization is imperfect
Network delays obscure event ordering

When two events appear to happen “at the same time,” what that really means is we don’t know which happened first — and we might never know.

This is why Distributed Systems rely on logical clocks and causal ordering rather than trusting timestamps alone — covered in Post 1.5 and Post 2.5.

4. Consistency and Availability Are Trade-offs

When networks partition and they eventually will, systems are forced to make a trade-off:

They can return correct results, even if that means becoming temporarily unavailable,
or they can remain available while serving data that may be inconsistent.

This tension is formalised by the CAP theorem. More importantly, it appears in almost every real-world system design decision

There is no universally correct choice — only trade-offs aligned with business and user expectations.

5. Complexity Grows Non-Linearly as Systems Scale

As systems scale:

Nodes increase
Interactions multiply
Failure scenarios explode

A bug or delay that is rare on a single machine becomes inevitable at scale. A failure that happens “once a month” on one node may occur every few minutes across a large fleet.

Distributed Systems don’t just scale capacity — they scale uncertainty.

A Real-World Example: A Banking Transaction

Consider a simple money transfer.

In a single-node system:

Debit account A
Credit account B
Done

In a distributed system:

Multiple services are involved (accounts, ledger, fraud checks, notifications)
Some services may be slow or temporarily unavailable
Messages may arrive late or be processed more than once

The hardest problem isn’t handling obvious failures — it’s handling ambiguity.

Did the debit succeed?
Was the confirmation delayed or lost?
Is retrying safe or dangerous?

Distributed Systems must make progress without always knowing the true state of the world.

How Amazon Learned This the Hard Way

Amazon’s early e-commerce platform was a single monolithic system. As it scaled, the team discovered exactly what the banking example above describes – partial failures became the norm, not the exception. A slow database call could hold up an entire page render. One team’s deployment could break another team’s service. Recovery was unpredictable.

Werner Vogels, Amazon’s CTO, later described the core insight that reshaped their architecture: “Failures are a given and everything will eventually fail over time.“ This led Amazon to decompose their platform into independent services — what we now call microservices — each owning its own data and communicating over the network.

The result was not simplicity. It was a deliberately Distributed System, designed around the assumption that the network is unreliable, that partial failure is normal and that no single component should be able to bring down the whole. Every trade-off we will cover in this series — Consistency vs Availability, Retries vs Idempotency, Coordination vs Independence — was a direct consequence of that architectural decision.

Amazon did not build a Distributed System because it sounded elegant. They built one because a single node could no longer absorb the uncertainty of their scale.

What This Means for Engineers

Distributed Systems force engineers to abandon comforting assumptions:

The network is not reliable
Time is not consistent
Failure is not exceptional
Recovery matters more than prevention

Reliable systems are not those that never fail, but those that fail in predictable ways and recover quickly.

Key Takeaways

A Distributed System is defined by coordination under uncertainty, not by the number of machines — The moment coordination depends on a network, ambiguity becomes a permanent operating condition
Partial failures are the default, not the exception — One service responding while another times out, one replica current while another is stale, is the normal state of a large distributed system at scale
Network communication is fundamentally unreliable — Messages can be delayed, lost, duplicated or delivered out of order, making remote calls categorically different from local function calls
There is no global clock in a distributed system — Physical clocks drift, NTP adjustments can move time backward and timestamps cannot reliably establish which of two distributed events happened first
Consistency and Availability are genuine trade-offs during network partitions — No design can provide both simultaneously, only deliberate choices about which to prioritise in each scenario
Complexity grows non-linearly with scale — A failure mode that occurs once a month on one node occurs every few minutes across a large fleet, turning rare events into operational norms
Recovery and Observability are first-class design concerns, not afterthoughts — Reliable Distributed Systems are not systems that never fail but systems that fail predictably and recover quickly

Frequently Asked Questions (FAQ)

Is a distributed system just multiple computers talking to each other?

No. It’s a system where components coordinate to behave like a single logical system, despite failures and network unpredictability.

Can a system be distributed if it runs only in one data center?

Yes. Distribution is about independence and failure boundaries, not geographic location.

What is a distributed system in simple terms?

A distributed system is a group of independent computers that work together and appear as a single system to users.

Is microservices architecture always a distributed system?

Yes, because services run independently and communicate over a network.

Is geographical distribution required?

No. Distribution is about failure boundaries, not physical distance.

What is the difference between a distributed system and a microservices architecture?

Microservices is an architectural style for decomposing a system into independently deployable services. A distributed system is any system where components coordinate over a network — which means every microservices architecture is a distributed system. The reverse is not true: a distributed system does not require microservices. A distributed database, a Hadoop cluster, and a CDN are all distributed systems without using a microservices architecture.

What I wish engineers understood before their first distributed systems incident

The definition at the top of this post — coordination under uncertainty, not just multiple machines — took me longer to internalise than it should have. In the early years of my career, I thought distributed systems were hard because they were technically complex. The algorithms, the protocols, the trade-offs. Those are hard. But that is not the real reason.

The real reason is simpler and more uncomfortable: distributed systems force you to design for a world where you cannot know the true state of things. Did the payment succeed before the timeout? Was the message received or just delayed? Is that node crashed or just slow? In a single-machine system, these questions have answers. In a distributed system, they frequently do not — and your design must be correct either way.

At IDFC First Bank, we run financial transaction systems where that uncertainty has real consequences. A payment that is ambiguously failed is not just a user experience problem. It is a potential duplicate charge if you retry incorrectly, a missed settlement if you do not retry at all and a compliance question if you cannot explain what happened. The banking transaction example in this post is not hypothetical. It is the class of problem we solve every day.

What building those systems taught me is exactly what this post describes: recovery matters more than prevention. You cannot prevent the network from dropping packets. You cannot prevent nodes from pausing at the worst moment. You cannot prevent clocks from drifting. What you can do is design systems that handle those realities predictably — that fail in ways you anticipated, recover automatically when they can, and surface clear signals when they cannot.

That is what this series is about. Start here. Read sequentially through Part 1. The five posts establish the mental model that everything else depends on. Engineers who skip the foundations and jump straight to replication or consensus consistently find those topics harder than they need to be — not because the concepts are complex, but because the problem those concepts are solving is not yet clear.

Series home: Distributed Systems — Concepts, Design & Real-World Engineering

Part 1 — Foundations

1.1 — What Is a Distributed System (Really)
1.2 — System Models: How Distributed Systems See the World
1.3 — Network Model: Latency, Loss and Partitions
1.4 — Node & Failure Model: Crashes, Slow Nodes and Partial Failure
1.5 — Time Model: Why Ordering Is Harder Than It Looks

Next: 1.2 — System Models: How Distributed Systems See the World →

Once you have worked through all five Foundation posts, Part 2 begins with how distributed systems cope with these realities in practice →

What Is a Distributed System (Really)?

Why Distributed Systems Are Fundamentally Hard

1. Partial Failure Is the Default

2. Communication Is Inherently Unreliable

3. Time Is Not What You Think It Is

4. Consistency and Availability Are Trade-offs

5. Complexity Grows Non-Linearly as Systems Scale

A Real-World Example: A Banking Transaction

How Amazon Learned This the Hard Way

What This Means for Engineers

Key Takeaways

Frequently Asked Questions (FAQ)

What I wish engineers understood before their first distributed systems incident

Like this:

Related

Why Distributed Systems Are Fundamentally Hard

1. Partial Failure Is the Default

2. Communication Is Inherently Unreliable

3. Time Is Not What You Think It Is

4. Consistency and Availability Are Trade-offs

5. Complexity Grows Non-Linearly as Systems Scale

A Real-World Example: A Banking Transaction

How Amazon Learned This the Hard Way

What This Means for Engineers

Key Takeaways

Frequently Asked Questions (FAQ)

What I wish engineers understood before their first distributed systems incident

Share this:

Like this:

Related

Discover more from Rahul Suryawanshi