Communication Fundamentals in Distributed Systems

Home » Distributed Systems » Communication & Coordination » Communication Fundamentals in Distributed Systems

Distributed Systems Series — Part 2.1: Communication & Coordination

Everything in Distributed Systems Starts with Communication

Communication fundamentals in distributed systems are the foundation on which every higher-level property — reliability, consistency, fault tolerance, scalability — is built. Before services can coordinate, replicate data, or agree on decisions, they must first exchange information. Before retries can be made safe, before service discovery can route correctly, before a leader can be elected, components must communicate. And communication in distributed systems is fundamentally different from communication inside a single process.

Once a system crosses a network boundary, the assumptions that feel natural in local code break down. As Post 1.3 established, messages can be delayed, lost, duplicated, or delivered out of order. A function call that either succeeds or fails in microseconds becomes a remote call that may hang, time out, partially succeed, or succeed on the receiver while the response is lost in transit. The sender and receiver execute independently, in different failure domains, on different hardware, under different load conditions.

This post establishes the communication vocabulary and patterns that every subsequent post in this series builds on.

Communication Is Message Passing

At its core, all communication in distributed systems is message passing. One component sends a message. Another component receives it. Every higher-level abstraction — RPC, REST, gRPC, event streams, message queues — is built on top of this basic mechanism.

Unlike shared-memory systems, distributed components cannot directly read or write each other’s state. All interaction happens through messages that travel across a network. This has three structural consequences that shape every design decision in distributed systems.

Communication is slower than local memory access by orders of magnitude — a local function call completes in nanoseconds, a network round trip takes milliseconds. Messages can be delayed, lost, duplicated, or reordered — the network provides no delivery guarantee beyond best effort. The sender and receiver execute independently — the sender cannot know what state the receiver is in when the message arrives, or whether it arrives at all.

These consequences are not engineering problems waiting to be solved. They are permanent properties of network communication that every system design must acknowledge and accommodate.

The Three Communication Patterns and Their Failure Characteristics

In practice, the choice of communication pattern is not abstract. Every pattern carries specific failure characteristics that directly shape how a system behaves when things go wrong. The three dominant patterns — synchronous RPC, asynchronous messaging, and event streaming — fail in fundamentally different ways. The engineering decision is about which failure characteristics the system is best positioned to handle, given its consistency requirements, latency budget, and operational maturity.

Synchronous RPC: REST and gRPC

In synchronous RPC, the caller sends a request and blocks, waiting for a response. This is the most intuitive model and the most common starting point for service-to-service communication.

REST over HTTP is the dominant form — human-readable, firewall-friendly, widely supported, and easy to debug with standard tooling. gRPC uses Protocol Buffers over HTTP/2, offering lower latency through binary encoding, built-in bidirectional streaming, and a strongly typed schema enforced at compile time. The trade-off: gRPC requires schema management and is harder to inspect without specialised tooling. REST is the right default for external APIs and teams with mixed tooling. gRPC is appropriate for internal service-to-service communication where performance matters and both sides control the schema.

The failure characteristics of synchronous RPC are the hardest to reason about. The caller is blocked for the entire duration of the call — a slow downstream service directly slows every caller. A timeout leaves the caller in uncertainty about whether the operation completed — the receiver may have processed the request and sent a response that was lost in transit. Retries risk duplicate execution unless the operation is idempotent. A crashed downstream service fails fast; a slow one holds connections open and can cascade failure upward through the call chain.

Synchronous RPC is appropriate when an immediate response is required to continue processing — authenticating a request before serving it, looking up a price before displaying it, validating inventory before accepting an order. It is the wrong choice when the downstream operation is slow, unreliable, or does not need to complete before the caller can continue.

Asynchronous Messaging: Message Queues

In asynchronous messaging, the caller places a message on a queue and continues immediately. A consumer reads from the queue and processes the message independently. The producer and consumer are decoupled in both time and availability — neither needs to be running for the other to make progress.

RabbitMQ, Amazon SQS, and Azure Service Bus are the most widely deployed implementations. The failure characteristics are different from synchronous RPC but not simpler. Messages can be delivered more than once — consumers must be idempotent. Message ordering is not guaranteed across partitions. A queue that grows faster than consumers can drain becomes a backpressure problem that is invisible until it becomes a crisis. Poison messages — messages that consistently fail processing — can stall a queue indefinitely without a dead-letter queue strategy in place.

Asynchronous messaging is appropriate when the producer and consumer do not need to be available simultaneously, when operations can tolerate eventual processing, and when absorbing traffic spikes without propagating load to downstream services is a requirement. The 2019 AWS SQS retry storm — covered in depth in Post 2.2 — is the canonical production example of what happens when async messaging retry logic is implemented incorrectly.

Event Streaming: Kafka and Kinesis

Event streaming treats communication as a durable, ordered log of events. Producers append events to a log. Consumers read from the log at their own pace, maintaining their own position (offset). Unlike a message queue where each message is delivered to one consumer and removed on acknowledgement, the log is retained — multiple independent consumers can read the same events, replay from any point in history, and reprocess without affecting other consumers.

Apache Kafka is the dominant implementation. Amazon Kinesis and Confluent Cloud are managed alternatives. The failure characteristics of event streaming introduce their own class of problems. Consumer lag — falling behind the producer — accumulates silently until it becomes a crisis, making consumer lag monitoring a production necessity rather than an optional metric. Offset management is the consumer’s responsibility; a bug in offset commit logic can cause reprocessing or silent data loss. Ordering is guaranteed within a partition but not across partitions, making partition key design a correctness decision, not just a performance one. Retention policies mean events expire — consumers that fall too far behind lose access to history.

Event streaming is appropriate when multiple independent consumers need the same events, when replay and reprocessing are requirements, or when a durable audit trail of system events is needed. It is the wrong choice when each message needs exactly one processor and deletion on acknowledgement is the correct behaviour.

Remote Calls Are Not Local Calls

This deserves emphasis beyond the three-pattern taxonomy, because it is the assumption that breaks more distributed systems than any other.

In local code, engineers rely on guarantees that feel natural: function calls are fast, memory is consistent, failures are immediate and obvious. In distributed systems, none of these hold. A remote call can succeed on the server while failing to deliver the response to the client. It can time out while the server is still processing the request. It can be retried and executed more than once. Worse, from the caller’s perspective, many distinct outcomes produce identical silence.

A timeout does not mean failure. It means uncertainty.

This single insight changes how every communication failure must be handled. A caller that treats a timeout as a definitive failure may abandon an operation that the server completed successfully — leaving the system in a partially applied state. A caller that treats a timeout as a signal to retry without idempotency may execute the same operation twice — producing duplicate side effects. The correct response to a timeout is to acknowledge the uncertainty and choose a recovery strategy that is safe regardless of whether the server succeeded or failed.

The Eight Fallacies of Distributed Computing

In the 1990s, engineers at Sun Microsystems compiled a list of assumptions that developers consistently made about distributed systems — all of which were false. L. Peter Deutsch is credited with the original eight. They remain the most concise description of why remote calls are not local calls. The full treatment, including three modern additions, is at The Eight Fallacies of Distributed Computing.

The network is reliable. Messages are delayed, lost, and duplicated. Every retry mechanism in this series exists because this fallacy is violated in production every day.

Latency is zero. A cross-region service call adds 80-100ms of network latency before a single byte of computation occurs. Systems designed assuming zero latency produce N+1 query patterns, synchronous fan-out to dozens of services, and response time budgets that are physically impossible to meet.

Bandwidth is infinite. Large payloads, high-frequency polling, and uncompressed JSON serialisation all consume bandwidth that has real cost and real throughput limits. At petabyte scale, bandwidth becomes one of the largest infrastructure cost line items.

The network is secure. Internal service-to-service traffic is not protected by default. The 2020 SolarWinds attack compromised internal networks and made east-west traffic readable by attackers who assumed internal networks were trusted.

Topology does not change. In Kubernetes, pod IPs change on every reschedule. In auto-scaling groups, instances are created and destroyed continuously. Any system that depends on stable addresses breaks constantly in production. This fallacy is the direct motivation for service discovery in Post 2.3.

There is one administrator. In a microservices organisation, dozens of teams deploy independently. No central authority serialises all changes. This is why distributed coordination protocols — covered in Post 2.4 and Post 2.6 — exist.

Transport cost is zero. TLS handshakes consume CPU. Network egress is billed per gigabyte. Cross-region data transfer costs scale with volume. These are not edge costs — at scale they become dominant operational expenses.

The network is homogeneous. A mobile client on 3G behaves completely differently from a service-to-service call within a single AWS region. A single timeout configuration calibrated for internal calls will either be too aggressive for mobile clients or too lenient for internal ones. Treating the network as uniform produces systems that are correct only for the subset of clients they were actually designed for.

The Cascade Problem: When One Slow Service Brings Down Many

The most dangerous failure mode in synchronous communication is not a crashed service — it is a slow one. A crashed service fails fast. Callers time out quickly, fall back, or route around it. A slow service holds connections open, exhausting the caller’s thread pool or connection pool while appearing healthy to infrastructure monitoring.

A typical microservices call chain illustrates how this propagates:

User Request → API Gateway → Order Service → Inventory Service → Database

If the Database becomes slow, Inventory Service calls take longer. If Inventory Service calls take longer, Order Service threads are held waiting. If Order Service threads are held waiting, API Gateway connections begin to queue. If API Gateway connections queue, user requests begin timing out — not because the API Gateway failed, but because a database on the other end of a four-hop call chain degraded. Each layer amplified the problem rather than absorbing it, because each layer assumed its downstream would respond within bounded time.

The 2021 Facebook BGP outage demonstrated this at global scale. A configuration change made Facebook’s internal services unable to reach their own infrastructure. Internal services that depended on other internal services for authentication, configuration, and coordination began failing as their dependencies became unreachable. Because many of these dependencies were synchronous, failure propagated rapidly. Services that could have continued operating in degraded mode instead failed completely because they had no fallback for a synchronous call that stopped responding.

Three design decisions prevent this cascade. Timeouts on every outbound call — never wait indefinitely; a timeout that fires early and returns an error is better than a thread held open for thirty seconds. Bulkheads — isolate thread pools or connection pools per downstream dependency so that a slow Inventory Service exhausts its own dedicated pool rather than the shared pool used for everything else. Circuit breakers — when a downstream service begins failing consistently, stop calling it for a period, return a fast failure immediately, and give it time to recover without additional load. The full treatment of bulkheads and circuit breakers is in Post 4.7.

Key Takeaways

  1. All communication in distributed systems is message passing — every abstraction (RPC, REST, gRPC, queues, streams) is built on top of this, and every guarantee offered by those abstractions must survive delayed, lost, and duplicated messages
  2. A timeout is not a failure — it is uncertainty about whether the server succeeded or failed, and every retry strategy and error handling decision must be safe under both possibilities
  3. Synchronous RPC, asynchronous messaging, and event streaming each have distinct failure characteristics — choose based on which failures your system is best positioned to tolerate, not which API feels most familiar
  4. Slow downstream services are more dangerous than crashed ones in synchronous call chains — crashed services fail fast, slow services hold resources open and cascade failure upward through every layer that was waiting for them
  5. The eight fallacies of distributed computing describe the assumptions that cause production systems to fail — the network is not reliable, latency is not zero, topology changes constantly, and there is no single administrator
  6. Timeouts, bulkheads, and circuit breakers are the three mechanisms that prevent communication failures from cascading — timeout bounds the blast radius per call, bulkheads bound it per dependency, circuit breakers prevent continued load on a recovering service
  7. Asynchronous systems are more resilient to availability failures but introduce ordering, duplication, backpressure, and observability challenges that synchronous systems avoid — the choice is which complexity is preferable given the use case

Frequently Asked Questions (FAQ)

What are communication fundamentals in distributed systems?

Communication fundamentals cover how distributed components exchange information across a network — the patterns they use (synchronous RPC, asynchronous messaging, event streaming), the guarantees each pattern provides, and the failure characteristics each introduces. Because all distributed system properties — reliability, consistency, fault tolerance, scalability — are built on top of communication, understanding what communication can and cannot guarantee is the prerequisite for understanding every higher-level distributed systems concept. The three unavoidable realities are that messages can be lost, delayed, or duplicated; that the sender cannot know whether the receiver processed the message; and that a timeout indicates uncertainty, not failure.

What is the difference between REST and gRPC?

Both are synchronous RPC patterns but differ in encoding and transport. REST uses JSON over HTTP/1.1 — human-readable, firewall-friendly, and easy to debug with standard tooling. gRPC uses Protocol Buffers over HTTP/2 — binary encoded, more compact, lower latency, supports server-side and bidirectional streaming, and enforces a strongly typed schema at compile time. REST is the right default for external APIs and teams with mixed tooling. gRPC is appropriate for internal service-to-service communication where performance matters, both sides control the schema, and the observability cost of binary encoding is acceptable.

When should I use a message queue instead of a direct service call?

Use a message queue when the producer and consumer do not need to be available simultaneously, when the operation does not need to complete before the caller can continue, or when absorbing traffic spikes without propagating load to downstream services is a requirement. If you need an immediate response to continue processing — authenticating a request, validating a price, checking inventory before committing an order — a synchronous call is more appropriate. The deciding factor is whether the caller’s correctness depends on knowing the outcome of the downstream operation before proceeding.

What is the difference between a message queue and an event stream?

A message queue delivers each message to one consumer and removes it once acknowledged — each message is processed exactly once by one consumer. An event stream retains messages as a durable ordered log that multiple independent consumers can read at their own pace, replay from any historical point, and reprocess without affecting other consumers. Use a queue when each message needs exactly one processor. Use a stream when multiple consumers need the same events, when replay and audit history are requirements, or when consumer lag visibility is operationally important.

What is a cascading failure and how does synchronous communication cause it?

A cascading failure occurs when a failure in one component propagates to other components, causing them to fail in turn, typically amplifying at each layer. In synchronous communication, the most common cause is a slow downstream service holding connections or threads open in the calling service. The calling service’s thread pool or connection pool is exhausted while it waits. New requests to the calling service queue or fail. The failure propagates upward through every service in the call chain, even though each upstream service is individually healthy. The Facebook 2021 BGP outage is the largest documented production cascade failure — a configuration change made internal dependencies unreachable, and synchronous coupling meant the failure propagated through the entire platform rapidly.

Is exactly-once delivery possible in distributed systems?

In the general case, no — not without significant coordination overhead. Most systems provide either at-most-once delivery (the message may be lost but will not be duplicated) or at-least-once delivery (the message will be delivered but may arrive more than once). Kafka’s transactional API offers exactly-once semantics within controlled boundaries — specifically, within a single Kafka cluster for producer-to-consumer flows — but these come with throughput and complexity costs. The practical engineering response is to design consumers to be idempotent — safe to execute multiple times with the same result — rather than relying on the delivery layer to guarantee exactly-once behaviour, which is more fragile and harder to verify.

Why does asynchronous communication make distributed systems harder to debug?

Because the producer and consumer are decoupled in time, a failure in the consumer may not surface until minutes or hours after the producer ran. Tracing a request across an asynchronous boundary requires distributed tracing instrumentation — correlation IDs and trace context that follow the message from producer through queue or stream to consumer. Without this instrumentation, a failed async operation appears as silence rather than an error, making root cause analysis significantly harder. Post 4.8 covers the observability infrastructure — OpenTelemetry, distributed tracing, structured logging — required to make asynchronous systems debuggable in production.


Continue the Series

Series home: Distributed Systems — Concepts, Design & Real-World Engineering

Part 2 — Communication & Coordination

Previous: ← 2.0 — From Constraints to Communication

Next: 2.2 — Reliability and Retries in Distributed Systems

Not read Part 1 yet? Start with 1.1 — What Is a Distributed System (Really)?

Discover more from Rahul Suryawanshi

Subscribe now to keep reading and get access to the full archive.

Continue reading