Communication and Coordination in Distributed Systems: Retries, Service Discovery, Distributed Locks and Consensus

Home » Distributed Systems: Complete Engineering Guide » Communication and Coordination in Distributed Systems

From Constraints to Coping Mechanisms

Part 1 established three realities that every distributed system operates within: networks are unreliable, nodes fail independently and partially and there is no global clock. These are not problems to be solved — they are permanent features of the environment. Part 2 asks the practical question that follows: given these constraints, how do distributed systems actually function?

The answer starts with communication. Before distributed systems can coordinate, replicate data or agree on decisions, components must exchange information. And exchanging information over an unreliable network — where messages can be lost, delayed, duplicated or reordered — is fundamentally different from calling a local function. This difference is the source of most production incidents in distributed systems that were not explicitly designed around it.

Part 2 covers the full communication and coordination stack — from the basic mechanics of message passing through to the production-grade coordination services that power Kubernetes, Kafka and every major cloud platform.

Communication is where the journey starts. Remote procedure calls look like local function calls but behave completely differently — they can hang indefinitely, succeed on the server while timing out on the client and be retried into duplicate execution. The Eight Fallacies of Distributed Computing describe the assumptions engineers make about networks that are false and every communication pattern in this part exists because one or more of those fallacies was violated in production.

Reliability is the first coping mechanism. When communication fails, systems must decide whether to retry and how. Retrying naively amplifies failure — a retry storm can turn a brief degradation into a full outage. Retrying correctly — with exponential backoff, full jitter, a retry budget and idempotency keys — turns transient failures into brief delays. Circuit breakers add the final layer: stopping retries entirely when a downstream service is sustainedly degraded, giving it recovery time rather than continued load.

Service discovery solves the naming problem that static configuration cannot. In a dynamic system where containers restart with new IP addresses, services scale horizontally and deployments roll continuously, hardcoded addresses are stale before the ink is dry. Service registries, health-aware routing and the client-side vs server-side discovery distinction are how production systems navigate this.

Coordination is where the hardest problems live. Distributed locks sound simple — let one node hold the lock, others wait. In practice, a node can hold a lock past its expiry due to a GC pause and not know it. Fencing tokens, leases with automatic expiry and quorum-based leader election are the mechanisms that make distributed coordination correct rather than merely optimistic. Split-brain — two nodes simultaneously believing they are the leader — is one of the most dangerous failure modes in distributed systems and preventing it requires understanding quorums at exactly the level Part 2 establishes.

Logical time provides the ordering guarantees that physical clocks cannot. Lamport clocks establish causality — which event could have influenced which other event — without trusting timestamps. Vector clocks detect concurrent writes, enabling conflict resolution in replicated systems like DynamoDB and Riak. These are not academic constructs — they are the mechanism by which real distributed databases avoid silent data corruption.

Coordination services — ZooKeeper, etcd, and Consul — centralise all of this complexity so application teams do not have to implement it themselves. They provide strong consistency, fault-tolerant replication, and safe primitives for leader election and distributed locking, built on consensus algorithms that Part 3 will examine in depth.

Part 2 closes with engineering guidelines — ten practical principles and a design review checklist that synthesises everything above into decisions engineers can make immediately.

Part 2 Posts — Full Reading List

2.0 — From Constraints to Communication

The bridge between Part 1 and Part 2. Establishes why all higher-level distributed systems behaviour — retries, coordination, consensus — is a direct response to the three constraints Part 1 identified. Introduces the Part 2 framework: communication mechanisms are not features, they are coping mechanisms for operating under fundamental uncertainty.

Key concepts: network constraints, node constraints, time constraints, coping mechanisms, communication as the foundation of coordination

Read this if: you want to understand the conceptual thread connecting Part 1 foundations to Part 2 mechanisms before diving into individual topics

2.1 — Communication Fundamentals in Distributed Systems

All communication in distributed systems is message passing — everything else is an abstraction built on top of it. Covers the difference between synchronous RPC, asynchronous messaging, and event streaming, the failure characteristics of each pattern, The Eight Fallacies of Distributed Computing, and the cascade problem — how a single slow service propagates failure up a synchronous call chain. Includes the documented consequences of the fallacies in production systems and why Netflix built Hystrix to break cascades.

Key concepts: message passing, RPC vs REST vs gRPC, synchronous vs asynchronous communication, Eight Fallacies of Distributed Computing, cascade failure, bulkheads, circuit breakers

Read this if: you want to understand the fundamental difference between local and remote calls, and why treating them the same is the most common source of distributed system failures

2.2 — Reliability and Retries in Distributed Systems

Retries are not a recovery mechanism — they are a design decision with system-wide consequences. Covers at-most-once vs at-least-once delivery semantics, why exactly-once is not achievable without significant coordination, exponential backoff with the full jitter formula, why jitter is not optional (the thundering herd problem), idempotency keys as the production standard for safe retries (Stripe’s implementation as the canonical example), circuit breaker states (closed, open, half-open), and how timeouts, retries, and circuit breakers form a coordinated defence layer. Includes the 2019 AWS SQS retry storm as a documented production example.

Key concepts: at-most-once, at-least-once, exactly-once semantics, exponential backoff, jitter, thundering herd, idempotency keys, circuit breaker pattern, retry storm, delivery semantics

Read this if: you want to understand how to design retry behaviour that reduces failure impact rather than amplifying it

2.3 — Naming and Service Discovery in Distributed Systems

Static configuration cannot keep pace with the rate of change in dynamic distributed systems. Covers the naming vs discovery distinction (stable identifiers vs resolving them to current healthy locations), why DNS is insufficient for service discovery (TTL caching, no health awareness, ephemeral port limitations), service registry lifecycle (registration, health checking, deregistration), client-side vs server-side discovery patterns, and how Consul, Kubernetes Services, and AWS Cloud Map implement discovery in production. Includes the liveness vs readiness distinction and how misconfigured health checks route traffic to instances that cannot serve it.

Key concepts: service registry, service discovery, DNS limitations, client-side discovery, server-side discovery, health checks, liveness vs readiness, Consul, Kubernetes Services, Netflix Eureka

Read this if: you want to understand how services find each other in a system where addresses change constantly and instances fail without warning

2.4 — Coordination and Distributed Locks

A distributed lock sounds like a local mutex — one node holds it, others wait. In practice, a node paused by GC can hold a lock past its lease expiry, not know the lease expired, and attempt a write that corrupts state another node has already updated. Covers the four properties of a correct distributed lock (mutual exclusion, deadlock freedom, fault tolerance, fencing), fencing tokens as the mechanism that makes stale lock holders safe to ignore, leader election requirements (safety vs liveness), split-brain prevention through quorum-based election, and how Apache Kafka uses leader election at scale with both ZooKeeper and KRaft. Includes a step-by-step walkthrough of the fencing token scenario.

Key concepts: distributed locks, leases, fencing tokens, leader election, split-brain, quorum-based election, mutual exclusion, deadlock freedom, crash-recovery, Kafka leader election

Read this if: you want to understand why distributed coordination is harder than it looks and what mechanisms make it correct

2.5 — Logical Clocks and Time in Distributed Systems

Physical time cannot be trusted for ordering events across distributed nodes — clocks drift, NTP adjustments move time backward, and GC pauses freeze a process while the wall clock advances. Logical time solves this by tracking causality rather than measuring time. Covers Lamport clocks (the happens-before relationship, how the counter works, what it guarantees and does not guarantee), vector clocks (one counter per node, detecting concurrent writes, the concurrency detection guarantee), and how Amazon DynamoDB and Riak use vector clocks in production to detect and resolve conflicting writes across replicas. Includes the Cassandra Last Write Wins data loss incident caused by clock skew as a production grounding example.

Key concepts: Lamport clocks, vector clocks, happens-before relationship, causality, concurrent writes, conflict detection, logical time vs physical time, DynamoDB vector clocks

Read this if: you want to understand how distributed systems reason about event ordering without trusting clocks, and why this matters for replicated databases

2.6 — Coordination Services: ZooKeeper, etcd and Consul

Re-implementing distributed coordination correctly inside every application that needs it would be expensive, error-prone, and almost certainly wrong in subtle ways. Coordination services — ZooKeeper, etcd, and Consul — centralise this complexity so application teams use proven primitives rather than building them. Covers what all coordination services provide (strong consistency, linearisable writes, quorum-based fault tolerance, availability trade-off during partitions), ZooKeeper’s hierarchical namespace with ephemeral znodes and watches, etcd’s flat key-value model with Raft and revision-based watches, Consul’s integrated service discovery and multi-datacenter support, and a direct comparison table for choosing between them. Includes production examples from Kubernetes, Kafka KRaft, HashiCorp Vault, and Cloudflare.

Key concepts: coordination services, ZooKeeper znodes, ephemeral nodes, watches, etcd Raft, Consul service registry, strong consistency, linearisable writes, quorum replication, ZAB protocol, when to use each

Read this if: you want to understand what coordination services provide, how the three major ones differ, and which to choose for your architecture

2.7 — Engineering Guidelines: Communication and Coordination

Ten practical engineering principles distilled from everything in Part 2, grounded in real production failure patterns. Covers: assume communication will fail (design implication), remote calls are not local calls (cascade prevention), retries must be designed not bolted on (the system-level view), never rely on wall-clock time for correctness (the logical time principle), expect stale information everywhere (defensive discovery design), coordination is expensive — use it sparingly (the idempotency alternative), prefer safety over false progress (the split-brain lesson), centralise complexity not data paths (the control plane principle), failure is a first-class design input (blast radius analysis), and make trade-offs explicit and documented (the root cause of most outages). Closes with a complete design review checklist engineers can apply to any service or architecture decision.

Key concepts: engineering guidelines, design review checklist, retry design, timeout design, circuit breaker, idempotency, coordination cost, control plane vs data plane, trade-off documentation, failure design

Read this if: you want a practical synthesis of Part 2 that you can bookmark and return to during design reviews and architecture discussions

What Part 2 Prepares You For

Part 2 establishes the communication and coordination layer that every distributed system is built on. By the end of Part 2, three things that were abstract in Part 1 become concrete and actionable.

Retries, timeouts, and circuit breakers are no longer just reliability patterns — they are the specific mechanisms that address the network model’s unreliability. Idempotency is not just a design principle — it is the prerequisite for safe retries in a world where messages can be delivered more than once. Coordination services are not just infrastructure components — they are the answer to the question of how distributed systems agree on shared state without building consensus from scratch.

Part 3 builds directly on this foundation. Replication is a coordination problem between nodes that communicate over exactly the unreliable networks Part 2 describes. Consistency models define what clients observe when replication produces temporary disagreement. Consensus — which Paxos and Raft implement — is the strongest form of the coordination problem Part 2 introduces. Everything in Part 3 is the full elaboration of problems that Part 2 first makes visible.

Continue the Series

Series home: Distributed Systems — Concepts, Design & Real-World Engineering

Previous: Part 1 — Foundations

Next: Part 3 — Replication, Consistency & Consensus

Part 3 takes the coordination problem introduced in Part 2 and examines its hardest form: how distributed systems store and replicate data consistently, what consistency guarantees mean across the full spectrum, what limits CAP and PACELC impose, and how Paxos and Raft achieve agreement despite failures.