Distributed Systems Series — Part 2.4: Communication & Coordination
When Independent Nodes Need to Agree
Coordination and distributed locks in distributed systems solve a class of problems that has no equivalent in single-machine programming. Most of distributed systems engineering is about avoiding coordination — it is expensive, it introduces latency and new failure modes, and it limits the scalability ceiling. But some problems cannot be solved without it.
Only one node should write to a shared resource at a time. Only one service instance should process a given job. Only one leader should issue decisions for a cluster. These are mutual exclusion problems. In a single-process system, a mutex solves them in a few lines of code. In a distributed system, the equivalent — a distributed lock — is fundamentally harder to get right, and the failure modes are significantly more dangerous than a deadlock or a race condition in local code.
Understanding why requires the foundation established in Post 1.2: distributed systems have no shared memory, no global clock, and no reliable communication. Coordination must work despite all three absences simultaneously.
What Coordination Is Actually Answering
At its core, coordination in a distributed system answers questions that shared memory answered trivially in a single process: who is allowed to do this right now? Who is the current leader? Has this operation already happened? Are we all acting on the same version of state?
In a single-process program, these questions are answered by shared memory, mutexes, and a single scheduler. Everything runs on one machine under one operating system. Coordination is cheap, instantaneous, and guaranteed atomic by the hardware.
In a distributed system, answering the same questions requires message passing across unreliable networks, persistent state on nodes that can fail, and agreement protocols that tolerate partial failures. That gap — between how trivial coordination feels locally and how hard it is distributed — is what trips most engineers the first time they encounter it seriously in production. The moment a mutex becomes a distributed lock, a new class of failure mode enters the picture that local programming simply does not prepare engineers for.
Why a Local Mutex Is Not Enough
A local mutex works because it lives in shared memory. All threads competing for the lock run in the same process on the same machine. The lock is either held or it is not. Acquisition is atomic at the hardware level. Release is guaranteed when the holding thread exits — the operating system cleans up resources for crashed processes.
In a distributed system, none of these properties hold. Competing processes run on different machines with no shared memory. The process holding the lock can crash without releasing it — and unlike a local OS, no distributed system automatically releases locks held by crashed processes. Network partitions can isolate the lock holder from the rest of the system. And most dangerously, a slow process may still believe it holds the lock after its lease has expired and another process has legitimately acquired it.
This last point is the most dangerous one. In a local system, a crashed process releases all its locks immediately. In a distributed system, a process that is merely slow — paused by garbage collection, stalled on a disk read, or delayed by a network hiccup — may continue attempting to act on a lock it no longer legitimately holds. As established in Post 1.4, slow nodes are more dangerous than crashed ones. The 2012 GitHub MySQL GC pause incident is the canonical example: a paused primary resumed and briefly asserted leadership over data that a new primary had already taken ownership of. This is how distributed locks produce data corruption even when the locking mechanism itself is working correctly.
Safety vs Liveness: The Core Tension in Coordination
Every coordination mechanism must balance two fundamental properties. Most real-world coordination failures trace back to optimising one at the expense of the other without understanding the trade-off being made.
Safety means nothing bad ever happens. In the context of distributed locks: two nodes never hold the same lock simultaneously. In leader election: two nodes never simultaneously believe they are the leader. Safety violations produce incorrect behaviour — data corruption, duplicate work, conflicting decisions. In a regulated financial environment, a safety violation in a payment processing lock is not just a data quality problem. It is a potential double-charge, a compliance event, and an audit failure.
Liveness means something good eventually happens. A lock is eventually released so another node can acquire it. A new leader is eventually elected after the old one fails. Liveness violations produce unavailability — the system stops making progress.
In a perfect world, you have both. In a distributed system under partition, you often must choose. A system that prioritises safety will refuse to make progress when it cannot confirm it is acting correctly. A system that prioritises liveness will make progress but risks doing so incorrectly. For most coordination problems — locks, leader election, critical writes — safety wins. Brief unavailability is recoverable. Data corruption and split-brain often are not.
The Four Properties of a Correct Distributed Lock
Mutual exclusion. At any given moment, at most one process holds the lock. This is the basic correctness requirement — without it, the lock provides no protection whatsoever.
Deadlock freedom. If the lock holder crashes or becomes unreachable, the lock must eventually be released. This is implemented through a lease — a time-bounded lock that expires automatically if not renewed. The holder must continuously renew the lease while holding it. If it fails to renew — because it crashed, was paused, or lost connectivity — the lease expires and the lock becomes available to others.
Fault tolerance. The locking service itself must remain available through partial failures. A distributed lock backed by a single node is a single point of failure that defeats the purpose of distributed coordination. The locking service must replicate its state across multiple nodes with quorum-based writes to survive individual node failures without losing lock state.
Fencing. Even with leases, a slow or paused process may attempt to use a lock after it has expired. Fencing addresses this by attaching a monotonically increasing token to every lock grant. The resource being protected validates the token on every write — if the token is lower than the highest it has seen, the write is rejected. Fencing is the property that most distributed lock implementations omit, and its absence is what turns a slow node into a correctness violation.
Fencing Tokens: The Detail Most Implementations Miss
Fencing deserves its own explanation because it is the mechanism that makes distributed locks safe under the precise failure mode that matters most — the slow node that does not know its lease has expired.
Consider this sequence without fencing:
- Process A acquires a lock with a 30-second lease
- Process A pauses for 40 seconds — a long GC pause, a VM live migration, a disk stall
- The lease expires. Process B acquires the lock
- Process B writes to the shared resource successfully
- Process A resumes, believes it still holds the lock, and also writes to the shared resource
- Both writes succeed. The resource is now in an incorrect state
Now the same sequence with fencing tokens:
- Process A acquires a lock and receives token 42
- Process A pauses for 40 seconds
- The lease expires. Process B acquires the lock and receives token 43
- Process B writes to the shared resource, passing token 43. The resource records 43 as the highest token seen
- Process A resumes and attempts to write, passing token 42
- The resource rejects the write — token 42 is lower than 43. Process A’s stale operation fails safely
Fencing pushes safety enforcement to the resource being protected rather than relying on the lock holder to know its lease has expired. This is the critical design principle: never trust a process to know that its own lock has expired. It was paused. It cannot know. The resource must enforce correctness regardless of what the holder believes.
This is also why Redis-based distributed locks — including the RedLock algorithm — are controversial in production: they do not natively support fencing tokens, which means a paused process can still corrupt the protected resource even when using a correctly implemented lock. Martin Kleppmann documented this limitation in detail in his analysis of RedLock, and it is the reason etcd and ZooKeeper are preferred for coordination that requires correctness guarantees.
Coordination Depends on Time — and Time Cannot Be Trusted
Coordination mechanisms rely heavily on time: timeouts determine when a lease expires, heartbeats signal that a leader is still alive, failure detection decides when to trigger re-election. But as established in Post 1.5 and Post 2.5, clocks in distributed systems are imperfect. They drift. NTP adjustments can move time backward. A GC pause can freeze a process for seconds while the wall clock advances.
Three recurring problems emerge when timing assumptions break down in coordination systems. A leader appears dead because its heartbeat was delayed by network congestion — but it is actually alive and still processing requests. A lease appears valid to its holder because its local clock has drifted behind the coordination service’s clock — but the coordination service has already expired it and granted it to someone else. A timeout fires too early under load, triggering unnecessary failovers that themselves add load, triggering more timeouts — a cascade driven entirely by timing assumptions failing under exactly the conditions where coordination is most needed.
This is why well-designed coordination systems prefer conservative failure detection — waiting longer to be sure rather than acting fast and being wrong. And why fencing tokens exist: to enforce correctness even when timing assumptions are violated, because timing assumptions will be violated in production.
Time is not a reliable coordination primitive. It is a hint — useful when it works, dangerous when trusted absolutely.
Leader Election: Coordination at the Cluster Level
Distributed locks solve mutual exclusion for a specific resource. Leader election solves a broader problem: among a set of nodes, exactly one should act as the authoritative decision-maker for the cluster at any given time. Leader election is used in database primary selection, job schedulers, configuration managers, and stream processors. The correctness requirements are strict: at most one leader at any time (safety), and if the current leader fails a new leader must eventually be elected (liveness).
Split-brain — two nodes simultaneously believing they are the leader — is one of the most dangerous failure modes in distributed systems. Both leaders may accept conflicting writes, issue contradictory decisions, or assign the same work to multiple workers. In a financial transaction system, two simultaneous leaders in a payment processing cluster could approve the same transaction twice or allow contradictory balance updates to both succeed.
The standard prevention mechanism is a quorum. A candidate can only become leader if it receives acknowledgement from a majority of nodes. With five nodes, quorum is three. A partition splitting the cluster 3-2 allows the larger group to elect a leader. The smaller group cannot — it cannot reach three nodes. At most one leader exists at any time, even during a partition. Quorum-based election is the foundation of Raft and Paxos — covered in Post 3.7.
How Apache Kafka Uses Leader Election in Production
Apache Kafka provides the clearest production example of leader election at scale. Each Kafka topic is divided into partitions, and each partition has one leader broker that handles all reads and writes. Followers replicate from the leader. When a leader broker fails, Kafka must elect a new leader for each affected partition quickly to minimise unavailability.
In Kafka’s original architecture, leader election was coordinated through ZooKeeper — each broker registered an ephemeral node, and ZooKeeper’s session timeout mechanism detected failures and triggered re-election. In Kafka’s KRaft mode, leader election is handled by an internal Raft-based metadata quorum, eliminating the ZooKeeper dependency entirely. The design reflects exactly the requirements described: exactly one leader per partition at any time, automatic failover when a leader fails, quorum-based election to prevent split-brain, and fencing through epoch numbers that reject writes from brokers with stale leadership credentials. The full treatment of coordination services — ZooKeeper, etcd, and Consul — that implement these primitives in production is in Post 2.6.
Avoid Coordination Where You Can
Perhaps the most important lesson from large-scale distributed systems engineering: coordination is expensive — avoid it unless it is genuinely necessary. Every distributed lock is a source of contention. Every leader election is a period of reduced availability. Every consensus round adds round-trip latency to the critical path.
Before reaching for a distributed lock, ask whether the problem can be solved without one. Idempotent operations — if executing an operation twice produces the same result as executing it once, you may not need a lock at all; idempotency keys from Post 2.2 are the implementation. Commutative updates — if the order of operations does not matter (incrementing a counter, adding to a set), concurrent writes can be applied in any order without coordination; CRDTs implement exactly this property. Optimistic concurrency — detect conflicts after the fact rather than preventing them upfront; accept the occasional rollback in exchange for eliminating lock contention on the happy path; this is how most MVCC database implementations work. Partitioned ownership — assign each piece of data to exactly one owner; that owner makes decisions about its data without coordinating with anyone; this is the sharding principle from Post 5.3.
The best distributed systems are not those that coordinate correctly — they are those that coordinate as rarely as possible, and do so correctly when they must.
Key Takeaways
- Coordination answers the questions shared memory answered locally — who holds the lock, who is the leader, what is the current state — but must do so across unreliable networks with failing nodes and imperfect clocks, making every answer probabilistic rather than certain
- Safety (nothing bad happens) must be prioritised over liveness (progress eventually happens) for most coordination problems — brief unavailability is recoverable, split-brain and data corruption in a regulated financial environment often are not
- Distributed locks require leases — time-bounded ownership that expires automatically when the holder fails, is paused, or loses connectivity — because a crashed process in a distributed system does not release its locks the way an OS cleans up local mutexes
- Fencing tokens are essential for correctness — a slow or paused process cannot know its lease has expired, so the resource must validate monotonically increasing tokens and reject writes from holders whose tokens are stale
- Quorum-based leader election prevents split-brain — a candidate must receive acknowledgement from a majority of nodes before declaring itself leader, ensuring at most one leader can exist even during a network partition
- Time is not a reliable coordination primitive — it is a hint that works most of the time and fails under the precise conditions (GC pauses, network congestion, clock drift) where coordination correctness matters most
- The best distributed systems coordinate as rarely as possible — idempotent operations, commutative updates, optimistic concurrency, and partitioned ownership eliminate the need for distributed locks in many cases that appear to require them
Frequently Asked Questions (FAQ)
What is a distributed lock and why is it harder than a local mutex?
A distributed lock is a mutual exclusion mechanism that works across multiple processes on different machines, ensuring at most one process performs a specific operation at a time. Unlike a local mutex — where lock acquisition is atomic at the hardware level, lock holders are guaranteed to release on crash, and all competitors share the same memory — a distributed lock must handle process crashes without automatic release, network partitions that isolate the lock holder, clock differences that make lease expiry ambiguous, and slow nodes that continue acting on stale lock ownership. Each of these failure modes requires explicit design. The local mutex assumes none of them exist.
What is a lease in the context of distributed locks?
A lease is a time-bounded lock — the holder owns it for a fixed duration and must actively renew it before it expires. If the holder crashes, is paused by GC, or loses connectivity, the lease expires automatically and another process can acquire the lock. Leases prevent the indefinite deadlock that would result from a holder crashing without releasing — in local programming the OS handles this cleanup, in distributed systems there is no equivalent automatic mechanism without explicit lease expiry.
What is a fencing token and why do most lock implementations fail to include it?
A fencing token is a monotonically increasing number attached to each lock grant, which the resource being protected uses to reject writes from stale lock holders. Without fencing, a process that pauses (GC, VM migration, disk stall) for longer than its lease duration can resume and write to the resource after a new holder has already legitimately acquired the lock — corrupting the resource silently. Most implementations omit fencing because it requires the protected resource to participate in the correctness guarantee rather than just the lock service. Redis-based locks including RedLock are specifically criticised for this omission. etcd and ZooKeeper support fencing through their revision and version mechanisms respectively.
What is split-brain and how does quorum prevent it?
Split-brain occurs when a network partition causes two groups of nodes to each independently elect a leader, resulting in two nodes simultaneously believing they have authoritative control. Each leader may accept conflicting writes or make contradictory decisions. Quorum prevents this by requiring a candidate to receive acknowledgement from a majority of nodes before declaring itself leader. With five nodes, quorum is three. A 3-2 partition allows the larger group to elect a leader — the smaller group cannot reach three nodes and cannot elect. At most one leader can exist at any time, even during a partition.
What is the difference between safety and liveness in coordination?
Safety means nothing bad ever happens — two nodes never hold the same lock, two leaders never exist simultaneously, no write succeeds without valid lock ownership. Liveness means something good eventually happens — a lock is eventually released, a new leader is eventually elected, the system eventually makes progress. When they conflict during a network partition, most coordination systems prioritise safety — they pause and wait rather than risk producing incorrect results. Brief unavailability is recoverable. A split-brain in a payment processing system or a double-write in a financial ledger may not be.
When should I design around coordination rather than using a distributed lock?
When the operation can be made idempotent — executing it twice produces the same result as once — a lock may be unnecessary because retries are safe regardless of whether the first attempt succeeded. When updates are commutative — order does not matter — concurrent writes can be applied in any sequence without coordination. When conflicts are rare, optimistic concurrency — detecting and rolling back conflicts after the fact — eliminates lock contention on the happy path at the cost of occasional retries. When data can be partitioned by owner — each piece belonging to exactly one authoritative node — that node can make decisions without cross-node coordination. Distributed locks are the right tool for specific mutual exclusion problems, not a general solution to concurrency.
What happens if the coordination service itself goes down?
With quorum-based replication — as etcd and ZooKeeper implement using Raft and ZAB respectively — the service survives individual node failures as long as a majority remains reachable. If the service becomes entirely unavailable, processes needing new locks cannot acquire them. Existing lease holders can continue until their leases expire, at which point they should stop operating rather than assume continued ownership. This is why coordination services must be treated as critical infrastructure — a failed coordination service means no new locks, no leader elections, and no authoritative configuration changes until it recovers.
What coordination failures taught me in production
Distributed locks are the mechanism where the gap between theory and production is widest. In theory, you acquire the lock, do the work, release the lock. In production, the process holding the lock pauses for GC, the lease expires, another process acquires the lock and starts writing, the first process resumes and also writes, and you have a corruption event that produces no errors in any log file because both writes succeeded.
At IDFC First Bank, we build financial transaction systems. The cost of a coordination failure in that environment is not a stale cache or a duplicate notification — it is a double-charge, a missed settlement, or a compliance event. The fencing token pattern is not optional in that context. It is the mechanism that makes the difference between a distributed lock that provides a correctness guarantee and one that provides the feeling of a correctness guarantee while failing silently under the precise conditions that matter most.
The thing I wish I had understood earlier: fencing does not live in the lock service. It lives in the resource being protected. The lock service gives you the token. The database, the payment processor, the inventory system — whatever is being protected — must validate that token on every write. If you implement fencing in the lock service but not in the resource, you have done half the work and achieved none of the benefit. Most distributed lock implementations I have reviewed in production have this gap.
The lesson – “Avoid coordination where you can” principle in action. The systems that scaled cleanly were the ones that coordinated least. The ones that did not scale were invariably the ones where we had reached for a distributed lock when a better design would have eliminated the need for one.
The next post — Post 2.5 on Logical Clocks — connects directly to what this post covered about time. Every lease expiry, every fencing token, every heartbeat timeout is fundamentally a time-based decision in a system where time cannot be fully trusted. Logical clocks are the mechanism that makes ordering reliable when physical clocks are not. Read it with the GC pause scenario from this post in mind — the connection between untrusted time and coordination failures will click in a way it would not have before.
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 2 — Communication & Coordination
- 2.0 — From Constraints to Communication
- 2.1 — Communication Fundamentals in Distributed Systems
- 2.2 — Reliability and Retries in Distributed Systems
- 2.3 — Naming and Service Discovery
- 2.4 — Coordination and Distributed Locks
- 2.5 — Logical Clocks and Time
- 2.6 — Coordination Services: ZooKeeper, etcd and Consul
- 2.7 — Engineering Guidelines: Communication and Coordination
Previous: ← 2.3 — Naming and Service Discovery
Next: 2.5 — Logical Clocks and Time →
Not read Part 1 yet? Start with 1.1 — What Is a Distributed System (Really)?