Distributed Systems Series — Part 2.3: Communication & Coordination
The Problem Nobody Thinks About Until It Breaks Everything
Service discovery and naming in distributed systems solve a problem that feels trivial until it does not. When engineers first build a distributed system, they wire services together with hardcoded addresses. Service A calls http://payments-service:8080. It works. The system ships.
Then the payments service gets redeployed. Its IP address changes. Service A starts failing. Someone updates the config, redeploys, everything works again — until it happens next week. At three services this is annoying. At thirty it is unmanageable. At three hundred it is a full-time job that still produces outages.
This is the naming and discovery problem. It sounds mundane. It is not. Every distributed system above a trivial scale must answer the same fundamental question: how does one service find another, reliably, in a system where addresses change constantly? And the answer is not “update the config more carefully.” It is to stop relying on static configuration entirely and use a discovery system that tracks the current state of the world in real time.
Why Static Configuration Fails at Scale
In a single-machine system, calling a function means jumping to a memory address that never changes at runtime. In a distributed system, the equivalent is a network endpoint — and unlike memory addresses, network endpoints are unstable by nature. Containers restart with new IP addresses assigned by the scheduler. Services scale horizontally — ten instances today, two tomorrow. Nodes fail and are replaced. Deployments roll instances in and out continuously. Services move between availability zones during rebalancing or failover.
Hardcoding addresses — or managing them in static configuration files — cannot keep pace with this rate of change. The deeper problem is that static configuration creates tight temporal coupling between services. Service A can only work correctly if it was configured with an address that is currently valid. In a dynamic system, “currently valid” is a moving target that static configuration cannot track. This is Fallacy 5 of the Eight Fallacies of Distributed Computing in direct production form: the topology does not change. It changes constantly.
Naming vs Discovery: Two Distinct Problems
These two problems are often conflated. Separating them precisely is the key to understanding why DNS alone is insufficient and why dedicated discovery systems exist.
Naming is the problem of giving services stable, human-readable identifiers that are independent of their physical location. payments-service is a name. 10.0.4.87:8080 is a location. The name should remain constant even when the location changes across deployments, restarts, and rebalancing. This separation is what allows systems to remain reconfigurable without breaking callers — as long as the name resolves correctly, the caller does not need to know or care where the service physically lives.
Discovery is the problem of resolving a name to a current, healthy location at the time of a request. Given the name payments-service, discovery answers: which specific instance, at which address, should this request go to right now? And if that instance becomes unhealthy thirty seconds from now, how does the caller learn about it before sending its next request?
DNS solves the naming problem well. It solves the discovery problem poorly. Understanding precisely why is the key to understanding every architectural decision that follows.
DNS: Good for Naming, Insufficient for Discovery
DNS is the oldest and most familiar name resolution system in distributed computing. For public internet services with relatively stable infrastructure, it works well. For internal service-to-service communication in a dynamic distributed system, it has three structural limitations that make it insufficient as a complete discovery solution.
TTL-based caching breaks fast failover. DNS records carry a TTL that tells resolvers how long to cache the result. In practice, clients and intermediate resolvers ignore TTLs or clamp them to minimums of 30 to 300 seconds. When a service instance fails and its DNS record is updated, callers continue routing to the dead instance for the entire cache window. In a system that expects failover within seconds — as Post 4.4 establishes for production failure detection — minutes of stale DNS routing is unacceptable.
DNS returns addresses, not health. A DNS response says “the payments service is at these addresses.” It has no knowledge of whether those addresses are currently healthy, overloaded, or returning errors on every request. A caller that receives a DNS response cannot distinguish a healthy instance from one that is alive but completely broken — the gray failure scenario established in Post 1.4.
DNS does not handle ephemeral ports well. In containerised systems, each instance may run on a shared host with a dynamically assigned port. Traditional DNS resolves hostnames to IP addresses — managing SRV records for every container instance at the rate containers are created and destroyed is operationally impractical.
DNS remains useful as a stable naming layer — most service discovery systems expose DNS-compatible interfaces for backwards compatibility. But a dedicated service registry is required for real-time, health-aware discovery.
Service Registries: The Standard Solution
A service registry is a dedicated system that maintains a real-time catalogue of which service instances are running, where they are, and whether they are healthy. The registry operates on a four-step lifecycle that runs continuously as the system changes.
Registration. When a service instance starts, it registers itself with the registry — providing its name, address, port, and any metadata such as version, region, or capabilities. In some implementations, the instance self-registers. In others (particularly Kubernetes), the orchestration platform registers on the instance’s behalf.
Health checking. The registry continuously monitors registered instances — either by polling a health endpoint on the instance, by requiring instances to send periodic heartbeats, or through a combination of both. Instances that fail health checks are marked unhealthy and removed from the available pool. This is the critical difference from DNS: the registry knows about health in real time, not just location.
Deregistration. When an instance shuts down cleanly, it deregisters itself. When it crashes without deregistering, the registry detects the absence of heartbeats after a configurable timeout and removes it automatically. The timeout is a direct application of the failure detection principles from Post 4.4 — the registry is performing the same probabilistic failure suspicion as a phi accrual detector, just at the service registration layer.
Resolution. When a caller needs to reach a service, it queries the registry for the current list of healthy instances. The registry returns only instances that are currently passing health checks — not the full registered list, and not a cached list from ten minutes ago.
This lifecycle means the registry always reflects the current state of the system, not the state as it was when a config file was last written. Callers get live, health-filtered endpoint lists rather than static addresses.
Client-Side vs Server-Side Discovery
Once a registry exists, two architectural patterns determine how callers use it. This choice shapes where complexity lives in the system and what failure modes each service must handle.
Client-side discovery. The calling service queries the registry directly and makes its own load-balancing decision. It receives a list of healthy instances and selects one — using round-robin, least-connections, or any other algorithm. Netflix’s Eureka with the Ribbon client library is the canonical example. Each Netflix microservice embedded a client that queried Eureka for the current instance list and performed its own routing.
The benefit: each service has complete control over its routing logic with no additional network hop. The cost: every service must embed discovery client logic. In a polyglot environment — services in Java, Python, Go, Node.js — each language needs its own client library implementation. Routing policy changes require updates across every service.
Server-side discovery. The caller sends its request to a load balancer or proxy. The proxy queries the registry, selects an instance, and forwards the request. The caller needs no knowledge of the registry or the instance list — it only knows the stable address of the proxy. Kubernetes Services operate on this model. A Kubernetes Service has a stable virtual IP. When a request arrives at that IP, kube-proxy routes it to a currently healthy pod — transparently to the caller.
The benefit: discovery logic is centralised — callers are simple, and routing policy changes can be applied in one place. The cost: an additional network hop through the proxy, and the proxy becomes a component that must itself be highly available.
How Real Systems Implement This
Consul is one of the most widely deployed service discovery systems outside Kubernetes. It combines a service registry, health checking, and key-value storage. Services register with the local Consul agent, which forwards registrations to the Consul cluster. Health checks run at the agent level — HTTP checks polling a health endpoint, TCP checks verifying port availability, or script checks running arbitrary commands. Consul exposes discovery through both a DNS interface (payments.service.consul resolves to healthy instance addresses) and an HTTP API (richer metadata including health status per instance). Consul uses a gossip protocol for cluster membership, making it resilient to individual control plane node failures.
Kubernetes service discovery is server-side by default. Every Service object gets a stable DNS name via CoreDNS: payments-service.default.svc.cluster.local. This name resolves to the Service’s ClusterIP. kube-proxy programs iptables or IPVS rules on every node to forward traffic from the ClusterIP to currently healthy pod endpoints. The Endpoints controller watches pod health and updates the endpoint list whenever pods start, stop, or fail readiness probes. Application code only needs to know the Service name — the control plane manages all instance tracking, health filtering, and routing.
For more advanced routing — canary deployments, traffic splitting, circuit breaking, mTLS — service meshes like Istio sit on top of Kubernetes Services and add a sidecar proxy (Envoy) alongside each pod. The sidecar intercepts all inbound and outbound traffic and applies routing rules configured through the Istio control plane. This gives platform teams the ability to implement circuit breaking, load balancing algorithms, retries, and observability at the infrastructure layer, without requiring application code changes — the same pattern that Post 4.7 covers for bulkheads and fault isolation.
AWS Cloud Map provides managed service discovery integrated with Route 53 and other AWS services. Services register instances through the API or automatically through ECS and EKS integrations. Callers discover instances through the Cloud Map API or through DNS queries against Route 53 private hosted zones. For teams running primarily on AWS without Kubernetes, Cloud Map is the lowest-friction path to production-grade service discovery.
The Health Check Problem
Service discovery is only as reliable as its health checks. A registry that returns unhealthy instances is worse than no registry — callers waste time and resources on requests that will fail, and the load amplifies the failure rather than routing around it.
Liveness vs readiness. As established in Post 1.4, a service can be alive while not being ready to serve traffic — still loading its cache, waiting for a downstream dependency, or processing at capacity. A liveness check confirms the process is running. A readiness check confirms the instance can currently handle requests. Discovery systems should route traffic only to instances passing readiness checks, not just liveness checks. Routing to a liveness-passing, readiness-failing instance produces the gray failure scenario where the service appears healthy to infrastructure monitoring while returning errors on every application request. Kubernetes formalises this distinction with separate liveness and readiness probe configurations and only adds pods to Service endpoints when their readiness probe passes. The full treatment of health-check-driven routing is in Post 4.6.
The flapping instance. An instance that oscillates between healthy and unhealthy — passing health checks intermittently — causes erratic routing behaviour. Discovery systems typically address this with hysteresis: an instance must fail N consecutive checks before being removed, and must pass M consecutive checks before being re-added. This prevents a briefly recovered instance from being flooded with traffic before it has proven stable.
The slow health check. A health check endpoint that is slow to respond can cause the registry to incorrectly mark a healthy instance as failed — particularly when the health check timeout is shorter than the instance’s response time under load. Health check timeouts must be set with awareness of the instance’s realistic latency distribution, not just its average response time.
Key Takeaways
- Static configuration cannot keep pace with the rate of change in a dynamic distributed system — names must be resolved at request time against a live registry, not baked into config files that are stale before they are deployed
- Naming (stable identifiers independent of physical location) and discovery (resolving names to current healthy instances) are distinct problems — DNS solves naming well but discovery poorly due to TTL caching, no health awareness, and poor ephemeral port handling
- Service registries maintain real-time catalogues of healthy instances through a continuous lifecycle of registration, health checking, automatic deregistration, and health-filtered resolution
- Client-side discovery gives services routing control but distributes complexity across every caller — server-side discovery centralises routing in a proxy at the cost of an additional network hop and a proxy that must itself be highly available
- Health checks must distinguish liveness from readiness — routing traffic to a live but unready instance produces gray failures that are hard to distinguish from application bugs and invisible to infrastructure monitoring
- Service meshes like Istio extend server-side discovery with circuit breaking, traffic splitting, mTLS, and observability at the infrastructure layer — making advanced routing patterns available without application code changes
- The registry itself is critical infrastructure — its failure prevents services from finding each other entirely; Consul and etcd both use Raft consensus across 3 or 5 nodes to maintain availability through individual node failures
Frequently Asked Questions (FAQ)
Why is DNS not enough for service discovery in microservices?
DNS has three limitations that make it insufficient on its own for dynamic distributed systems. TTL-based caching means stale records persist for 30 to 300 seconds after an instance fails or moves — unacceptable when failure detection and failover must happen in seconds. DNS returns addresses without any knowledge of whether those addresses are currently healthy — a caller receiving a DNS response cannot distinguish a healthy instance from one returning errors on every request. And DNS was not designed for the dynamic port assignments common in containerised environments, making SRV record management for every container instance operationally impractical. DNS works well as a stable naming layer, but a dedicated service registry is needed for real-time, health-aware discovery.
What is the difference between client-side and server-side service discovery?
In client-side discovery, the calling service queries the registry directly and selects an instance itself — giving it full control over routing logic but requiring every service to embed discovery client code. Netflix’s Eureka with Ribbon is the canonical example. In server-side discovery, the caller sends requests to a proxy or load balancer that handles registry queries and instance selection transparently — Kubernetes Services operate this way. Client-side discovery distributes routing complexity across every caller. Server-side discovery centralises it in the proxy, simplifying callers at the cost of an additional network hop and a proxy that must itself be highly available.
What is a service registry?
A service registry is a dedicated system that maintains a real-time catalogue of running service instances — their names, addresses, ports, health status, and metadata. Services register themselves on startup, send heartbeats or respond to health checks while running, and deregister on shutdown. The registry automatically removes instances that stop sending heartbeats after a configurable timeout. Callers query the registry to receive a current list of healthy instances rather than a static address. Consul, etcd, and Kubernetes’ built-in Endpoints controller are the most widely used implementations in production.
How does Kubernetes handle service discovery?
Kubernetes implements server-side discovery through Service objects. Each Service gets a stable DNS name via CoreDNS and a virtual ClusterIP. kube-proxy programs iptables or IPVS rules on every node to forward traffic from the ClusterIP to currently healthy pod endpoints. The Endpoints controller watches pod health and readiness probes, keeping the endpoint list current as pods start, stop, and fail. Application code only needs to know the Service name — the Kubernetes control plane handles all instance tracking, health filtering, and routing. For advanced routing, Istio adds a sidecar proxy alongside each pod that applies circuit breaking, traffic splitting, and mTLS without application code changes.
What is the difference between liveness and readiness in health checks?
A liveness check answers whether the process is alive and not deadlocked — it detects crash-stop and crash-recovery failures. A readiness check answers whether the instance can currently handle requests correctly — it detects gray failures where the process is alive but unable to serve traffic, such as a warming cache, an exhausted connection pool, or a degraded downstream dependency. Discovery systems should route traffic only to instances passing readiness checks. Routing to a liveness-passing, readiness-failing instance produces application errors that are invisible to infrastructure monitoring because the process appears healthy. Kubernetes exposes both as separate probe configurations and only includes pods in Service endpoints when their readiness probe passes.
What happens if the service registry itself goes down?
Clients with cached endpoint lists can continue routing to those cached endpoints while the registry is unavailable — degraded but functional. The critical failure mode is a cold start: a service that has never queried the registry and has no cached endpoints cannot route any requests at all. This is why the registry must be treated as critical infrastructure with strong availability guarantees. Consul and etcd both use Raft consensus across an odd number of nodes (3 or 5) to maintain availability through individual node failures — a single node failure does not take down the registry because the remaining majority can still form quorum and serve requests.
What I wish I had known before wiring up my first microservices platform
The first time I built something at the scale where service discovery actually mattered was at Jio Platforms. Millions of users, dozens of services, constant deployments. The moment we moved from “Services that knew where to find each other” to “Services that asked a registry where to find each other” felt like a small change in architecture and a large change in operational reality. Deployments stopped requiring coordinated config updates across teams. Services stopped failing because someone forgot to update an IP address in a properties file. The registry handled all of it.
What I did not fully appreciate until later was the health check problem. We had health check endpoints. They returned 200. They were not readiness checks — they were liveness checks dressed up to look like readiness checks. The endpoint returned 200 whether the downstream database connection was healthy or not. Whether the cache was warmed or not. Whether the service was processing requests correctly or returning errors on 40% of them. We were routing traffic to instances that passed our health checks and failing our users. The registry was doing its job perfectly. The health checks were lying to it.
At IDFC First Bank, the stakes are different. In a regulated banking environment, routing a request to an instance that is alive but degraded is not just a performance problem — it is a potential compliance problem. The readiness check for a service must verify that the downstream core banking connections are healthy, that the fraud check service is reachable and that the transaction log is writable. Not just that the process is alive. We invested significantly in making our health checks reflect actual readiness, not just process liveness. It changed the operational character of the platform — we started catching degraded instances before they affected customers rather than after.
The practical thing I would tell any engineer setting up service discovery for the first time: your health check endpoint is not a formality. It is the contract between your service and the discovery layer about whether you are ready to do your job. Treat it with the same care you would treat your most important API endpoint. If it lies, everything downstream of it will eventually pay the price.
The next post — Post 2.4 on Coordination and Distributed locks — is where the problems get harder. Service discovery is about finding services. Coordination is about making services agree on shared state without corrupting it. If service discovery felt like infrastructure plumbing, coordination will feel like distributed systems engineering at its most precise and most unforgiving.
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 2 — Communication & Coordination
- 2.0 — From Constraints to Communication
- 2.1 — Communication Fundamentals in Distributed Systems
- 2.2 — Reliability and Retries in Distributed Systems
- 2.3 — Naming and Service Discovery in Distributed Systems
- 2.4 — Coordination and Distributed Locks
- 2.5 — Logical Clocks and Time
- 2.6 — Coordination Services: ZooKeeper, etcd and Consul
- 2.7 — Engineering Guidelines: Communication and Coordination
Previous: ← 2.2 — Reliability and Retries in Distributed Systems
Next: 2.4 — Coordination and Distributed Locks →
Not read Part 1 yet? Start with 1.1 — What Is a Distributed System (Really)?