Distributed Systems Series — Part 4.8: Fault Tolerance & High Availability
Distributed Systems Without Observability Are Black Boxes
Every mechanism covered in Part 4 — failure detection, redundancy, self-healing, high availability architecture, fault isolation — produces value only if engineers can observe whether it is working. A Raft cluster that is experiencing unnecessary leader elections is degrading availability, but without metrics showing election frequency, no one knows. A gray failure producing incorrect results for 5% of requests is more damaging than a crash, but without application-level observability, heartbeat-based failure detection will not surface it. A bulkhead that is rejecting requests at an unexpected rate indicates a downstream dependency problem, but without structured logs correlating rejections to upstream requests, the dependency is invisible.
In a monolithic system, diagnosing failures means reading logs from one process. In a distributed system, a single user request traverses an API gateway, multiple microservices, a database, a cache, and a message queue. Each hop adds latency, each hop can fail, and the failure that manifests as a user-visible error may have originated several hops upstream from where the error is reported. Without the ability to follow a request across service boundaries and reconstruct what happened at each step, diagnosing failures is guesswork.
Observability is the engineering discipline that solves this. This post covers the three pillars of observability — logs, metrics, and traces — the OpenTelemetry standard that instruments them, the four golden signals that provide the minimum sufficient metrics for any distributed service, how correlation IDs connect signals across service boundaries, the tooling landscape, and the practical incident diagnosis workflow that all three signals enable together.
Monitoring vs Observability: A Precise Distinction
The terms monitoring and observability are frequently conflated. They are different disciplines that answer different questions.
Monitoring answers: is the system working? Monitoring is based on known failure modes — you instrument the things you know can go wrong, set thresholds, and alert when those thresholds are breached. Monitoring tells you that CPU is above 80%, that error rate has exceeded 1%, that the primary database has failed. It is reactive and bounded by what you anticipated when writing the alerts.
Observability answers: why is the system behaving this way? Observability allows engineers to ask new questions about system behaviour without modifying code — questions that were not anticipated when the system was built. A system is observable if its internal states can be inferred from its external outputs. When an unexpected failure mode occurs — and in production, it always does — observability allows engineers to investigate the specific condition rather than being limited to the pre-defined failure modes that monitoring anticipated.
The practical difference: monitoring detects that something is wrong. Observability explains what happened, where it happened, and why. Both are required. Monitoring without observability means fast detection and slow diagnosis. Observability without monitoring means deep investigative capability without the alerting that initiates investigation in the first place.
The connection to Post 4.1‘s failure taxonomy: monitoring is sufficient for detecting crash-stop, crash-recovery, and omission failures — absences that threshold-based alerts can detect. Observability is required for detecting gray failures — partial functioning that passes infrastructure health checks but produces incorrect application results. A spike in 5xx error rate for a specific API endpoint, visible only in structured logs with request-level context, may be the only signal that a gray failure is occurring.
The Three Pillars: Logs, Metrics, and Traces
Three complementary telemetry signals together provide complete observability. Each answers questions the others cannot.
Logs: Discrete Event Records
A log is a timestamped record of a discrete event — something that happened at a specific point in time in a specific service. Logs are the highest-fidelity signal: they can capture arbitrary context about a specific event, including error messages, stack traces, request parameters, and state at the time of the event.
Unstructured logs — plain text lines — were sufficient for monolithic systems where all logs came from one process and could be grepped. In distributed systems, unstructured logs are nearly useless. They cannot be reliably parsed, queried, or correlated across services. Every service produces logs in a different format. Aggregating and searching them requires brittle regular expressions that break on format changes.
Structured logging is the production standard. Every log event is emitted as a JSON object (or another structured format) with a consistent set of mandatory fields. The mandatory fields that must be present on every log event, regardless of service:
timestamp — ISO 8601 UTC, to millisecond precision. severity — one of DEBUG, INFO, WARN, ERROR, FATAL. service — the name of the emitting service. trace_id — the distributed trace ID for the request (more on this below). span_id — the current span within the trace. request_id — the correlation ID propagated from the caller. message — a human-readable description of the event.
With these fields present on every log event from every service, an engineer investigating a specific user’s failed request can filter logs across all services by trace_id and reconstruct the complete event sequence in chronological order — even across service boundaries, even when events arrived at the log aggregation system out of order.
Log levels require discipline to be useful. DEBUG logs are verbose and expensive — appropriate for local development and short-term production debugging, not for continuous production emission. INFO logs record significant state transitions — service startup, request completion, configuration changes. WARN logs record unexpected conditions that were handled — retried requests, degraded fallback activations, configuration values outside expected ranges. ERROR logs record failures that require investigation — unhandled exceptions, downstream timeouts, data corruption detection. Emitting ERROR logs for expected conditions (such as client 4xx errors) destroys the signal-to-noise ratio and causes alert fatigue.
Metrics: Numerical Time-Series Measurements
A metric is a numerical measurement recorded at regular intervals — a time series of values that captures the quantitative state of the system over time. Metrics are the most efficient observability signal: they are highly compressible, cheap to store long-term, and ideal for dashboards and alerting.
Google’s SRE book introduced the four golden signals as the minimum sufficient set of metrics for any user-facing distributed service. These four metrics, correctly instrumented, provide the first-level diagnostic signal that tells engineers whether a service is healthy and where to look when it is not.
Latency — the time it takes to serve a request, measured at p50, p95, p99, and p99.9. Always distinguish successful request latency from error request latency — errors that return immediately (fast 500s) can mask latency problems in successful requests if aggregated together. As established in Post 3.8, p99 and p99.9 matter more than p50 for user experience.
Traffic — the demand on the service, measured as requests per second, messages per second, or transactions per second. Traffic metrics establish the baseline against which latency and error rate anomalies are interpreted — a latency spike during a traffic spike is a different diagnosis from a latency spike during normal traffic.
Errors — the rate of requests that are failing, measured as a percentage of total requests. Distinguish between explicit errors (5xx responses, unhandled exceptions) and implicit errors (responses that are technically successful but semantically wrong — returning empty results where results are expected, or returning stale data that the application treats as an error). Implicit errors are the observability signature of gray failures.
Saturation — how full the service is, measured as the utilisation of the most constrained resource. For CPU-bound services, CPU utilisation. For I/O-bound services, disk or network I/O saturation. For thread-pool-backed services, thread pool utilisation and rejection rate. For connection-pool-backed services, connection pool wait time and exhaustion rate. Saturation metrics, combined with the bulkhead metrics from Post 4.7, are the primary signal for diagnosing resource exhaustion failures before they cascade.
Three metric types cover most instrumentation needs. Counters monotonically increase — total requests served, total errors, total bytes written. They are reset on service restart. Gauges represent current state that can increase or decrease — current active connections, current queue depth, current memory usage. Histograms record the distribution of values — request latency distributions that allow p50, p95, and p99 to be calculated correctly. Histograms are more expensive than counters or gauges but are the only correct way to instrument latency — averages of latency are meaningless and misleading.
Traces: Request-Level Execution Paths
A distributed trace follows a single request as it propagates across services, capturing the complete execution path — which services were called, in what order, how long each took, and where errors occurred. Traces answer the “where” question that logs and metrics cannot: not just that latency increased, but which specific service or database call in which specific request is responsible.
A trace is a tree of spans. A span represents a single unit of work — an HTTP request, a database query, a cache lookup, a message queue operation. Each span records: its trace ID (shared across the entire request tree), its span ID (unique to this operation), its parent span ID (identifying the caller), its start time, its duration, its status (success or error), and optional key-value attributes providing context.
The trace ID is the link that connects all three observability signals for a single request. When the trace ID is propagated through all service calls and included in every log event, an engineer investigating a slow or failed request can start from the trace (to see the complete execution path), jump to logs for a specific span (filtered by trace ID and span ID) to see detailed event context, and correlate with metrics to understand whether the performance of the failing span is typical or anomalous.
Trace sampling is required at scale. Full tracing of every request in a high-throughput system produces volumes that are prohibitively expensive to store and analyse. Head-based sampling makes the sampling decision at the start of a trace — a fixed percentage of all traces are recorded. Tail-based sampling makes the sampling decision after the trace completes — traces that exceed a latency threshold or contain errors are always recorded, while normal traces are sampled at a lower rate. Tail-based sampling is more expensive to implement but significantly more valuable — it ensures that exactly the traces you need for incident diagnosis are retained.
OpenTelemetry: The Instrumentation Standard
Before OpenTelemetry, each observability vendor provided its own SDK. Instrumenting an application for Datadog required Datadog’s SDK. Switching to Honeycomb required replacing all instrumentation code. Vendor lock-in at the instrumentation layer was pervasive and expensive.
OpenTelemetry (OTel), released as a CNCF project through the merger of OpenCensus and OpenTracing, provides vendor-neutral instrumentation APIs and SDKs for all three observability signals. Applications instrumented with OpenTelemetry can send telemetry to any compatible backend — Datadog, Honeycomb, Grafana, Jaeger, Prometheus — by changing the exporter configuration, not the instrumentation code.
OpenTelemetry has three components. The API defines the instrumentation interface — the calls that application code makes to create spans, record metrics, and emit logs. The SDK implements the API with configurable exporters, samplers, and processors. The Collector is a standalone agent that receives telemetry from applications, processes it (filtering, batching, enriching), and exports it to one or more backends. The Collector decouples instrumentation from backend — the application sends to the Collector, and the Collector routes to whatever backends are configured.
OpenTelemetry is now the standard starting point for new distributed systems instrumentation. Every major observability vendor supports OpenTelemetry ingestion. AWS, Google Cloud, and Azure all provide native OpenTelemetry integration. Instrumenting with vendor-specific SDKs for new systems in 2025 is a deliberate choice to accept vendor lock-in, not a necessity.
Correlation IDs and Context Propagation
The trace ID that OpenTelemetry propagates through all service calls is the mechanism that makes distributed tracing work. But trace ID propagation requires explicit design — every service in the call chain must extract the trace context from incoming requests and inject it into all outgoing requests.
OpenTelemetry’s W3C TraceContext propagation format standardises this. The traceparent HTTP header carries the trace ID and span ID from caller to callee. Every OpenTelemetry-instrumented service automatically extracts this header from incoming requests and injects it into outgoing requests. No application code change is required beyond instrumentation — the propagation is automatic.
For systems that are not yet fully instrumented with OpenTelemetry, manual correlation ID propagation provides partial observability. A correlation ID (X-Request-ID or X-Correlation-ID) is generated at the system entry point — the API gateway or the first service to receive a request — and propagated through all downstream service calls as a request header. Every service includes this ID in every log event it emits for that request. Without distributed tracing, the correlation ID allows logs to be correlated across services even without the parent-child span structure that traces provide.
The two common failures of context propagation that destroy observability are worth naming explicitly. First, services that create new correlation IDs rather than propagating the incoming one — every internal service call starts a new trace, making it impossible to connect upstream and downstream events. Second, services that propagate correlation IDs through synchronous HTTP calls but not through asynchronous message queue messages — when a request triggers an async job, the trace context is lost at the queue boundary, making the async processing invisible in traces.
The Observability Tooling Landscape
The observability tooling landscape has two tiers: open-source self-hosted stacks and commercial managed services.
The dominant open-source stack is the Grafana ecosystem. Prometheus collects and stores metrics, with a powerful query language (PromQL) for alerting and dashboards. Loki collects and indexes logs, using the same label-based query model as Prometheus. Tempo stores distributed traces. Grafana provides dashboards and alerting across all three. The combination of Prometheus + Loki + Tempo + Grafana provides a complete three-pillar observability stack with no per-datapoint cost. Jaeger and Zipkin are standalone open-source trace storage and query systems used in environments where Tempo is not the right fit.
Commercial managed services trade cost for reduced operational overhead. Datadog provides metrics, logs, traces, and alerting in a single platform with excellent cross-signal correlation and a large library of pre-built integrations. It is expensive at scale but significantly reduces the operational work of running observability infrastructure. Honeycomb specialises in high-cardinality event data and is particularly strong for trace-driven debugging — its query engine allows arbitrary filtering and aggregation on trace attributes that Prometheus-style metrics cannot express. It is the tool of choice for teams that have adopted Charity Majors’ observability-driven development philosophy. New Relic and Dynatrace provide similar full-stack managed observability with strong APM (application performance monitoring) capabilities.
SLI, SLO and SLA: Observability Driving Reliability Contracts
Observability data is the measurement foundation for service reliability contracts. The three-level hierarchy — SLI, SLO, SLA — connects raw telemetry to business commitments.
A service level indicator (SLI) is a specific metric that measures the aspect of service behaviour you care about — the percentage of requests that return a successful response within 200ms, for example. SLIs are derived from observability data: the four golden signals are common SLI candidates.
A service level objective (SLO) is a target value for an SLI — “99.9% of requests will return a successful response within 200ms, measured over a 30-day rolling window.” SLOs define what good looks like and drive the error budget framework introduced in Post 4.2. SLO-based alerting fires when the error budget burn rate exceeds a threshold — not when a single threshold is breached, but when the system is consuming its error budget faster than the target allows.
A service level agreement (SLA) is a contractual commitment to external customers — the published availability and performance commitments that carry business consequences if violated. SLAs are typically more conservative than SLOs, providing a buffer between the internal target (SLO) and the external commitment (SLA).
SLO-based alerting is preferable to threshold-based alerting for most signals. A static threshold alert on error rate fires every time error rate exceeds 1% — including brief spikes that resolve in seconds and never threaten the error budget. An SLO burn rate alert fires when the error rate is high enough and sustained enough to consume the monthly error budget at an unacceptable rate. SLO-based alerting produces fewer false positives and more actionable alerts.
The Incident Diagnosis Workflow
Three signals, used in the correct sequence, reduce time to diagnosis significantly. The correct sequence is metrics → traces → logs — from broad signal to specific context.
Step 1 — Metrics identify what is broken and its scope. The four golden signals provide the first-level triage. Error rate spike: something is failing. Latency spike: something is slow. Saturation spike: something is running out of resources. Traffic anomaly: demand has changed unexpectedly. Metrics establish the time window of the incident and the affected services — the starting point for investigation.
Step 2 — Traces identify where it is broken. For the affected service and time window, traces show which specific downstream calls are contributing to the latency or errors. A trace that shows 80% of checkout request latency occurring in the fraud check service narrows the investigation to that service. A trace that shows a database call returning an error propagating through three upstream services identifies the root cause service immediately. Traces answer the “where” question that metrics cannot.
Step 3 — Logs identify why it is broken. With the specific service and time window identified from traces, logs provide the detailed event context. The error message, the stack trace, the specific request parameters that triggered the failure, the state of the system at the time — all of this is in the logs for the specific spans identified by the trace. Filtering logs by trace ID and span ID narrows from millions of log events to the dozens that are relevant to the specific failure being investigated.
This three-step workflow consistently reduces mean time to diagnosis in production incidents. The key enabler is the trace ID that connects all three signals — the same ID present in the trace, in the logs, and (through exemplars in Prometheus) linkable from metrics to specific traces. Without this connection, engineers must correlate signals manually by timestamp and service name — which is slow, error-prone, and frequently misleading.
Alerting That Does Not Produce Alert Fatigue
Alert fatigue — the condition where engineers stop responding to alerts because too many alerts are false positives — is one of the most common and damaging operational problems in distributed systems. A team that receives 200 alerts per week and investigates 5% of them is not operating an observability practice — it is running a noise generator.
Three principles reduce alert fatigue without reducing coverage. Alert on symptoms, not causes. An alert that fires when a user-visible SLI is degraded (error rate above threshold, latency above threshold) is always actionable — it means users are being affected. An alert that fires when CPU is above 80% may or may not be affecting users. Alert on the user-visible symptom; use dashboards to investigate the root cause after the alert fires. Use SLO burn rate alerting. As described above, SLO burn rate alerts fire only when the incident is severe enough to threaten the error budget — filtering out brief transient spikes that resolve before they matter. Set meaningful alert thresholds based on historical data. Alert thresholds set without reference to historical baseline behaviour produce both false positives (thresholds too tight) and missed incidents (thresholds too loose). Set initial thresholds at three sigma above the historical mean and adjust based on alert quality over time.
Key Takeaways
- Monitoring detects known failure modes; observability allows engineers to investigate unknown failure modes — both are required, and observability is specifically the mechanism that surfaces gray failures that threshold-based monitoring misses
- The three pillars of observability — logs, metrics, and traces — answer different questions: logs explain why an event happened, metrics show what is happening at scale, traces show where in the request path failures originate
- The four golden signals (latency, traffic, errors, saturation) provide the minimum sufficient metrics for any distributed service — instrument these first before adding service-specific metrics
- OpenTelemetry is the vendor-neutral instrumentation standard — instrument with OTel and export to any backend, avoiding vendor lock-in at the instrumentation layer
- Correlation IDs and trace context propagation connect all three signals for a single request — without trace ID propagation through all service calls and all log events, cross-service investigation requires manual timestamp correlation that is slow and error-prone
- The incident diagnosis workflow is metrics → traces → logs: metrics identify what is broken and its scope, traces identify where in the execution path the failure originates, logs provide the specific event context that explains why
- SLO-based alerting reduces alert fatigue by firing only when incidents are severe enough to threaten the error budget — not on every transient threshold breach that resolves before affecting users
Frequently Asked Questions (FAQ)
What is observability in distributed systems?
Observability is the ability to understand the internal state of a distributed system from its external outputs — logs, metrics, and traces. A system is observable if engineers can ask arbitrary questions about its behaviour without modifying code. Observability goes beyond monitoring (which detects known failure modes against pre-defined thresholds) by allowing investigation of unexpected failure modes that were not anticipated when the system was built. In distributed systems, observability is essential because failures can originate anywhere in a multi-service call chain and are often invisible without cross-service telemetry correlation.
What are the four golden signals?
The four golden signals, from Google’s SRE book, are the minimum sufficient set of metrics for any user-facing distributed service: latency (how long requests take, measured at p99 not just average), traffic (request volume per second, establishing the demand baseline), errors (the rate of failed requests, including both explicit errors and implicit errors like stale data returns), and saturation (how close the most constrained resource is to its limit — CPU utilisation, thread pool utilisation, connection pool wait time). Instrumenting these four signals first provides the alerting foundation before adding service-specific metrics.
What is OpenTelemetry and why does it matter?
OpenTelemetry (OTel) is a CNCF open-source project that provides vendor-neutral APIs, SDKs, and a Collector for instrumenting distributed systems with logs, metrics, and traces. Before OpenTelemetry, each observability vendor provided its own SDK — switching backends required replacing all instrumentation code. With OpenTelemetry, application code instruments against the OTel API, and the exporter configuration determines which backend receives the telemetry. This eliminates vendor lock-in at the instrumentation layer and is now the standard starting point for new distributed systems instrumentation.
What is a distributed trace and how does it work?
A distributed trace follows a single request as it propagates across services, capturing the complete execution path as a tree of spans. Each span represents one unit of work (an HTTP call, a database query, a cache lookup) and records its start time, duration, status, parent span ID, and trace ID. The trace ID is propagated through all service calls via HTTP headers (W3C TraceContext standard) so that all spans from a single request share the same trace ID. Trace data is stored in a trace backend (Jaeger, Tempo, Honeycomb) where engineers can visualise the complete execution path, identify the slowest spans, and jump from a trace to the logs for a specific span by filtering on trace ID and span ID.
What is SLO-based alerting and why is it better than threshold alerting?
SLO-based alerting fires when the error budget burn rate exceeds a threshold — when incidents are consuming the monthly error budget faster than is sustainable — rather than when a single metric exceeds a fixed threshold. A static threshold alert fires every time error rate exceeds 1%, including transient spikes that resolve in seconds. An SLO burn rate alert fires only when the error rate is high enough and sustained enough to threaten the monthly budget. SLO-based alerting produces fewer false positives, more actionable alerts, and directly connects alerting to user-visible reliability targets rather than arbitrary infrastructure thresholds.
How do you detect gray failures with observability?
Gray failures — nodes that pass infrastructure health checks while returning incorrect or degraded results — are invisible to heartbeat-based failure detection and threshold-based monitoring on infrastructure metrics. Detecting them requires application-level observability. The signatures to instrument: error rate by endpoint and response code (a specific endpoint returning elevated 5xx or returning semantically incorrect responses while overall availability appears fine), latency distribution anomalies (a specific downstream dependency showing p99 degradation while p50 is normal), and business-level metrics (order success rate, payment conversion rate, search result quality scores) that detect correctness problems that technical metrics do not capture.
Continue the Series
Series home: Distributed Systems — Concepts, Design & Real-World Engineering
Part 4 — Fault Tolerance & High Availability Overview
- 4.1 — Failure Taxonomy: How Distributed Systems Fail
- 4.2 — Fault Tolerance vs High Availability: Understanding the Difference
- 4.3 — Redundancy Patterns in Distributed Systems
- 4.4 — Failure Detection: Heartbeats, Timeouts and the Phi Accrual Detector
- 4.5 — Recovery and Self-Healing Systems
- 4.6 — Designing for High Availability: Patterns and Trade-offs
- 4.7 — Fault Isolation and Bulkheads
- 4.8 — Observability and Diagnosing Distributed Failures
- 4.9 — Chaos Engineering and Resilience Culture
Previous: ← 4.7 — Fault Isolation and Bulkheads
Next: 4.9 — Chaos Engineering and Resilience Culture →
Related posts from earlier in the series:
- 4.1 — Failure Taxonomy — Gray failures that only observability can detect
- 4.2 — Fault Tolerance vs High Availability — SLO error budgets that observability measures
- 3.8 — Performance Trade-offs in Replicated Systems — Tail latency and why p99 matters more than p50
- 4.7 — Fault Isolation and Bulkheads — Thread pool and connection pool saturation metrics
- 1.4 — Node & Failure Model — The slow node failure mode that observability surfaces