Distributed Tracing
Table of Contents
When service A calls service B calls service C, a single request fans out across processes, networks, and log streams. Distributed tracing gives you a unified timeline of that request — every hop, every latency contribution, every error — in one view.
Core concepts
Trace — a tree of spans sharing a single trace_id. One trace = one end-to-end request.
Span — one unit of work (an HTTP handler, an outgoing request, a DB query) with a span_id, parent reference, timing, and key-value attributes.
Context propagation — passes trace_id + parent span_id across process boundaries via the W3C traceparent header. Instrumentation libraries inject the header on outgoing calls and extract it on incoming ones — no application code touches headers.
sequenceDiagram
participant A as ServiceA
participant B as ServiceB
participant C as ServiceC
A->>B: POST /dispatch (traceparent header)
B->>C: POST /command (traceparent header)
C-->>B: 200 OK
B-->>A: 200 OK
OpenTelemetry setup
OpenTelemetry (OTel) is the vendor-neutral instrumentation standard. Your code emits telemetry in OTel format; a separate collector decides where it goes (X-Ray, Jaeger, Datadog, …). Three things to configure:
- A
TracerProviderwith aResource(service name) and an exporter - A
SpanExporter— OTLP to a collector in production,ConsoleSpanExporterfor local dev - Auto-instrumentors for your frameworks
uv add opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-http \
opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-httpx \
opentelemetry-instrumentation-requests# src/shared/observability.py
import contextlib
import os
from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter, SpanExporter
def init_tracing(service_name: str, *, app=None) -> None:
otlp_endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT")
exporter: SpanExporter
if otlp_endpoint:
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
exporter = OTLPSpanExporter(endpoint=f"{otlp_endpoint}/v1/traces")
else:
exporter = ConsoleSpanExporter()
provider = TracerProvider(resource=Resource.create({SERVICE_NAME: service_name}))
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
with contextlib.suppress(ImportError):
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
HTTPXClientInstrumentor().instrument()
with contextlib.suppress(ImportError):
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()
if app is not None:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app, excluded_urls="health")Note
excluded_urls="health"prevents health-check probes from creating traces. Without it, load-balancer checks pollute the trace map with “Client” nodes.
FastAPI example
Two services: a caller that sends requests and a server that handles them. Both call init_tracing at startup — that’s all the wiring needed.
Server (receives requests, creates server spans automatically):
# server/main.py
from fastapi import FastAPI
from shared.observability import init_tracing
app = FastAPI()
init_tracing("my-server", app=app)
@app.post("/command")
async def handle_command(payload: dict):
# this handler is automatically wrapped in a span
result = process(payload)
return {"status": "ok"}Caller (sends requests, injects traceparent header automatically):
# caller/main.py
import httpx
from fastapi import FastAPI
from shared.observability import init_tracing
app = FastAPI()
init_tracing("my-caller", app=app)
@app.post("/dispatch")
async def dispatch(payload: dict):
async with httpx.AsyncClient() as client:
# httpx auto-instrumentation injects traceparent --- the server
# sees the same trace_id and creates a child span
resp = await client.post("http://my-server:8000/command", json=payload)
return {"upstream_status": resp.status_code}The result: one trace with three spans (caller server span, caller client span, server server span), all linked by the same trace_id. No manual header passing, no middleware — the instrumentors handle it. The same applies if you use requests instead of httpx — the RequestsInstrumentor in init_tracing covers it.
Manual spans
Auto-instrumentation covers HTTP entry points. For non-HTTP triggers — a message consumer, a protocol handler, a scheduled job — create the root span yourself:
tracer = trace.get_tracer(__name__)
def handle_incoming_message(payload):
with tracer.start_as_current_span("process-message"):
# any HTTP calls made here automatically become child spans
dispatch_to_downstream(payload)Structured logs with trace context
Logs become useful for tracing when they carry trace_id and span_id. With loguru (see Logging for general setup), replace the default sink with a JSON sink that reads the current OTel context:
import json, os, sys
from loguru import logger
from opentelemetry import trace
def _json_sink(message) -> None:
record = message.record
ctx = trace.get_current_span().get_span_context()
trace_id = format(ctx.trace_id, "032x") if ctx.trace_id else "0" * 32
span_id = format(ctx.span_id, "016x") if ctx.span_id else "0" * 16
entry = {
"timestamp": record["time"].strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
"level": record["level"].name,
"service": record["extra"].get("service", "unknown"),
"message": record["message"],
"trace_id": trace_id,
"span_id": span_id,
}
if record["exception"] is not None:
entry["exception"] = str(record["exception"])
sys.stderr.write(json.dumps(entry, default=str) + "\n")
sys.stderr.flush()
def configure_logging(service_name: str) -> None:
logger.remove()
logger.configure(extra={"service": service_name})
logger.add(_json_sink, level=os.environ.get("LOGURU_LEVEL", "INFO"))Your log aggregator can now group log lines by trace_id or link them to the trace timeline.
Tip
Some backends (e.g. AWS X-Ray) require an additional field in a backend-specific format to link logs to traces in the UI. Check your backend’s docs.
The collector
Your app should not export traces directly to the tracing backend. Instead, send them over OTLP to a collector running alongside it:
flowchart LR
App["Your app (OTLP)"] -->|"localhost:4318"| Collector["OTel Collector"]
Collector --> Backend["Tracing backend"]
The collector handles batching, retry, sampling, and export. Switching backends is a collector config change, not a code change. In a containerized environment (ECS, Kubernetes), run the collector as a sidecar and mark it non-essential so telemetry failures never crash your app.
The only config your app needs: OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318.
Async boundaries
Context propagation works over synchronous request-response calls. Asynchronous boundaries — message queues, event streams — break the trace chain because there is no carrier for the traceparent header.
You can bridge them by embedding traceparent in the message payload, but that requires schema changes. For most cases, correlating by a business-level ID (e.g. request_id) across separate traces is simpler.
Further reading
OpenTelemetry docs:
- Context Propagation — how trace context moves across service boundaries
- Python SDK — full Python instrumentation guide
- Python Propagation — manual inject/extract when auto-instrumentation isn’t enough
- Python Instrumentation Libraries — using and discovering auto-instrumentors
- FastAPI Instrumentation — hooks, excluded URLs, header capture
- Collector Quick Start — run the collector in Docker in minutes
- Collector Configuration — receivers, exporters, processors, pipelines
Specs and standards:
- W3C Trace Context — the
traceparentheader format
Backends:
- AWS X-Ray with ADOT — OTel with AWS X-Ray via the AWS Distro for OpenTelemetry
- ADOT Collector — collector config for AWS backends
- Jaeger — open-source tracing backend, good for local development