Agent Observability — Tracing, Logging & Evaluation
Why Observability Is Hard for Agents
Traditional software has deterministic, synchronous execution. If something fails, you know exactly which line caused it. Agent systems are different in four ways:
Non-determinism: the same input can produce different traces on different runs because LLM sampling is probabilistic.
Multi-step failures: a task may fail on step 7 because of a bad decision on step 2. The failure point and the root cause are separated by several steps and thousands of tokens.
Nested calls: agents call tools which call APIs which may call other agents. The trace is a tree, not a line.
Emergent costs: a single agent run can involve 5 LLM calls and 8 tool calls. The total cost and latency emerge from the combination and are not visible from any single call.
Without structured observability, debugging a production agent failure means reading hundreds of log lines manually. With it, you can filter to the failed trace, see every LLM call and tool call in order, and identify the exact point of failure in seconds.
What to Capture
For every agent run, capture:
- Run-level: run_id, task, user_id, start_time, total_duration_ms, total_cost_usd, step_count, final_status (success/failed), final_answer
- LLM call-level: model, input_tokens, output_tokens, latency_ms, cost_usd, prompt hash
- Tool call-level: tool_name, input_args, output_summary, latency_ms, success/error
This data enables: failure diagnosis, cost analysis, latency profiling, and automated evaluation.
OpenTelemetry Tracing
OpenTelemetry is the standard for distributed tracing. It produces traces visible in tools like Jaeger, Zipkin, Grafana Tempo, or Langfuse.
Structured JSON Logging
Every agent event should be logged as a JSON object so it can be queried programmatically:
The Complete AgentTracer Class
Automated Evaluation Pipeline
Record 100 runs, then replay them through an LLM-as-judge to measure task completion quality:
Anomaly Detection
Key Takeaways
- Agent observability requires three layers: structured per-event logs, OpenTelemetry distributed traces, and aggregate metrics stored in a queryable database.
- Capture every LLM call with (model, input_tokens, output_tokens, latency_ms, cost) — this data drives both debugging and cost optimisation.
- Use a unique run_id on every task execution and attach it to every downstream log event — this is what enables end-to-end trace reconstruction.
- The
@trace_llm_calland@trace_tool_calldecorators add observability to existing functions without changing their logic. - Record runs to SQLite; replay a 10% sample through LLM-as-judge evaluation to continuously measure task quality without evaluating every run.
- Anomaly detection alerts on avg_steps doubling (looping) or cost spiking (runaway LLM calls) — both are production failure modes that latency alone won't catch.
- Structured JSON logs are queryable with any log analytics tool; unstructured text logs require manual parsing.
- Track five metrics in production: task_success_rate, avg_steps, avg_cost_usd, p99_latency, and tool_error_rate. Dashboard these and set alert thresholds before launch.