Lesson 6

Shipping Agents to Production

18 min

An agent that works in a Jupyter notebook is a demo. An agent that handles 50,000 requests per day, stays within a monthly LLM budget, recovers from rate limits, and lets engineers debug yesterday's failed run is a product. This lesson covers everything between demo and production.

Cost Tracking

LLM API costs scale with token usage. Track every call:

python

from dataclasses import dataclass, field
from datetime import datetime

PRICING = {
    "claude-opus-4-5":   {"input": 15.00, "output": 75.00},   # per million tokens
    "claude-haiku-4-5":  {"input":  0.25, "output":  1.25},
    "claude-sonnet-4-5": {"input":  3.00, "output": 15.00},
}

@dataclass
class UsageRecord:
    request_id:    str
    model:         str
    input_tokens:  int
    output_tokens: int
    timestamp:     datetime = field(default_factory=datetime.utcnow)

    @property
    def cost_usd(self) -> float:
        p = PRICING.get(self.model, {"input": 0, "output": 0})
        return (
            self.input_tokens  / 1_000_000 * p["input"] +
            self.output_tokens / 1_000_000 * p["output"]
        )

usage_log: list[UsageRecord] = []

def tracked_llm_call(model: str, messages: list, **kwargs):
    from anthropic import Anthropic
    client   = Anthropic()
    response = client.messages.create(model=model, messages=messages, **kwargs)

    record = UsageRecord(
        request_id    = response.id,
        model         = model,
        input_tokens  = response.usage.input_tokens,
        output_tokens = response.usage.output_tokens,
    )
    usage_log.append(record)

    if sum(r.cost_usd for r in usage_log) > DAILY_BUDGET_USD:
        raise RuntimeError("Daily LLM budget exceeded")

    return response

OpenTelemetry Instrumentation

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.service")

def instrumented_agent(user_message: str, session_id: str) -> str:
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("session.id",       session_id)
        span.set_attribute("input.length",     len(user_message))
        span.set_attribute("input.preview",    user_message[:100])

        try:
            result = run_agent(user_message)
            span.set_attribute("output.length",  len(result))
            span.set_status(trace.StatusCode.OK)
            return result
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

Traces flow to Jaeger, Honeycomb, or Grafana Tempo, giving you waterfall views of every agent step with timing.

Multi-Turn State Management

python

import json
from pathlib import Path

class ConversationStore:
    def __init__(self, storage_dir: str = "./sessions"):
        self.dir = Path(storage_dir)
        self.dir.mkdir(exist_ok=True)

    def load(self, session_id: str) -> list[dict]:
        path = self.dir / f"{session_id}.json"
        if not path.exists():
            return []
        return json.loads(path.read_text())

    def save(self, session_id: str, messages: list[dict]):
        path = self.dir / f"{session_id}.json"
        path.write_text(json.dumps(messages, indent=2))

    def truncate(self, messages: list[dict], max_tokens: int = 80_000) -> list[dict]:
        """Keep the system message and the most recent turns within token budget."""
        # Simple approximation: 1 token ≈ 4 characters
        while sum(len(str(m)) // 4 for m in messages) > max_tokens and len(messages) > 2:
            messages.pop(1)  # remove oldest non-system message
        return messages

store = ConversationStore()

def multi_turn_agent(user_message: str, session_id: str) -> str:
    messages = store.load(session_id)
    messages.append({"role": "user", "content": user_message})
    messages = store.truncate(messages)

    result = run_agent_from_messages(messages)

    messages.append({"role": "assistant", "content": result})
    store.save(session_id, messages)
    return result

Failure Mode Taxonomy

| Failure mode | Symptom | Mitigation | |---|---|---| | Infinite tool loop | Max iterations reached | Hard iteration limit + alert | | Rate limit (429) | RateLimitError | Exponential backoff with jitter | | Context overflow | context_length_exceeded | Truncate oldest messages | | LLM refusal | "I cannot help with..." | Rephrase + fallback chain | | Tool error storm | All tools failing | Circuit breaker pattern |

python

import time, random

def with_retry(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if "rate_limit" not in str(e).lower() or attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Production Checklist

| Category | Item | Status | |---|---|---| | Cost | Per-request cost logged | Required | | Cost | Daily budget cap enforced | Required | | Observability | OpenTelemetry traces | Required | | Observability | Error rate dashboard | Required | | Safety | Injection detection | Required | | Safety | Tool argument validation | Required | | State | Session persistence | Required | | State | Context truncation | Required | | Reliability | Retry with backoff | Required | | Reliability | Fallback chain | Required |

Summary

Track token usage and cost on every LLM call; set hard daily budget caps that raise exceptions before bills spiral.
Instrument agents with OpenTelemetry spans so every step has timing, attributes, and error state visible in your tracing backend.
Persist conversation history in a session store and truncate oldest turns when approaching context limits.
Classify failure modes explicitly and handle each: rate limits with backoff, context overflow with truncation, tool storms with circuit breakers.
Before shipping, verify every item in the production checklist — cost, observability, safety, state, and reliability all must pass.

Guardrails and Safety