Evaluation & Safety

Observability

The practice of monitoring, tracing, and evaluating AI system behavior in production — including LLM calls, latency, costs, retrieval quality, and output correctness.

Why it matters

You cannot improve what you cannot measure. AI observability is how teams go from 'it works in demo' to 'it works reliably in production at scale.'

Why AI observability is different

Traditional software observability tracks request/response cycles, error rates, and latency. AI observability adds a new dimension: quality. A 200 OK response that hallucinates is worse than a 500 error you can detect and retry. You need to monitor what the model said, not just whether it responded.

What to track

Traces — full request lifecycle: prompt → retrieval → LLM call → output → guardrails.
Cost — token usage and spend per request, per user, per feature.
Latency — time to first token, total generation time, retrieval latency.
Quality metrics — relevance scores, hallucination rates, user feedback signals.

Tools

Langfuse (open-source), LangSmith, Braintrust, and Helicone are the leading platforms. Most support OpenTelemetry-compatible tracing.