You ship your LLM application. It works in staging. You deploy to production. Then one of three things happens: users complain about quality, costs spike unexpectedly, or nothing breaks at all for two weeks before something breaks catastrophically and you have no idea why.
All three scenarios share a root cause: no observability.
Traditional software monitoring tells you whether your code is running. LLM observability tells you whether your AI is behaving correctly — which is a fundamentally different problem.
What Actually Breaks in Production
Before setting up monitoring, understand what you are monitoring for. LLM failures are different from service failures.
Quality drift. The model produces good outputs when you first deploy. Over time — without any code changes — output quality degrades. This happens because prompt edge cases accumulate, the model provider updates underlying weights, or user input patterns shift from what you tested. You will not catch this with uptime checks.
Hallucination spikes. Your application works 95% of the time. For 5% of inputs, the model confidently produces wrong information. If you are not sampling and evaluating outputs, you will find out from users, not dashboards.
Context window abuse. Agents that loop, conversations that grow unbounded, retrieval that pulls too much context — these all cause latency spikes and cost blowups that are invisible until your bill arrives.
Prompt injection. Users craft inputs that override your system prompt. You will not see this without logging full conversation turns.
Silent failures. The model produces an output that looks valid but fails downstream parsing. Your application swallows the error, the user sees a generic failure, and you have no trace of what the model actually returned.
The Metrics That Matter
Not everything needs monitoring. These are the metrics worth tracking.
Latency by percentile. P50, P95, P99. Not average. LLM latency is heavily skewed — your average can look fine while P99 is 30 seconds. Track token generation speed (tokens per second) separately from time to first token, since users perceive these differently.
Cost per request and per user. Total spend is a lagging indicator. You want cost-per-request trending data so you can see when a new feature or prompt change is 3x more expensive before your bill arrives. Tag costs by feature, user tier, and endpoint.
Token consumption distribution. How many tokens do your prompt templates actually use? How much context does your RAG retrieve on average? Visualizing these distributions reveals optimization opportunities and cost drivers.
Quality scores. Automated quality evaluation is not perfect, but it is much better than nothing. At minimum: response length distribution (very short or very long responses are often failures), tool call success rate for agent systems, and downstream task completion rate.
Error rates by type. Rate limit errors, context length errors, content filter blocks, and timeout rates should each be tracked separately. They have different causes and different fixes.
Tools Worth Using
LangSmith is the most capable observability tool for LangChain-based applications. It captures full traces — every LLM call, tool call, and chain step with inputs, outputs, latency, and token counts. The evaluation framework lets you run automated quality checks against a test set after every deployment. If you are on LangChain, use it.
The honest limitation: LangSmith's pricing scales with trace volume. High-traffic applications can get expensive. Use sampling in production rather than tracing every request.
Helicone sits as a proxy between your application and the OpenAI API. Zero code changes required — redirect your requests through Helicone's endpoint and you get cost tracking, latency monitoring, and prompt management out of the box. The lightweight integration makes it a good choice for teams that need observability quickly without refactoring.
Custom logging is often underrated. A well-designed logging schema capturing request ID, user ID, prompt hash, model parameters, input tokens, output tokens, latency, and a sample of responses will get you 80% of what you need at near-zero infrastructure cost. Start here if you are pre-traction.
Arize and other MLOps platforms make sense at enterprise scale when you need drift detection, A/B testing infrastructure, and team-level dashboards. Overkill for most early-stage applications.
Building Observability Before You Need It
The cost of adding observability after an incident is much higher than building it upfront. Here is the minimum viable observability stack for a production LLM application.
Structured Request Logging
Every LLM call should log:
- Request ID and user ID
- Timestamp and response time
- Model and parameters used
- Prompt template name and version
- Input token count and output token count
- Success or error type
- A sample of the actual request/response (sampling rate: 10-20%)
This gives you the data you need for debugging without storing everything.
Cost Alerting
Set a daily cost threshold alert before you think you need it. The time between "this feature is getting expensive" and "this month's bill is three times budget" is shorter than you expect.
Quality Sampling
Sample 2-5% of production responses for human review or automated evaluation. Even a simple binary rating (good/bad) on sampled outputs will catch quality problems weeks before users start complaining in volume.
Agent Trace Logging
For agent systems specifically, log:
- Number of tool calls per request
- Which tools were called and in what order
- Whether the agent completed the task or hit max iterations
- Total tokens used across all steps
Agents that loop or call tools excessively are easy to catch in trace logs and impossible to catch otherwise.
A Practical Setup in Under an Hour
For a new LLM application, here is what to set up before launch:
- Add Helicone as a proxy (15 minutes, no code changes)
- Add structured logging to your LLM call wrapper (30 minutes)
- Create a cost alert at 2x your expected daily spend (5 minutes)
- Set up a weekly sampling review in your task tracker (5 minutes)
This is not the complete observability picture. It is the minimum you need to detect real problems in production.
The Phase 8 curriculum at MindloomHQ's Agentic AI course covers Production AI in depth — observability, evaluation frameworks, deployment patterns, and cost optimization for LLM systems at scale.
LLM observability is not optional for production applications. The question is whether you build it before or after your first incident.