Production LLMOps
Production LLM systems require different operational practices than traditional software. Models can degrade silently, cost can spike unpredictably, and prompt changes can have cascading effects.
Deployment Patterns
Pattern 1: Simple API Proxy
The simplest deployment. A thin wrapper around an LLM API with basic error handling.
User → [Proxy] → Claude/GPT/Gemini API │ ├── Retry on 5xx ├── Circuit breaker on rate limits └── Basic loggingBest for: Prototypes, internal tools, low-volume applications.
Gotchas:
- No fallback if the API goes down
- No cost tracking per user
- No prompt versioning
Pattern 2: Multi-Model with Fallback
Route requests to a primary model, fall back to a secondary if it fails.
User → [Router] → Primary (Claude) │ Fallback (GPT-5.5) │ Emergency (DeepSeek V4) │ ├── Health checks every 30s ├── Timeout: 30s per model └── Retry: 2x on 5xx before fallbackImplementation:
import asynciofrom tenacity import retry, stop_after_attempt, wait_exponential
MODELS = { "primary": {"provider": "anthropic", "model": "claude-sonnet-4-20260510"}, "fallback": {"provider": "openai", "model": "gpt-5.5"}, "emergency": {"provider": "deepseek", "model": "deepseek-v4"},}
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))async def generate_with_fallback(prompt, context): for tier, config in MODELS.items(): try: response = await call_model(config, prompt, context) return response except (APIError, TimeoutError) as e: log.warning(f"{tier} failed: {e}, falling back") continue raise AllModelsExhausted("All models failed")Pattern 3: Canary Deployments
Gradually roll out new prompts or models to a subset of users.
Day 1: 5% of users → new prompt versionDay 2: 20% of users → new prompt versionDay 3: 50% of users → new prompt versionDay 4: 100% of users → new prompt versionRollback criteria: If any metric degrades by 5%+ (accuracy, latency, user satisfaction), roll back immediately.
Implementation with feature flags:
prompt_versions = { "v1": {"prompt": "Answer concisely: {question}"}, "v2": {"prompt": "You are a helpful assistant. Answer: {question}"},}
def get_prompt_version(user_id): # Consistent hashing for stable assignment hash_val = hash(user_id) % 100 if hash_val < config.canary_percent: # e.g., 5% return prompt_versions["v2"] return prompt_versions["v1"]Pattern 4: Multi-Region / Multi-Provider
For high-availability systems, spread across regions and providers.
US-East: [Claude API] [GPT API]US-West: [Claude API] [GPT API]EU: [Claude API] [GPT API]
Health checks → Route to healthiest regionCross-region failover in <5sMonitoring
LLM monitoring is different from traditional monitoring. You need to track not just system metrics (latency, errors) but also semantic metrics (quality, safety, cost).
What to Monitor
| Category | Metrics | Why It Matters |
|---|---|---|
| Latency | P50/P95/P99 time to first token, total response time | User experience |
| Errors | API errors, rate limits, timeouts, model-side errors | Reliability |
| Cost | Cost per query, per user, per model, per feature | Budget control |
| Quality | Accuracy (on labeled data), user ratings, thumbs up/down | Model degradation detection |
| Safety | Toxicity scores, PII leakage, jailbreak attempts | Compliance, trust |
| Drift | Output length distribution, token usage patterns | Early warning for model changes |
Monitoring Tools (May 2026)
| Tool | Best For | Key Feature |
|---|---|---|
| LangSmith | LLM tracing, prompt debugging, evaluation | Full trace visualization, dataset management |
| Weights & Biases | Experiment tracking, model registry | Prompts as experiments, cost tracking |
| Arize AI | ML observability, drift detection | Embedding drift monitoring, LLM-specific metrics |
| Helicone | Simple proxy-based monitoring | Easy setup (proxy layer), cost breakdowns |
| Logfire | OpenTelemetry-native, structured logging | Distributed tracing for LLM pipelines |
| Datadog / Grafana | Infrastructure monitoring | Custom dashboards, alerting |
Setting Up Alerts
Don’t alert on everything. Alert on things that matter:
alerts: - name: high_error_rate condition: error_rate > 5% over 5 minutes severity: critical action: page on-call, auto-fallback to secondary model
- name: latency_spike condition: p95_latency > 10s over 5 minutes severity: warning action: investigate provider, check rate limits
- name: cost_anomaly condition: daily_cost > 2x rolling_7day_average severity: warning action: notify team, check for abuse
- name: quality_drop condition: user_thumbs_down > 10% over 1 hour severity: critical action: roll back last prompt change, notify teamTrace Visualization
Every LLM call should be traceable end-to-end:
User request ─→ Auth check ─→ Prompt assembly ─→ Model call ─→ Output validation ─→ Response (trace_id: abc123) ├── latency: 2.3s ├── tokens: 450 input / 120 output ├── model: claude-sonnet-4-20260510 ├── cost: $0.003 └── version: prompt_v3 / model_v2Prompt Versioning & CI/CD
Prompts are code. They should be versioned, reviewed, and deployed through a pipeline just like application code.
Prompt as Code
Store prompts in a structured format, not scattered across code:
name: classify_sentimentversion: 3model: claude-sonnet-4-20260510system_prompt: | You are a sentiment classifier. Analyze the text and return: - positive - negative - neutraltemperature: 0.1max_tokens: 10output_schema: type: string enum: [positive, negative, neutral]tags: - production - classificationCI/CD Pipeline for Prompts
Developer edits prompt.yaml ↓PR created ↓Automated tests run: - Unit tests (formatting, length checks) - Eval tests (accuracy on labeled test set) - Regression tests (known edge cases) ↓Review + approve ↓Merge to main ↓Deploy to staging (5% canary) ↓Monitor for 1 hour ↓Deploy to production (100%)Automated Prompt Testing
def test_prompt_version(): """Validate a prompt version before deployment.""" test_cases = [ {"input": "I love this!", "expected": "positive"}, {"input": "This is terrible.", "expected": "negative"}, {"input": "It's okay I guess.", "expected": "neutral"}, ]
accuracy = 0 for case in test_cases: response = call_model(prompt_version, case["input"]) if response.strip().lower() == case["expected"]: accuracy += 1
accuracy_pct = accuracy / len(test_cases) * 100 assert accuracy_pct >= 80, f"Prompt accuracy {accuracy_pct}% < 80% threshold"A/B Testing
Run controlled experiments to validate changes.
Setup
EXPERIMENTS = { "prompt_v3_vs_v4": { "control": "prompt_v3", "treatment": "prompt_v4", "split": 50, # 50/50 split "metrics": ["accuracy", "latency", "user_satisfaction"], "min_sample": 1000, "min_duration": "1h", }}
def assign_experiment(user_id, experiment_name): experiment = EXPERIMENTS[experiment_name] hash_val = hash(f"{user_id}:{experiment_name}") % 100 return "treatment" if hash_val < experiment["split"] else "control"What to Measure
| Metric | How to Measure | Minimum Detectable Effect |
|---|---|---|
| Accuracy | Labeled test set, human eval, LLM-as-judge | 2% |
| Latency | P50/P95 time to first token | 100ms |
| Cost | Tokens + model tier per query | 5% |
| User satisfaction | Thumbs up/down, follow-up questions | 5% |
| Safety | Toxicity scores, PII detection rate | 1% |
Statistical significance: Use a chi-squared test or Bayesian A/B test. Don’t declare a winner until p < 0.05 with at least 1,000 samples per variant.
Cost Tracking
LLM costs are variable and can spike. Track aggressively.
Per-Query Cost Breakdown
def track_cost(user_id, feature, model, input_tokens, output_tokens): cost = (input_tokens * model.input_price / 1_000_000 + output_tokens * model.output_price / 1_000_000)
log_to_database({ "timestamp": datetime.now(), "user_id": user_id, "feature": feature, "model": model.name, "input_tokens": input_tokens, "output_tokens": output_tokens, "cost": cost, })
return costCost Dashboard
Track cost by multiple dimensions:
| Dimension | Why |
|---|---|
| Per user | Identify heavy users, potential abuse |
| Per feature | Which features are most expensive? |
| Per model | Are you using expensive models for cheap tasks? |
| Per hour/day | Identify cost spikes |
| Per experiment | Track experiment costs separately |
Cost Optimization Playbook
- Add routing: Route simple queries to cheaper models
- Enable prompt caching: 50-90% savings on repeated prefixes
- Reduce output tokens: Shorter responses cost less
- Batch where possible: Reduce per-request overhead
- Negotiate volume pricing: API discounts at 100M+ tokens/month
Caching Strategies
LLM outputs can be cached to reduce cost and latency.
Exact Match Cache
The simplest cache. If the same prompt was asked before, return the cached response.
cache = RedisCache(ttl=3600) # 1 hour
def get_response(prompt, model): cache_key = hash(f"{model}:{prompt}") cached = cache.get(cache_key) if cached: return cached
response = call_model(prompt) cache.set(cache_key, response) return responseHit rate: 5-15% for most applications Best for: FAQ, documentation, product information
Semantic Cache
Cache based on meaning, not exact text. If a user asks “What’s your return policy?” and another asks “How do I return something?”, return the same answer.
def semantic_cache_lookup(query, threshold=0.95): query_embedding = embed(query) cached_entries = cache.search(query_embedding, top_k=1)
if cached_entries and cached_entries[0].similarity > threshold: return cached_entries[0].response
return NoneImplementation: Use the same embedding model as your vector database. Store (embedding, response, prompt) triples. On cache miss, embed the query, find the closest match, return if similarity exceeds threshold.
Hit rate: 20-40% (much higher than exact match) Best for: Customer support, conversational AI Tradeoff: Higher latency than exact match (requires embedding computation + vector search)
Cache Strategy Decision
| Type | Latency Savings | Cost Savings | Complexity |
|---|---|---|---|
| None | 0x | 0% | None |
| Exact match (Redis) | 100x for hits | 5-15% | Low |
| Prompt caching (API) | 2x for cached prefixes | 50-90% on cached | Built-in |
| Semantic (Vector DB) | 10x for hits | 20-40% | Medium |
Rate Limiting & Throttling
Protect your system from abuse and cost spikes.
Per-User Rate Limits
RATE_LIMITS = { "free_tier": {"rpm": 10, "tpm": 10000, "cost_limit": 0.10}, # per day "pro_tier": {"rpm": 100, "tpm": 100000, "cost_limit": 1.00}, "enterprise": {"rpm": 1000, "tpm": 1000000, "cost_limit": None},}
def check_rate_limit(user): limits = RATE_LIMITS[user.tier] usage = get_usage(user.id, window="1m")
if usage.requests >= limits.rpm: raise RateLimitExceeded("Too many requests per minute") if usage.tokens >= limits.tpm: raise RateLimitExceeded("Too many tokens per minute") if limits.cost_limit and usage.daily_cost >= limits.cost_limit: raise CostLimitExceeded("Daily cost limit reached")Queue Management
When traffic spikes exceed capacity, queue requests instead of dropping them:
async def process_with_queue(request, queue): if queue.length() > 100: return {"error": "Too many requests", "retry_after": 10}
response = await queue.enqueue(request, timeout=30) return responseGuardrails
Prevent the model from producing harmful or undesirable outputs.
Input Guardrails
Check user input before it reaches the model:
def check_input(user_input): checks = [ check_prompt_injection(user_input), check_pii_in_input(user_input), check_toxicity(user_input), check_max_length(user_input), ] return all(checks)Output Guardrails
Check model output before showing it to the user:
def check_output(model_output): checks = [ check_pii_in_output(model_output), check_hallucination(model_output), check_format(model_output), check_topic_boundary(model_output), ] return all(checks)Hallucination Detection
One of the hardest problems in production LLM systems. Approaches:
- Self-consistency check: Ask the model the same question twice. If answers diverge, likely hallucination.
- Factual grounding check: Extract claims from the output and verify against retrieved context.
- Perplexity-based: If the model’s own confidence is low (high perplexity), the answer is suspect.
- LLM-as-judge: Use a second LLM call to verify the first one’s output.
def check_hallucination(question, answer, context_chunks): # Verify answer claims against context verification_prompt = f""" Question: {question} Answer: {answer} Context: {context_chunks}
Does the answer contain any claims NOT supported by the context? Respond with ONLY: supported or unsupported """ result = judge_llm.generate(verification_prompt) return result.strip() == "supported"Incident Response
When things go wrong in production, respond systematically.
Common Incidents
| Incident | Symptoms | Response |
|---|---|---|
| Model degradation | Higher error rate, worse quality, more user complaints | Roll back last prompt/model change. Check provider status. |
| Cost spike | 3x+ normal daily cost | Check for abuse, runaway agents, model misrouting. |
| Latency spike | P95 > 10s | Check provider status, rate limits, network issues. |
| Hallucination wave | Multiple users report wrong answers | Check if context retrieval is broken (RAG), check for prompt injection. |
| Provider outage | 100% errors from one provider | Fail over to secondary provider. |
Incident Response Playbook
1. DETECT Alert triggers (error rate > 5%, latency > 10s) ↓2. TRIAGE Is it us or the provider? If provider: fail over If us: check last deployment ↓3. MITIGATE Roll back last change Enable fallback model Rate limit aggressive users ↓4. RESOLVE Confirm metrics returned to baseline ↓5. LEARN Post-mortem: what happened, why, how to prevent Add monitoring to catch earlier next timeProduction Checklist
- Multi-model fallback configured (primary + secondary + emergency)
- Canary deployment process for prompts and models
- Monitoring dashboards for latency, errors, cost, quality
- Alerts configured for error rate, latency, cost anomalies
- Prompts versioned and stored as code (YAML/JSON)
- CI/CD pipeline for prompt changes with automated eval tests
- A/B testing framework in place
- Cost tracking per user, per feature, per model
- Caching strategy implemented (exact + semantic)
- Rate limiting per user tier
- Input/output guardrails deployed
- Hallucination detection in place for critical paths
- Incident response playbook documented
- Regular load testing (quarterly)
Key Takeaways
- LLMops is different from traditional ops — models degrade silently, costs are variable, quality is subjective
- Always have a fallback — no provider is 100% reliable
- Version everything — prompts, models, configurations
- Monitor semantically — latency and errors aren’t enough; track quality and safety too
- Cache aggressively — exact match for simple Q&A, semantic cache for conversational
- Route intelligently — simple queries to cheap models, complex to expensive
- Guard inputs and outputs — LLMs can be attacked through both
- Incident response is a playbook — don’t make it up on the spot
See Also:
- Inference Optimization - Making models faster and cheaper
- Evaluation & Testing - Testing strategies for LLMs
- Agents & Frameworks - Building and monitoring AI agents
- RAG Architecture - Production RAG patterns