Skip to content

Production LLMOps

📖 12 min read deep-diveproductionllmopsmonitoring
Deploying, monitoring, and operating LLM applications in production - CI/CD, caching, guardrails, and incident response
Key Takeaways
  • Always deploy multi-model fallback — primary, secondary, and emergency models prevent outages
  • Monitor quality and safety metrics alongside latency and errors — models degrade silently
  • Cache aggressively: exact match (Redis) for FAQ, semantic cache (vector DB) for conversational patterns
  • Use canary deployments for prompt changes — roll back within minutes if metrics degrade

Production LLM systems require different operational practices than traditional software. Models can degrade silently, cost can spike unpredictably, and prompt changes can have cascading effects.


Deployment Patterns

Pattern 1: Simple API Proxy

The simplest deployment. A thin wrapper around an LLM API with basic error handling.

User → [Proxy] → Claude/GPT/Gemini API
├── Retry on 5xx
├── Circuit breaker on rate limits
└── Basic logging

Best for: Prototypes, internal tools, low-volume applications.

Gotchas:

  • No fallback if the API goes down
  • No cost tracking per user
  • No prompt versioning

Pattern 2: Multi-Model with Fallback

Route requests to a primary model, fall back to a secondary if it fails.

User → [Router] → Primary (Claude)
│ Fallback (GPT-5.5)
│ Emergency (DeepSeek V4)
├── Health checks every 30s
├── Timeout: 30s per model
└── Retry: 2x on 5xx before fallback

Implementation:

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
MODELS = {
"primary": {"provider": "anthropic", "model": "claude-sonnet-4-20260510"},
"fallback": {"provider": "openai", "model": "gpt-5.5"},
"emergency": {"provider": "deepseek", "model": "deepseek-v4"},
}
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def generate_with_fallback(prompt, context):
for tier, config in MODELS.items():
try:
response = await call_model(config, prompt, context)
return response
except (APIError, TimeoutError) as e:
log.warning(f"{tier} failed: {e}, falling back")
continue
raise AllModelsExhausted("All models failed")

Pattern 3: Canary Deployments

Gradually roll out new prompts or models to a subset of users.

Day 1: 5% of users → new prompt version
Day 2: 20% of users → new prompt version
Day 3: 50% of users → new prompt version
Day 4: 100% of users → new prompt version

Rollback criteria: If any metric degrades by 5%+ (accuracy, latency, user satisfaction), roll back immediately.

Implementation with feature flags:

prompt_versions = {
"v1": {"prompt": "Answer concisely: {question}"},
"v2": {"prompt": "You are a helpful assistant. Answer: {question}"},
}
def get_prompt_version(user_id):
# Consistent hashing for stable assignment
hash_val = hash(user_id) % 100
if hash_val < config.canary_percent: # e.g., 5%
return prompt_versions["v2"]
return prompt_versions["v1"]

Pattern 4: Multi-Region / Multi-Provider

For high-availability systems, spread across regions and providers.

US-East: [Claude API] [GPT API]
US-West: [Claude API] [GPT API]
EU: [Claude API] [GPT API]
Health checks → Route to healthiest region
Cross-region failover in <5s

Monitoring

LLM monitoring is different from traditional monitoring. You need to track not just system metrics (latency, errors) but also semantic metrics (quality, safety, cost).

What to Monitor

CategoryMetricsWhy It Matters
LatencyP50/P95/P99 time to first token, total response timeUser experience
ErrorsAPI errors, rate limits, timeouts, model-side errorsReliability
CostCost per query, per user, per model, per featureBudget control
QualityAccuracy (on labeled data), user ratings, thumbs up/downModel degradation detection
SafetyToxicity scores, PII leakage, jailbreak attemptsCompliance, trust
DriftOutput length distribution, token usage patternsEarly warning for model changes

Monitoring Tools (May 2026)

ToolBest ForKey Feature
LangSmithLLM tracing, prompt debugging, evaluationFull trace visualization, dataset management
Weights & BiasesExperiment tracking, model registryPrompts as experiments, cost tracking
Arize AIML observability, drift detectionEmbedding drift monitoring, LLM-specific metrics
HeliconeSimple proxy-based monitoringEasy setup (proxy layer), cost breakdowns
LogfireOpenTelemetry-native, structured loggingDistributed tracing for LLM pipelines
Datadog / GrafanaInfrastructure monitoringCustom dashboards, alerting

Setting Up Alerts

Don’t alert on everything. Alert on things that matter:

alerts:
- name: high_error_rate
condition: error_rate > 5% over 5 minutes
severity: critical
action: page on-call, auto-fallback to secondary model
- name: latency_spike
condition: p95_latency > 10s over 5 minutes
severity: warning
action: investigate provider, check rate limits
- name: cost_anomaly
condition: daily_cost > 2x rolling_7day_average
severity: warning
action: notify team, check for abuse
- name: quality_drop
condition: user_thumbs_down > 10% over 1 hour
severity: critical
action: roll back last prompt change, notify team

Trace Visualization

Every LLM call should be traceable end-to-end:

User request ─→ Auth check ─→ Prompt assembly ─→ Model call ─→ Output validation ─→ Response
(trace_id: abc123)
├── latency: 2.3s
├── tokens: 450 input / 120 output
├── model: claude-sonnet-4-20260510
├── cost: $0.003
└── version: prompt_v3 / model_v2

Prompt Versioning & CI/CD

Prompts are code. They should be versioned, reviewed, and deployed through a pipeline just like application code.

Prompt as Code

Store prompts in a structured format, not scattered across code:

prompts/classify_sentiment.yaml
name: classify_sentiment
version: 3
model: claude-sonnet-4-20260510
system_prompt: |
You are a sentiment classifier. Analyze the text and return:
- positive
- negative
- neutral
temperature: 0.1
max_tokens: 10
output_schema:
type: string
enum: [positive, negative, neutral]
tags:
- production
- classification

CI/CD Pipeline for Prompts

Developer edits prompt.yaml
PR created
Automated tests run:
- Unit tests (formatting, length checks)
- Eval tests (accuracy on labeled test set)
- Regression tests (known edge cases)
Review + approve
Merge to main
Deploy to staging (5% canary)
Monitor for 1 hour
Deploy to production (100%)

Automated Prompt Testing

def test_prompt_version():
"""Validate a prompt version before deployment."""
test_cases = [
{"input": "I love this!", "expected": "positive"},
{"input": "This is terrible.", "expected": "negative"},
{"input": "It's okay I guess.", "expected": "neutral"},
]
accuracy = 0
for case in test_cases:
response = call_model(prompt_version, case["input"])
if response.strip().lower() == case["expected"]:
accuracy += 1
accuracy_pct = accuracy / len(test_cases) * 100
assert accuracy_pct >= 80, f"Prompt accuracy {accuracy_pct}% < 80% threshold"

A/B Testing

Run controlled experiments to validate changes.

Setup

EXPERIMENTS = {
"prompt_v3_vs_v4": {
"control": "prompt_v3",
"treatment": "prompt_v4",
"split": 50, # 50/50 split
"metrics": ["accuracy", "latency", "user_satisfaction"],
"min_sample": 1000,
"min_duration": "1h",
}
}
def assign_experiment(user_id, experiment_name):
experiment = EXPERIMENTS[experiment_name]
hash_val = hash(f"{user_id}:{experiment_name}") % 100
return "treatment" if hash_val < experiment["split"] else "control"

What to Measure

MetricHow to MeasureMinimum Detectable Effect
AccuracyLabeled test set, human eval, LLM-as-judge2%
LatencyP50/P95 time to first token100ms
CostTokens + model tier per query5%
User satisfactionThumbs up/down, follow-up questions5%
SafetyToxicity scores, PII detection rate1%

Statistical significance: Use a chi-squared test or Bayesian A/B test. Don’t declare a winner until p < 0.05 with at least 1,000 samples per variant.


Cost Tracking

LLM costs are variable and can spike. Track aggressively.

Per-Query Cost Breakdown

def track_cost(user_id, feature, model, input_tokens, output_tokens):
cost = (input_tokens * model.input_price / 1_000_000 +
output_tokens * model.output_price / 1_000_000)
log_to_database({
"timestamp": datetime.now(),
"user_id": user_id,
"feature": feature,
"model": model.name,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost": cost,
})
return cost

Cost Dashboard

Track cost by multiple dimensions:

DimensionWhy
Per userIdentify heavy users, potential abuse
Per featureWhich features are most expensive?
Per modelAre you using expensive models for cheap tasks?
Per hour/dayIdentify cost spikes
Per experimentTrack experiment costs separately

Cost Optimization Playbook

  1. Add routing: Route simple queries to cheaper models
  2. Enable prompt caching: 50-90% savings on repeated prefixes
  3. Reduce output tokens: Shorter responses cost less
  4. Batch where possible: Reduce per-request overhead
  5. Negotiate volume pricing: API discounts at 100M+ tokens/month

Caching Strategies

LLM outputs can be cached to reduce cost and latency.

Exact Match Cache

The simplest cache. If the same prompt was asked before, return the cached response.

cache = RedisCache(ttl=3600) # 1 hour
def get_response(prompt, model):
cache_key = hash(f"{model}:{prompt}")
cached = cache.get(cache_key)
if cached:
return cached
response = call_model(prompt)
cache.set(cache_key, response)
return response

Hit rate: 5-15% for most applications Best for: FAQ, documentation, product information

Semantic Cache

Cache based on meaning, not exact text. If a user asks “What’s your return policy?” and another asks “How do I return something?”, return the same answer.

def semantic_cache_lookup(query, threshold=0.95):
query_embedding = embed(query)
cached_entries = cache.search(query_embedding, top_k=1)
if cached_entries and cached_entries[0].similarity > threshold:
return cached_entries[0].response
return None

Implementation: Use the same embedding model as your vector database. Store (embedding, response, prompt) triples. On cache miss, embed the query, find the closest match, return if similarity exceeds threshold.

Hit rate: 20-40% (much higher than exact match) Best for: Customer support, conversational AI Tradeoff: Higher latency than exact match (requires embedding computation + vector search)

Cache Strategy Decision

TypeLatency SavingsCost SavingsComplexity
None0x0%None
Exact match (Redis)100x for hits5-15%Low
Prompt caching (API)2x for cached prefixes50-90% on cachedBuilt-in
Semantic (Vector DB)10x for hits20-40%Medium

Rate Limiting & Throttling

Protect your system from abuse and cost spikes.

Per-User Rate Limits

RATE_LIMITS = {
"free_tier": {"rpm": 10, "tpm": 10000, "cost_limit": 0.10}, # per day
"pro_tier": {"rpm": 100, "tpm": 100000, "cost_limit": 1.00},
"enterprise": {"rpm": 1000, "tpm": 1000000, "cost_limit": None},
}
def check_rate_limit(user):
limits = RATE_LIMITS[user.tier]
usage = get_usage(user.id, window="1m")
if usage.requests >= limits.rpm:
raise RateLimitExceeded("Too many requests per minute")
if usage.tokens >= limits.tpm:
raise RateLimitExceeded("Too many tokens per minute")
if limits.cost_limit and usage.daily_cost >= limits.cost_limit:
raise CostLimitExceeded("Daily cost limit reached")

Queue Management

When traffic spikes exceed capacity, queue requests instead of dropping them:

async def process_with_queue(request, queue):
if queue.length() > 100:
return {"error": "Too many requests", "retry_after": 10}
response = await queue.enqueue(request, timeout=30)
return response

Guardrails

Prevent the model from producing harmful or undesirable outputs.

Input Guardrails

Check user input before it reaches the model:

def check_input(user_input):
checks = [
check_prompt_injection(user_input),
check_pii_in_input(user_input),
check_toxicity(user_input),
check_max_length(user_input),
]
return all(checks)

Output Guardrails

Check model output before showing it to the user:

def check_output(model_output):
checks = [
check_pii_in_output(model_output),
check_hallucination(model_output),
check_format(model_output),
check_topic_boundary(model_output),
]
return all(checks)

Hallucination Detection

One of the hardest problems in production LLM systems. Approaches:

  1. Self-consistency check: Ask the model the same question twice. If answers diverge, likely hallucination.
  2. Factual grounding check: Extract claims from the output and verify against retrieved context.
  3. Perplexity-based: If the model’s own confidence is low (high perplexity), the answer is suspect.
  4. LLM-as-judge: Use a second LLM call to verify the first one’s output.
def check_hallucination(question, answer, context_chunks):
# Verify answer claims against context
verification_prompt = f"""
Question: {question}
Answer: {answer}
Context: {context_chunks}
Does the answer contain any claims NOT supported by the context?
Respond with ONLY: supported or unsupported
"""
result = judge_llm.generate(verification_prompt)
return result.strip() == "supported"

Incident Response

When things go wrong in production, respond systematically.

Common Incidents

IncidentSymptomsResponse
Model degradationHigher error rate, worse quality, more user complaintsRoll back last prompt/model change. Check provider status.
Cost spike3x+ normal daily costCheck for abuse, runaway agents, model misrouting.
Latency spikeP95 > 10sCheck provider status, rate limits, network issues.
Hallucination waveMultiple users report wrong answersCheck if context retrieval is broken (RAG), check for prompt injection.
Provider outage100% errors from one providerFail over to secondary provider.

Incident Response Playbook

1. DETECT
Alert triggers (error rate > 5%, latency > 10s)
2. TRIAGE
Is it us or the provider?
If provider: fail over
If us: check last deployment
3. MITIGATE
Roll back last change
Enable fallback model
Rate limit aggressive users
4. RESOLVE
Confirm metrics returned to baseline
5. LEARN
Post-mortem: what happened, why, how to prevent
Add monitoring to catch earlier next time

Production Checklist

  • Multi-model fallback configured (primary + secondary + emergency)
  • Canary deployment process for prompts and models
  • Monitoring dashboards for latency, errors, cost, quality
  • Alerts configured for error rate, latency, cost anomalies
  • Prompts versioned and stored as code (YAML/JSON)
  • CI/CD pipeline for prompt changes with automated eval tests
  • A/B testing framework in place
  • Cost tracking per user, per feature, per model
  • Caching strategy implemented (exact + semantic)
  • Rate limiting per user tier
  • Input/output guardrails deployed
  • Hallucination detection in place for critical paths
  • Incident response playbook documented
  • Regular load testing (quarterly)

Key Takeaways

  1. LLMops is different from traditional ops — models degrade silently, costs are variable, quality is subjective
  2. Always have a fallback — no provider is 100% reliable
  3. Version everything — prompts, models, configurations
  4. Monitor semantically — latency and errors aren’t enough; track quality and safety too
  5. Cache aggressively — exact match for simple Q&A, semantic cache for conversational
  6. Route intelligently — simple queries to cheap models, complex to expensive
  7. Guard inputs and outputs — LLMs can be attacked through both
  8. Incident response is a playbook — don’t make it up on the spot

See Also: