Production LLMOps

📖 12 min read deep-diveproductionllmopsmonitoring

Deploying, monitoring, and operating LLM applications in production - CI/CD, caching, guardrails, and incident response

Key Takeaways

Always deploy multi-model fallback — primary, secondary, and emergency models prevent outages
Monitor quality and safety metrics alongside latency and errors — models degrade silently
Cache aggressively: exact match (Redis) for FAQ, semantic cache (vector DB) for conversational patterns
Use canary deployments for prompt changes — roll back within minutes if metrics degrade

Production LLM systems require different operational practices than traditional software. Models can degrade silently, cost can spike unpredictably, and prompt changes can have cascading effects.

Deployment Patterns

Pattern 1: Simple API Proxy

The simplest deployment. A thin wrapper around an LLM API with basic error handling.

User → [Proxy] → Claude/GPT/Gemini API
          │
          ├── Retry on 5xx
          ├── Circuit breaker on rate limits
          └── Basic logging

Best for: Prototypes, internal tools, low-volume applications.

Gotchas:

No fallback if the API goes down
No cost tracking per user
No prompt versioning

Pattern 2: Multi-Model with Fallback

Route requests to a primary model, fall back to a secondary if it fails.

User → [Router] → Primary (Claude)
        │         Fallback (GPT-5.5)
        │         Emergency (DeepSeek V4)
        │
        ├── Health checks every 30s
        ├── Timeout: 30s per model
        └── Retry: 2x on 5xx before fallback

Implementation:

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

MODELS = {
    "primary": {"provider": "anthropic", "model": "claude-sonnet-4-20260510"},
    "fallback": {"provider": "openai", "model": "gpt-5.5"},
    "emergency": {"provider": "deepseek", "model": "deepseek-v4"},
}

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def generate_with_fallback(prompt, context):
    for tier, config in MODELS.items():
        try:
            response = await call_model(config, prompt, context)
            return response
        except (APIError, TimeoutError) as e:
            log.warning(f"{tier} failed: {e}, falling back")
            continue
    raise AllModelsExhausted("All models failed")

Pattern 3: Canary Deployments

Gradually roll out new prompts or models to a subset of users.

Day 1:  5% of users → new prompt version
Day 2: 20% of users → new prompt version
Day 3: 50% of users → new prompt version
Day 4: 100% of users → new prompt version

Rollback criteria: If any metric degrades by 5%+ (accuracy, latency, user satisfaction), roll back immediately.

Implementation with feature flags:

prompt_versions = {
    "v1": {"prompt": "Answer concisely: {question}"},
    "v2": {"prompt": "You are a helpful assistant. Answer: {question}"},
}

def get_prompt_version(user_id):
    # Consistent hashing for stable assignment
    hash_val = hash(user_id) % 100
    if hash_val < config.canary_percent:  # e.g., 5%
        return prompt_versions["v2"]
    return prompt_versions["v1"]

Pattern 4: Multi-Region / Multi-Provider

For high-availability systems, spread across regions and providers.

US-East: [Claude API] [GPT API]
US-West: [Claude API] [GPT API]
EU:      [Claude API] [GPT API]

Health checks → Route to healthiest region
Cross-region failover in <5s

Monitoring

LLM monitoring is different from traditional monitoring. You need to track not just system metrics (latency, errors) but also semantic metrics (quality, safety, cost).

What to Monitor

Category	Metrics	Why It Matters
Latency	P50/P95/P99 time to first token, total response time	User experience
Errors	API errors, rate limits, timeouts, model-side errors	Reliability
Cost	Cost per query, per user, per model, per feature	Budget control
Quality	Accuracy (on labeled data), user ratings, thumbs up/down	Model degradation detection
Safety	Toxicity scores, PII leakage, jailbreak attempts	Compliance, trust
Drift	Output length distribution, token usage patterns	Early warning for model changes

Monitoring Tools (May 2026)

Tool	Best For	Key Feature
LangSmith	LLM tracing, prompt debugging, evaluation	Full trace visualization, dataset management
Weights & Biases	Experiment tracking, model registry	Prompts as experiments, cost tracking
Arize AI	ML observability, drift detection	Embedding drift monitoring, LLM-specific metrics
Helicone	Simple proxy-based monitoring	Easy setup (proxy layer), cost breakdowns
Logfire	OpenTelemetry-native, structured logging	Distributed tracing for LLM pipelines
Datadog / Grafana	Infrastructure monitoring	Custom dashboards, alerting

Setting Up Alerts

Don’t alert on everything. Alert on things that matter:

alerts:
  - name: high_error_rate
    condition: error_rate > 5% over 5 minutes
    severity: critical
    action: page on-call, auto-fallback to secondary model

  - name: latency_spike
    condition: p95_latency > 10s over 5 minutes
    severity: warning
    action: investigate provider, check rate limits

  - name: cost_anomaly
    condition: daily_cost > 2x rolling_7day_average
    severity: warning
    action: notify team, check for abuse

  - name: quality_drop
    condition: user_thumbs_down > 10% over 1 hour
    severity: critical
    action: roll back last prompt change, notify team

Trace Visualization

Every LLM call should be traceable end-to-end:

User request ─→ Auth check ─→ Prompt assembly ─→ Model call ─→ Output validation ─→ Response
  (trace_id: abc123)
  ├── latency: 2.3s
  ├── tokens: 450 input / 120 output
  ├── model: claude-sonnet-4-20260510
  ├── cost: $0.003
  └── version: prompt_v3 / model_v2

Prompt Versioning & CI/CD

Prompts are code. They should be versioned, reviewed, and deployed through a pipeline just like application code.

Prompt as Code

Store prompts in a structured format, not scattered across code:

name: classify_sentiment
version: 3
model: claude-sonnet-4-20260510
system_prompt: |
  You are a sentiment classifier. Analyze the text and return:
  - positive
  - negative
  - neutral
temperature: 0.1
max_tokens: 10
output_schema:
  type: string
  enum: [positive, negative, neutral]
tags:
  - production
  - classification

CI/CD Pipeline for Prompts

Developer edits prompt.yaml
  ↓
PR created
  ↓
Automated tests run:
  - Unit tests (formatting, length checks)
  - Eval tests (accuracy on labeled test set)
  - Regression tests (known edge cases)
  ↓
Review + approve
  ↓
Merge to main
  ↓
Deploy to staging (5% canary)
  ↓
Monitor for 1 hour
  ↓
Deploy to production (100%)

Automated Prompt Testing

def test_prompt_version():
    """Validate a prompt version before deployment."""
    test_cases = [
        {"input": "I love this!", "expected": "positive"},
        {"input": "This is terrible.", "expected": "negative"},
        {"input": "It's okay I guess.", "expected": "neutral"},
    ]

    accuracy = 0
    for case in test_cases:
        response = call_model(prompt_version, case["input"])
        if response.strip().lower() == case["expected"]:
            accuracy += 1

    accuracy_pct = accuracy / len(test_cases) * 100
    assert accuracy_pct >= 80, f"Prompt accuracy {accuracy_pct}% < 80% threshold"

A/B Testing

Run controlled experiments to validate changes.

Setup

EXPERIMENTS = {
    "prompt_v3_vs_v4": {
        "control": "prompt_v3",
        "treatment": "prompt_v4",
        "split": 50,  # 50/50 split
        "metrics": ["accuracy", "latency", "user_satisfaction"],
        "min_sample": 1000,
        "min_duration": "1h",
    }
}

def assign_experiment(user_id, experiment_name):
    experiment = EXPERIMENTS[experiment_name]
    hash_val = hash(f"{user_id}:{experiment_name}") % 100
    return "treatment" if hash_val < experiment["split"] else "control"

What to Measure

Metric	How to Measure	Minimum Detectable Effect
Accuracy	Labeled test set, human eval, LLM-as-judge	2%
Latency	P50/P95 time to first token	100ms
Cost	Tokens + model tier per query	5%
User satisfaction	Thumbs up/down, follow-up questions	5%
Safety	Toxicity scores, PII detection rate	1%

Statistical significance: Use a chi-squared test or Bayesian A/B test. Don’t declare a winner until p < 0.05 with at least 1,000 samples per variant.

Cost Tracking

LLM costs are variable and can spike. Track aggressively.

Per-Query Cost Breakdown

def track_cost(user_id, feature, model, input_tokens, output_tokens):
    cost = (input_tokens * model.input_price / 1_000_000 +
            output_tokens * model.output_price / 1_000_000)

    log_to_database({
        "timestamp": datetime.now(),
        "user_id": user_id,
        "feature": feature,
        "model": model.name,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cost": cost,
    })

    return cost

Cost Dashboard

Track cost by multiple dimensions:

Dimension	Why
Per user	Identify heavy users, potential abuse
Per feature	Which features are most expensive?
Per model	Are you using expensive models for cheap tasks?
Per hour/day	Identify cost spikes
Per experiment	Track experiment costs separately

Cost Optimization Playbook

Add routing: Route simple queries to cheaper models
Enable prompt caching: 50-90% savings on repeated prefixes
Reduce output tokens: Shorter responses cost less
Batch where possible: Reduce per-request overhead
Negotiate volume pricing: API discounts at 100M+ tokens/month

Caching Strategies

LLM outputs can be cached to reduce cost and latency.

Exact Match Cache

The simplest cache. If the same prompt was asked before, return the cached response.

cache = RedisCache(ttl=3600)  # 1 hour

def get_response(prompt, model):
    cache_key = hash(f"{model}:{prompt}")
    cached = cache.get(cache_key)
    if cached:
        return cached

    response = call_model(prompt)
    cache.set(cache_key, response)
    return response

Hit rate: 5-15% for most applications Best for: FAQ, documentation, product information

Semantic Cache

Cache based on meaning, not exact text. If a user asks “What’s your return policy?” and another asks “How do I return something?”, return the same answer.

def semantic_cache_lookup(query, threshold=0.95):
    query_embedding = embed(query)
    cached_entries = cache.search(query_embedding, top_k=1)

    if cached_entries and cached_entries[0].similarity > threshold:
        return cached_entries[0].response

    return None

Implementation: Use the same embedding model as your vector database. Store (embedding, response, prompt) triples. On cache miss, embed the query, find the closest match, return if similarity exceeds threshold.

Hit rate: 20-40% (much higher than exact match) Best for: Customer support, conversational AI Tradeoff: Higher latency than exact match (requires embedding computation + vector search)

Cache Strategy Decision

Type	Latency Savings	Cost Savings	Complexity
None	0x	0%	None
Exact match (Redis)	100x for hits	5-15%	Low
Prompt caching (API)	2x for cached prefixes	50-90% on cached	Built-in
Semantic (Vector DB)	10x for hits	20-40%	Medium

Rate Limiting & Throttling

Protect your system from abuse and cost spikes.

Per-User Rate Limits

RATE_LIMITS = {
    "free_tier": {"rpm": 10, "tpm": 10000, "cost_limit": 0.10},  # per day
    "pro_tier": {"rpm": 100, "tpm": 100000, "cost_limit": 1.00},
    "enterprise": {"rpm": 1000, "tpm": 1000000, "cost_limit": None},
}

def check_rate_limit(user):
    limits = RATE_LIMITS[user.tier]
    usage = get_usage(user.id, window="1m")

    if usage.requests >= limits.rpm:
        raise RateLimitExceeded("Too many requests per minute")
    if usage.tokens >= limits.tpm:
        raise RateLimitExceeded("Too many tokens per minute")
    if limits.cost_limit and usage.daily_cost >= limits.cost_limit:
        raise CostLimitExceeded("Daily cost limit reached")

Queue Management

When traffic spikes exceed capacity, queue requests instead of dropping them:

async def process_with_queue(request, queue):
    if queue.length() > 100:
        return {"error": "Too many requests", "retry_after": 10}

    response = await queue.enqueue(request, timeout=30)
    return response

Guardrails

Prevent the model from producing harmful or undesirable outputs.

Input Guardrails

Check user input before it reaches the model:

def check_input(user_input):
    checks = [
        check_prompt_injection(user_input),
        check_pii_in_input(user_input),
        check_toxicity(user_input),
        check_max_length(user_input),
    ]
    return all(checks)

Output Guardrails

Check model output before showing it to the user:

def check_output(model_output):
    checks = [
        check_pii_in_output(model_output),
        check_hallucination(model_output),
        check_format(model_output),
        check_topic_boundary(model_output),
    ]
    return all(checks)

Hallucination Detection

One of the hardest problems in production LLM systems. Approaches:

Self-consistency check: Ask the model the same question twice. If answers diverge, likely hallucination.
Factual grounding check: Extract claims from the output and verify against retrieved context.
Perplexity-based: If the model’s own confidence is low (high perplexity), the answer is suspect.
LLM-as-judge: Use a second LLM call to verify the first one’s output.

def check_hallucination(question, answer, context_chunks):
    # Verify answer claims against context
    verification_prompt = f"""
    Question: {question}
    Answer: {answer}
    Context: {context_chunks}

    Does the answer contain any claims NOT supported by the context?
    Respond with ONLY: supported or unsupported
    """
    result = judge_llm.generate(verification_prompt)
    return result.strip() == "supported"

Incident Response

When things go wrong in production, respond systematically.

Common Incidents

Incident	Symptoms	Response
Model degradation	Higher error rate, worse quality, more user complaints	Roll back last prompt/model change. Check provider status.
Cost spike	3x+ normal daily cost	Check for abuse, runaway agents, model misrouting.
Latency spike	P95 > 10s	Check provider status, rate limits, network issues.
Hallucination wave	Multiple users report wrong answers	Check if context retrieval is broken (RAG), check for prompt injection.
Provider outage	100% errors from one provider	Fail over to secondary provider.

Incident Response Playbook

1. DETECT
   Alert triggers (error rate > 5%, latency > 10s)
   ↓
2. TRIAGE
   Is it us or the provider?
   If provider: fail over
   If us: check last deployment
   ↓
3. MITIGATE
   Roll back last change
   Enable fallback model
   Rate limit aggressive users
   ↓
4. RESOLVE
   Confirm metrics returned to baseline
   ↓
5. LEARN
   Post-mortem: what happened, why, how to prevent
   Add monitoring to catch earlier next time

Production Checklist

Key Takeaways

LLMops is different from traditional ops — models degrade silently, costs are variable, quality is subjective
Always have a fallback — no provider is 100% reliable
Version everything — prompts, models, configurations
Monitor semantically — latency and errors aren’t enough; track quality and safety too
Cache aggressively — exact match for simple Q&A, semantic cache for conversational
Route intelligently — simple queries to cheap models, complex to expensive
Guard inputs and outputs — LLMs can be attacked through both
Incident response is a playbook — don’t make it up on the spot