Evaluation & Testing for LLMs
How to know if your LLM application is actually working - from individual prompt tests to production monitoring.
Why Evaluation Is Hard
LLM outputs are non-deterministic. The same prompt can produce different results each time. Traditional software testing (assert equals) doesn’t work. Instead, you need a layered evaluation strategy.
Tests that pass today may fail tomorrow after a model update.Tests that pass for Claude may fail for GPT."Passing" depends on your quality bar, not a binary right/wrong.Evaluation Layers
Layer 1: Unit Tests (Per Prompt)
Test individual prompt templates against known inputs.
def test_summary_prompt(): result = run_prompt("Summarize: The sky is blue because...") assert "Raleigh scattering" in result or "wavelength" in result assert len(result.split()) < 50 # summary should be concise assert not contains_hallucination(result, source_text="The sky is blue because...")What to check:
- Required keywords or patterns present
- Output length constraints met
- Format/structure matches expected
- No prohibited content (PII, hallucinations)
Layer 2: Property-Based Tests
Test that outputs satisfy certain properties regardless of content.
properties = [ ("starts_with_greeting", lambda o: o.startswith("Hello")), ("ends_with_question", lambda o: o.strip().endswith("?")), ("under_500_chars", lambda o: len(o) < 500), ("contains_no_urls", lambda o: "http" not in o),]Layer 3: Semantic Evaluation
Measure output quality using another LLM as judge.
Input: "Explain quantum computing to a 10-year-old"Output: "Quantum computers use qubits that can be 0 and 1 at the same time..."
Eval checks:- Is the explanation age-appropriate? ✓- Are there factual errors? ✓ (none found)- Is it engaging for a child? ✓- Length appropriate? ✓ (2 paragraphs)LLM-as-Judge
Using one LLM to evaluate another’s output is the most practical evaluation method for production systems.
Structured Eval Prompt
You are a quality evaluator. Rate the following assistant response on:1. Accuracy (1-5): No factual errors2. Completeness (1-5): Addresses all parts of the query3. Clarity (1-5): Easy to understand4. Safety (1-5): No harmful or biased content
Return ONLY a JSON object:{"accuracy": 5, "completeness": 4, "clarity": 5, "safety": 5, "pass": true}Implementation
import json
def evaluate_response(query, response, rubric): eval_prompt = f""" Rate this response on the following criteria: {rubric}
Query: {query} Response: {response}
Return JSON only. """ eval_result = client.messages.create( model="claude-sonnet-4-20260510", max_tokens=200, messages=[{"role": "user", "content": eval_prompt}], ) return json.loads(eval_result.content[0].text)LLM-as-Judge Pitfalls
| Issue | Mitigation |
|---|---|
| Judge prefers longer responses | Control for length in rubric |
| Judge agrees with itself | Use a different model as judge |
| Position bias (favors first/last) | Randomize comparison order |
| Verbosity bias | Normalize responses before comparison |
| Self-enhancement bias | Never use the same model as judge and generator |
A/B Testing
Compare two versions of a prompt, model, or system configuration.
Methodology
- Split traffic 50/50 between control (A) and variant (B)
- Collect both quantitative (latency, cost) and qualitative (eval scores) metrics
- Statistical significance - run until you have 100+ samples per variant
- Decision - switch if B is clearly better; keep A if uncertain
Sample Size Calculator
import math
def min_sample_size(effect_size=0.1, alpha=0.05, power=0.8): z_alpha = 1.96 # for 95% confidence z_beta = 0.84 # for 80% power n = (2 * (z_alpha + z_beta)**2) / (effect_size**2) return math.ceil(n)
# To detect a 10% improvement: ~1,570 samples per variant# To detect a 5% improvement: ~6,283 samples per variantBenchmark Methodology
Understanding how benchmarks actually work is essential for interpreting model scores correctly.
How Major Benchmarks Work
MMLU (Massive Multitask Language Understanding):
- 57 subjects from high-school to professional-level (math, law, medicine, physics…)
- Format: Multiple choice (A/B/C/D), ~14,000 questions
- Scoring: Simple accuracy — what % did the model get right?
- Ceiling: Human experts score ~90%; top models now hit 92-93%
- Pitfall: Many questions are answerable from memorized internet text, not genuine reasoning
HumanEval:
- 164 hand-written Python programming problems
- Format: Function signature + docstring, model writes the body
- Scoring: Functional correctness — does the code pass unit tests?
- Ceiling: 100% is theoretically possible; top models hit 96-99%
- Pitfall: Tests are simple (few edge cases), models may memorize common patterns
GPQA (Graduate-Level Google-Proof Q&A):
- 448 expert-written questions in biology, physics, chemistry
- Format: Multiple choice, designed to be “Google-proof” (can’t find answers by search)
- Scoring: Accuracy
- Ceiling: Domain experts score ~85%; top models hit 87% (o3)
- Why it matters: Hardest benchmark — tests genuine understanding, not memorization
SWE-bench:
- 2,294 real GitHub issues from 12 popular Python repos
- Format: Model gets issue description + codebase, must produce a patch
- Scoring: Does the patch pass the repository’s existing test suite?
- Why it matters: Most realistic coding benchmark — tests real-world software engineering
- Current best: Claude Sonnet 4.6 at ~49%, o3 at ~71%
GSM8K (Grade School Math 8K):
- 8,500 grade-school math word problems
- Format: Natural language problem, model must output the numeric answer
- Scoring: Exact match accuracy
- Saturation: Many models now score 95%+; useful for regression testing but no longer differentiates
Design Arena:
- AI design quality benchmark with 5M+ community votes
- Format: Blind pairwise comparison of model outputs (code, UI, images, video)
- Scoring: Bradley-Terry Elo rating
- Why it matters: Measures creative/aesthetic output, not just reasoning — an entirely different axis of capability
Benchmark Contamination
The single biggest issue with benchmarks: data leakage.
Training data (internet, books, code) contains benchmark questions.Model memorizes answers during training.Benchmark score is inflated.Real-world performance is lower.How to detect contamination:
- N-gram overlap: If 10+ consecutive words from a benchmark question appear verbatim in training data, it’s contaminated
- Perplexity analysis: Models have unusually low perplexity on contaminated questions
- Time of release: If a benchmark was released before the model’s training cutoff, it may be in the training data
Real example:
- GPT-4 scored 86% on MMLU
- GPT-4’s training data included the internet up to September 2021
- MMLU was released in 2020
- Likely contamination: 10-20% of MMLU questions may have been seen during training
- Adjusted true performance: perhaps 70-75%
How to mitigate:
- Use benchmarks released after your model’s training cutoff
- Test on held-out subsets (MMLU-Redux, HumanEval-X)
- Use benchmarks designed to resist contamination (GPQA, which is “Google-proof”)
- Report contamination analysis alongside scores
Benchmark Saturation
As models improve, benchmarks become less useful:
| Benchmark | Year Released | Saturation | What It Tells You Now |
|---|---|---|---|
| MMLU | 2020 | Near-saturated | Whether a model is basic vs capable |
| HumanEval | 2021 | Near-saturated | Coding fluency (not real-world skill) |
| GSM8K | 2021 | Saturated | Basic arithmetic, small use |
| GPQA | 2023 | Not saturated | Genuine reasoning capability |
| SWE-bench | 2023 | Not saturated | Real-world coding ability |
| Design Arena | 2025 | Not saturated | Creative/design capability |
Pattern: Benchmarks lose discriminating power in 2-3 years. The community must keep creating harder benchmarks.
How to Interpret Scores
Benchmark score ≠ real-world capability| Score Range | What It Actually Means |
|---|---|
| 90-100% | Model is good at this type of question |
| 80-90% | Model is competent but makes mistakes |
| 70-80% | Model has basic capability, needs improvement |
| Below 70% | Model struggles with this domain |
But context matters:
- 95% on MMLU (saturated) ≠ 95% on GPQA (hard)
- A coding model may score 95% on HumanEval but 30% on SWE-bench
- A model strong in English may be weak in Chinese (even with same benchmark format)
Statistical significance:
- Small score differences (<2%) are often noise
- Benchmark runs are non-deterministic (temperature, sampling variation)
- Run each benchmark 3-5 times and report mean + variance
Beyond Benchmarks
Benchmarks tell you how models compare in controlled conditions. For real-world decisions, add:
1. Human evaluation (gold standard):
- Have domain experts rate outputs blind (A vs B)
- 50-100 samples per variant is enough for statistical significance
- Cost: $500-5000 per evaluation round
2. LLM-as-judge evaluation:
- Use a strong model (Claude, GPT-5.5) to rate your application’s outputs
- Well-calibrated with human ratings (0.7-0.9 correlation)
- Cost: $0.01-0.05 per evaluation
3. Task-specific evals:
- Create a dataset of 50-100 real user queries with gold-standard responses
- Run your application’s outputs through the same eval criteria
- This is more valuable than any public benchmark
4. Production metrics:
- User satisfaction scores
- Task completion rate
- Time to resolution
- Escalation rate (for support)
- These are the only metrics that truly matter
Guardrails & Safety Testing
Automated checks that run on every output before showing it to users.
Types of Guardrails
| Type | Example | Tool |
|---|---|---|
| Content filter | Block profanity, hate speech | NeMo Guardrails, Guardrails AI |
| PII detection | Redact emails, phone numbers | Presidio, custom regex |
| Hallucination check | Verify claims against source | Custom LLM-as-judge |
| Format validation | Ensure JSON is valid | Pydantic, Zod |
| Toxicity scoring | Rate harmful content | Perspective API |
Implementation Pattern
def safety_check(output): checks = [ check_pii(output), check_toxicity(output), check_hallucination(output, source_documents), check_format(output), ] failed = [c for c in checks if not c["pass"]] if failed: return {"pass": False, "failures": failed, "output": "I cannot provide that response."} return {"pass": True, "output": output}Production Monitoring
Continuous evaluation in production catches regressions that unit tests miss.
Metrics to Track
| Metric | What It Detects | Alert Threshold |
|---|---|---|
| Eval pass rate | Overall quality drops | <90% pass rate |
| Avg latency | Performance regression | >2x baseline |
| Error rate | API failures | >1% errors |
| User feedback score | Subjective quality | <3.5/5 stars |
| Cost per request | Budget issues | >2x baseline |
Automated Eval Pipeline
User Query → LLM → Output → Guardrails → User ↓ Eval Queue (async) ↓ Eval LLM (judge) ↓ Metrics Dashboard + AlertsEval Frameworks Compared
| Framework | Approach | Best For |
|---|---|---|
| DeepEval | LLM-as-judge, unit tests | Python-based eval pipelines |
| LangSmith | Trace + evaluate | LangChain users, debugging |
| Weights & Biases | Experiment tracking | Research, model comparison |
| Arize | Production monitoring | ML observability at scale |
| Custom (your code) | Full control | Specific business logic |
Quick Start: Minimal Eval Suite
# A minimal evaluation setup - adapt for your use caseevals = [ {"query": "What is RAG?", "checks": ["retrieval", "augmented", "generation"]}, {"query": "Explain transformers", "checks": ["attention", "encoder", "decoder"]}, {"query": "Summarize this email", "checks": ["`<50` words"], "max_length": 50},]
def run_evals(model, evals): results = [] for eval_case in evals: response = call_model(model, eval_case["query"]) passed = all(check in response for check in eval_case["checks"]) results.append({"query": eval_case["query"], "passed": passed, "response": response}) return resultsSee Also
- Benchmarks - Public benchmark comparisons
- Training & Fine-tuning - When evals show you need a better model
- Prompt Engineering - Iterate prompts before building evals