Skip to content

Evaluation & Testing for LLMs

📖 10 min read deep-diveevaluationtestingbenchmarks
How to evaluate LLM outputs - benchmarks, qualitative metrics, A/B testing, guardrails, and LLM-as-judge patterns.
Key Takeaways
  • Use 4 evaluation layers: unit tests, property-based tests, LLM-as-judge, and production monitoring
  • Benchmarks saturate in 2-3 years — GPQA and SWE-bench are the hardest current benchmarks
  • Best eval for your use case: 50-100 real user queries with gold-standard responses
  • LLM-as-judge correlates 0.7-0.9 with human ratings — use a different model than the one being evaluated

How to know if your LLM application is actually working - from individual prompt tests to production monitoring.


Why Evaluation Is Hard

LLM outputs are non-deterministic. The same prompt can produce different results each time. Traditional software testing (assert equals) doesn’t work. Instead, you need a layered evaluation strategy.

Tests that pass today may fail tomorrow after a model update.
Tests that pass for Claude may fail for GPT.
"Passing" depends on your quality bar, not a binary right/wrong.

Evaluation Layers

Layer 1: Unit Tests (Per Prompt)

Test individual prompt templates against known inputs.

def test_summary_prompt():
result = run_prompt("Summarize: The sky is blue because...")
assert "Raleigh scattering" in result or "wavelength" in result
assert len(result.split()) < 50 # summary should be concise
assert not contains_hallucination(result, source_text="The sky is blue because...")

What to check:

  • Required keywords or patterns present
  • Output length constraints met
  • Format/structure matches expected
  • No prohibited content (PII, hallucinations)

Layer 2: Property-Based Tests

Test that outputs satisfy certain properties regardless of content.

properties = [
("starts_with_greeting", lambda o: o.startswith("Hello")),
("ends_with_question", lambda o: o.strip().endswith("?")),
("under_500_chars", lambda o: len(o) < 500),
("contains_no_urls", lambda o: "http" not in o),
]

Layer 3: Semantic Evaluation

Measure output quality using another LLM as judge.

Input: "Explain quantum computing to a 10-year-old"
Output: "Quantum computers use qubits that can be 0 and 1 at the same time..."
Eval checks:
- Is the explanation age-appropriate? ✓
- Are there factual errors? ✓ (none found)
- Is it engaging for a child? ✓
- Length appropriate? ✓ (2 paragraphs)

LLM-as-Judge

Using one LLM to evaluate another’s output is the most practical evaluation method for production systems.

Structured Eval Prompt

You are a quality evaluator. Rate the following assistant response on:
1. Accuracy (1-5): No factual errors
2. Completeness (1-5): Addresses all parts of the query
3. Clarity (1-5): Easy to understand
4. Safety (1-5): No harmful or biased content
Return ONLY a JSON object:
{"accuracy": 5, "completeness": 4, "clarity": 5, "safety": 5, "pass": true}

Implementation

import json
def evaluate_response(query, response, rubric):
eval_prompt = f"""
Rate this response on the following criteria:
{rubric}
Query: {query}
Response: {response}
Return JSON only.
"""
eval_result = client.messages.create(
model="claude-sonnet-4-20260510",
max_tokens=200,
messages=[{"role": "user", "content": eval_prompt}],
)
return json.loads(eval_result.content[0].text)

LLM-as-Judge Pitfalls

IssueMitigation
Judge prefers longer responsesControl for length in rubric
Judge agrees with itselfUse a different model as judge
Position bias (favors first/last)Randomize comparison order
Verbosity biasNormalize responses before comparison
Self-enhancement biasNever use the same model as judge and generator

A/B Testing

Compare two versions of a prompt, model, or system configuration.

Methodology

  1. Split traffic 50/50 between control (A) and variant (B)
  2. Collect both quantitative (latency, cost) and qualitative (eval scores) metrics
  3. Statistical significance - run until you have 100+ samples per variant
  4. Decision - switch if B is clearly better; keep A if uncertain

Sample Size Calculator

import math
def min_sample_size(effect_size=0.1, alpha=0.05, power=0.8):
z_alpha = 1.96 # for 95% confidence
z_beta = 0.84 # for 80% power
n = (2 * (z_alpha + z_beta)**2) / (effect_size**2)
return math.ceil(n)
# To detect a 10% improvement: ~1,570 samples per variant
# To detect a 5% improvement: ~6,283 samples per variant

Benchmark Methodology

Understanding how benchmarks actually work is essential for interpreting model scores correctly.

How Major Benchmarks Work

MMLU (Massive Multitask Language Understanding):

  • 57 subjects from high-school to professional-level (math, law, medicine, physics…)
  • Format: Multiple choice (A/B/C/D), ~14,000 questions
  • Scoring: Simple accuracy — what % did the model get right?
  • Ceiling: Human experts score ~90%; top models now hit 92-93%
  • Pitfall: Many questions are answerable from memorized internet text, not genuine reasoning

HumanEval:

  • 164 hand-written Python programming problems
  • Format: Function signature + docstring, model writes the body
  • Scoring: Functional correctness — does the code pass unit tests?
  • Ceiling: 100% is theoretically possible; top models hit 96-99%
  • Pitfall: Tests are simple (few edge cases), models may memorize common patterns

GPQA (Graduate-Level Google-Proof Q&A):

  • 448 expert-written questions in biology, physics, chemistry
  • Format: Multiple choice, designed to be “Google-proof” (can’t find answers by search)
  • Scoring: Accuracy
  • Ceiling: Domain experts score ~85%; top models hit 87% (o3)
  • Why it matters: Hardest benchmark — tests genuine understanding, not memorization

SWE-bench:

  • 2,294 real GitHub issues from 12 popular Python repos
  • Format: Model gets issue description + codebase, must produce a patch
  • Scoring: Does the patch pass the repository’s existing test suite?
  • Why it matters: Most realistic coding benchmark — tests real-world software engineering
  • Current best: Claude Sonnet 4.6 at ~49%, o3 at ~71%

GSM8K (Grade School Math 8K):

  • 8,500 grade-school math word problems
  • Format: Natural language problem, model must output the numeric answer
  • Scoring: Exact match accuracy
  • Saturation: Many models now score 95%+; useful for regression testing but no longer differentiates

Design Arena:

  • AI design quality benchmark with 5M+ community votes
  • Format: Blind pairwise comparison of model outputs (code, UI, images, video)
  • Scoring: Bradley-Terry Elo rating
  • Why it matters: Measures creative/aesthetic output, not just reasoning — an entirely different axis of capability

Benchmark Contamination

The single biggest issue with benchmarks: data leakage.

Training data (internet, books, code) contains benchmark questions.
Model memorizes answers during training.
Benchmark score is inflated.
Real-world performance is lower.

How to detect contamination:

  • N-gram overlap: If 10+ consecutive words from a benchmark question appear verbatim in training data, it’s contaminated
  • Perplexity analysis: Models have unusually low perplexity on contaminated questions
  • Time of release: If a benchmark was released before the model’s training cutoff, it may be in the training data

Real example:

  • GPT-4 scored 86% on MMLU
  • GPT-4’s training data included the internet up to September 2021
  • MMLU was released in 2020
  • Likely contamination: 10-20% of MMLU questions may have been seen during training
  • Adjusted true performance: perhaps 70-75%

How to mitigate:

  • Use benchmarks released after your model’s training cutoff
  • Test on held-out subsets (MMLU-Redux, HumanEval-X)
  • Use benchmarks designed to resist contamination (GPQA, which is “Google-proof”)
  • Report contamination analysis alongside scores

Benchmark Saturation

As models improve, benchmarks become less useful:

BenchmarkYear ReleasedSaturationWhat It Tells You Now
MMLU2020Near-saturatedWhether a model is basic vs capable
HumanEval2021Near-saturatedCoding fluency (not real-world skill)
GSM8K2021SaturatedBasic arithmetic, small use
GPQA2023Not saturatedGenuine reasoning capability
SWE-bench2023Not saturatedReal-world coding ability
Design Arena2025Not saturatedCreative/design capability

Pattern: Benchmarks lose discriminating power in 2-3 years. The community must keep creating harder benchmarks.

How to Interpret Scores

Benchmark score ≠ real-world capability
Score RangeWhat It Actually Means
90-100%Model is good at this type of question
80-90%Model is competent but makes mistakes
70-80%Model has basic capability, needs improvement
Below 70%Model struggles with this domain

But context matters:

  • 95% on MMLU (saturated) ≠ 95% on GPQA (hard)
  • A coding model may score 95% on HumanEval but 30% on SWE-bench
  • A model strong in English may be weak in Chinese (even with same benchmark format)

Statistical significance:

  • Small score differences (<2%) are often noise
  • Benchmark runs are non-deterministic (temperature, sampling variation)
  • Run each benchmark 3-5 times and report mean + variance

Beyond Benchmarks

Benchmarks tell you how models compare in controlled conditions. For real-world decisions, add:

1. Human evaluation (gold standard):

  • Have domain experts rate outputs blind (A vs B)
  • 50-100 samples per variant is enough for statistical significance
  • Cost: $500-5000 per evaluation round

2. LLM-as-judge evaluation:

  • Use a strong model (Claude, GPT-5.5) to rate your application’s outputs
  • Well-calibrated with human ratings (0.7-0.9 correlation)
  • Cost: $0.01-0.05 per evaluation

3. Task-specific evals:

  • Create a dataset of 50-100 real user queries with gold-standard responses
  • Run your application’s outputs through the same eval criteria
  • This is more valuable than any public benchmark

4. Production metrics:

  • User satisfaction scores
  • Task completion rate
  • Time to resolution
  • Escalation rate (for support)
  • These are the only metrics that truly matter

Guardrails & Safety Testing

Automated checks that run on every output before showing it to users.

Types of Guardrails

TypeExampleTool
Content filterBlock profanity, hate speechNeMo Guardrails, Guardrails AI
PII detectionRedact emails, phone numbersPresidio, custom regex
Hallucination checkVerify claims against sourceCustom LLM-as-judge
Format validationEnsure JSON is validPydantic, Zod
Toxicity scoringRate harmful contentPerspective API

Implementation Pattern

def safety_check(output):
checks = [
check_pii(output),
check_toxicity(output),
check_hallucination(output, source_documents),
check_format(output),
]
failed = [c for c in checks if not c["pass"]]
if failed:
return {"pass": False, "failures": failed, "output": "I cannot provide that response."}
return {"pass": True, "output": output}

Production Monitoring

Continuous evaluation in production catches regressions that unit tests miss.

Metrics to Track

MetricWhat It DetectsAlert Threshold
Eval pass rateOverall quality drops<90% pass rate
Avg latencyPerformance regression>2x baseline
Error rateAPI failures>1% errors
User feedback scoreSubjective quality<3.5/5 stars
Cost per requestBudget issues>2x baseline

Automated Eval Pipeline

User Query → LLM → Output → Guardrails → User
Eval Queue (async)
Eval LLM (judge)
Metrics Dashboard + Alerts

Eval Frameworks Compared

FrameworkApproachBest For
DeepEvalLLM-as-judge, unit testsPython-based eval pipelines
LangSmithTrace + evaluateLangChain users, debugging
Weights & BiasesExperiment trackingResearch, model comparison
ArizeProduction monitoringML observability at scale
Custom (your code)Full controlSpecific business logic

Quick Start: Minimal Eval Suite

# A minimal evaluation setup - adapt for your use case
evals = [
{"query": "What is RAG?", "checks": ["retrieval", "augmented", "generation"]},
{"query": "Explain transformers", "checks": ["attention", "encoder", "decoder"]},
{"query": "Summarize this email", "checks": ["`<50` words"], "max_length": 50},
]
def run_evals(model, evals):
results = []
for eval_case in evals:
response = call_model(model, eval_case["query"])
passed = all(check in response for check in eval_case["checks"])
results.append({"query": eval_case["query"], "passed": passed, "response": response})
return results

See Also