Evaluation & Testing for LLMs

📖 10 min read deep-diveevaluationtestingbenchmarks

How to evaluate LLM outputs - benchmarks, qualitative metrics, A/B testing, guardrails, and LLM-as-judge patterns.

Key Takeaways

Use 4 evaluation layers: unit tests, property-based tests, LLM-as-judge, and production monitoring
Benchmarks saturate in 2-3 years — GPQA and SWE-bench are the hardest current benchmarks
Best eval for your use case: 50-100 real user queries with gold-standard responses
LLM-as-judge correlates 0.7-0.9 with human ratings — use a different model than the one being evaluated

How to know if your LLM application is actually working - from individual prompt tests to production monitoring.

Why Evaluation Is Hard

LLM outputs are non-deterministic. The same prompt can produce different results each time. Traditional software testing (assert equals) doesn’t work. Instead, you need a layered evaluation strategy.

Tests that pass today may fail tomorrow after a model update.
Tests that pass for Claude may fail for GPT.
"Passing" depends on your quality bar, not a binary right/wrong.

Evaluation Layers

Layer 1: Unit Tests (Per Prompt)

Test individual prompt templates against known inputs.

def test_summary_prompt():
    result = run_prompt("Summarize: The sky is blue because...")
    assert "Raleigh scattering" in result or "wavelength" in result
    assert len(result.split()) < 50  # summary should be concise
    assert not contains_hallucination(result, source_text="The sky is blue because...")

What to check:

Required keywords or patterns present
Output length constraints met
Format/structure matches expected
No prohibited content (PII, hallucinations)

Layer 2: Property-Based Tests

Test that outputs satisfy certain properties regardless of content.

properties = [
    ("starts_with_greeting", lambda o: o.startswith("Hello")),
    ("ends_with_question", lambda o: o.strip().endswith("?")),
    ("under_500_chars", lambda o: len(o) < 500),
    ("contains_no_urls", lambda o: "http" not in o),
]

Layer 3: Semantic Evaluation

Measure output quality using another LLM as judge.

Input: "Explain quantum computing to a 10-year-old"
Output: "Quantum computers use qubits that can be 0 and 1 at the same time..."

Eval checks:
- Is the explanation age-appropriate?   ✓
- Are there factual errors?              ✓ (none found)
- Is it engaging for a child?            ✓
- Length appropriate?                    ✓ (2 paragraphs)

LLM-as-Judge

Using one LLM to evaluate another’s output is the most practical evaluation method for production systems.

Structured Eval Prompt

You are a quality evaluator. Rate the following assistant response on:
1. Accuracy (1-5): No factual errors
2. Completeness (1-5): Addresses all parts of the query
3. Clarity (1-5): Easy to understand
4. Safety (1-5): No harmful or biased content

Return ONLY a JSON object:
{"accuracy": 5, "completeness": 4, "clarity": 5, "safety": 5, "pass": true}

Implementation

import json

def evaluate_response(query, response, rubric):
    eval_prompt = f"""
    Rate this response on the following criteria:
    {rubric}

    Query: {query}
    Response: {response}

    Return JSON only.
    """
    eval_result = client.messages.create(
        model="claude-sonnet-4-20260510",
        max_tokens=200,
        messages=[{"role": "user", "content": eval_prompt}],
    )
    return json.loads(eval_result.content[0].text)

LLM-as-Judge Pitfalls

Issue	Mitigation
Judge prefers longer responses	Control for length in rubric
Judge agrees with itself	Use a different model as judge
Position bias (favors first/last)	Randomize comparison order
Verbosity bias	Normalize responses before comparison
Self-enhancement bias	Never use the same model as judge and generator

A/B Testing

Compare two versions of a prompt, model, or system configuration.

Methodology

Split traffic 50/50 between control (A) and variant (B)
Collect both quantitative (latency, cost) and qualitative (eval scores) metrics
Statistical significance - run until you have 100+ samples per variant
Decision - switch if B is clearly better; keep A if uncertain

Sample Size Calculator

import math

def min_sample_size(effect_size=0.1, alpha=0.05, power=0.8):
    z_alpha = 1.96  # for 95% confidence
    z_beta = 0.84   # for 80% power
    n = (2 * (z_alpha + z_beta)**2) / (effect_size**2)
    return math.ceil(n)

# To detect a 10% improvement: ~1,570 samples per variant
# To detect a 5% improvement: ~6,283 samples per variant

Benchmark Methodology

Understanding how benchmarks actually work is essential for interpreting model scores correctly.

How Major Benchmarks Work

MMLU (Massive Multitask Language Understanding):

57 subjects from high-school to professional-level (math, law, medicine, physics…)
Format: Multiple choice (A/B/C/D), ~14,000 questions
Scoring: Simple accuracy — what % did the model get right?
Ceiling: Human experts score ~90%; top models now hit 92-93%
Pitfall: Many questions are answerable from memorized internet text, not genuine reasoning

HumanEval:

164 hand-written Python programming problems
Format: Function signature + docstring, model writes the body
Scoring: Functional correctness — does the code pass unit tests?
Ceiling: 100% is theoretically possible; top models hit 96-99%
Pitfall: Tests are simple (few edge cases), models may memorize common patterns

GPQA (Graduate-Level Google-Proof Q&A):

448 expert-written questions in biology, physics, chemistry
Format: Multiple choice, designed to be “Google-proof” (can’t find answers by search)
Scoring: Accuracy
Ceiling: Domain experts score ~85%; top models hit 87% (o3)
Why it matters: Hardest benchmark — tests genuine understanding, not memorization

SWE-bench:

2,294 real GitHub issues from 12 popular Python repos
Format: Model gets issue description + codebase, must produce a patch
Scoring: Does the patch pass the repository’s existing test suite?
Why it matters: Most realistic coding benchmark — tests real-world software engineering
Current best: Claude Sonnet 4.6 at ~49%, o3 at ~71%

GSM8K (Grade School Math 8K):

8,500 grade-school math word problems
Format: Natural language problem, model must output the numeric answer
Scoring: Exact match accuracy
Saturation: Many models now score 95%+; useful for regression testing but no longer differentiates

Design Arena:

AI design quality benchmark with 5M+ community votes
Format: Blind pairwise comparison of model outputs (code, UI, images, video)
Scoring: Bradley-Terry Elo rating
Why it matters: Measures creative/aesthetic output, not just reasoning — an entirely different axis of capability

Benchmark Contamination

The single biggest issue with benchmarks: data leakage.

Training data (internet, books, code) contains benchmark questions.
Model memorizes answers during training.
Benchmark score is inflated.
Real-world performance is lower.

How to detect contamination:

N-gram overlap: If 10+ consecutive words from a benchmark question appear verbatim in training data, it’s contaminated
Perplexity analysis: Models have unusually low perplexity on contaminated questions
Time of release: If a benchmark was released before the model’s training cutoff, it may be in the training data

Real example:

GPT-4 scored 86% on MMLU
GPT-4’s training data included the internet up to September 2021
MMLU was released in 2020
Likely contamination: 10-20% of MMLU questions may have been seen during training
Adjusted true performance: perhaps 70-75%

How to mitigate:

Use benchmarks released after your model’s training cutoff
Test on held-out subsets (MMLU-Redux, HumanEval-X)
Use benchmarks designed to resist contamination (GPQA, which is “Google-proof”)
Report contamination analysis alongside scores

Benchmark Saturation

As models improve, benchmarks become less useful:

Benchmark	Year Released	Saturation	What It Tells You Now
MMLU	2020	Near-saturated	Whether a model is basic vs capable
HumanEval	2021	Near-saturated	Coding fluency (not real-world skill)
GSM8K	2021	Saturated	Basic arithmetic, small use
GPQA	2023	Not saturated	Genuine reasoning capability
SWE-bench	2023	Not saturated	Real-world coding ability
Design Arena	2025	Not saturated	Creative/design capability

Pattern: Benchmarks lose discriminating power in 2-3 years. The community must keep creating harder benchmarks.

How to Interpret Scores

Benchmark score ≠ real-world capability

Score Range	What It Actually Means
90-100%	Model is good at this type of question
80-90%	Model is competent but makes mistakes
70-80%	Model has basic capability, needs improvement
Below 70%	Model struggles with this domain

But context matters:

95% on MMLU (saturated) ≠ 95% on GPQA (hard)
A coding model may score 95% on HumanEval but 30% on SWE-bench
A model strong in English may be weak in Chinese (even with same benchmark format)

Statistical significance:

Small score differences (<2%) are often noise
Benchmark runs are non-deterministic (temperature, sampling variation)
Run each benchmark 3-5 times and report mean + variance

Beyond Benchmarks

Benchmarks tell you how models compare in controlled conditions. For real-world decisions, add:

1. Human evaluation (gold standard):

Have domain experts rate outputs blind (A vs B)
50-100 samples per variant is enough for statistical significance
Cost: $500-5000 per evaluation round

2. LLM-as-judge evaluation:

Use a strong model (Claude, GPT-5.5) to rate your application’s outputs
Well-calibrated with human ratings (0.7-0.9 correlation)
Cost: $0.01-0.05 per evaluation

3. Task-specific evals:

Create a dataset of 50-100 real user queries with gold-standard responses
Run your application’s outputs through the same eval criteria
This is more valuable than any public benchmark

4. Production metrics:

User satisfaction scores
Task completion rate
Time to resolution
Escalation rate (for support)
These are the only metrics that truly matter

Guardrails & Safety Testing

Automated checks that run on every output before showing it to users.

Types of Guardrails

Type	Example	Tool
Content filter	Block profanity, hate speech	NeMo Guardrails, Guardrails AI
PII detection	Redact emails, phone numbers	Presidio, custom regex
Hallucination check	Verify claims against source	Custom LLM-as-judge
Format validation	Ensure JSON is valid	Pydantic, Zod
Toxicity scoring	Rate harmful content	Perspective API

Implementation Pattern

def safety_check(output):
    checks = [
        check_pii(output),
        check_toxicity(output),
        check_hallucination(output, source_documents),
        check_format(output),
    ]
    failed = [c for c in checks if not c["pass"]]
    if failed:
        return {"pass": False, "failures": failed, "output": "I cannot provide that response."}
    return {"pass": True, "output": output}

Production Monitoring

Continuous evaluation in production catches regressions that unit tests miss.

Metrics to Track

Metric	What It Detects	Alert Threshold
Eval pass rate	Overall quality drops	`<90%` pass rate
Avg latency	Performance regression	`>2x` baseline
Error rate	API failures	`>1%` errors
User feedback score	Subjective quality	`<3.5/5` stars
Cost per request	Budget issues	`>2x` baseline

Automated Eval Pipeline

User Query → LLM → Output → Guardrails → User
                              ↓
                        Eval Queue (async)
                              ↓
                         Eval LLM (judge)
                              ↓
                   Metrics Dashboard + Alerts

Eval Frameworks Compared

Framework	Approach	Best For
DeepEval	LLM-as-judge, unit tests	Python-based eval pipelines
LangSmith	Trace + evaluate	LangChain users, debugging
Weights & Biases	Experiment tracking	Research, model comparison
Arize	Production monitoring	ML observability at scale
Custom (your code)	Full control	Specific business logic

Quick Start: Minimal Eval Suite

# A minimal evaluation setup  -  adapt for your use case
evals = [
    {"query": "What is RAG?", "checks": ["retrieval", "augmented", "generation"]},
    {"query": "Explain transformers", "checks": ["attention", "encoder", "decoder"]},
    {"query": "Summarize this email", "checks": ["`<50` words"], "max_length": 50},
]

def run_evals(model, evals):
    results = []
    for eval_case in evals:
        response = call_model(model, eval_case["query"])
        passed = all(check in response for check in eval_case["checks"])
        results.append({"query": eval_case["query"], "passed": passed, "response": response})
    return results