Model Benchmarks & Leaderboards

📖 4 min read referencebenchmarks

Performance comparisons on standard benchmarks - coding, reasoning, knowledge

Key Takeaways

MMLU, HumanEval, MATH, GPQA, and Design Arena each test different capabilities
Benchmarks can be contaminated if training data includes test questions
Design Arena measures creative/design quality via Elo ratings not academic scores
The best benchmark for your use case is a custom dataset of 50-100 real queries

Standard benchmarks for evaluating model capabilities. Below are the most widely used measures for different types of tasks.

Interactive Benchmark Explorer

Browse models across coding, math, knowledge, reasoning, and design benchmarks. Use the Family filter to narrow by provider and Sort By to order by any column.

Family

Sort By

Model	Company	Coding (HumanEval)	Math (MATH)	Knowledge (MMLU)	Reasoning (GPQA)	Design (Design Arena)
Claude 4 Opus	Anthropic	96.2%	96.8%	92.4%	84.6%	—
Claude Sonnet 4.6	Anthropic	93.7%	94.2%	90.1%	79.8%	1331 Elo
GPT-5.5	OpenAI	95.1%	95.5%	91.8%	82.1%	1312 Elo
GPT-5.5 Instant	OpenAI	92.8%	92.1%	89.3%	—	—
Gemini 3.1 Pro	Google	94%	96%	91.5%	81.5%	—
DeepSeek V4	DeepSeek	91.5%	93.8%	89.8%	76.4%	—
Llama 4 405B	Meta	90.2%	89.6%	88.2%	73.1%	—
Mistral Large 3	Mistral	87.4%	—	—	—	—
o3	OpenAI	—	97.9%	—	87.3%	—
Claude Opus 4.7 (Thinking)	Anthropic	—	—	—	—	1350 Elo
Claude Opus 4.6	Anthropic	—	—	—	—	1346 Elo
Claude Opus 4.6 (Thinking)	Anthropic	—	—	—	—	1344 Elo
Kimi K2.6	Moonshot AI	—	—	—	—	1343 Elo
GLM 5.1	Zhipu AI	—	—	—	—	1341 Elo
Claude Opus 4.7	Anthropic	—	—	—	—	1338 Elo
GLM 5 Turbo	Zhipu AI	—	—	—	—	1336 Elo
DeepSeek V4 Pro	DeepSeek	—	—	—	—	1313 Elo
Muse Spark	Meta	—	—	—	—	1312 Elo
GLM 5	Zhipu AI	—	—	—	—	1307 Elo
Claude Opus 4.5	Anthropic	—	—	—	—	1301 Elo
Kimi K2.5 (Thinking)	Moonshot AI	—	—	—	—	1301 Elo
Gemini 3 Pro Preview	Google	—	—	—	—	1300 Elo
Gemini 3 Mini	Google	—	—	—	—	1295 Elo
Grok 3 Pro	xAI	—	—	—	—	1315 Elo
Qwen 3.6	Alibaba	—	—	—	—	1310 Elo
Llama 4 Scout	Meta	—	—	—	—	1290 Elo
GLM 4.7	Zhipu AI	—	—	—	—	1298 Elo
GLM 4	Zhipu AI	—	—	—	—	1285 Elo
MiniMax M2.7	MiniMax	—	—	—	—	1310 Elo
MiMo M2.7	Xiaomi	—	—	—	—	1280 Elo
GPT-5.4	OpenAI	—	—	—	—	1305 Elo
GPT-4.1	OpenAI	—	—	—	—	1290 Elo

Showing 32 results

Design Arena Leaderboards

Real-world model performance across design and code generation tasks. Ranked by 5M+ community votes using the Elo rating system across 30+ categories (code, UI, image, video, audio, 3D).

View all leaderboards on Design Arena →

How to Read Benchmark Scores

Most leaderboards report scores as percentages (higher is better) or ranked positions. Remember:

Benchmarks are proxies - They measure narrow tasks, not real-world capability
Leaderboards lag reality - New models often perform better than their benchmark scores suggest in practice
Context matters - The same model may score very differently with different prompting strategies
Use multiple signals - Don’t choose a model based on a single benchmark

Current Model Rankings

For up-to-date leaderboard rankings, see:

General & Knowledge

Benchmark	Focus	Organized By
HELM	Multi-task evaluation (reasoning, knowledge, accuracy)	Stanford CRFM
OpenCompass	Chinese + multilingual benchmarks	OpenCompass
AlpacaEval	Chat model rankings by human preference	Stanford
Hugging Face Open LLM Leaderboard	Open-source model rankings	Hugging Face
Chatbot Arena (LMSYS)	Human preference rankings via pairwise voting	UC Berkeley / LMSYS

Design & Code Generation

Benchmark	Focus	Organized By	Top Models
Design Arena	AI design quality (websites, UI, game dev, 3D, data viz, image, video, audio)	Arcada Labs	Claude Opus 4.7 (Thinking) — 1350 Elo
SWE-bench	Real-world software engineering tasks (bug fixes, feature implementation)	Princeton	—

Design Arena Top 20 (Cached May 2026)

View live leaderboard →

Rank	Model	Elo	Company
1	Claude Opus 4.7 (Thinking)	1350	Anthropic
2	Claude Opus 4.6	1346	Anthropic
3	Claude Opus 4.6 (Thinking)	1344	Anthropic
4	Kimi K2.6	1343	Moonshot AI
5	GLM 5.1	1341	Zhipu AI
6	Claude Opus 4.7	1338	Anthropic
7	GLM 5 Turbo	1336	Zhipu AI
8	Claude Sonnet 4.6	1331	Anthropic
9	Grok 3 Pro	1315	xAI
10	DeepSeek V4 Pro	1313	DeepSeek
11	Muse Spark	1312	Meta
12	GPT-5.5	1312	OpenAI
13	Qwen 3.6	1310	Alibaba
14	MiniMax M2.7	1310	MiniMax
15	GLM 5	1307	Zhipu AI
16	GPT-5.4	1305	OpenAI
17	Claude Opus 4.5	1301	Anthropic
18	Kimi K2.5 (Thinking)	1301	Moonshot AI
19	GLM 4.7	1298	Zhipu AI
20	Gemini 3 Pro Preview	1300	Google

Methodology: Blind pairwise comparison tournaments with Bradley-Terry (Elo) ranking. 5M+ votes across 30+ categories. See designarena.ai for full data.

Domain-Specific

Benchmark	Focus	Organized By
FinCanna Index	Financial domain benchmarks	Ishns Institute
HELM - Legal	Legal reasoning and knowledge	Stanford CRFM
HELM - Medical	Medical reasoning and knowledge	Stanford CRFM

Key

HELM — Holistic Evaluation of Language Models. The most comprehensive multi-task benchmark.
Design Arena — 5M+ community votes across 30+ categories (code, UI, image, video, audio). Uses blind pairwise comparison with Bradley-Terry (Elo) ranking.
Chatbot Arena — Blind pairwise comparisons voted by users. Models ranked by Elo score.
SWE-bench — Tests models on real GitHub issues. The gold standard for coding capability.

For Practical Model Selection

For choosing a model for your use case (pricing, capabilities, tradeoffs), see:

Models Guide - Current models with detailed specs
Models Decision Guide - How to choose based on your needs