Skip to content

Model Benchmarks & Leaderboards

📖 4 min read referencebenchmarks
Performance comparisons on standard benchmarks - coding, reasoning, knowledge
Key Takeaways
  • MMLU, HumanEval, MATH, GPQA, and Design Arena each test different capabilities
  • Benchmarks can be contaminated if training data includes test questions
  • Design Arena measures creative/design quality via Elo ratings not academic scores
  • The best benchmark for your use case is a custom dataset of 50-100 real queries

Standard benchmarks for evaluating model capabilities. Below are the most widely used measures for different types of tasks.

Interactive Benchmark Explorer

Browse models across coding, math, knowledge, reasoning, and design benchmarks. Use the Family filter to narrow by provider and Sort By to order by any column.

Model Company Coding (HumanEval)Math (MATH)Knowledge (MMLU)Reasoning (GPQA)Design (Design Arena)
Claude 4 Opus Anthropic 96.2% 96.8% 92.4% 84.6%
Claude Sonnet 4.6 Anthropic 93.7% 94.2% 90.1% 79.8% 1331 Elo
GPT-5.5 OpenAI 95.1% 95.5% 91.8% 82.1% 1312 Elo
GPT-5.5 Instant OpenAI 92.8% 92.1% 89.3%
Gemini 3.1 Pro Google 94% 96% 91.5% 81.5%
DeepSeek V4 DeepSeek 91.5% 93.8% 89.8% 76.4%
Llama 4 405B Meta 90.2% 89.6% 88.2% 73.1%
Mistral Large 3 Mistral 87.4%
o3 OpenAI 97.9% 87.3%
Claude Opus 4.7 (Thinking) Anthropic 1350 Elo
Claude Opus 4.6 Anthropic 1346 Elo
Claude Opus 4.6 (Thinking) Anthropic 1344 Elo
Kimi K2.6 Moonshot AI 1343 Elo
GLM 5.1 Zhipu AI 1341 Elo
Claude Opus 4.7 Anthropic 1338 Elo
GLM 5 Turbo Zhipu AI 1336 Elo
DeepSeek V4 Pro DeepSeek 1313 Elo
Muse Spark Meta 1312 Elo
GLM 5 Zhipu AI 1307 Elo
Claude Opus 4.5 Anthropic 1301 Elo
Kimi K2.5 (Thinking) Moonshot AI 1301 Elo
Gemini 3 Pro Preview Google 1300 Elo
Gemini 3 Mini Google 1295 Elo
Grok 3 Pro xAI 1315 Elo
Qwen 3.6 Alibaba 1310 Elo
Llama 4 Scout Meta 1290 Elo
GLM 4.7 Zhipu AI 1298 Elo
GLM 4 Zhipu AI 1285 Elo
MiniMax M2.7 MiniMax 1310 Elo
MiMo M2.7 Xiaomi 1280 Elo
GPT-5.4 OpenAI 1305 Elo
GPT-4.1 OpenAI 1290 Elo

Showing 32 results


Design Arena Leaderboards

Real-world model performance across design and code generation tasks. Ranked by 5M+ community votes using the Elo rating system across 30+ categories (code, UI, image, video, audio, 3D).

View all leaderboards on Design Arena →


How to Read Benchmark Scores

Most leaderboards report scores as percentages (higher is better) or ranked positions. Remember:

  • Benchmarks are proxies - They measure narrow tasks, not real-world capability
  • Leaderboards lag reality - New models often perform better than their benchmark scores suggest in practice
  • Context matters - The same model may score very differently with different prompting strategies
  • Use multiple signals - Don’t choose a model based on a single benchmark

Current Model Rankings

For up-to-date leaderboard rankings, see:

General & Knowledge

BenchmarkFocusOrganized By
HELMMulti-task evaluation (reasoning, knowledge, accuracy)Stanford CRFM
OpenCompassChinese + multilingual benchmarksOpenCompass
AlpacaEvalChat model rankings by human preferenceStanford
Hugging Face Open LLM LeaderboardOpen-source model rankingsHugging Face
Chatbot Arena (LMSYS)Human preference rankings via pairwise votingUC Berkeley / LMSYS

Design & Code Generation

BenchmarkFocusOrganized ByTop Models
Design ArenaAI design quality (websites, UI, game dev, 3D, data viz, image, video, audio)Arcada LabsClaude Opus 4.7 (Thinking) — 1350 Elo
SWE-benchReal-world software engineering tasks (bug fixes, feature implementation)Princeton

Design Arena Top 20 (Cached May 2026)

View live leaderboard →

RankModelEloCompany
1Claude Opus 4.7 (Thinking)1350Anthropic
2Claude Opus 4.61346Anthropic
3Claude Opus 4.6 (Thinking)1344Anthropic
4Kimi K2.61343Moonshot AI
5GLM 5.11341Zhipu AI
6Claude Opus 4.71338Anthropic
7GLM 5 Turbo1336Zhipu AI
8Claude Sonnet 4.61331Anthropic
9Grok 3 Pro1315xAI
10DeepSeek V4 Pro1313DeepSeek
11Muse Spark1312Meta
12GPT-5.51312OpenAI
13Qwen 3.61310Alibaba
14MiniMax M2.71310MiniMax
15GLM 51307Zhipu AI
16GPT-5.41305OpenAI
17Claude Opus 4.51301Anthropic
18Kimi K2.5 (Thinking)1301Moonshot AI
19GLM 4.71298Zhipu AI
20Gemini 3 Pro Preview1300Google

Methodology: Blind pairwise comparison tournaments with Bradley-Terry (Elo) ranking. 5M+ votes across 30+ categories. See designarena.ai for full data.

Domain-Specific

BenchmarkFocusOrganized By
FinCanna IndexFinancial domain benchmarksIshns Institute
HELM - LegalLegal reasoning and knowledgeStanford CRFM
HELM - MedicalMedical reasoning and knowledgeStanford CRFM

Key

  • HELM — Holistic Evaluation of Language Models. The most comprehensive multi-task benchmark.
  • Design Arena — 5M+ community votes across 30+ categories (code, UI, image, video, audio). Uses blind pairwise comparison with Bradley-Terry (Elo) ranking.
  • Chatbot Arena — Blind pairwise comparisons voted by users. Models ranked by Elo score.
  • SWE-bench — Tests models on real GitHub issues. The gold standard for coding capability.

For Practical Model Selection

For choosing a model for your use case (pricing, capabilities, tradeoffs), see: