Model Benchmarks & Leaderboards
Standard benchmarks for evaluating model capabilities. Below are the most widely used measures for different types of tasks.
Interactive Benchmark Explorer
Browse models across coding, math, knowledge, reasoning, and design benchmarks. Use the Family filter to narrow by provider and Sort By to order by any column.
| Model | Company | Coding (HumanEval) | Math (MATH) | Knowledge (MMLU) | Reasoning (GPQA) | Design (Design Arena) |
|---|---|---|---|---|---|---|
| Claude 4 Opus | Anthropic | 96.2% | 96.8% | 92.4% | 84.6% | — |
| Claude Sonnet 4.6 | Anthropic | 93.7% | 94.2% | 90.1% | 79.8% | 1331 Elo |
| GPT-5.5 | OpenAI | 95.1% | 95.5% | 91.8% | 82.1% | 1312 Elo |
| GPT-5.5 Instant | OpenAI | 92.8% | 92.1% | 89.3% | — | — |
| Gemini 3.1 Pro | 94% | 96% | 91.5% | 81.5% | — | |
| DeepSeek V4 | DeepSeek | 91.5% | 93.8% | 89.8% | 76.4% | — |
| Llama 4 405B | Meta | 90.2% | 89.6% | 88.2% | 73.1% | — |
| Mistral Large 3 | Mistral | 87.4% | — | — | — | — |
| o3 | OpenAI | — | 97.9% | — | 87.3% | — |
| Claude Opus 4.7 (Thinking) | Anthropic | — | — | — | — | 1350 Elo |
| Claude Opus 4.6 | Anthropic | — | — | — | — | 1346 Elo |
| Claude Opus 4.6 (Thinking) | Anthropic | — | — | — | — | 1344 Elo |
| Kimi K2.6 | Moonshot AI | — | — | — | — | 1343 Elo |
| GLM 5.1 | Zhipu AI | — | — | — | — | 1341 Elo |
| Claude Opus 4.7 | Anthropic | — | — | — | — | 1338 Elo |
| GLM 5 Turbo | Zhipu AI | — | — | — | — | 1336 Elo |
| DeepSeek V4 Pro | DeepSeek | — | — | — | — | 1313 Elo |
| Muse Spark | Meta | — | — | — | — | 1312 Elo |
| GLM 5 | Zhipu AI | — | — | — | — | 1307 Elo |
| Claude Opus 4.5 | Anthropic | — | — | — | — | 1301 Elo |
| Kimi K2.5 (Thinking) | Moonshot AI | — | — | — | — | 1301 Elo |
| Gemini 3 Pro Preview | — | — | — | — | 1300 Elo | |
| Gemini 3 Mini | — | — | — | — | 1295 Elo | |
| Grok 3 Pro | xAI | — | — | — | — | 1315 Elo |
| Qwen 3.6 | Alibaba | — | — | — | — | 1310 Elo |
| Llama 4 Scout | Meta | — | — | — | — | 1290 Elo |
| GLM 4.7 | Zhipu AI | — | — | — | — | 1298 Elo |
| GLM 4 | Zhipu AI | — | — | — | — | 1285 Elo |
| MiniMax M2.7 | MiniMax | — | — | — | — | 1310 Elo |
| MiMo M2.7 | Xiaomi | — | — | — | — | 1280 Elo |
| GPT-5.4 | OpenAI | — | — | — | — | 1305 Elo |
| GPT-4.1 | OpenAI | — | — | — | — | 1290 Elo |
Showing 32 results
Design Arena Leaderboards
Real-world model performance across design and code generation tasks. Ranked by 5M+ community votes using the Elo rating system across 30+ categories (code, UI, image, video, audio, 3D).
View all leaderboards on Design Arena →
How to Read Benchmark Scores
Most leaderboards report scores as percentages (higher is better) or ranked positions. Remember:
- Benchmarks are proxies - They measure narrow tasks, not real-world capability
- Leaderboards lag reality - New models often perform better than their benchmark scores suggest in practice
- Context matters - The same model may score very differently with different prompting strategies
- Use multiple signals - Don’t choose a model based on a single benchmark
Current Model Rankings
For up-to-date leaderboard rankings, see:
General & Knowledge
| Benchmark | Focus | Organized By |
|---|---|---|
| HELM | Multi-task evaluation (reasoning, knowledge, accuracy) | Stanford CRFM |
| OpenCompass | Chinese + multilingual benchmarks | OpenCompass |
| AlpacaEval | Chat model rankings by human preference | Stanford |
| Hugging Face Open LLM Leaderboard | Open-source model rankings | Hugging Face |
| Chatbot Arena (LMSYS) | Human preference rankings via pairwise voting | UC Berkeley / LMSYS |
Design & Code Generation
| Benchmark | Focus | Organized By | Top Models |
|---|---|---|---|
| Design Arena | AI design quality (websites, UI, game dev, 3D, data viz, image, video, audio) | Arcada Labs | Claude Opus 4.7 (Thinking) — 1350 Elo |
| SWE-bench | Real-world software engineering tasks (bug fixes, feature implementation) | Princeton | — |
Design Arena Top 20 (Cached May 2026)
| Rank | Model | Elo | Company |
|---|---|---|---|
| 1 | Claude Opus 4.7 (Thinking) | 1350 | Anthropic |
| 2 | Claude Opus 4.6 | 1346 | Anthropic |
| 3 | Claude Opus 4.6 (Thinking) | 1344 | Anthropic |
| 4 | Kimi K2.6 | 1343 | Moonshot AI |
| 5 | GLM 5.1 | 1341 | Zhipu AI |
| 6 | Claude Opus 4.7 | 1338 | Anthropic |
| 7 | GLM 5 Turbo | 1336 | Zhipu AI |
| 8 | Claude Sonnet 4.6 | 1331 | Anthropic |
| 9 | Grok 3 Pro | 1315 | xAI |
| 10 | DeepSeek V4 Pro | 1313 | DeepSeek |
| 11 | Muse Spark | 1312 | Meta |
| 12 | GPT-5.5 | 1312 | OpenAI |
| 13 | Qwen 3.6 | 1310 | Alibaba |
| 14 | MiniMax M2.7 | 1310 | MiniMax |
| 15 | GLM 5 | 1307 | Zhipu AI |
| 16 | GPT-5.4 | 1305 | OpenAI |
| 17 | Claude Opus 4.5 | 1301 | Anthropic |
| 18 | Kimi K2.5 (Thinking) | 1301 | Moonshot AI |
| 19 | GLM 4.7 | 1298 | Zhipu AI |
| 20 | Gemini 3 Pro Preview | 1300 |
Methodology: Blind pairwise comparison tournaments with Bradley-Terry (Elo) ranking. 5M+ votes across 30+ categories. See designarena.ai for full data.
Domain-Specific
| Benchmark | Focus | Organized By |
|---|---|---|
| FinCanna Index | Financial domain benchmarks | Ishns Institute |
| HELM - Legal | Legal reasoning and knowledge | Stanford CRFM |
| HELM - Medical | Medical reasoning and knowledge | Stanford CRFM |
Key
- HELM — Holistic Evaluation of Language Models. The most comprehensive multi-task benchmark.
- Design Arena — 5M+ community votes across 30+ categories (code, UI, image, video, audio). Uses blind pairwise comparison with Bradley-Terry (Elo) ranking.
- Chatbot Arena — Blind pairwise comparisons voted by users. Models ranked by Elo score.
- SWE-bench — Tests models on real GitHub issues. The gold standard for coding capability.
For Practical Model Selection
For choosing a model for your use case (pricing, capabilities, tradeoffs), see:
- Models Guide - Current models with detailed specs
- Models Decision Guide - How to choose based on your needs