Models Decision Guide
How to choose the right model for your specific task. Reasoning, speed, cost, and capabilities compared.
Find the Right Model
Claude Opus 4.8
AnthropicComplex reasoning, long documents
Claude Sonnet 4.6
AnthropicDefault choice for most tasks
Claude Haiku 4.5
AnthropicFast & cheap for simple tasks
GPT-5.5
OpenAIFlagship reasoning & coding
GPT-5.4
OpenAIBest value for most production workloads
GPT-5.4 mini
OpenAICost-efficient coding & agents
o3
OpenAIHardest problems (math, logic)
o1
OpenAIHard reasoning at lower cost than o3
Gemini 3.1 Pro
GoogleMassive documents & vision
DeepSeek V4 Flash
DeepSeekExtreme budget, minimal quality loss
DeepSeek V4 Pro
DeepSeekFrontier quality at fraction of cost
Llama 4
MetaPrivate, self-hosted
Quick Decision Tree
1. What's your primary constraint? ├─ Cost → DeepSeek V4 Flash or GPT-5.5 Instant ├─ Speed → Claude Haiku or GPT-5.5 Instant ├─ Reasoning/Quality → Claude Opus or o3 ├─ Long context → Gemini 3.1 Pro (1M tokens) └─ Privacy/On-prem → Llama 4 self-hosted
2. What's your use case? ├─ Writing/Analysis → Claude Sonnet (default) ├─ Code generation → Claude Sonnet or GPT-4o ├─ Reasoning/Math → o3 or Claude Opus ├─ Vision/Images → GPT-4o or Gemini 3.1 ├─ Document processing → Gemini 3.1 Pro (1M context) └─ Real-time → Perplexity or Claude with web searchModel Categories
Tier 1: Reasoning Models (Slow but Brilliant)
Claude Opus 4.8 (Anthropic)
- Context: 1M tokens (read a whole book)
- Speed: Slow (think for 30+ seconds)
- Cost: $15-75 per 1M tokens
- Best for: Complex reasoning, multi-step logic, deep analysis
- Why: Most capable model available. Use when Sonnet struggles.
- When NOT to use: Simple tasks, real-time applications, cost-sensitive work
o3 (OpenAI)
- Context: 128K tokens
- Speed: Very slow (extended thinking)
- Cost: Premium pricing
- Best for: Extremely hard problems (math, coding competitions, logic puzzles)
- Why: Breakthrough reasoning capability
- When NOT to use: General tasks (overkill), any time-sensitive work
o1 (OpenAI)
- Context: 128K tokens
- Speed: Slow but faster than o3
- Cost: 60 per 1M output
- Best for: Difficult reasoning without needing extended thinking
- Why: Good reasoning at reasonable speed
- When NOT to use: Simple tasks, real-time
Tier 2: Default Models (Fast & Smart)
Claude 3.5 Sonnet 4.6 (Anthropic) ⭐ Recommended Default
- Context: 200K tokens
- Speed: Fast (2-5 seconds)
- Cost: $3-15 per 1M tokens
- Best for: Almost everything - writing, code, analysis
- Why: Best balance of speed, quality, cost
- When to use: Your first choice for any task
GPT-4o (OpenAI)
- Context: 128K tokens
- Speed: Fast (2-5 seconds)
- Cost: $2-8 per 1M tokens
- Best for: All-around work, especially vision/images
- Why: Extremely reliable, good at everything
- When to use: When you need vision, or want OpenAI’s reliability
Gemini 3.1 Pro (Google)
- Context: 1M tokens (!!)
- Speed: Medium (5-10 seconds)
- Cost: $2-12 per 1M tokens
- Best for: Document analysis, long-context research
- Why: Only model that can read entire books
- When to use: When context window matters more than speed
Tier 3: Speed-Focused (Fast & Cheap)
Claude 3.5 Haiku (Anthropic)
- Context: 200K tokens
- Speed: Ultra-fast (under 1 second)
- Cost: $0.80-4 per 1M tokens
- Best for: Classification, routing, summaries, high volume
- Why: Surprisingly capable despite being the smallest
- When to use: When speed is critical or volume is high
GPT-4 Turbo (OpenAI)
- Context: 128K tokens
- Speed: Fast
- Cost: $0.01-0.03 per 1K tokens
- Best for: Production systems, high volume
- Why: Reliable, cheap
- When to use: Cost-sensitive production
DeepSeek V4 Flash (China)
- Context: 128K tokens
- Speed: Fast
- Cost: $0.14-0.28 per 1M tokens (!!)
- Vision: ❌ Text only. Use DeepSeek VL for image tasks.
- Best for: Budget-conscious work, routing, high volume
- Why: Shockingly cheap and good quality
DeepSeek VL (China)
- Context: 128K tokens
- Speed: Medium
- Cost: ~2.19 per 1M tokens
- Vision: ✅ Images
- Best for: Vision tasks within DeepSeek ecosystem
- Why: DeepSeek’s dedicated vision model. Use when you need image understanding at DeepSeek pricing.
- When to use: Cost is the #1 constraint
Tier 4: Specialized
GPT-4 Vision (OpenAI)
- Best for: Image analysis, OCR, visual understanding
- Why: GPT-4o is better and cheaper now
- When to use: Legacy systems
Claude 3 Opus (Anthropic, previous version)
- Replaced by Claude 4.7
- When to use: Nowhere; use Opus 4.8 instead
Open-Source Models (Llama, Mistral, DeepSeek)
- Best for: Privacy, on-premise, fine-tuning
- Why: Full control, no API costs
- When to use: When data can’t leave your infrastructure
- How: Run locally with Ollama or LM Studio
Decision Matrix By Use Case
| Task | Model | Why | Cost |
|---|---|---|---|
| Customer support chatbot | Haiku | Fast, cheap | $2-5/month |
| Blog post writing | Sonnet | Quality + speed balance | $1-3/month |
| Code generation | Sonnet or GPT-4o | Both excellent | $2-5/month |
| Complex reasoning | Opus or o3 | Need the power | $50-200/month |
| Document analysis (100 pages) | Gemini 3.1 Pro | Only fits 1M context | $2-10/month |
| Real-time Q&A | Perplexity | Web search built-in | Free-20/month |
| Vision/image tasks | GPT-4o | Best at images | $2-5/month |
| Routing/classification | Haiku | Speed + cheap | $1-2/month |
| Data extraction | Sonnet + structured output | Reliable parsing | $2-5/month |
| High volume (1000+ requests/day) | Haiku or V4 Flash | Need cheap inference | $10-50/month |
Cost Comparison for Common Scenarios
Scenario 1: Personal Research Assistant
Use case: 10 questions/day, 2000 input tokens avg, 500 output tokens
| Model | Monthly Cost | Speed | Quality |
|---|---|---|---|
| Claude Sonnet | $0.90 | Fast | Excellent |
| GPT-4o | $0.60 | Fast | Excellent |
| Gemini 3.1 Pro | $0.60 | Medium | Excellent |
| DeepSeek V4 | $0.33 | Fast | Good |
Recommendation: Sonnet or GPT-4o (negligible difference)
Scenario 2: High-Volume Classification (10,000 req/day)
Use case: 500 input tokens, 50 output tokens per request
| Model | Monthly Cost | Speed | Quality |
|---|---|---|---|
| Claude Haiku | $45 | Ultra-fast | Good |
| GPT-4 Turbo | $15 | Fast | Good |
| DeepSeek Flash | $7 | Fast | Good |
Recommendation: DeepSeek Flash (10x cheaper than Haiku)
Scenario 3: Complex Reasoning (50 req/day)
Use case: 3000 input tokens, 2000 output tokens per request
| Model | Monthly Cost | Speed | Quality |
|---|---|---|---|
| Claude Opus | $675 | Slow | Excellent |
| Claude Sonnet | $225 | Fast | Excellent |
| o3 | $2000 | Very slow | Best-in-class |
| o1 | $450 | Slow | Excellent |
Recommendation: Sonnet (best balance), o3 (if you need the best and can wait)
Which Model for Your Project?
If You’re Building a Startup/Product
Start with: Claude Sonnet + Haiku combo
- Sonnet for complex tasks
- Haiku for high-volume/cheap tasks
- Reason: Cost-effective, reliable, good quality
Scale with: Opus if you hit reasoning limits
If You’re Prototyping/Learning
Start with: GPT-4o or Claude Sonnet (free tier)
- Both have good free credits
- Reason: Simplest to get started
Explore: Try multiple models on the same task to see tradeoffs
If Cost Is Your #1 Constraint
Use: DeepSeek V4 Flash + Sonnet combo
- Flash for everything possible
- Sonnet when Flash isn’t good enough
- Reason: 10x cheaper overall
If You Need Long Context (1000+ page documents)
Use: Gemini 3.1 Pro (only option with 1M context)
- Reason: Nothing else can handle that much text
If You Need On-Premise/Privacy
Use: Llama 4 (run locally with Ollama)
- Cost: Free (just electricity)
- Tradeoff: Slower, less capable
- Reason: Data never leaves your machine
If Speed Matters Most
Use: Haiku + use caching
- Response time: under 1 second
- Or: o1 if you need reasoning (slower but better)
- Reason: Trade quality for speed when needed
Optimization Strategies
Strategy 1: Routing (Mixture of Models)
Use a cheap model to decide which model to use:
Input: User question ↓Haiku: "Is this question simple? (yes/no)" ├─ yes → Use Haiku for answer (cheap) └─ no → Use Sonnet for answer (better)Savings: 80% of questions use Haiku, 20% use Sonnet = 30% cost reduction
Strategy 2: Caching
If you analyze the same document repeatedly:
First request: Analyze document X (full cost)Second request: Analyze document X (90% cheaper - cached)Savings: With caching, 2nd-10th requests are 90% cheaper
Strategy 3: Batch Processing
Don’t ask questions one-at-a-time:
❌ Bad: 1000 questions, each call costs $0.01 = $10✅ Good: Batch 100 questions per call, 10 calls = $0.10Savings: 100x for certain APIs
Vision / Image Input Support
Not all models can process images. Here’s which ones can and what they support:
| Model | Vision | Type | Notes |
|---|---|---|---|
| Claude Sonnet 4.6 | ✅ | Images | Strong image analysis, charts, documents |
| Claude Opus 4.8 | ✅ | Images | Best for detailed visual reasoning |
| GPT-5.5 | ✅ | Images | Solid multimodal |
| GPT-5.5 Instant | ❌ | Text only | Fastest, no vision |
| o3 | ❌ | Text only | Reasoning-only model |
| Gemini 3.1 Pro | ✅ | Images + Video | Best multimodal support |
| DeepSeek V4 | ❌ | Text only | Use DeepSeek VL for vision |
| DeepSeek V4 Flash | ❌ | Text only | Use DeepSeek VL for vision |
| DeepSeek VL | ✅ | Images | DeepSeek’s dedicated vision model |
| Llama 4 | ✅ | Images | Open-source multimodal |
Key takeaway: If your task involves analyzing images, charts, or screenshots, pick a model with ✅. DeepSeek V4 and V4 Flash are excellent for text-only tasks but can’t process images at all - use DeepSeek VL if you need vision in the DeepSeek ecosystem.
What Changed Recently (May 2026)
- o3 released (best reasoning ever, very expensive)
- Claude 4.7 released (Opus version, improved reasoning)
- DeepSeek V4 released (open-source reasoning, cheaper than ever)
- Gemini 3.1 released (1M context window, multimodal)
- GPT-5.5 released (minor improvements over 4o)
Implication: 2026 is dominated by reasoning models and long-context models.
Common Mistakes
❌ Using Opus for everything - Overkill 95% of the time
✅ Use Sonnet by default, Opus when needed
❌ Ignoring cost - Can add up fast with high volume
✅ Calculate your actual usage, optimize routing
❌ Assuming newer = better - Sometimes not true
✅ Test models on your actual task
❌ Using same model for everything - Suboptimal
✅ Use a mix (Haiku for cheap, Sonnet for quality)
Model Specifications & Capabilities
All current models with pricing, context windows, and vision support. Source of truth: src/data/models.ts.
| Model | Company | Context | Input/Output | Vision | Notes |
|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 1M | $5/$25 per 1M | ✅ | Most capable Claude (May 2026). Best for complex reasoning and agentic coding. Adaptive thinking. Fast Mode $10/$50. |
| Claude Opus 4.8 (Thinking) | Anthropic | 1M | $5/$25 per 1M | ✅ | Top-ranked on Design Arena. Thinking mode enabled. |
| Claude Opus 4.7 | Anthropic | 1M | $5/$25 per 1M | ✅ | Previous flagship (superseded by Opus 4.8, May 2026). |
| Claude Opus 4.6 | Anthropic | 1M | $5/$25 per 1M | ✅ | Previous gen flagship. Still highly capable. |
| Claude Opus 4.6 (Thinking) | Anthropic | 1M | $5/$25 per 1M | ✅ | Previous gen with thinking mode. Strong on design benchmarks. |
| Claude Opus 4.5 | Anthropic | 200K | $5/$25 per 1M | ❌ | Earlier generation. Still available for certain use cases. |
| Claude Sonnet 4.6 | Anthropic | 1M | $3/$15 per 1M | ✅ | Best balance of speed & quality. Default pick. |
| Claude Haiku 4.5 | Anthropic | 200K | $1/$5 per 1M | ✅ | Ultra-fast, cheapest Claude. |
| GPT-5.5 | OpenAI | 1M | $5/$30 per 1M | ✅ | Flagship. Reasoning levels none→xhigh. Strong all-around. |
| GPT-5.4 | OpenAI | 1M | $2.50/$15 per 1M | ✅ | Affordable professional tier. Near-flagship capability. |
| GPT-5.4 mini | OpenAI | 400K | $0.75/$4.50 per 1M | ❌ | Strong mini for coding & agents. Fast. |
| GPT-5.4 nano | OpenAI | 400K | $0.20/$1.25 per 1M | ❌ | Fastest, cheapest. Ideal for high-throughput. |
| GPT-4.1 | OpenAI | 128K | $2/$8 per 1M | ❌ | Previous gen. Superseded by GPT-5.4 mini. |
| o3 | OpenAI | 128K | $2/$8 per 1M | ❌ | Dedicated reasoning model. Spends tokens on hidden thinking. 87% cheaper than o1. |
| o1 | OpenAI | 128K | $15/$60 per 1M | ❌ | Earlier reasoning model. Superseded by o3. |
| Gemini 3.1 Pro | 1M | $2/$12 per 1M | ✅ | Flagship Gemini. Best context window, excellent multimodal. Prompts >200K billed $4/$18. | |
| Gemini 3.5 Flash | 1M | $1.50/$9 per 1M | ✅ | Fast Gemini. $0.15/M cached input (90% off). Free tier on AI Studio. | |
| DeepSeek V4 Flash | DeepSeek | 1M | $0.14/$0.28 per 1M | ❌ | Cost leader. MIT license. FREE on OpenCode. |
| DeepSeek V4 Pro | DeepSeek | 1M | $0.435/$0.87 per 1M | ❌ | Premium tier. Thinking mode default. 75% price cut now permanent (announced May 22, 2026). |
| DeepSeek R1 | DeepSeek | 1M | $0.435/$0.87 per 1M | ❌ | Deprecated as standalone; folded into V4 Flash thinking mode (deepseek-reasoner). Open-weight. |
| DeepSeek V4 | DeepSeek | 128K | $0.55/$2.19 per 1M | ❌ | Previous gen. Superseded by V4 Flash and Pro. |
| Llama 4 | Meta | varies | Free (self-host) | ✅ | Open weights. MIT license. Run locally. |
| Llama 4 Scout | Meta | 10M | Free (self-host) | ✅ | MoE variant. 10M context window, 109B total params. |
| Muse Spark | Meta | varies | API-only (preview) | ❌ | Meta's first proprietary (closed-weight) frontier model, Apr 2026. Powers Meta AI; private-preview API. NOT open-weight. |
| Grok 4.3 | xAI | 1M | $1.25/$2.50 per 1M | ✅ | xAI flagship (Apr 2026). Real-time X data. Legacy Grok 3/4 aliases route here. |
| Grok 3 Pro | xAI | 128K | $3/$15 per 1M | ✅ | Previous gen. Routes to Grok 4.3. |
| Kimi K2.6 | Moonshot AI | 256K | ~$0.60/$2.50 per 1M | ✅ | Latest Kimi. Top-5 on Design Arena. Agent swarm capabilities. ($0.16/M cached input.) |
| Kimi K2.5 (Thinking) | Moonshot AI | 256K | ~$0.55/$2.19 per 1M | ❌ | Previous gen with thinking mode. |
| GLM 5.1 | Zhipu AI | 200K | ~$0.98/$3.08 per 1M | ❌ | Zhipu's flagship. Top-5 on Design Arena. Open-weight. |
| GLM 5 Turbo | Zhipu AI | 128K | ~$0.30/$1.00 per 1M | ❌ | Fast inference variant of GLM 5. |
| GLM 5 | Zhipu AI | 128K | ~$0.60/$1.92 per 1M | ❌ | Base GLM 5 model. Strong multilingual performance. |
| GLM 4.7 | Zhipu AI | 128K | ~$0.30/$1.00 per 1M | ❌ | Mid-cycle update between GLM 4 and GLM 5. |
| GLM 4 | Zhipu AI | 128K | ~$0.20/$0.80 per 1M | ❌ | Previous gen. Still solid for Chinese-language tasks. |
| Qwen 3.6 | Alibaba | 128K | ~$0.33/$1.95 per 1M | ✅ | Alibaba's flagship. Strong across all benchmarks. (DashScope direct pricing.) |
| MiniMax M2.7 | MiniMax | 128K | ~$0.30/$1.20 per 1M | ✅ | Independent Chinese AI lab. Strong long-context performance. |
| MiMo V2.5 | Xiaomi | 128K | ~$1/$3 per 1M | ✅ | Xiaomi's multimodal model. |
Model Capability Matrix
How models perform across key tasks, rated on a 1-5 scale based on benchmark scores and real-world performance.
| Task \ Model | Opus Claude | Sonnet Claude | GPT-5.5 GPT | Instant GPT | Gemini Gemini | DS V4 DeepSeek | DS VL DeepSeek | o3 OpenAI | Llama 4 Llama | K2.6 Moonshot | GLM 5.1 Zhipu | Muse Meta | DS Pro DeepSeek | Opus 4.6 Claude | Grok 3 xAI | Qwen 3.6 Alibaba | Scout Llama | G 3 Mini Gemini | M2.7 MiniMax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coding Generate and refactor code | 5 Claude 4 Opus Coding: 5/5 HumanEval 96.2% Best-in-class code generation | 4 Claude Sonnet 4.6 Coding: 4/5 HumanEval 93.7% Strong daily driver | 5 GPT-5.5 Coding: 5/5 HumanEval 95.1% Excellent for most tasks | 4 GPT-5.5 Instant Coding: 4/5 HumanEval 92.8% Fast, good quality | 4 Gemini 3.1 Pro Coding: 4/5 HumanEval 94.0% Strong, especially with long context | 4 DeepSeek V4 Coding: 4/5 HumanEval 91.5% Surprisingly capable for price | 4 DeepSeek VL Coding: 4/5 HumanEval 90%+ Strong coder with vision understanding | 5 o3 Coding: 5/5 SWE-bench 71.7% Top-tier for complex coding | 3 Llama 4 405B Coding: 3/5 HumanEval 90.2% Good open-source option | 4 Kimi K2.6 Coding: 4/5 Strong coder, agentic capabilities | 4 GLM 5.1 Coding: 4/5 Strong multilingual coder | 4 Muse Spark Coding: 4/5 Strong coder, Llama lineage | 4 DeepSeek V4 Pro Coding: 4/5 Strong coder, premium variant | 5 Claude Opus 4.6 Coding: 5/5 HumanEval ~95% Excellent coder, slightly behind 4.7 | 4 Grok 3 Pro Coding: 4/5 Strong coder, real-time data access | 4 Qwen 3.6 Coding: 4/5 Strong multilingual coder | 3 Llama 4 Scout Coding: 3/5 Decent coder, MoE efficiency | 4 Gemini 3 Mini Coding: 4/5 Good coder for its size | 4 MiniMax M2.7 Coding: 4/5 Strong coder |
| Math Mathematical reasoning | 5 Claude 4 Opus Math: 5/5 MATH 96.8% Excellent mathematical reasoning | 4 Claude Sonnet 4.6 Math: 4/5 MATH 94.2% Strong, suitable for most needs | 5 GPT-5.5 Math: 5/5 MATH 95.5% Very strong math capability | 4 GPT-5.5 Instant Math: 4/5 MATH 92.1% Fast, good for basic math | 5 Gemini 3.1 Pro Math: 5/5 MATH 96.0% Excellent math performance | 4 DeepSeek V4 Math: 4/5 MATH 93.8% Strong for the price | 4 DeepSeek VL Math: 4/5 Good math, similar to DeepSeek V4 | 5 o3 Math: 5/5 MATH 97.9% Best-in-class math | 3 Llama 4 405B Math: 3/5 MATH 89.6% Decent open-source option | 4 Kimi K2.6 Math: 4/5 Solid math reasoning | 4 GLM 5.1 Math: 4/5 Good math reasoning | 4 Muse Spark Math: 4/5 Good math reasoning | 4 DeepSeek V4 Pro Math: 4/5 Good math reasoning | 5 Claude Opus 4.6 Math: 5/5 MATH ~95% Strong math capabilities | 4 Grok 3 Pro Math: 4/5 Good math reasoning | 4 Qwen 3.6 Math: 4/5 Good math reasoning | 3 Llama 4 Scout Math: 3/5 Adequate math reasoning | 4 Gemini 3 Mini Math: 4/5 Solid math reasoning | 4 MiniMax M2.7 Math: 4/5 Good math reasoning |
| Reasoning Complex multi-step reasoning | 5 Claude 4 Opus Reasoning: 5/5 GPQA 84.6% Deep, nuanced reasoning | 4 Claude Sonnet 4.6 Reasoning: 4/5 GPQA 79.8% Strong reasoning for most tasks | 4 GPT-5.5 Reasoning: 4/5 GPQA 82.1% Capable multi-step reasoning | 3 GPT-5.5 Instant Reasoning: 3/5 GPQA 78.0% Good, but trades depth for speed | 4 Gemini 3.1 Pro Reasoning: 4/5 GPQA 81.5% Solid reasoning, improved with 3.1 | 4 DeepSeek V4 Reasoning: 4/5 GPQA 76.4% Remarkably capable for cost | 4 DeepSeek VL Reasoning: 4/5 Solid reasoning with visual context | 5 o3 Reasoning: 5/5 GPQA 87.3% State-of-the-art reasoning | 3 Llama 4 405B Reasoning: 3/5 GPQA 73.1% Competitive open-source | 4 Kimi K2.6 Reasoning: 4/5 Strong reasoning with thinking mode | 4 GLM 5.1 Reasoning: 4/5 Solid reasoning capabilities | 4 Muse Spark Reasoning: 4/5 Competitive reasoning | 4 DeepSeek V4 Pro Reasoning: 4/5 Solid reasoning capabilities | 5 Claude Opus 4.6 Reasoning: 5/5 Deep reasoning, thinking mode available | 4 Grok 3 Pro Reasoning: 4/5 Solid multi-step reasoning | 4 Qwen 3.6 Reasoning: 4/5 Solid reasoning | 3 Llama 4 Scout Reasoning: 3/5 Competitive reasoning for size | 3 Gemini 3 Mini Reasoning: 3/5 Adequate reasoning | 4 MiniMax M2.7 Reasoning: 4/5 Solid reasoning |
| Writing Prose, analysis, long-form | 5 Claude 4 Opus Writing: 5/5 Best prose, nuance, and voice | 5 Claude Sonnet 4.6 Writing: 5/5 Excellent writing for daily use | 4 GPT-5.5 Writing: 4/5 Very good, slightly less nuanced | 3 GPT-5.5 Instant Writing: 3/5 Adequate, optimized for speed | 4 Gemini 3.1 Pro Writing: 4/5 Strong, especially analytical writing | 3 DeepSeek V4 Writing: 3/5 Decent, lags behind top models | 3 DeepSeek VL Writing: 3/5 Decent, vision-enhanced writing | 3 o3 Writing: 3/5 Reasoning-focused, not writing-optimized | 3 Llama 4 405B Writing: 3/5 Solid for open-source | 4 Kimi K2.6 Writing: 4/5 Good long-form writing | 4 GLM 5.1 Writing: 4/5 Strong multilingual writing | 3 Muse Spark Writing: 3/5 Adequate prose generation | 3 DeepSeek V4 Pro Writing: 3/5 Decent writing quality | 5 Claude Opus 4.6 Writing: 5/5 Excellent prose quality | 3 Grok 3 Pro Writing: 3/5 Adequate, not writing-optimized | 4 Qwen 3.6 Writing: 4/5 Strong multilingual writing | 3 Llama 4 Scout Writing: 3/5 Solid for open-source | 3 Gemini 3 Mini Writing: 3/5 Decent, speed-optimized | 3 MiniMax M2.7 Writing: 3/5 Adequate writing quality |
| Vision Image understanding | 4 Claude 4 Opus Vision: 4/5 Good image understanding | 4 Claude Sonnet 4.6 Vision: 4/5 Strong vision capability | 4 GPT-5.5 Vision: 4/5 Multimodal, strong image analysis | 3 GPT-5.5 Instant Vision: 3/5 Basic vision support | 5 Gemini 3.1 Pro Vision: 5/5 Best-in-class multimodal | 0 DeepSeek V4 Vision: 0/5 Text-only model. Use DeepSeek VL for vision. | 4 DeepSeek VL Vision: 4/5 DeepSeek's dedicated vision model. Strong image understanding. | 3 o3 Vision: 3/5 Text-only reasoning model | 3 Llama 4 405B Vision: 3/5 Basic multimodal support | 3 Kimi K2.6 Vision: 3/5 Basic vision support | 3 GLM 5.1 Vision: 3/5 Basic vision support | 3 Muse Spark Vision: 3/5 Basic multimodal support | 0 DeepSeek V4 Pro Vision: 0/5 Text-only. Use DeepSeek VL for vision. | 4 Claude Opus 4.6 Vision: 4/5 Good image understanding | 3 Grok 3 Pro Vision: 3/5 Basic vision support | 3 Qwen 3.6 Vision: 3/5 Basic vision support | 3 Llama 4 Scout Vision: 3/5 Basic multimodal support | 4 Gemini 3 Mini Vision: 4/5 Good vision for speed-optimized | 3 MiniMax M2.7 Vision: 3/5 Basic vision support |
| Long Context Processing large documents | 5 Claude 4 Opus Long Context: 5/5 400K context Excellent long-doc processing | 4 Claude Sonnet 4.6 Long Context: 4/5 200K context Very capable with long docs | 4 GPT-5.5 Long Context: 4/5 128K context Solid long context | 3 GPT-5.5 Instant Long Context: 3/5 128K context Same window, faster processing | 5 Gemini 3.1 Pro Long Context: 5/5 1M context Industry-leading context window | 3 DeepSeek V4 Long Context: 3/5 128K context Standard context window | 3 DeepSeek VL Long Context: 3/5 128K context Same context window as V4 | 3 o3 Long Context: 3/5 128K context Focuses on depth, not span | 3 Llama 4 405B Long Context: 3/5 128K context Standard for open-source | 5 Kimi K2.6 Long Context: 5/5 256K context Excellent long-context, agent swarm | 3 GLM 5.1 Long Context: 3/5 128K context Standard context window | 3 Muse Spark Long Context: 3/5 128K context Standard for open-weight | 3 DeepSeek V4 Pro Long Context: 3/5 128K context Standard context window | 4 Claude Opus 4.6 Long Context: 4/5 200K context Solid long-doc processing | 3 Grok 3 Pro Long Context: 3/5 128K context Standard context window | 3 Qwen 3.6 Long Context: 3/5 128K context Standard context window | 5 Llama 4 Scout Long Context: 5/5 10M context Massive context window, best in class | 3 Gemini 3 Mini Long Context: 3/5 128K context Standard context window | 4 MiniMax M2.7 Long Context: 4/5 128K context Strong long-context performance |
| Agentic Tool use, multi-step tasks | 5 Claude 4 Opus Agentic: 5/5 Excellent tool use and reasoning | 5 Claude Sonnet 4.6 Agentic: 5/5 SWE-bench 49% Best-in-class agentic coding | 4 GPT-5.5 Agentic: 4/5 Strong function calling | 3 GPT-5.5 Instant Agentic: 3/5 Fast but less reliable | 4 Gemini 3.1 Pro Agentic: 4/5 Good tool use, improving | 3 DeepSeek V4 Agentic: 3/5 Basic function calling | 3 DeepSeek VL Agentic: 3/5 Basic function calling with vision | 4 o3 Agentic: 4/5 Reasoning-first agentic | 3 Llama 4 405B Agentic: 3/5 Improving with each release | 5 Kimi K2.6 Agentic: 5/5 Up to 100 specialized agents in swarm | 3 GLM 5.1 Agentic: 3/5 Basic agentic capabilities | 4 Muse Spark Agentic: 4/5 Good tool use | 3 DeepSeek V4 Pro Agentic: 3/5 Basic function calling | 5 Claude Opus 4.6 Agentic: 5/5 Excellent tool use | 3 Grok 3 Pro Agentic: 3/5 Basic function calling | 3 Qwen 3.6 Agentic: 3/5 Basic agentic capabilities | 3 Llama 4 Scout Agentic: 3/5 Basic tool use | 3 Gemini 3 Mini Agentic: 3/5 Basic agentic capabilities | 3 MiniMax M2.7 Agentic: 3/5 Basic agentic capabilities |
| Speed Response latency | 2 Claude 4 Opus Speed: 2/5 Slowest, but most thoughtful | 3 Claude Sonnet 4.6 Speed: 3/5 Moderate speed | 4 GPT-5.5 Speed: 4/5 Fast for frontier quality | 5 GPT-5.5 Instant Speed: 5/5 Fastest in class, <1s responses | 4 Gemini 3.1 Pro Speed: 4/5 Consistently fast | 4 DeepSeek V4 Speed: 4/5 Good speed for the price | 3 DeepSeek VL Speed: 3/5 Slower than V4 due to vision processing | 1 o3 Speed: 1/5 Slow deliberative reasoning | 3 Llama 4 405B Speed: 3/5 Varies by deployment | 3 Kimi K2.6 Speed: 3/5 Moderate speed | 4 GLM 5.1 Speed: 4/5 Fast inference | 4 Muse Spark Speed: 4/5 Fast inference | 4 DeepSeek V4 Pro Speed: 4/5 Fast inference | 2 Claude Opus 4.6 Speed: 2/5 Slower, thoughtful responses | 4 Grok 3 Pro Speed: 4/5 Fast inference | 4 Qwen 3.6 Speed: 4/5 Fast inference | 3 Llama 4 Scout Speed: 3/5 MoE, moderate speed | 5 Gemini 3 Mini Speed: 5/5 Fastest Gemini variant | 4 MiniMax M2.7 Speed: 4/5 Fast inference |
| Cost Efficiency Value per dollar | 2 Claude 4 Opus Cost Efficiency: 2/5 $15/$75 per 1M Most expensive per token | 3 Claude Sonnet 4.6 Cost Efficiency: 3/5 $3/$15 per 1M Reasonable for quality | 3 GPT-5.5 Cost Efficiency: 3/5 $2/$8 per 1M Competitive pricing | 4 GPT-5.5 Instant Cost Efficiency: 4/5 $0.05/$0.20 per 1M Very cheap, fast | 4 Gemini 3.1 Pro Cost Efficiency: 4/5 $2/$12 per 1M Good value for long context | 5 DeepSeek V4 Cost Efficiency: 5/5 $0.55/$2.19 per 1M 10-50x cheaper than peers | 4 DeepSeek VL Cost Efficiency: 4/5 Competitive pricing for vision tasks | 1 o3 Cost Efficiency: 1/5 $10-60 per 1M output Most expensive reasoning | 5 Llama 4 405B Cost Efficiency: 5/5 Free (self-host) Open-source, no API costs | 4 Kimi K2.6 Cost Efficiency: 4/5 Competitive pricing | 4 GLM 5.1 Cost Efficiency: 4/5 Competitive pricing | 5 Muse Spark Cost Efficiency: 5/5 Free (self-host) Open-weight, no API costs | 4 DeepSeek V4 Pro Cost Efficiency: 4/5 Good value for quality | 2 Claude Opus 4.6 Cost Efficiency: 2/5 $15/$75 per 1M Expensive but capable | 2 Grok 3 Pro Cost Efficiency: 2/5 $3/$15 per 1M Premium pricing | 4 Qwen 3.6 Cost Efficiency: 4/5 Competitive pricing | 5 Llama 4 Scout Cost Efficiency: 5/5 Free (self-host) Open-weight, no API costs | 4 Gemini 3 Mini Cost Efficiency: 4/5 $1/$6 per 1M Affordable for quality | 4 MiniMax M2.7 Cost Efficiency: 4/5 Competitive pricing |
| Score | Meaning |
|---|---|
| 5 (dark green) | Best in class. Top performer for this task. |
| 4 (light green) | Strong. Excellent for most use cases. |
| 3 (yellow) | Good. Capable but not top-tier. |
| 2 (orange) | Fair. Works for simple cases. |
| 1 (red) | Limited. Not recommended for this task. |
Scores combine public benchmarks and real-world usage as of May 2026. “Speed” measures output latency, not throughput. For detailed benchmark numbers, see the Benchmarks page.
Where to Start
- Pick a default: Claude Sonnet (recommended) or GPT-4o
- Use free credits to test on your actual task
- Measure cost: How many tokens? How many requests?
- Optimize: Add Haiku for cheap tasks, Opus only when Sonnet fails
- Monitor: Track costs monthly
See Also:
- Economics of AI - Cost analysis and optimization
- Tools Guide - Where to access models
- Builder Path - How to build with APIs