Models Decision Guide

📖 11 min read modelsdecision-guidereference

Choose the right LLM for your use case - reasoning, speed, cost, and capabilities

Key Takeaways

Choose by primary constraint: cost, speed, reasoning, context, or privacy
Claude Sonnet is the best default — good balance of quality, speed, and cost
DeepSeek V4 Flash is the best budget option at $0.14/$0.28 per 1M
For privacy-sensitive apps, self-host Llama 4 or run on-device Phi-4

How to choose the right model for your specific task. Reasoning, speed, cost, and capabilities compared.

Find the Right Model

Use Case:

Speed:

Cost:

Showing 12 of 12 models

Claude Opus 4.8

Anthropic

Context: 1M

Speed: slow

Price: $5/25 per 1M

reasoningwritinganalysiscoding

Complex reasoning, long documents

Claude Sonnet 4.6

Anthropic

Context: 1M

Speed: fast

Price: $3/15 per 1M

writingcodinganalysis

Default choice for most tasks

Claude Haiku 4.5

Anthropic

Context: 200K

Speed: ultra-fast

Price: $1/5 per 1M

routingclassificationspeed

Fast & cheap for simple tasks

GPT-5.5

OpenAI

Context: 1M

Speed: fast

Price: $5/30 per 1M

writingcodingvisionreasoning

Flagship reasoning & coding

GPT-5.4

OpenAI

Context: 1M

Speed: fast

Price: $2.5/15 per 1M

writingcodinganalysis

Best value for most production workloads

GPT-5.4 mini

OpenAI

Context: 400K

Speed: ultra-fast

Price: $0.75/4.5 per 1M

speedbudgetroutingcoding

Cost-efficient coding & agents

o3

OpenAI

Context: 128K

Speed: very-slow

reasoning

Hardest problems (math, logic)

o1

OpenAI

Context: 128K

Speed: slow

Price: $15/60 per 1M

reasoning

Hard reasoning at lower cost than o3

Gemini 3.1 Pro

Google

Context: 1M

Speed: medium

Price: $2/12 per 1M

long-contextvisionresearch

Massive documents & vision

DeepSeek V4 Flash

DeepSeek

Context: 1M

Speed: fast

Price: $0.14/0.28 per 1M

speedbudgetrouting

Extreme budget, minimal quality loss

DeepSeek V4 Pro

DeepSeek

Context: 1M

Speed: medium

Price: $0.435/0.87 per 1M

reasoningbudgetcoding

Frontier quality at fraction of cost

Llama 4

Quick Decision Tree

1. What's your primary constraint?
   ├─ Cost → DeepSeek V4 Flash or GPT-5.5 Instant
   ├─ Speed → Claude Haiku or GPT-5.5 Instant
   ├─ Reasoning/Quality → Claude Opus or o3
   ├─ Long context → Gemini 3.1 Pro (1M tokens)
   └─ Privacy/On-prem → Llama 4 self-hosted

2. What's your use case?
   ├─ Writing/Analysis → Claude Sonnet (default)
   ├─ Code generation → Claude Sonnet or GPT-4o
   ├─ Reasoning/Math → o3 or Claude Opus
   ├─ Vision/Images → GPT-4o or Gemini 3.1
   ├─ Document processing → Gemini 3.1 Pro (1M context)
   └─ Real-time → Perplexity or Claude with web search

Model Categories

Tier 1: Reasoning Models (Slow but Brilliant)

Claude Opus 4.8 (Anthropic)

Context: 1M tokens (read a whole book)
Speed: Slow (think for 30+ seconds)
Cost: $15-75 per 1M tokens
Best for: Complex reasoning, multi-step logic, deep analysis
Why: Most capable model available. Use when Sonnet struggles.
When NOT to use: Simple tasks, real-time applications, cost-sensitive work

o3 (OpenAI)

Context: 128K tokens
Speed: Very slow (extended thinking)
Cost: Premium pricing
Best for: Extremely hard problems (math, coding competitions, logic puzzles)
Why: Breakthrough reasoning capability
When NOT to use: General tasks (overkill), any time-sensitive work

o1 (OpenAI)

Context: 128K tokens
Speed: Slow but faster than o3
Cost: $15 per 1M input,$ 60 per 1M output
Best for: Difficult reasoning without needing extended thinking
Why: Good reasoning at reasonable speed
When NOT to use: Simple tasks, real-time

Tier 2: Default Models (Fast & Smart)

Claude 3.5 Sonnet 4.6 (Anthropic) ⭐ Recommended Default

Context: 200K tokens
Speed: Fast (2-5 seconds)
Cost: $3-15 per 1M tokens
Best for: Almost everything - writing, code, analysis
Why: Best balance of speed, quality, cost
When to use: Your first choice for any task

GPT-4o (OpenAI)

Context: 128K tokens
Speed: Fast (2-5 seconds)
Cost: $2-8 per 1M tokens
Best for: All-around work, especially vision/images
Why: Extremely reliable, good at everything
When to use: When you need vision, or want OpenAI’s reliability

Gemini 3.1 Pro (Google)

Context: 1M tokens (!!)
Speed: Medium (5-10 seconds)
Cost: $2-12 per 1M tokens
Best for: Document analysis, long-context research
Why: Only model that can read entire books
When to use: When context window matters more than speed

Tier 3: Speed-Focused (Fast & Cheap)

Claude 3.5 Haiku (Anthropic)

Context: 200K tokens
Speed: Ultra-fast (under 1 second)
Cost: $0.80-4 per 1M tokens
Best for: Classification, routing, summaries, high volume
Why: Surprisingly capable despite being the smallest
When to use: When speed is critical or volume is high

GPT-4 Turbo (OpenAI)

Context: 128K tokens
Speed: Fast
Cost: $0.01-0.03 per 1K tokens
Best for: Production systems, high volume
Why: Reliable, cheap
When to use: Cost-sensitive production

DeepSeek V4 Flash (China)

Context: 128K tokens
Speed: Fast
Cost: $0.14-0.28 per 1M tokens (!!)
Vision: ❌ Text only. Use DeepSeek VL for image tasks.
Best for: Budget-conscious work, routing, high volume
Why: Shockingly cheap and good quality

DeepSeek VL (China)

Context: 128K tokens
Speed: Medium
Cost: ~ $0.55/$ 2.19 per 1M tokens
Vision: ✅ Images
Best for: Vision tasks within DeepSeek ecosystem
Why: DeepSeek’s dedicated vision model. Use when you need image understanding at DeepSeek pricing.
When to use: Cost is the #1 constraint

Tier 4: Specialized

GPT-4 Vision (OpenAI)

Best for: Image analysis, OCR, visual understanding
Why: GPT-4o is better and cheaper now
When to use: Legacy systems

Claude 3 Opus (Anthropic, previous version)

Replaced by Claude 4.7
When to use: Nowhere; use Opus 4.8 instead

Open-Source Models (Llama, Mistral, DeepSeek)

Best for: Privacy, on-premise, fine-tuning
Why: Full control, no API costs
When to use: When data can’t leave your infrastructure
How: Run locally with Ollama or LM Studio

Decision Matrix By Use Case

Task	Model	Why	Cost
Customer support chatbot	Haiku	Fast, cheap	$2-5/month
Blog post writing	Sonnet	Quality + speed balance	$1-3/month
Code generation	Sonnet or GPT-4o	Both excellent	$2-5/month
Complex reasoning	Opus or o3	Need the power	$50-200/month
Document analysis (100 pages)	Gemini 3.1 Pro	Only fits 1M context	$2-10/month
Real-time Q&A	Perplexity	Web search built-in	Free-20/month
Vision/image tasks	GPT-4o	Best at images	$2-5/month
Routing/classification	Haiku	Speed + cheap	$1-2/month
Data extraction	Sonnet + structured output	Reliable parsing	$2-5/month
High volume (1000+ requests/day)	Haiku or V4 Flash	Need cheap inference	$10-50/month

Cost Comparison for Common Scenarios

Scenario 1: Personal Research Assistant

Use case: 10 questions/day, 2000 input tokens avg, 500 output tokens

Model	Monthly Cost	Speed	Quality
Claude Sonnet	$0.90	Fast	Excellent
GPT-4o	$0.60	Fast	Excellent
Gemini 3.1 Pro	$0.60	Medium	Excellent
DeepSeek V4	$0.33	Fast	Good

Recommendation: Sonnet or GPT-4o (negligible difference)

Scenario 2: High-Volume Classification (10,000 req/day)

Use case: 500 input tokens, 50 output tokens per request

Model	Monthly Cost	Speed	Quality
Claude Haiku	$45	Ultra-fast	Good
GPT-4 Turbo	$15	Fast	Good
DeepSeek Flash	$7	Fast	Good

Recommendation: DeepSeek Flash (10x cheaper than Haiku)

Scenario 3: Complex Reasoning (50 req/day)

Use case: 3000 input tokens, 2000 output tokens per request

Model	Monthly Cost	Speed	Quality
Claude Opus	$675	Slow	Excellent
Claude Sonnet	$225	Fast	Excellent
o3	$2000	Very slow	Best-in-class
o1	$450	Slow	Excellent

Recommendation: Sonnet (best balance), o3 (if you need the best and can wait)

Which Model for Your Project?

If You’re Building a Startup/Product

Start with: Claude Sonnet + Haiku combo

Sonnet for complex tasks
Haiku for high-volume/cheap tasks
Reason: Cost-effective, reliable, good quality

Scale with: Opus if you hit reasoning limits

If You’re Prototyping/Learning

Start with: GPT-4o or Claude Sonnet (free tier)

Both have good free credits
Reason: Simplest to get started

Explore: Try multiple models on the same task to see tradeoffs

If Cost Is Your #1 Constraint

Use: DeepSeek V4 Flash + Sonnet combo

Flash for everything possible
Sonnet when Flash isn’t good enough
Reason: 10x cheaper overall

If You Need Long Context (1000+ page documents)

Use: Gemini 3.1 Pro (only option with 1M context)

Reason: Nothing else can handle that much text

If You Need On-Premise/Privacy

Use: Llama 4 (run locally with Ollama)

Cost: Free (just electricity)
Tradeoff: Slower, less capable
Reason: Data never leaves your machine

If Speed Matters Most

Use: Haiku + use caching

Response time: under 1 second
Or: o1 if you need reasoning (slower but better)
Reason: Trade quality for speed when needed

Optimization Strategies

Strategy 1: Routing (Mixture of Models)

Use a cheap model to decide which model to use:

Input: User question
  ↓
Haiku: "Is this question simple? (yes/no)"
  ├─ yes → Use Haiku for answer (cheap)
  └─ no → Use Sonnet for answer (better)

Savings: 80% of questions use Haiku, 20% use Sonnet = 30% cost reduction

Strategy 2: Caching

If you analyze the same document repeatedly:

First request: Analyze document X (full cost)
Second request: Analyze document X (90% cheaper - cached)

Savings: With caching, 2nd-10th requests are 90% cheaper

Strategy 3: Batch Processing

Don’t ask questions one-at-a-time:

❌ Bad: 1000 questions, each call costs $0.01 = $10
✅ Good: Batch 100 questions per call, 10 calls = $0.10

Savings: 100x for certain APIs

Vision / Image Input Support

Not all models can process images. Here’s which ones can and what they support:

Model	Vision	Type	Notes
Claude Sonnet 4.6	✅	Images	Strong image analysis, charts, documents
Claude Opus 4.8	✅	Images	Best for detailed visual reasoning
GPT-5.5	✅	Images	Solid multimodal
GPT-5.5 Instant	❌	Text only	Fastest, no vision
o3	❌	Text only	Reasoning-only model
Gemini 3.1 Pro	✅	Images + Video	Best multimodal support
DeepSeek V4	❌	Text only	Use DeepSeek VL for vision
DeepSeek V4 Flash	❌	Text only	Use DeepSeek VL for vision
DeepSeek VL	✅	Images	DeepSeek’s dedicated vision model
Llama 4	✅	Images	Open-source multimodal

Key takeaway: If your task involves analyzing images, charts, or screenshots, pick a model with ✅. DeepSeek V4 and V4 Flash are excellent for text-only tasks but can’t process images at all - use DeepSeek VL if you need vision in the DeepSeek ecosystem.

What Changed Recently (May 2026)

o3 released (best reasoning ever, very expensive)
Claude 4.7 released (Opus version, improved reasoning)
DeepSeek V4 released (open-source reasoning, cheaper than ever)
Gemini 3.1 released (1M context window, multimodal)
GPT-5.5 released (minor improvements over 4o)

Implication: 2026 is dominated by reasoning models and long-context models.

Common Mistakes

❌ Using Opus for everything - Overkill 95% of the time
✅ Use Sonnet by default, Opus when needed

❌ Ignoring cost - Can add up fast with high volume
✅ Calculate your actual usage, optimize routing

❌ Assuming newer = better - Sometimes not true
✅ Test models on your actual task

❌ Using same model for everything - Suboptimal
✅ Use a mix (Haiku for cheap, Sonnet for quality)

Model Specifications & Capabilities

All current models with pricing, context windows, and vision support. Source of truth: src/data/models.ts.

Model	Company	Context	Input/Output	Vision	Notes
Claude Opus 4.8	Anthropic	1M	$5/$25 per 1M	✅	Most capable Claude (May 2026). Best for complex reasoning and agentic coding. Adaptive thinking. Fast Mode $10/$50.
Claude Opus 4.8 (Thinking)	Anthropic	1M	$5/$25 per 1M	✅	Top-ranked on Design Arena. Thinking mode enabled.
Claude Opus 4.7	Anthropic	1M	$5/$25 per 1M	✅	Previous flagship (superseded by Opus 4.8, May 2026).
Claude Opus 4.6	Anthropic	1M	$5/$25 per 1M	✅	Previous gen flagship. Still highly capable.
Claude Opus 4.6 (Thinking)	Anthropic	1M	$5/$25 per 1M	✅	Previous gen with thinking mode. Strong on design benchmarks.
Claude Opus 4.5	Anthropic	200K	$5/$25 per 1M	❌	Earlier generation. Still available for certain use cases.
Claude Sonnet 4.6	Anthropic	1M	$3/$15 per 1M	✅	Best balance of speed & quality. Default pick.
Claude Haiku 4.5	Anthropic	200K	$1/$5 per 1M	✅	Ultra-fast, cheapest Claude.
GPT-5.5	OpenAI	1M	$5/$30 per 1M	✅	Flagship. Reasoning levels none→xhigh. Strong all-around.
GPT-5.4	OpenAI	1M	$2.50/$15 per 1M	✅	Affordable professional tier. Near-flagship capability.
GPT-5.4 mini	OpenAI	400K	$0.75/$4.50 per 1M	❌	Strong mini for coding & agents. Fast.
GPT-5.4 nano	OpenAI	400K	$0.20/$1.25 per 1M	❌	Fastest, cheapest. Ideal for high-throughput.
GPT-4.1	OpenAI	128K	$2/$8 per 1M	❌	Previous gen. Superseded by GPT-5.4 mini.
o3	OpenAI	128K	$2/$8 per 1M	❌	Dedicated reasoning model. Spends tokens on hidden thinking. 87% cheaper than o1.
o1	OpenAI	128K	$15/$60 per 1M	❌	Earlier reasoning model. Superseded by o3.
Gemini 3.1 Pro	Google	1M	$2/$12 per 1M	✅	Flagship Gemini. Best context window, excellent multimodal. Prompts >200K billed $4/$18.
Gemini 3.5 Flash	Google	1M	$1.50/$9 per 1M	✅	Fast Gemini. $0.15/M cached input (90% off). Free tier on AI Studio.
DeepSeek V4 Flash	DeepSeek	1M	$0.14/$0.28 per 1M	❌	Cost leader. MIT license. FREE on OpenCode.
DeepSeek V4 Pro	DeepSeek	1M	$0.435/$0.87 per 1M	❌	Premium tier. Thinking mode default. 75% price cut now permanent (announced May 22, 2026).
DeepSeek R1	DeepSeek	1M	$0.435/$0.87 per 1M	❌	Deprecated as standalone; folded into V4 Flash thinking mode (deepseek-reasoner). Open-weight.
DeepSeek V4	DeepSeek	128K	$0.55/$2.19 per 1M	❌	Previous gen. Superseded by V4 Flash and Pro.
Llama 4	Meta	varies	Free (self-host)	✅	Open weights. MIT license. Run locally.
Llama 4 Scout	Meta	10M	Free (self-host)	✅	MoE variant. 10M context window, 109B total params.
Muse Spark	Meta	varies	API-only (preview)	❌	Meta's first proprietary (closed-weight) frontier model, Apr 2026. Powers Meta AI; private-preview API. NOT open-weight.
Grok 4.3	xAI	1M	$1.25/$2.50 per 1M	✅	xAI flagship (Apr 2026). Real-time X data. Legacy Grok 3/4 aliases route here.
Grok 3 Pro	xAI	128K	$3/$15 per 1M	✅	Previous gen. Routes to Grok 4.3.
Kimi K2.6	Moonshot AI	256K	~$0.60/$2.50 per 1M	✅	Latest Kimi. Top-5 on Design Arena. Agent swarm capabilities. ($0.16/M cached input.)
Kimi K2.5 (Thinking)	Moonshot AI	256K	~$0.55/$2.19 per 1M	❌	Previous gen with thinking mode.
GLM 5.1	Zhipu AI	200K	~$0.98/$3.08 per 1M	❌	Zhipu's flagship. Top-5 on Design Arena. Open-weight.
GLM 5 Turbo	Zhipu AI	128K	~$0.30/$1.00 per 1M	❌	Fast inference variant of GLM 5.
GLM 5	Zhipu AI	128K	~$0.60/$1.92 per 1M	❌	Base GLM 5 model. Strong multilingual performance.
GLM 4.7	Zhipu AI	128K	~$0.30/$1.00 per 1M	❌	Mid-cycle update between GLM 4 and GLM 5.
GLM 4	Zhipu AI	128K	~$0.20/$0.80 per 1M	❌	Previous gen. Still solid for Chinese-language tasks.
Qwen 3.6	Alibaba	128K	~$0.33/$1.95 per 1M	✅	Alibaba's flagship. Strong across all benchmarks. (DashScope direct pricing.)
MiniMax M2.7	MiniMax	128K	~$0.30/$1.20 per 1M	✅	Independent Chinese AI lab. Strong long-context performance.
MiMo V2.5	Xiaomi	128K	~$1/$3 per 1M	✅	Xiaomi's multimodal model.

Model Capability Matrix

How models perform across key tasks, rated on a 1-5 scale based on benchmark scores and real-world performance.

Strength: Best Strong Good Fair Limited

Filter:

Task \ Model	Opus Claude	Sonnet Claude	GPT-5.5 GPT	Instant GPT	Gemini Gemini	DS V4 DeepSeek	DS VL DeepSeek	o3 OpenAI	Llama 4 Llama	K2.6 Moonshot	GLM 5.1 Zhipu	Muse Meta	DS Pro DeepSeek	Opus 4.6 Claude	Grok 3 xAI	Qwen 3.6 Alibaba	Scout Llama	G 3 Mini Gemini	M2.7 MiniMax
Coding Generate and refactor code	5 Claude 4 Opus Coding: 5/5 HumanEval 96.2% Best-in-class code generation	4 Claude Sonnet 4.6 Coding: 4/5 HumanEval 93.7% Strong daily driver	5 GPT-5.5 Coding: 5/5 HumanEval 95.1% Excellent for most tasks	4 GPT-5.5 Instant Coding: 4/5 HumanEval 92.8% Fast, good quality	4 Gemini 3.1 Pro Coding: 4/5 HumanEval 94.0% Strong, especially with long context	4 DeepSeek V4 Coding: 4/5 HumanEval 91.5% Surprisingly capable for price	4 DeepSeek VL Coding: 4/5 HumanEval 90%+ Strong coder with vision understanding	5 o3 Coding: 5/5 SWE-bench 71.7% Top-tier for complex coding	3 Llama 4 405B Coding: 3/5 HumanEval 90.2% Good open-source option	4 Kimi K2.6 Coding: 4/5 Strong coder, agentic capabilities	4 GLM 5.1 Coding: 4/5 Strong multilingual coder	4 Muse Spark Coding: 4/5 Strong coder, Llama lineage	4 DeepSeek V4 Pro Coding: 4/5 Strong coder, premium variant	5 Claude Opus 4.6 Coding: 5/5 HumanEval ~95% Excellent coder, slightly behind 4.7	4 Grok 3 Pro Coding: 4/5 Strong coder, real-time data access	4 Qwen 3.6 Coding: 4/5 Strong multilingual coder	3 Llama 4 Scout Coding: 3/5 Decent coder, MoE efficiency	4 Gemini 3 Mini Coding: 4/5 Good coder for its size	4 MiniMax M2.7 Coding: 4/5 Strong coder
Math Mathematical reasoning	5 Claude 4 Opus Math: 5/5 MATH 96.8% Excellent mathematical reasoning	4 Claude Sonnet 4.6 Math: 4/5 MATH 94.2% Strong, suitable for most needs	5 GPT-5.5 Math: 5/5 MATH 95.5% Very strong math capability	4 GPT-5.5 Instant Math: 4/5 MATH 92.1% Fast, good for basic math	5 Gemini 3.1 Pro Math: 5/5 MATH 96.0% Excellent math performance	4 DeepSeek V4 Math: 4/5 MATH 93.8% Strong for the price	4 DeepSeek VL Math: 4/5 Good math, similar to DeepSeek V4	5 o3 Math: 5/5 MATH 97.9% Best-in-class math	3 Llama 4 405B Math: 3/5 MATH 89.6% Decent open-source option	4 Kimi K2.6 Math: 4/5 Solid math reasoning	4 GLM 5.1 Math: 4/5 Good math reasoning	4 Muse Spark Math: 4/5 Good math reasoning	4 DeepSeek V4 Pro Math: 4/5 Good math reasoning	5 Claude Opus 4.6 Math: 5/5 MATH ~95% Strong math capabilities	4 Grok 3 Pro Math: 4/5 Good math reasoning	4 Qwen 3.6 Math: 4/5 Good math reasoning	3 Llama 4 Scout Math: 3/5 Adequate math reasoning	4 Gemini 3 Mini Math: 4/5 Solid math reasoning	4 MiniMax M2.7 Math: 4/5 Good math reasoning
Reasoning Complex multi-step reasoning	5 Claude 4 Opus Reasoning: 5/5 GPQA 84.6% Deep, nuanced reasoning	4 Claude Sonnet 4.6 Reasoning: 4/5 GPQA 79.8% Strong reasoning for most tasks	4 GPT-5.5 Reasoning: 4/5 GPQA 82.1% Capable multi-step reasoning	3 GPT-5.5 Instant Reasoning: 3/5 GPQA 78.0% Good, but trades depth for speed	4 Gemini 3.1 Pro Reasoning: 4/5 GPQA 81.5% Solid reasoning, improved with 3.1	4 DeepSeek V4 Reasoning: 4/5 GPQA 76.4% Remarkably capable for cost	4 DeepSeek VL Reasoning: 4/5 Solid reasoning with visual context	5 o3 Reasoning: 5/5 GPQA 87.3% State-of-the-art reasoning	3 Llama 4 405B Reasoning: 3/5 GPQA 73.1% Competitive open-source	4 Kimi K2.6 Reasoning: 4/5 Strong reasoning with thinking mode	4 GLM 5.1 Reasoning: 4/5 Solid reasoning capabilities	4 Muse Spark Reasoning: 4/5 Competitive reasoning	4 DeepSeek V4 Pro Reasoning: 4/5 Solid reasoning capabilities	5 Claude Opus 4.6 Reasoning: 5/5 Deep reasoning, thinking mode available	4 Grok 3 Pro Reasoning: 4/5 Solid multi-step reasoning	4 Qwen 3.6 Reasoning: 4/5 Solid reasoning	3 Llama 4 Scout Reasoning: 3/5 Competitive reasoning for size	3 Gemini 3 Mini Reasoning: 3/5 Adequate reasoning	4 MiniMax M2.7 Reasoning: 4/5 Solid reasoning
Writing Prose, analysis, long-form	5 Claude 4 Opus Writing: 5/5 Best prose, nuance, and voice	5 Claude Sonnet 4.6 Writing: 5/5 Excellent writing for daily use	4 GPT-5.5 Writing: 4/5 Very good, slightly less nuanced	3 GPT-5.5 Instant Writing: 3/5 Adequate, optimized for speed	4 Gemini 3.1 Pro Writing: 4/5 Strong, especially analytical writing	3 DeepSeek V4 Writing: 3/5 Decent, lags behind top models	3 DeepSeek VL Writing: 3/5 Decent, vision-enhanced writing	3 o3 Writing: 3/5 Reasoning-focused, not writing-optimized	3 Llama 4 405B Writing: 3/5 Solid for open-source	4 Kimi K2.6 Writing: 4/5 Good long-form writing	4 GLM 5.1 Writing: 4/5 Strong multilingual writing	3 Muse Spark Writing: 3/5 Adequate prose generation	3 DeepSeek V4 Pro Writing: 3/5 Decent writing quality	5 Claude Opus 4.6 Writing: 5/5 Excellent prose quality	3 Grok 3 Pro Writing: 3/5 Adequate, not writing-optimized	4 Qwen 3.6 Writing: 4/5 Strong multilingual writing	3 Llama 4 Scout Writing: 3/5 Solid for open-source	3 Gemini 3 Mini Writing: 3/5 Decent, speed-optimized	3 MiniMax M2.7 Writing: 3/5 Adequate writing quality
Vision Image understanding	4 Claude 4 Opus Vision: 4/5 Good image understanding	4 Claude Sonnet 4.6 Vision: 4/5 Strong vision capability	4 GPT-5.5 Vision: 4/5 Multimodal, strong image analysis	3 GPT-5.5 Instant Vision: 3/5 Basic vision support	5 Gemini 3.1 Pro Vision: 5/5 Best-in-class multimodal	0 DeepSeek V4 Vision: 0/5 Text-only model. Use DeepSeek VL for vision.	4 DeepSeek VL Vision: 4/5 DeepSeek's dedicated vision model. Strong image understanding.	3 o3 Vision: 3/5 Text-only reasoning model	3 Llama 4 405B Vision: 3/5 Basic multimodal support	3 Kimi K2.6 Vision: 3/5 Basic vision support	3 GLM 5.1 Vision: 3/5 Basic vision support	3 Muse Spark Vision: 3/5 Basic multimodal support	0 DeepSeek V4 Pro Vision: 0/5 Text-only. Use DeepSeek VL for vision.	4 Claude Opus 4.6 Vision: 4/5 Good image understanding	3 Grok 3 Pro Vision: 3/5 Basic vision support	3 Qwen 3.6 Vision: 3/5 Basic vision support	3 Llama 4 Scout Vision: 3/5 Basic multimodal support	4 Gemini 3 Mini Vision: 4/5 Good vision for speed-optimized	3 MiniMax M2.7 Vision: 3/5 Basic vision support
Long Context Processing large documents	5 Claude 4 Opus Long Context: 5/5 400K context Excellent long-doc processing	4 Claude Sonnet 4.6 Long Context: 4/5 200K context Very capable with long docs	4 GPT-5.5 Long Context: 4/5 128K context Solid long context	3 GPT-5.5 Instant Long Context: 3/5 128K context Same window, faster processing	5 Gemini 3.1 Pro Long Context: 5/5 1M context Industry-leading context window	3 DeepSeek V4 Long Context: 3/5 128K context Standard context window	3 DeepSeek VL Long Context: 3/5 128K context Same context window as V4	3 o3 Long Context: 3/5 128K context Focuses on depth, not span	3 Llama 4 405B Long Context: 3/5 128K context Standard for open-source	5 Kimi K2.6 Long Context: 5/5 256K context Excellent long-context, agent swarm	3 GLM 5.1 Long Context: 3/5 128K context Standard context window	3 Muse Spark Long Context: 3/5 128K context Standard for open-weight	3 DeepSeek V4 Pro Long Context: 3/5 128K context Standard context window	4 Claude Opus 4.6 Long Context: 4/5 200K context Solid long-doc processing	3 Grok 3 Pro Long Context: 3/5 128K context Standard context window	3 Qwen 3.6 Long Context: 3/5 128K context Standard context window	5 Llama 4 Scout Long Context: 5/5 10M context Massive context window, best in class	3 Gemini 3 Mini Long Context: 3/5 128K context Standard context window	4 MiniMax M2.7 Long Context: 4/5 128K context Strong long-context performance
Agentic Tool use, multi-step tasks	5 Claude 4 Opus Agentic: 5/5 Excellent tool use and reasoning	5 Claude Sonnet 4.6 Agentic: 5/5 SWE-bench 49% Best-in-class agentic coding	4 GPT-5.5 Agentic: 4/5 Strong function calling	3 GPT-5.5 Instant Agentic: 3/5 Fast but less reliable	4 Gemini 3.1 Pro Agentic: 4/5 Good tool use, improving	3 DeepSeek V4 Agentic: 3/5 Basic function calling	3 DeepSeek VL Agentic: 3/5 Basic function calling with vision	4 o3 Agentic: 4/5 Reasoning-first agentic	3 Llama 4 405B Agentic: 3/5 Improving with each release	5 Kimi K2.6 Agentic: 5/5 Up to 100 specialized agents in swarm	3 GLM 5.1 Agentic: 3/5 Basic agentic capabilities	4 Muse Spark Agentic: 4/5 Good tool use	3 DeepSeek V4 Pro Agentic: 3/5 Basic function calling	5 Claude Opus 4.6 Agentic: 5/5 Excellent tool use	3 Grok 3 Pro Agentic: 3/5 Basic function calling	3 Qwen 3.6 Agentic: 3/5 Basic agentic capabilities	3 Llama 4 Scout Agentic: 3/5 Basic tool use	3 Gemini 3 Mini Agentic: 3/5 Basic agentic capabilities	3 MiniMax M2.7 Agentic: 3/5 Basic agentic capabilities
Speed Response latency	2 Claude 4 Opus Speed: 2/5 Slowest, but most thoughtful	3 Claude Sonnet 4.6 Speed: 3/5 Moderate speed	4 GPT-5.5 Speed: 4/5 Fast for frontier quality	5 GPT-5.5 Instant Speed: 5/5 Fastest in class, <1s responses	4 Gemini 3.1 Pro Speed: 4/5 Consistently fast	4 DeepSeek V4 Speed: 4/5 Good speed for the price	3 DeepSeek VL Speed: 3/5 Slower than V4 due to vision processing	1 o3 Speed: 1/5 Slow deliberative reasoning	3 Llama 4 405B Speed: 3/5 Varies by deployment	3 Kimi K2.6 Speed: 3/5 Moderate speed	4 GLM 5.1 Speed: 4/5 Fast inference	4 Muse Spark Speed: 4/5 Fast inference	4 DeepSeek V4 Pro Speed: 4/5 Fast inference	2 Claude Opus 4.6 Speed: 2/5 Slower, thoughtful responses	4 Grok 3 Pro Speed: 4/5 Fast inference	4 Qwen 3.6 Speed: 4/5 Fast inference	3 Llama 4 Scout Speed: 3/5 MoE, moderate speed	5 Gemini 3 Mini Speed: 5/5 Fastest Gemini variant	4 MiniMax M2.7 Speed: 4/5 Fast inference
Cost Efficiency Value per dollar	2 Claude 4 Opus Cost Efficiency: 2/5 $15/$75 per 1M Most expensive per token	3 Claude Sonnet 4.6 Cost Efficiency: 3/5 $3/$15 per 1M Reasonable for quality	3 GPT-5.5 Cost Efficiency: 3/5 $2/$8 per 1M Competitive pricing	4 GPT-5.5 Instant Cost Efficiency: 4/5 $0.05/$0.20 per 1M Very cheap, fast	4 Gemini 3.1 Pro Cost Efficiency: 4/5 $2/$12 per 1M Good value for long context	5 DeepSeek V4 Cost Efficiency: 5/5 $0.55/$2.19 per 1M 10-50x cheaper than peers	4 DeepSeek VL Cost Efficiency: 4/5 Competitive pricing for vision tasks	1 o3 Cost Efficiency: 1/5 $10-60 per 1M output Most expensive reasoning	5 Llama 4 405B Cost Efficiency: 5/5 Free (self-host) Open-source, no API costs	4 Kimi K2.6 Cost Efficiency: 4/5 Competitive pricing	4 GLM 5.1 Cost Efficiency: 4/5 Competitive pricing	5 Muse Spark Cost Efficiency: 5/5 Free (self-host) Open-weight, no API costs	4 DeepSeek V4 Pro Cost Efficiency: 4/5 Good value for quality	2 Claude Opus 4.6 Cost Efficiency: 2/5 $15/$75 per 1M Expensive but capable	2 Grok 3 Pro Cost Efficiency: 2/5 $3/$15 per 1M Premium pricing	4 Qwen 3.6 Cost Efficiency: 4/5 Competitive pricing	5 Llama 4 Scout Cost Efficiency: 5/5 Free (self-host) Open-weight, no API costs	4 Gemini 3 Mini Cost Efficiency: 4/5 $1/$6 per 1M Affordable for quality	4 MiniMax M2.7 Cost Efficiency: 4/5 Competitive pricing

Score	Meaning
5 (dark green)	Best in class. Top performer for this task.
4 (light green)	Strong. Excellent for most use cases.
3 (yellow)	Good. Capable but not top-tier.
2 (orange)	Fair. Works for simple cases.
1 (red)	Limited. Not recommended for this task.

Scores combine public benchmarks and real-world usage as of May 2026. “Speed” measures output latency, not throughput. For detailed benchmark numbers, see the Benchmarks page.

Where to Start

Pick a default: Claude Sonnet (recommended) or GPT-4o
Use free credits to test on your actual task
Measure cost: How many tokens? How many requests?
Optimize: Add Haiku for cheap tasks, Opus only when Sonnet fails
Monitor: Track costs monthly