Skip to content

LLM Model Comparison

ModelContextStrengthsMy notes
Claude Sonnet 4.6200kLong-context reasoning, codingDefault for agentic tasks.
Claude Opus 4.6200kDeep reasoning, nuanceUse when Sonnet gets it wrong.
Claude Haiku 4.5200kFast, cheapGreat for classification + routing.
GPT-4-class128k+General, strong tools
Open-source (Llama/Mistral/Qwen)variesOn-prem, privacyStart here when data can’t leave.

How I pick

1. Does this need to run on-prem or handle sensitive data?
└─ yes → open-source + local inference
└─ no → continue
2. Is latency the bottleneck (chatbot UX)?
└─ yes → small/fast model + RAG
└─ no → continue
3. Does the task need deep multi-step reasoning?
└─ yes → frontier model
└─ no → mid-tier frontier model

Cost sanity checks

Before rolling anything to prod, multiply:

daily requests × avg input tokens × input $/Mtok
+ daily requests × avg output tokens × output $/Mtok

If the number scares you, add caching, a router, or a smaller model for the cheap path.