Skip to content

Economics of AI

📖 7 min read referenceeconomicscost
Build-vs-buy analysis, TCO models, cost optimization strategies, and market dynamics for AI systems
Key Takeaways
  • API is cheaper than self-hosting below 50M tokens per month
  • Model routing (simple queries to cheap models, complex to expensive) is the highest-ROI optimization
  • Prompt caching saves up to 81% on repeated context prefixes
  • AI inference prices dropped 90%+ from 2024 to 2026 — don't over-optimize for today's prices

Making cost-effective decisions about AI - when to build vs buy, how to model total cost, and how the market is evolving.

For current model pricing, see the canonical Models Guide page. The prices below are examples for cost estimation.


Build vs Buy: The Framework

The first decision: use an API (buy) or self-host / fine-tune (build).

FactorBuy (API)Build (Self-host)
Time to marketDaysWeeks to months
Upfront cost$0500500-500K (infra)
Per-query cost0.0010.001-0.100.00010.0001-0.01
PrivacyData leaves your infraData stays on-prem
CustomizationPrompting onlyFull control
MaintenanceProvider handles itYour team handles it
ScalingAuto-scalesYou manage capacity
QualityFrontier modelsSLMs or fine-tuned

Decision Tree

Is privacy the primary concern?
├─ Yes → Self-host (build)
└─ No → Continue
Is usage volume > 10M tokens/month?
├─ Yes → Build may be cheaper
└─ No → API is almost certainly cheaper
Do you need frontier-level quality?
├─ Yes → Buy API (can't self-host frontier quality)
└─ No → Build with SLM (Phi-4, Llama 4, Mistral)
Do you need custom behavior (fine-tuned)?
├─ Yes → Build with API fine-tuning or self-host
└─ No → Buy with prompting

Total Cost of Ownership (TCO)

API-Based TCO

Monthly API Cost =
(input_tokens × input_price + output_tokens × output_price) / 1,000,000
× daily_requests × 30

Example: 100K requests/day, 2000 input + 500 output tokens each

ModelInput PriceOutput PriceMonthly Cost
Claude Sonnet 4.6$3/M$15/M$40,500
GPT-5.5$2/M$8/M$24,000
Gemini 3.1 Pro$2/M$12/M$30,000
DeepSeek V4$0.55/M$2.19/M$6,585
DeepSeek V4 Flash$0.14/M$0.28/M$1,260
GPT-5.5 Instant$0.05/M$0.15/M$525

Hidden API costs:

  • Embedding generation (if using RAG): $0.02-0.13/1K tokens
  • Reranking: $1/1K docs (Cohere) or free (self-host)
  • Vector DB hosting: $0.04/1K vectors/month (Pinecone)
  • Monitoring: $0-500/month depending on tool

Self-Hosted TCO

Monthly Self-Hosted Cost =
GPU rental + storage + bandwidth + engineering time + electricity

GPU rental prices (May 2026):

GPUMemoryCost/HourBest For
NVIDIA H10080GB$1.5-3.0Training, 70B+ inference
NVIDIA B200192GB$3.0-5.0Training, MoE inference
NVIDIA A10080GB$1.0-2.0Inference, fine-tuning
AMD MI300X192GB$1.0-2.0Inference (Linux only)
Apple M4 Ultra256GB unifiedIncluded (Mac)Local 7B-70B, prototyping

Example: Self-hosting Llama 4 70B

ItemMonthly Cost
2x H100 (runpod/lamda)2,160(30days×24h×2,160 (30 days × 24h × 3/h)
Storage (500GB SSD)$50
Bandwidth (10TB)$100
Engineering time (20% FTE)$2,000-5,000
Total$4,310-7,310

Breakeven Analysis

At what volume does self-hosting become cheaper than API?

ModelAPI Cost/M TokenSelf-host Break-even
Claude Sonnet 4.6$15/output~50M tokens/month
GPT-5.5$8/output~100M tokens/month
DeepSeek V4 Flash$0.28/output~1B tokens/month

Rule of thumb: You need 50M+ output tokens per month before self-hosting makes financial sense for frontier-class models. For SLMs, the break-even is lower (10-20M tokens/month).


Optimization Strategies

1. Model Routing

Route simple queries to cheap models, complex to expensive:

Query TypeModelCost/Query% of Traffic
Simple Q&A, classificationGPT-5.5 Instant$0.0000560%
Content generation, analysisClaude Sonnet 4.6$0.00330%
Complex reasoning, codingClaude Opus 4.7$0.01510%

Weighted average cost: 0.00005×600.00005 × 60% + 0.003 × 30% + 0.015×100.015 × 10% = **0.0024/query**

Without routing (all Claude Opus): $0.015/query — 6.25x more expensive

2. Prompt Caching

If users share common system prompts or context:

Without caching: 10K requests × 5K tokens × $3/M = $150/day
With caching: 10K × (500 new + 4500 cached at 10% price)
= 10K × (500 × $3/M + 4500 × $0.30/M)
= $15 + $13.50 = $28.50/day
Savings: 81%

3. Semantic Caching

Cache similar queries to avoid re-computation:

  • Exact match cache — 5-15% hit rate, trivial to implement
  • Semantic cache — 20-40% hit rate, requires vector DB
  • Combined — 25-50% hit rate

4. Output Length Optimization

Shorter responses cost less:

ChangeCost ImpactQuality Impact
Reduce max_tokens from 2000 to 50075% less output costUsually minimal
Request structured output (JSON)30-50% lessMore focused responses
Use concise system prompt10-20% lessOften improves quality

5. Batch Processing

For non-real-time workloads, batch processing reduces cost:

Real-time: 10K requests at $0.003 = $30/day
Batch: 10K requests at $0.001 = $10/day (with batching discounts)
Savings: 67%

Market Dynamics (May 2026)

The Price Collapse

AI inference costs have dropped dramatically in 2026:

PeriodCost of 1M Output Tokens (GPT-class)Driver
Jan 2024$30-60GPT-4 pricing
Jan 2025$10-20Competition from Claude, Gemini
Jan 2026$2-8DeepSeek V3/V4 shock
May 20260.050.05-0.28 (Flash/Instant)Price war, open-weight models

Key driver: DeepSeek V4 Flash at 0.14/0.14/0.28 forced every provider to cut prices. The market is now in a race to the bottom on commodity inference.

Market Structure

TierExamplesPricingMarket Share
PremiumClaude Opus, o3, Grok 3 Pro33-60/M output~10%
StandardClaude Sonnet, GPT-5.5, Gemini 3.1 Pro22-15/M output~40%
BudgetGPT-5.5 Instant, DeepSeek V4 Flash, Gemini 3 Mini0.050.05-0.28/M output~50%

What This Means for Builders

StrategyImplication
Don’t optimize prematurelyPrices will likely halve again in 12 months. Over-optimizing for today’s prices is wasted effort.
Redundancy is freeWith multiple providers at similar prices, always have a fallback.
Open-weight is viableSelf-hosting Llama 4 Scout, DeepSeek V4, or Muse Spark is cheaper than API at scale.
Fine-tuning ROI has shrunkWith prompt engineering + RAG + routing, fine-tuning rarely pays back its cost.
Edge AI economics shiftOn-device SLMs (Phi-4, Gemma) cost $0/query. For high-volume, privacy-sensitive apps, this is the cheapest option.

Cost Estimation Tool

def estimate_monthly_cost(
daily_requests=1000,
input_tokens=2000,
output_tokens=500,
model="claude-sonnet"
):
pricing = {
"claude-opus": {"in": 15, "out": 75},
"claude-sonnet": {"in": 3, "out": 15},
"claude-haiku": {"in": 0.80, "out": 4},
"gpt-5.5": {"in": 2, "out": 8},
"gpt-5.5-instant": {"in": 0.05, "out": 0.15},
"gemini-3.1-pro": {"in": 2, "out": 12},
"deepseek-v4": {"in": 0.55, "out": 2.19},
"deepseek-v4-flash":{"in": 0.14, "out": 0.28},
}
p = pricing[model]
daily_cost = (
daily_requests * input_tokens * p["in"] +
daily_requests * output_tokens * p["out"]
) / 1_000_000
return round(daily_cost * 30, 2)

Key Takeaways

Prices as of May 2026. API pricing changes frequently. Check individual provider pages for current rates.

  1. API is almost always cheaper than self-hosting at low-to-medium volume (<50M tokens/month)
  2. Routing is the highest-ROI optimization — use cheap models for 80% of traffic, expensive for 20%
  3. Prices are dropping 50%+ per year — don’t over-optimize for today’s prices
  4. Caching is underutilized — prompt caching (81% savings) and semantic caching (40% hit rate) are the easiest wins
  5. Self-hosting breaks even at 50M+ tokens/month — viable for high-volume, not for small-scale
  6. Edge AI costs $0/query — for privacy-sensitive or high-volume apps, on-device SLMs are the ultimate cost play

See Also: