Inference Optimization: Speed, Cost & Edge AI
Making LLM inference fast and affordable - techniques that work at any scale.
The Latency-Cost Tradeoff
Every optimization is a tradeoff between latency, throughput, cost, and output quality. The best approach depends on your use case.
| Goal | Optimize For | Key Techniques |
|---|---|---|
| Real-time chat | Low latency (<1s) | KV cache, smaller models, quantization |
| Batch processing | High throughput | Batching, speculative decoding |
| Budget constraint | Low cost | Quantization, smaller models, prompt caching |
| Maximum quality | No compromise | Full precision, no quantization |
Quantization
Reducing model precision to shrink memory and speed up inference. Most models are trained in FP16 or FP32; quantization converts weights to lower precision.
Common Formats
| Format | Bits/Weight | Speedup | Quality Impact | Use Case |
|---|---|---|---|---|
| FP16 | 16 | 1x | None | Baseline |
| INT8 | 8 | ~2x | Minimal | Production default |
| INT4 | 4 | ~3-4x | Small but noticeable | Local deployment |
| NF4 | 4 | ~3-4x | Less loss than INT4 | QLoRA fine-tuning |
| FP8 | 8 | ~2x | None (newer hardware) | H100/H200 optimized |
Rule of thumb: INT8 for production APIs where quality matters. INT4 for local/edge deployment where memory is tight.
Code Example: Loading a Quantized Model (Transformers)
from transformers import AutoModelForCausalLM, BitsAndBytesConfigimport torch
# INT4 quantization configquant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True,)
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-8B", quantization_config=quant_config, device_map="auto",)KV Caching
The key-value cache stores attention vectors from previous tokens so they don’t need to be recomputed with each new token. This is the single most impactful optimization for latency.
How It Works
- First token: compute full attention (slow, ~100ms)
- Subsequent tokens: reuse cached KV vectors (fast, ~10ms each)
- Tradeoff: KV cache grows with sequence length (~2MB per token for a 70B model)
Prompt Caching
Some providers (Anthropic, OpenAI) offer prompt caching - if you send the same system prompt repeatedly, cached portions are billed at ~10% of the normal rate.
# Anthropic prompt caching (automatic with repeated prefixes)response = client.messages.create( model="claude-sonnet-4-20260510", max_tokens=1000, system=[{"type": "text", "text": LONG_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}], messages=[{"role": "user", "content": "Analyze this document."}],)# Cached portions charged at 10% of normal rateKV Cache Strategy
- Short conversations (<2K tokens): No special handling needed
- Long documents (10K+ tokens): Enable prompt caching for repeated prefixes
- Very long contexts (100K+): Consider sliding window attention or streaming LLMs
Batching
Processing multiple requests simultaneously improves GPU utilization and throughput.
Static vs Dynamic Batching
| Type | How It Works | Best For |
|---|---|---|
| Static | Fixed batch size, all requests finish together | Predictable workloads |
| Dynamic (continuous) | New requests join in-progress batches as others finish | Variable traffic, real-time |
Throughput Impact
Single request: 1 req → 10s → 0.1 req/sBatch of 8: 8 req → 12s → 0.67 req/s (6.7x throughput)Batch of 32: 32 req → 18s → 1.78 req/s (17.8x throughput)Key insight: Batching increases throughput but increases latency for individual requests. Use batching for offline processing; avoid it for real-time chat.
Speculative Decoding
A smaller “draft” model predicts multiple tokens ahead, and the large model verifies them in parallel. When drafts are correct, you get multiple tokens for the cost of one verification step.
When It Works
- High-acceptance tasks: Code generation, structured output (JSON, XML)
- Low-acceptance tasks: Creative writing, nuanced reasoning
- Typical speedup: 1.5x-3x for code, 1.2x-1.5x for general text
Implementation
# Using vLLM with speculative decodingfrom vllm import LLM, SamplingParams
llm = LLM( model="meta-llama/Llama-3.2-70B", speculative_model="meta-llama/Llama-3.2-8B", # draft model num_speculative_tokens=5,)Cost Optimization Strategy
| Technique | Cost Reduction | Implementation Effort |
|---|---|---|
| Prompt caching | 50-90% on repeated prefixes | Built into API (Anthropic, OpenAI) |
| Model routing | 60-80% overall | Requires routing logic |
| Batching | 5-10x throughput | Requires vLLM or similar |
| Quantization | 2-4x memory reduction | One-time model conversion |
| Smaller models for simple tasks | 10-100x cost difference | Task routing |
Practical Decision Tree
What's your priority?│├─ Lowest latency → KV cache + small model + no batching├─ Highest throughput → Large batches + speculative decoding├─ Lowest cost → Small model + INT4 + prompt caching + routing└─ Best quality → Large model + FP16 + no quantizationSmall Language Models & Edge AI
Not every task needs a 400B-parameter frontier model. Small Language Models (SLMs) are optimized for efficiency, running on consumer hardware, phones, and even browsers.
The SLM Landscape (May 2026)
| Model | Parameters | Quality Relative To | Best For |
|---|---|---|---|
| Phi-4 (Microsoft) | 14B | GPT-3.5 class | Reasoning, code, general |
| Gemma 2/3 (Google) | 2B-9B | Llama 3 8B class | Lightweight, multilingual |
| Llama 3.2 (Meta) | 1B-11B | GPT-3.5 class | General, on-device |
| TinyLlama | 1.1B | GPT-2 class | Ultra-lightweight, CPU only |
| Qwen 2.5 (Alibaba) | 0.5B-72B | Frontier at 72B, efficient at smaller | Multilingual |
| Mistral 7B | 7B | GPT-3.5 class | Fast, instruction-following |
| H2O-Danube | 1.8B | GPT-2 class | Simpler tasks |
Key insight: Phi-4 (14B) matches GPT-3.5 quality at 1/10th the size. The gap between small and large models is shrinking rapidly due to better training data and techniques.
When SLMs Are Enough
| Task | SLM Works? | Frontier Model Better? |
|---|---|---|
| Classification, routing, intent detection | ✅ Yes | Marginally |
| Simple Q&A, summarization | ✅ Yes | Slightly |
| Code generation (common patterns) | ✅ Yes | For complex logic |
| Creative writing, nuanced analysis | ⚠️ Sometimes | ✅ Significantly |
| Multi-step reasoning | ❌ Rarely | ✅ Much better |
| Multilingual, low-resource | ⚠️ Depends on training | ✅ Usually better |
Rule of thumb: If a human could answer in 5 seconds, an SLM is probably sufficient. If it takes a human 30+ seconds of thinking, use a larger model.
On-Device AI
The biggest trend in SLMs is running them directly on phones, laptops, and IoT devices — no internet connection required.
Apple Intelligence (Apple, 2024-2026):
- On-device models for summarization, rewriting, image editing
- Uses a mix of on-device SLM (3B class) and cloud fallback to GPT/Claude
- Privacy-focused: sensitive queries stay on-device
- Available on iPhone 16+ and M-series Macs
Android AI (Google, 2025-2026):
- Gemini Nano: 1.8B model running on Pixel and Samsung devices
- Features: smart reply, summarization, photo editing
- Powered by Google Tensor chips with dedicated AI accelerators
Browser-based AI:
- WebLLM / WebGPU: Run SLMs directly in the browser using WebGPU API
- Transformers.js: Run Hugging Face models in-browser
- Chrome built-in AI: Gemini Nano available via
window.aiAPI - Use cases: privacy-sensitive chatbots, local language translation, accessibility tools
Edge deployment formats:
| Format | Platform | Use Case |
|---|---|---|
| GGUF | llama.cpp, Ollama, LM Studio | CPU inference, personal computers |
| CoreML | Apple devices (iOS, macOS) | On-device, Apple Silicon optimized |
| TFLite | Android, embedded Linux | Mobile phones, Raspberry Pi |
| ExecuTorch | Meta’s edge runtime | Mobile, wearable, IoT |
| ONNX Runtime | Cross-platform | Production edge servers |
| WebGPU | Browser | Zero-install, in-browser AI |
The Hybrid Pattern: Routing
The most efficient deployment uses SLMs and frontier models together, with a router that decides which model to use for each query.
User query ↓Router (lightweight classifier) ├─ Simple query → SLM (fast, cheap, on-device) ├─ Complex query → Frontier model (powerful, slower, API) └─ Sensitive data → SLM (privacy)Implementation with a router model:
def route_query(query: str) -> str: """Route to the right model based on query complexity.""" # Simple heuristic: token length and keyword detection if len(query) < 50 and not any(kw in query for kw in complex_keywords): return "slm" # Phi-4 or Gemma elif "password" in query or "ssn" in query or "medical" in query: return "slm" # Privacy-sensitive, keep on-device else: return "frontier" # Claude, GPT-5.5, or GeminiAlternatively, use a classifier model:
router_model = AutoModelForSequenceClassification.from_pretrained("routing-model")complexity = router_model.predict(query) # 0-1 score
if complexity < 0.3: return phi4.generate(query) # SLMelif complexity < 0.7: return claude_sonnet.generate(query) # Mid-tierelse: return claude_opus.generate(query) # FrontierResults of a good routing strategy:
- 70-80% of queries go to the SLM (fast, cheap)
- 15-25% go to mid-tier (balanced)
- 5-10% go to frontier (expensive but necessary)
- Overall cost reduction: 60-80%
Quantization for Edge
Edge deployment depends heavily on quantization. A 7B model in FP16 needs 14GB of memory (impossible on a phone). In INT4, it needs only 3.5GB (feasible on recent phones).
Memory requirements by format:
| Model | FP16 | INT8 | INT4 | NF4 |
|---|---|---|---|---|
| Phi-4 (14B) | 28GB | 14GB | 7GB | 7GB |
| Gemma 2 (9B) | 18GB | 9GB | 4.5GB | 4.5GB |
| Llama 3.2 (8B) | 16GB | 8GB | 4GB | 4GB |
| TinyLlama (1.1B) | 2.2GB | 1.1GB | 550MB | 550MB |
Apple Neural Engine: Apple’s ANE can run 7B-class models in INT4 at 30+ tokens/sec on iPhone 17 Pro. This makes real-time on-device chat viable.
Qualcomm AI Engine: Android phones with Snapdragon 8 Elite can run 7B INT4 models at 20+ tokens/sec.
When to Use Each Approach
| Scenario | Recommended Setup | Cost | Latency |
|---|---|---|---|
| Privacy-sensitive chat | On-device SLM (INT4) | $0 | 50-200ms |
| High-volume API | Router → mostly SLM | $0.001/query | 100-500ms |
| Mobile app | On-device SLM + cloud fallback | $0.001/query | 50ms on-device |
| Browser extension | WebLLM + Transformers.js | $0 | 200-500ms |
| IoT / embedded | TinyLlama (GGUF, INT4) | $0 | 500ms+ |
Production Tools
| Tool | Best For | Key Feature |
|---|---|---|
| vLLM | High-throughput serving | PagedAttention, continuous batching |
| TensorRT-LLM | NVIDIA GPU optimization | Kernel fusion, INT4/FP8 |
| Ollama | Local experimentation | One-command setup |
| llama.cpp | CPU + edge deployment | Extremely efficient CPU inference |
| TGI (Text Generation Inference) | Hugging Face ecosystem | Token streaming, tensor parallelism |
Quick Reference: Optimization by Scenario
| Scenario | Recommended Setup | Estimated Cost/Month |
|---|---|---|
| On-device AI (phone/browser) | Phi-4 INT4 + WebLLM / CoreML | $0 (on-device) |
Personal chatbot (<1K req/day) | Ollama + 7B INT4 model | $0 (local) |
| Production API (10K req/day) | vLLM + 70B INT8 + prompt caching | $200-500 |
| Batch processing (1M req/day) | TensorRT-LLM + FP8 + continuous batching | $1,000-3,000 |
| Real-time voice (50ms SLA) | Lightweight model + KV cache + no batching | $500-2,000 |
| Hybrid routing (100K req/day) | Router → 80% SLM / 20% frontier | $100-500 |
See Also
- How LLMs Work - Foundation concepts
- Training & Fine-tuning - Model adaptation
- Models Guide - Pricing reference