Inference Optimization: Speed, Cost & Edge AI

📖 10 min read deep-diveinferenceoptimizationedge-aislm

Techniques to make LLM inference faster, cheaper, and deployable on edge - quantization, caching, batching, speculative decoding, SLMs, and on-device AI

Key Takeaways

Quantization (INT8/INT4) reduces memory 2-4x with minimal quality loss
KV cache is the single most impactful latency optimization — reuses computed attention vectors
Small Language Models (Phi-4, Gemma) match GPT-3.5 class quality at 1/10th the size
Hybrid routing (SLM for 80% of queries, frontier for 20%) cuts costs by 60-80%

Making LLM inference fast and affordable - techniques that work at any scale.

The Latency-Cost Tradeoff

Every optimization is a tradeoff between latency, throughput, cost, and output quality. The best approach depends on your use case.

Goal	Optimize For	Key Techniques
Real-time chat	Low latency (`<1s`)	KV cache, smaller models, quantization
Batch processing	High throughput	Batching, speculative decoding
Budget constraint	Low cost	Quantization, smaller models, prompt caching
Maximum quality	No compromise	Full precision, no quantization

Quantization

Reducing model precision to shrink memory and speed up inference. Most models are trained in FP16 or FP32; quantization converts weights to lower precision.

Common Formats

Format	Bits/Weight	Speedup	Quality Impact	Use Case
FP16	16	1x	None	Baseline
INT8	8	~2x	Minimal	Production default
INT4	4	~3-4x	Small but noticeable	Local deployment
NF4	4	~3-4x	Less loss than INT4	QLoRA fine-tuning
FP8	8	~2x	None (newer hardware)	H100/H200 optimized

Rule of thumb: INT8 for production APIs where quality matters. INT4 for local/edge deployment where memory is tight.

Code Example: Loading a Quantized Model (Transformers)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# INT4 quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    quantization_config=quant_config,
    device_map="auto",
)

KV Caching

The key-value cache stores attention vectors from previous tokens so they don’t need to be recomputed with each new token. This is the single most impactful optimization for latency.

How It Works

First token: compute full attention (slow, ~100ms)
Subsequent tokens: reuse cached KV vectors (fast, ~10ms each)
Tradeoff: KV cache grows with sequence length (~2MB per token for a 70B model)

Prompt Caching

Some providers (Anthropic, OpenAI) offer prompt caching - if you send the same system prompt repeatedly, cached portions are billed at ~10% of the normal rate.

# Anthropic prompt caching (automatic with repeated prefixes)
response = client.messages.create(
    model="claude-sonnet-4-20260510",
    max_tokens=1000,
    system=[{"type": "text", "text": LONG_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "Analyze this document."}],
)
# Cached portions charged at 10% of normal rate

KV Cache Strategy

Short conversations (<2K tokens): No special handling needed
Long documents (10K+ tokens): Enable prompt caching for repeated prefixes
Very long contexts (100K+): Consider sliding window attention or streaming LLMs

Batching

Processing multiple requests simultaneously improves GPU utilization and throughput.

Static vs Dynamic Batching

Type	How It Works	Best For
Static	Fixed batch size, all requests finish together	Predictable workloads
Dynamic (continuous)	New requests join in-progress batches as others finish	Variable traffic, real-time

Throughput Impact

Single request:    1 req → 10s → 0.1 req/s
Batch of 8:        8 req → 12s → 0.67 req/s (6.7x throughput)
Batch of 32:      32 req → 18s → 1.78 req/s (17.8x throughput)

Key insight: Batching increases throughput but increases latency for individual requests. Use batching for offline processing; avoid it for real-time chat.

Speculative Decoding

A smaller “draft” model predicts multiple tokens ahead, and the large model verifies them in parallel. When drafts are correct, you get multiple tokens for the cost of one verification step.

When It Works

High-acceptance tasks: Code generation, structured output (JSON, XML)
Low-acceptance tasks: Creative writing, nuanced reasoning
Typical speedup: 1.5x-3x for code, 1.2x-1.5x for general text

Implementation

# Using vLLM with speculative decoding
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.2-70B",
    speculative_model="meta-llama/Llama-3.2-8B",  # draft model
    num_speculative_tokens=5,
)

Cost Optimization Strategy

Technique	Cost Reduction	Implementation Effort
Prompt caching	50-90% on repeated prefixes	Built into API (Anthropic, OpenAI)
Model routing	60-80% overall	Requires routing logic
Batching	5-10x throughput	Requires vLLM or similar
Quantization	2-4x memory reduction	One-time model conversion
Smaller models for simple tasks	10-100x cost difference	Task routing

Practical Decision Tree

What's your priority?
│
├─ Lowest latency → KV cache + small model + no batching
├─ Highest throughput → Large batches + speculative decoding
├─ Lowest cost → Small model + INT4 + prompt caching + routing
└─ Best quality → Large model + FP16 + no quantization

Small Language Models & Edge AI

Not every task needs a 400B-parameter frontier model. Small Language Models (SLMs) are optimized for efficiency, running on consumer hardware, phones, and even browsers.

The SLM Landscape (May 2026)

Model	Parameters	Quality Relative To	Best For
Phi-4 (Microsoft)	14B	GPT-3.5 class	Reasoning, code, general
Gemma 2/3 (Google)	2B-9B	Llama 3 8B class	Lightweight, multilingual
Llama 3.2 (Meta)	1B-11B	GPT-3.5 class	General, on-device
TinyLlama	1.1B	GPT-2 class	Ultra-lightweight, CPU only
Qwen 2.5 (Alibaba)	0.5B-72B	Frontier at 72B, efficient at smaller	Multilingual
Mistral 7B	7B	GPT-3.5 class	Fast, instruction-following
H2O-Danube	1.8B	GPT-2 class	Simpler tasks

Key insight: Phi-4 (14B) matches GPT-3.5 quality at 1/10th the size. The gap between small and large models is shrinking rapidly due to better training data and techniques.

When SLMs Are Enough

Task	SLM Works?	Frontier Model Better?
Classification, routing, intent detection	✅ Yes	Marginally
Simple Q&A, summarization	✅ Yes	Slightly
Code generation (common patterns)	✅ Yes	For complex logic
Creative writing, nuanced analysis	⚠️ Sometimes	✅ Significantly
Multi-step reasoning	❌ Rarely	✅ Much better
Multilingual, low-resource	⚠️ Depends on training	✅ Usually better

Rule of thumb: If a human could answer in 5 seconds, an SLM is probably sufficient. If it takes a human 30+ seconds of thinking, use a larger model.

On-Device AI

The biggest trend in SLMs is running them directly on phones, laptops, and IoT devices — no internet connection required.

Apple Intelligence (Apple, 2024-2026):

On-device models for summarization, rewriting, image editing
Uses a mix of on-device SLM (3B class) and cloud fallback to GPT/Claude
Privacy-focused: sensitive queries stay on-device
Available on iPhone 16+ and M-series Macs

Android AI (Google, 2025-2026):

Gemini Nano: 1.8B model running on Pixel and Samsung devices
Features: smart reply, summarization, photo editing
Powered by Google Tensor chips with dedicated AI accelerators

Browser-based AI:

WebLLM / WebGPU: Run SLMs directly in the browser using WebGPU API
Transformers.js: Run Hugging Face models in-browser
Chrome built-in AI: Gemini Nano available via window.ai API
Use cases: privacy-sensitive chatbots, local language translation, accessibility tools

Edge deployment formats:

Format	Platform	Use Case
GGUF	llama.cpp, Ollama, LM Studio	CPU inference, personal computers
CoreML	Apple devices (iOS, macOS)	On-device, Apple Silicon optimized
TFLite	Android, embedded Linux	Mobile phones, Raspberry Pi
ExecuTorch	Meta’s edge runtime	Mobile, wearable, IoT
ONNX Runtime	Cross-platform	Production edge servers
WebGPU	Browser	Zero-install, in-browser AI

The Hybrid Pattern: Routing

The most efficient deployment uses SLMs and frontier models together, with a router that decides which model to use for each query.

User query
  ↓
Router (lightweight classifier)
  ├─ Simple query → SLM (fast, cheap, on-device)
  ├─ Complex query → Frontier model (powerful, slower, API)
  └─ Sensitive data → SLM (privacy)

Implementation with a router model:

def route_query(query: str) -> str:
    """Route to the right model based on query complexity."""
    # Simple heuristic: token length and keyword detection
    if len(query) < 50 and not any(kw in query for kw in complex_keywords):
        return "slm"  # Phi-4 or Gemma
    elif "password" in query or "ssn" in query or "medical" in query:
        return "slm"  # Privacy-sensitive, keep on-device
    else:
        return "frontier"  # Claude, GPT-5.5, or Gemini

Alternatively, use a classifier model:

router_model = AutoModelForSequenceClassification.from_pretrained("routing-model")
complexity = router_model.predict(query)  # 0-1 score

if complexity < 0.3:
    return phi4.generate(query)      # SLM
elif complexity < 0.7:
    return claude_sonnet.generate(query)  # Mid-tier
else:
    return claude_opus.generate(query)    # Frontier

Results of a good routing strategy:

70-80% of queries go to the SLM (fast, cheap)
15-25% go to mid-tier (balanced)
5-10% go to frontier (expensive but necessary)
Overall cost reduction: 60-80%

Quantization for Edge

Edge deployment depends heavily on quantization. A 7B model in FP16 needs 14GB of memory (impossible on a phone). In INT4, it needs only 3.5GB (feasible on recent phones).

Memory requirements by format:

Model	FP16	INT8	INT4	NF4
Phi-4 (14B)	28GB	14GB	7GB	7GB
Gemma 2 (9B)	18GB	9GB	4.5GB	4.5GB
Llama 3.2 (8B)	16GB	8GB	4GB	4GB
TinyLlama (1.1B)	2.2GB	1.1GB	550MB	550MB

Apple Neural Engine: Apple’s ANE can run 7B-class models in INT4 at 30+ tokens/sec on iPhone 17 Pro. This makes real-time on-device chat viable.

Qualcomm AI Engine: Android phones with Snapdragon 8 Elite can run 7B INT4 models at 20+ tokens/sec.

When to Use Each Approach

Scenario	Recommended Setup	Cost	Latency
Privacy-sensitive chat	On-device SLM (INT4)	$0	50-200ms
High-volume API	Router → mostly SLM	$0.001/query	100-500ms
Mobile app	On-device SLM + cloud fallback	$0.001/query	50ms on-device
Browser extension	WebLLM + Transformers.js	$0	200-500ms
IoT / embedded	TinyLlama (GGUF, INT4)	$0	500ms+

Production Tools

Tool	Best For	Key Feature
vLLM	High-throughput serving	PagedAttention, continuous batching
TensorRT-LLM	NVIDIA GPU optimization	Kernel fusion, INT4/FP8
Ollama	Local experimentation	One-command setup
llama.cpp	CPU + edge deployment	Extremely efficient CPU inference
TGI (Text Generation Inference)	Hugging Face ecosystem	Token streaming, tensor parallelism

Quick Reference: Optimization by Scenario

Scenario	Recommended Setup	Estimated Cost/Month
On-device AI (phone/browser)	Phi-4 INT4 + WebLLM / CoreML	$0 (on-device)
Personal chatbot (`<1K` req/day)	Ollama + 7B INT4 model	$0 (local)
Production API (10K req/day)	vLLM + 70B INT8 + prompt caching	$200-500
Batch processing (1M req/day)	TensorRT-LLM + FP8 + continuous batching	$1,000-3,000
Real-time voice (50ms SLA)	Lightweight model + KV cache + no batching	$500-2,000
Hybrid routing (100K req/day)	Router → 80% SLM / 20% frontier	$100-500