Skip to content

Inference Optimization: Speed, Cost & Edge AI

📖 10 min read deep-diveinferenceoptimizationedge-aislm
Techniques to make LLM inference faster, cheaper, and deployable on edge - quantization, caching, batching, speculative decoding, SLMs, and on-device AI
Key Takeaways
  • Quantization (INT8/INT4) reduces memory 2-4x with minimal quality loss
  • KV cache is the single most impactful latency optimization — reuses computed attention vectors
  • Small Language Models (Phi-4, Gemma) match GPT-3.5 class quality at 1/10th the size
  • Hybrid routing (SLM for 80% of queries, frontier for 20%) cuts costs by 60-80%

Making LLM inference fast and affordable - techniques that work at any scale.



The Latency-Cost Tradeoff

Every optimization is a tradeoff between latency, throughput, cost, and output quality. The best approach depends on your use case.

GoalOptimize ForKey Techniques
Real-time chatLow latency (<1s)KV cache, smaller models, quantization
Batch processingHigh throughputBatching, speculative decoding
Budget constraintLow costQuantization, smaller models, prompt caching
Maximum qualityNo compromiseFull precision, no quantization

Quantization

Reducing model precision to shrink memory and speed up inference. Most models are trained in FP16 or FP32; quantization converts weights to lower precision.

Common Formats

FormatBits/WeightSpeedupQuality ImpactUse Case
FP16161xNoneBaseline
INT88~2xMinimalProduction default
INT44~3-4xSmall but noticeableLocal deployment
NF44~3-4xLess loss than INT4QLoRA fine-tuning
FP88~2xNone (newer hardware)H100/H200 optimized

Rule of thumb: INT8 for production APIs where quality matters. INT4 for local/edge deployment where memory is tight.

Code Example: Loading a Quantized Model (Transformers)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# INT4 quantization config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
quantization_config=quant_config,
device_map="auto",
)

KV Caching

The key-value cache stores attention vectors from previous tokens so they don’t need to be recomputed with each new token. This is the single most impactful optimization for latency.

How It Works

  1. First token: compute full attention (slow, ~100ms)
  2. Subsequent tokens: reuse cached KV vectors (fast, ~10ms each)
  3. Tradeoff: KV cache grows with sequence length (~2MB per token for a 70B model)

Prompt Caching

Some providers (Anthropic, OpenAI) offer prompt caching - if you send the same system prompt repeatedly, cached portions are billed at ~10% of the normal rate.

# Anthropic prompt caching (automatic with repeated prefixes)
response = client.messages.create(
model="claude-sonnet-4-20260510",
max_tokens=1000,
system=[{"type": "text", "text": LONG_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "Analyze this document."}],
)
# Cached portions charged at 10% of normal rate

KV Cache Strategy

  • Short conversations (<2K tokens): No special handling needed
  • Long documents (10K+ tokens): Enable prompt caching for repeated prefixes
  • Very long contexts (100K+): Consider sliding window attention or streaming LLMs

Batching

Processing multiple requests simultaneously improves GPU utilization and throughput.

Static vs Dynamic Batching

TypeHow It WorksBest For
StaticFixed batch size, all requests finish togetherPredictable workloads
Dynamic (continuous)New requests join in-progress batches as others finishVariable traffic, real-time

Throughput Impact

Single request: 1 req → 10s → 0.1 req/s
Batch of 8: 8 req → 12s → 0.67 req/s (6.7x throughput)
Batch of 32: 32 req → 18s → 1.78 req/s (17.8x throughput)

Key insight: Batching increases throughput but increases latency for individual requests. Use batching for offline processing; avoid it for real-time chat.


Speculative Decoding

A smaller “draft” model predicts multiple tokens ahead, and the large model verifies them in parallel. When drafts are correct, you get multiple tokens for the cost of one verification step.

When It Works

  • High-acceptance tasks: Code generation, structured output (JSON, XML)
  • Low-acceptance tasks: Creative writing, nuanced reasoning
  • Typical speedup: 1.5x-3x for code, 1.2x-1.5x for general text

Implementation

# Using vLLM with speculative decoding
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.2-70B",
speculative_model="meta-llama/Llama-3.2-8B", # draft model
num_speculative_tokens=5,
)

Cost Optimization Strategy

TechniqueCost ReductionImplementation Effort
Prompt caching50-90% on repeated prefixesBuilt into API (Anthropic, OpenAI)
Model routing60-80% overallRequires routing logic
Batching5-10x throughputRequires vLLM or similar
Quantization2-4x memory reductionOne-time model conversion
Smaller models for simple tasks10-100x cost differenceTask routing

Practical Decision Tree

What's your priority?
├─ Lowest latency → KV cache + small model + no batching
├─ Highest throughput → Large batches + speculative decoding
├─ Lowest cost → Small model + INT4 + prompt caching + routing
└─ Best quality → Large model + FP16 + no quantization

Small Language Models & Edge AI

Not every task needs a 400B-parameter frontier model. Small Language Models (SLMs) are optimized for efficiency, running on consumer hardware, phones, and even browsers.

The SLM Landscape (May 2026)

ModelParametersQuality Relative ToBest For
Phi-4 (Microsoft)14BGPT-3.5 classReasoning, code, general
Gemma 2/3 (Google)2B-9BLlama 3 8B classLightweight, multilingual
Llama 3.2 (Meta)1B-11BGPT-3.5 classGeneral, on-device
TinyLlama1.1BGPT-2 classUltra-lightweight, CPU only
Qwen 2.5 (Alibaba)0.5B-72BFrontier at 72B, efficient at smallerMultilingual
Mistral 7B7BGPT-3.5 classFast, instruction-following
H2O-Danube1.8BGPT-2 classSimpler tasks

Key insight: Phi-4 (14B) matches GPT-3.5 quality at 1/10th the size. The gap between small and large models is shrinking rapidly due to better training data and techniques.

When SLMs Are Enough

TaskSLM Works?Frontier Model Better?
Classification, routing, intent detection✅ YesMarginally
Simple Q&A, summarization✅ YesSlightly
Code generation (common patterns)✅ YesFor complex logic
Creative writing, nuanced analysis⚠️ Sometimes✅ Significantly
Multi-step reasoning❌ Rarely✅ Much better
Multilingual, low-resource⚠️ Depends on training✅ Usually better

Rule of thumb: If a human could answer in 5 seconds, an SLM is probably sufficient. If it takes a human 30+ seconds of thinking, use a larger model.

On-Device AI

The biggest trend in SLMs is running them directly on phones, laptops, and IoT devices — no internet connection required.

Apple Intelligence (Apple, 2024-2026):

  • On-device models for summarization, rewriting, image editing
  • Uses a mix of on-device SLM (3B class) and cloud fallback to GPT/Claude
  • Privacy-focused: sensitive queries stay on-device
  • Available on iPhone 16+ and M-series Macs

Android AI (Google, 2025-2026):

  • Gemini Nano: 1.8B model running on Pixel and Samsung devices
  • Features: smart reply, summarization, photo editing
  • Powered by Google Tensor chips with dedicated AI accelerators

Browser-based AI:

  • WebLLM / WebGPU: Run SLMs directly in the browser using WebGPU API
  • Transformers.js: Run Hugging Face models in-browser
  • Chrome built-in AI: Gemini Nano available via window.ai API
  • Use cases: privacy-sensitive chatbots, local language translation, accessibility tools

Edge deployment formats:

FormatPlatformUse Case
GGUFllama.cpp, Ollama, LM StudioCPU inference, personal computers
CoreMLApple devices (iOS, macOS)On-device, Apple Silicon optimized
TFLiteAndroid, embedded LinuxMobile phones, Raspberry Pi
ExecuTorchMeta’s edge runtimeMobile, wearable, IoT
ONNX RuntimeCross-platformProduction edge servers
WebGPUBrowserZero-install, in-browser AI

The Hybrid Pattern: Routing

The most efficient deployment uses SLMs and frontier models together, with a router that decides which model to use for each query.

User query
Router (lightweight classifier)
├─ Simple query → SLM (fast, cheap, on-device)
├─ Complex query → Frontier model (powerful, slower, API)
└─ Sensitive data → SLM (privacy)

Implementation with a router model:

def route_query(query: str) -> str:
"""Route to the right model based on query complexity."""
# Simple heuristic: token length and keyword detection
if len(query) < 50 and not any(kw in query for kw in complex_keywords):
return "slm" # Phi-4 or Gemma
elif "password" in query or "ssn" in query or "medical" in query:
return "slm" # Privacy-sensitive, keep on-device
else:
return "frontier" # Claude, GPT-5.5, or Gemini

Alternatively, use a classifier model:

router_model = AutoModelForSequenceClassification.from_pretrained("routing-model")
complexity = router_model.predict(query) # 0-1 score
if complexity < 0.3:
return phi4.generate(query) # SLM
elif complexity < 0.7:
return claude_sonnet.generate(query) # Mid-tier
else:
return claude_opus.generate(query) # Frontier

Results of a good routing strategy:

  • 70-80% of queries go to the SLM (fast, cheap)
  • 15-25% go to mid-tier (balanced)
  • 5-10% go to frontier (expensive but necessary)
  • Overall cost reduction: 60-80%

Quantization for Edge

Edge deployment depends heavily on quantization. A 7B model in FP16 needs 14GB of memory (impossible on a phone). In INT4, it needs only 3.5GB (feasible on recent phones).

Memory requirements by format:

ModelFP16INT8INT4NF4
Phi-4 (14B)28GB14GB7GB7GB
Gemma 2 (9B)18GB9GB4.5GB4.5GB
Llama 3.2 (8B)16GB8GB4GB4GB
TinyLlama (1.1B)2.2GB1.1GB550MB550MB

Apple Neural Engine: Apple’s ANE can run 7B-class models in INT4 at 30+ tokens/sec on iPhone 17 Pro. This makes real-time on-device chat viable.

Qualcomm AI Engine: Android phones with Snapdragon 8 Elite can run 7B INT4 models at 20+ tokens/sec.

When to Use Each Approach

ScenarioRecommended SetupCostLatency
Privacy-sensitive chatOn-device SLM (INT4)$050-200ms
High-volume APIRouter → mostly SLM$0.001/query100-500ms
Mobile appOn-device SLM + cloud fallback$0.001/query50ms on-device
Browser extensionWebLLM + Transformers.js$0200-500ms
IoT / embeddedTinyLlama (GGUF, INT4)$0500ms+

Production Tools

ToolBest ForKey Feature
vLLMHigh-throughput servingPagedAttention, continuous batching
TensorRT-LLMNVIDIA GPU optimizationKernel fusion, INT4/FP8
OllamaLocal experimentationOne-command setup
llama.cppCPU + edge deploymentExtremely efficient CPU inference
TGI (Text Generation Inference)Hugging Face ecosystemToken streaming, tensor parallelism

Quick Reference: Optimization by Scenario

ScenarioRecommended SetupEstimated Cost/Month
On-device AI (phone/browser)Phi-4 INT4 + WebLLM / CoreML$0 (on-device)
Personal chatbot (<1K req/day)Ollama + 7B INT4 model$0 (local)
Production API (10K req/day)vLLM + 70B INT8 + prompt caching$200-500
Batch processing (1M req/day)TensorRT-LLM + FP8 + continuous batching$1,000-3,000
Real-time voice (50ms SLA)Lightweight model + KV cache + no batching$500-2,000
Hybrid routing (100K req/day)Router → 80% SLM / 20% frontier$100-500

See Also