Skip to content

AI System Design — Interview Prep

Key frameworks for system design interviews targeting ML/LLM Engineer, AI Platform Engineer, and Applied Scientist roles.

Roles covered: AI Engineer · ML Platform Engineer · RAG Engineer · Inference Engineer · MLOps Engineer


1. Designing a RAG System

Core Components

ComponentOptionsTradeoff
Embedding modelOpenAI text-embedding-3-large, Cohere, BGE, E5Quality vs cost vs latency vs self-host capability
Vector DBPinecone (managed), Weaviate (self-host), pgvector, Chroma, QdrantManaged vs self-host, filtering support, index type
Chunking strategyFixed size, semantic, recursive, sentence-windowGranularity vs context retention vs retrieval quality
RetrievalDense (vector), sparse (BM25), hybridPrecision vs recall vs complexity

Architecture Patterns

User Query → Query Rewriting → Retrieval → Reranking → Generation
│ │ │
Rewrite for Search both Cross-encoder
multiple intents dense + re-ranks top
sparse results

Simple RAG: Embed query → top-k vector search → concatenate chunks → LLM generation. Fast, minimal infra, but can miss context.

Hybrid RAG: BM25 + vector search → reciprocal rank fusion → rerank → generate. Better recall, more complex, needs both sparse and dense indexes.

Agentic RAG: Query → route to tool → execute search → evaluate results → iterate if needed. Best for complex questions, higher latency, needs orchestration.

Key Design Decisions

  • Chunk size: 256-512 tokens is typical. Smaller = more precise, larger = more context. Test on your data.
  • Top-k: 3-5 chunks typically. More = more context but more noise and higher token cost.
  • Reranker: Cross-encoder (Cohere, BGE) adds 50-100ms but improves relevance by 10-20%.
  • Caching: Embedding cache (exact match) + LLM response cache (semantic) reduces latency by 40-60%.

2. Designing a Model Serving Platform

Latency Budget

ComponentBudgetStrategy
Network50-100msEdge deployment, multi-region
Tokenization5-10msCached tokenizers
Prefill (prompt)200-500msKV cache, prompt caching
Decode (per token)10-50msContinuous batching, speculation
Post-processing10-20msStreaming, partial parsing

Serving Strategies

StrategyLatencyThroughputCostComplexity
SynchronousLowLowHighLow
BatchHighHighLowMedium
StreamingFirst-token lowMediumMediumMedium
Speculative decodingLowHighMediumHigh

Scaling Considerations

  • GPU memory: 70B model ≈ 140GB at FP16. Need 2x A100-80GB or quantization (FP8 = 70GB, INT4 = 35GB).
  • Throughput: vLLM with continuous batching achieves 10-50x higher throughput than naive Hugging Face deployment.
  • Cold start: Loading a 70B model takes 30-120 seconds. Use keep-warm or serverless inference (Groq, Replicate).
  • Multi-model routing: Route simple queries to small/cheap models, complex to large models. Save 60-80% on compute.

3. Designing a Feature Store

Architecture

Online API <──> Online Store (Redis/DynamoDB)
↕ sync
Feature Pipeline → Offline Store (S3/Parquet)
Training Pipeline
StorePurposeLow-LatencyHigh-ThroughputConsistency
OnlineReal-time serving✅ Yes❌ NoEventual
OfflineTraining data❌ No✅ YesStrong

4. Designing an A/B Testing Platform

Metrics Framework

MetricDefinitionWhat It Measures
OfflineAccuracy, F1, BLEUModel quality in isolation
OnlineCTR, engagement, retentionUser-level impact
GuardrailLatency p99, error rate, toxicitySystem health
CounterfactualWhat-if analysisCausal impact

Experiment Design

  1. Hypothesis: Model A improves response quality by 10% over Model B
  2. Randomization: User-level, session-level, or request-level
  3. Power analysis: Minimum sample size = 2 * (Z_α/2 + Z_β)² * σ² / δ²
  4. Duration: Minimum 1 full business cycle (1-2 weeks) to capture weekly patterns
  5. Analysis: Two-sample t-test, delta method for ratios, Benjamini-Hochberg for multiple comparisons

5. Tradeoff Decision Framework

Build vs Buy

FactorBuildBuy (API)
Time to marketMonthsDays
ControlFullLimited to API
Cost at low volumeHigh infraPay-per-token
Cost at high volumeAmortizedScales linearly
CustomizationUnlimitedPrompting only
MaintenanceYour teamProvider handles
PrivacyFull controlData leaves infra

Decision rule: Build when >50M tokens/month OR custom fine-tuning needed OR data cannot leave premise.

Cost Per Query (May 2026)

Simple classification: GPT-5.5 Instant ~$0.00001/query
RAG with GPT-5.5: ~$0.001-0.005/query
Agentic workflow: ~$0.01-0.05/query
Fine-tuned small model (self-host): ~$0.0001/query (amortized)

Quick Reference

  • Throughput vs latency: Continuous batching maximizes throughput; speculative decoding reduces latency
  • Memory bandwidth bound: LLM inference is typically memory-bandwidth bound, not compute bound
  • KV cache: Linear in batch size x sequence length. PagedAttention reduces fragmentation by 60-80%
  • Prompt caching: Cache prefix KV for common system prompts → 50-80% reduction in prefill time