How LLMs Work: Technical Deep Dive
Understanding the mechanics of Large Language Models - from tokens to transformers to training.
Part 1: Tokens and Tokenization
The building blocks of everything an LLM does:
What Are Tokens?
LLMs don’t understand words. They understand numbers. A token is a number representing a piece of text.
Examples:
- “Hello” → token 15234
- ”,” → token 89
- ” world” → token 62
One token ≈ 4 characters or 0.75 words.
Why tokens?
- Efficiency: Numbers are faster to process than text
- Consistency: Same words always become same tokens
- Compression: Reduces context size
The Tokenizer
A tokenizer is a lookup table (or learned algorithm) that converts text to tokens.
Text: "Hello, world!" ↓ (tokenizer)Tokens: [15234, 89, 62] ↓ (reverse lookup)Text: "Hello, world!"Common tokenizers:
- BPE (Byte Pair Encoding) - GPT uses this
- SentencePiece - Used by many open models
- WordPiece - Google’s Bert
Key insight: Different models use different tokenizers, so the same text has different token counts in different models.
Part 2: Embeddings
How words become numbers the model can process:
What Is an Embedding?
An embedding is a vector (list of numbers) representing the meaning of text.
Example:
- “king” embedding: [0.2, -0.1, 0.8, 0.3, …, 0.1]
- “queen” embedding: [0.25, -0.08, 0.75, 0.35, …, 0.12]
- “man” embedding: [0.1, 0.2, 0.6, 0.1, …, 0.0]
Famous property (word2vec):
king - man + woman ≈ queenEmbeddings capture semantic meaning. Similar words have similar embeddings.
How Embeddings Are Made
LLMs have an embedding layer - a lookup table that converts tokens to vectors:
Token: 15234 ↓Embedding Layer (lookup table) ↓Vector: [0.2, -0.1, 0.8, ..., 0.1] (768 dimensions for smaller models)Sizes vary:
- Small models: 768 dimensions
- Medium models: 1024-2048 dimensions
- Large models: 4096+ dimensions
Key insight: Embeddings are learned during training. The model learns to create embeddings where similar words are close together.
Part 3: Positional Encoding
LLMs process all tokens in parallel, not sequentially. But order matters: “dog bites man” ≠ “man bites dog”.
Solution: Positional encoding
Each token also gets a position vector:
Token 1: "The" → [0.2, -0.1, 0.8] + position_1Token 2: "dog" → [0.5, 0.3, 0.1] + position_2Token 3: "bites" → [0.1, 0.9, -0.2] + position_3The model learns that position matters and uses it to understand sequence.
Part 4: The Attention Mechanism
This is the secret sauce. Attention is what makes transformers work.
The Problem Attention Solves
Without attention: The model treats all tokens equally. “The dog bites man” - all tokens are equally relevant to predicting the next word.
With attention: The model focuses on relevant tokens. When predicting the next word after “bites”, it focuses on “dog” and “man”, ignores “The”.
How Attention Works (Simplified)
For each token, compute:
- Query (Q): What am I looking for?
- Key (K): What information do I have?
- Value (V): What’s the information?
Query for "bites": "I need info about the subject/object" ↓Score each other token's Key: "The" → low score (not important) "dog" → high score (subject!) "man" → medium score (object!) ↓Use scores to weight Values: dog's value (40%) + man's value (35%) + The's value (25%) ↓Result: Weighted information = "attention output"Why It’s Powerful
Attention lets the model:
- Focus on relevant context
- Understand long-range dependencies
- Handle ambiguity (“bank” - financial or river?)
Part 5: The Transformer Architecture
A transformer is built from attention blocks stacked on top of each other.
One Transformer Block
Input tokens (with embeddings) ↓Multi-head Attention (8-16 attention heads in parallel) ↓Feed-forward Network (neural network) ↓Output (fed to next block or to prediction layer)Why multiple attention heads? Each head can focus on different types of relationships:
- Head 1: “focus on grammar”
- Head 2: “focus on meaning”
- Head 3: “focus on long-range dependencies”
Stacking Blocks
A modern LLM has 12-96 transformer blocks stacked:
Input ↓Block 1 (extract basic patterns) ↓Block 2 (extract medium-level patterns) ↓... ↓Block 96 (extract complex semantic understanding) ↓Output predictionIntuition: Earlier blocks extract simple patterns (grammar, word relationships). Later blocks extract complex patterns (reasoning, meaning).
Part 6: Training vs Inference
How models learn versus how they’re used:
Training (Expensive, One-Time)
Goal: Learn the embeddings, attention weights, and parameters
Process:
- Start with random parameters
- Feed in text (billions of tokens)
- Predict next token (language modeling task)
- Compare prediction to actual
- Adjust parameters slightly to reduce error
- Repeat millions of times
Cost:
- GPT-4: ~$100M in compute
- Claude Opus: Similar scale
- Llama 70B: ~$5-10M
Time: Weeks to months on clusters of GPUs/TPUs
Inference (Cheap, Happens Constantly)
Goal: Use the trained model to predict text
Process:
- Load trained parameters (frozen, don’t change)
- For each token, predict the next one
- Pick (sample) next token
- Repeat until done
Cost: ~$0.001 per thousand tokens (Claude)
Time: Milliseconds to seconds
Key insight: Training is expensive. Inference is cheap. That’s why you don’t fine-tune for simple tasks.
Part 7: Generation (How Models Produce Output)
How the model turns probabilities into text:
Greedy Decoding
Always pick the most likely next token:
Prompt: "The capital of France is"Model thinks: 50% Paris, 20% Lyon, 10% MarseilleResult: Always pick "Paris" (greedy)Output: "The capital of France is Paris"Pros: Deterministic, fast
Cons: Boring, repetitive
Temperature Sampling
Scale probabilities and sample randomly:
Temperature = 0.7 (common):
Original: 50% Paris, 20% Lyon, 10% MarseilleAfter: 48% Paris, 22% Lyon, 12% MarseilleSample: Randomly pick (usually Paris, sometimes Lyon)Temperature = 0.1 (deterministic): Probabilities become more extreme → usually pick top choice
Temperature = 2.0 (creative): Probabilities flatten → more random choices
Part 8: Scaling Laws
There’s a predictable relationship between model size, data, compute, and performance. Understanding this is essential for making informed decisions about model selection and architecture.
The Core Finding (Kaplan et al., 2020)
Scaling laws describe how model performance improves as you increase three resources:
Performance improvement ≈ f( Model parameters (size), Training tokens (data), Compute budget (FLOPs))Empirical relationship (Kaplan):
- 2x parameters → ~6% better loss
- 10x parameters → ~20% better loss
- 100x parameters → ~40% better loss
Key insight: Performance scales as a power-law with each resource. There are diminishing returns, but no “plateau” was observed at GPT-3 scale.
The Three Axes
| Axis | What it costs | What it buys |
|---|---|---|
| Parameters | Memory (VRAM), inference latency | Model capacity, knowledge retention |
| Training tokens | Data collection, compute for processing | Broad knowledge, language fluency |
| Compute budget | GPU/TPU hours, dollars | Overall capability |
The art of model design is deciding how to allocate your compute budget across parameters and data.
Chinchilla Optimal Compute (2022)
DeepMind showed that most models were undertrained — they had too many parameters for the amount of data they were trained on.
Chinchilla’s finding: For a compute-optimal model, the ratio should be:
- 20 tokens of training data per parameter
Optimal model size = Compute budget / (6 × 20)Training tokens = 20 × parametersExample: If you have FLOPs of compute:
- Optimal model size: ~6B parameters
- Optimal training tokens: ~120B tokens
Why this matters: Many models (including GPT-3) were significantly over-parameterized. You could get the same performance from a smaller model trained on more data, for less cost.
Current State (May 2026)
Scaling has evolved significantly since the original papers:
What’s changed:
- Data quality > data quantity: The best models now prioritize curated, high-quality data over raw scale. Chinchilla assumed all tokens are equal — they’re not.
- MoE decouples capacity from compute: Mixture of Experts models can have hundreds of billions of parameters but only use a fraction per token (see Part 11).
- Test-time compute is a new axis: Models like o3 and DeepSeek R1 spend more compute at inference time (thinking) rather than just at training time.
The diminishing returns debate:
- Pro-scaling: Larger models continue to improve on hard benchmarks (GPQA, SWE-bench)
- Anti-scaling: Marginal gains shrink; better data, architecture, and inference techniques offer higher ROI
- Consensus: Scaling still works, but the frontier has moved from “scale blindly” to “scale intelligently” (better data, better architecture, better training methods)
Practical implications:
- Frontier models (Claude Opus, GPT-5.5) are in the 1T+ parameter range (with MoE)
- Small models (Phi-4, Gemma) achieve GPT-3.5-class performance at 1/100th the size through better data and training
- The cost of training frontier models has risen to 1B per run
- Most teams should not train models — fine-tune existing ones instead
Part 9: Context Window
Models can only see a limited context:
Claude: 200K tokens (≈150K words)
GPT-4o: 128K tokens (≈96K words)
Gemini 3.1: 1M tokens (≈750K words)
Why the Limit?
Attention is O(n²) complexity:
- 1000 tokens → 1M operations
- 100K tokens → 10B operations (expensive!)
How It Matters
With 200K context, you can:
- Read a 150-page document
- Have a 100-turn conversation
- But not both simultaneously
Part 10: Mixture of Experts (MoE) Architecture
Most frontier models no longer use a single dense neural network. Instead, they use a Mixture of Experts (MoE) architecture that decouples total parameter count from per-token compute cost.
The Problem MoE Solves
A dense model uses all its parameters for every token. A 1T parameter dense model would be impossibly expensive to run. MoE solves this by activating only a subset of parameters per token.
Dense model (inefficient at scale): Every token → All parameters → Output
MoE model (efficient at scale): Every token → Router → Top 2 of 16 experts → OutputHow MoE Works
1. Experts: Feed-forward networks, each specialized in different types of processing. A model might have 16 to 256 experts.
2. Router (Gating Network): A small neural network that decides which experts to use for each token. It outputs a probability distribution over experts.
3. Top-K Routing: Only the top K experts (typically K=2) are activated per token. The others are skipped entirely.
Token: "bites" ↓Router scores: Expert 7 (92%), Expert 12 (85%), Expert 3 (12%), ... ↓Activate: Expert 7 + Expert 12 (weighted by router scores) ↓Output: Combined expert output → next layerWhy K=2? Two experts provides enough specialization without the overhead of coordinating more. Some models use K=1 (cheapest) or K=4 (more capacity).
Expert Specialization
Experts naturally specialize during training without explicit supervision:
- Some experts learn grammar patterns
- Some experts learn factual knowledge
- Some experts learn coding syntax
- Some experts learn multi-step reasoning
This emerges from training — no one assigns roles to experts. The router learns to route tokens to the right experts based on the task.
Load Balancing
A critical challenge: the router might route all tokens to the same few experts, leaving others unused. Solutions:
Load balancing loss: A penalty term added during training that encourages the router to distribute tokens evenly across experts.
Expert capacity: A hard limit on how many tokens each expert can process. Tokens routed to a full expert are passed through (not processed) or rerouted.
MoE vs Dense: Tradeoffs
| Aspect | Dense | MoE |
|---|---|---|
| Total parameters | N | 4N to 10N (experts + shared) |
| Active parameters per token | N | ~N/4 to N/2 (only K experts) |
| Training cost | Baseline | 2-3x more (experts add gradient compute) |
| Inference cost | Baseline | Similar to dense (same active params) |
| Quality at same active params | Worse | Better (more total knowledge) |
| Memory (VRAM) | N | 4N-10N (all experts must be loaded) |
Key insight: MoE gives you the knowledge capacity of a much larger model at the inference cost of a smaller one. The tradeoff is higher memory (to store all experts) and higher training cost.
Real-World MoE Models (May 2026)
| Model | Total Params | Active Params | Experts | Top-K |
|---|---|---|---|---|
| Llama 4 | 405B | ~90B | 16 | 2 |
| Llama 4 Scout | 109B | ~26B | 16 | 2 |
| Grok 2 | ~300B | ~60B | 8 | 2 |
| Mixtral 8x7B | 47B | 13B | 8 | 2 |
| DeepSeek V2/V3 | 236B | ~21B | Varies | Varies |
| Gemini 3.1 Pro | Unknown | Unknown | Unknown | Unknown |
MoE Challenges
- Memory bandwidth: All experts must be loaded in memory even though only 2 are used per token. This is why MoE models need high-memory GPUs.
- Expert collapse: If load balancing fails, some experts become useless. The model wastes parameters.
- Fine-tuning difficulty: MoE models are harder to fine-tune because updates can destabilize the router.
- Batch inference complexity: Different tokens in a batch might route to different experts, making efficient batching harder.
The Future: Dense vs MoE
As of May 2026, the industry is split:
- Anthropic (Claude) uses dense architectures, arguing they produce more coherent outputs
- Google (Gemini), Meta (Llama 4), and DeepSeek use MoE for cost-effective scaling
- OpenAI (GPT-5.5) uses a hybrid approach — dense core with MoE-like routing for specialized tasks
Part 11: Fine-tuning
After training, you can adapt a model to your domain:
How: Start with a trained model, train on your data (thousands of examples), update 0.1-1% of parameters.
Cost: 10-100x cheaper than pre-training.
When to use: Model outputs wrong style/tone, needs domain expertise.
When not to: You have under 100 examples (use RAG instead), you need simple reasoning (use prompting).
For a complete guide, see Training & Fine-tuning.
Common Misconceptions
❌ “Bigger models understand better”
✅ They have more parameters, but understanding is still pattern matching
❌ “LLMs read sequentially”
✅ They process all tokens in parallel (that’s why transformers are fast)
❌ “Attention is human-interpretable”
✅ It’s interpretable compared to other neural networks, but still opaque
❌ “LLMs have memory”
✅ They have context window. Each conversation starts fresh
Key Takeaways
- Tokens: Text → numbers for processing
- Embeddings: Numbers → vectors representing meaning
- Attention: Focus on relevant context
- Transformers: Blocks of attention + feedforward layers
- Training: Expensive (done once), learn parameters
- Inference: Cheap (done constantly), use parameters
- Scaling Laws: Performance scales with parameters, data, and compute — but with diminishing returns and intelligent tradeoffs
- Context window: Limits what the model can see
- MoE: Enables much larger models by only activating a fraction of parameters per token
- Fine-tuning: Cheap adaptation to your data
See Also:
- Beginner Path - Non-technical intro
- Builder Path - Hands-on implementation
- RAG Architecture - Understanding embeddings in practice
- Prompt Engineering - Using transformers effectively