How LLMs Work: Technical Deep Dive

📖 13 min read deep-divearchitecturellmscalingmoe

Technical deep dive - transformers, attention, embeddings, scaling laws, and MoE

Key Takeaways

Tokens become embeddings, attention identifies relevance, and transformer blocks build understanding
Training costs over $100M for frontier models while inference costs $0.001 per 1K tokens
MoE activates only a fraction of parameters per token, enabling much larger models at similar cost
Scaling laws predict diminishing returns with 10x parameters giving roughly 20% better performance

Understanding the mechanics of Large Language Models - from tokens to transformers to training.

Part 1: Tokens and Tokenization

The building blocks of everything an LLM does:

What Are Tokens?

LLMs don’t understand words. They understand numbers. A token is a number representing a piece of text.

Examples:

“Hello” → token 15234
”,” → token 89
” world” → token 62

One token ≈ 4 characters or 0.75 words.

Why tokens?

Efficiency: Numbers are faster to process than text
Consistency: Same words always become same tokens
Compression: Reduces context size

The Tokenizer

A tokenizer is a lookup table (or learned algorithm) that converts text to tokens.

Text: "Hello, world!"
  ↓ (tokenizer)
Tokens: [15234, 89, 62]
  ↓ (reverse lookup)
Text: "Hello, world!"

Common tokenizers:

BPE (Byte Pair Encoding) - GPT uses this
SentencePiece - Used by many open models
WordPiece - Google’s Bert

Key insight: Different models use different tokenizers, so the same text has different token counts in different models.

Part 2: Embeddings

How words become numbers the model can process:

What Is an Embedding?

An embedding is a vector (list of numbers) representing the meaning of text.

Example:

“king” embedding: [0.2, -0.1, 0.8, 0.3, …, 0.1]
“queen” embedding: [0.25, -0.08, 0.75, 0.35, …, 0.12]
“man” embedding: [0.1, 0.2, 0.6, 0.1, …, 0.0]

Famous property (word2vec):

king - man + woman ≈ queen

Embeddings capture semantic meaning. Similar words have similar embeddings.

How Embeddings Are Made

LLMs have an embedding layer - a lookup table that converts tokens to vectors:

Token: 15234
  ↓
Embedding Layer (lookup table)
  ↓
Vector: [0.2, -0.1, 0.8, ..., 0.1] (768 dimensions for smaller models)

Sizes vary:

Small models: 768 dimensions
Medium models: 1024-2048 dimensions
Large models: 4096+ dimensions

Key insight: Embeddings are learned during training. The model learns to create embeddings where similar words are close together.

Part 3: Positional Encoding

LLMs process all tokens in parallel, not sequentially. But order matters: “dog bites man” ≠ “man bites dog”.

Solution: Positional encoding

Each token also gets a position vector:

Token 1: "The" → [0.2, -0.1, 0.8] + position_1
Token 2: "dog" → [0.5, 0.3, 0.1] + position_2
Token 3: "bites" → [0.1, 0.9, -0.2] + position_3

The model learns that position matters and uses it to understand sequence.

Part 4: The Attention Mechanism

This is the secret sauce. Attention is what makes transformers work.

The Problem Attention Solves

Without attention: The model treats all tokens equally. “The dog bites man” - all tokens are equally relevant to predicting the next word.

With attention: The model focuses on relevant tokens. When predicting the next word after “bites”, it focuses on “dog” and “man”, ignores “The”.

How Attention Works (Simplified)

For each token, compute:

Query (Q): What am I looking for?
Key (K): What information do I have?
Value (V): What’s the information?

Query for "bites": "I need info about the subject/object"
  ↓
Score each other token's Key:
  "The" → low score (not important)
  "dog" → high score (subject!)
  "man" → medium score (object!)
  ↓
Use scores to weight Values:
  dog's value (40%) + man's value (35%) + The's value (25%)
  ↓
Result: Weighted information = "attention output"

Why It’s Powerful

Attention lets the model:

Focus on relevant context
Understand long-range dependencies
Handle ambiguity (“bank” - financial or river?)

Part 5: The Transformer Architecture

A transformer is built from attention blocks stacked on top of each other.

One Transformer Block

Input tokens (with embeddings)
  ↓
Multi-head Attention (8-16 attention heads in parallel)
  ↓
Feed-forward Network (neural network)
  ↓
Output (fed to next block or to prediction layer)

Why multiple attention heads? Each head can focus on different types of relationships:

Head 1: “focus on grammar”
Head 2: “focus on meaning”
Head 3: “focus on long-range dependencies”

Stacking Blocks

A modern LLM has 12-96 transformer blocks stacked:

Input
  ↓
Block 1 (extract basic patterns)
  ↓
Block 2 (extract medium-level patterns)
  ↓
...
  ↓
Block 96 (extract complex semantic understanding)
  ↓
Output prediction

Intuition: Earlier blocks extract simple patterns (grammar, word relationships). Later blocks extract complex patterns (reasoning, meaning).

Part 6: Training vs Inference

How models learn versus how they’re used:

Training (Expensive, One-Time)

Goal: Learn the embeddings, attention weights, and parameters

Process:

Start with random parameters
Feed in text (billions of tokens)
Predict next token (language modeling task)
Compare prediction to actual
Adjust parameters slightly to reduce error
Repeat millions of times

Cost:

GPT-4: ~$100M in compute
Claude Opus: Similar scale
Llama 70B: ~$5-10M

Time: Weeks to months on clusters of GPUs/TPUs

Inference (Cheap, Happens Constantly)

Goal: Use the trained model to predict text

Process:

Load trained parameters (frozen, don’t change)
For each token, predict the next one
Pick (sample) next token
Repeat until done

Cost: ~$0.001 per thousand tokens (Claude)

Time: Milliseconds to seconds

Key insight: Training is expensive. Inference is cheap. That’s why you don’t fine-tune for simple tasks.

Part 7: Generation (How Models Produce Output)

How the model turns probabilities into text:

Greedy Decoding

Always pick the most likely next token:

Prompt: "The capital of France is"
Model thinks: 50% Paris, 20% Lyon, 10% Marseille
Result: Always pick "Paris" (greedy)
Output: "The capital of France is Paris"

Pros: Deterministic, fast
Cons: Boring, repetitive

Temperature Sampling

Scale probabilities and sample randomly:

Temperature = 0.7 (common):

Original: 50% Paris, 20% Lyon, 10% Marseille
After: 48% Paris, 22% Lyon, 12% Marseille
Sample: Randomly pick (usually Paris, sometimes Lyon)

Temperature = 0.1 (deterministic): Probabilities become more extreme → usually pick top choice

Temperature = 2.0 (creative): Probabilities flatten → more random choices

Part 8: Scaling Laws

There’s a predictable relationship between model size, data, compute, and performance. Understanding this is essential for making informed decisions about model selection and architecture.

The Core Finding (Kaplan et al., 2020)

Scaling laws describe how model performance improves as you increase three resources:

Performance improvement ≈ f(
  Model parameters (size),
  Training tokens (data),
  Compute budget (FLOPs)
)

Empirical relationship (Kaplan):

2x parameters → ~6% better loss
10x parameters → ~20% better loss
100x parameters → ~40% better loss

Key insight: Performance scales as a power-law with each resource. There are diminishing returns, but no “plateau” was observed at GPT-3 scale.

The Three Axes

Axis	What it costs	What it buys
Parameters	Memory (VRAM), inference latency	Model capacity, knowledge retention
Training tokens	Data collection, compute for processing	Broad knowledge, language fluency
Compute budget	GPU/TPU hours, dollars	Overall capability

The art of model design is deciding how to allocate your compute budget across parameters and data.

Chinchilla Optimal Compute (2022)

DeepMind showed that most models were undertrained — they had too many parameters for the amount of data they were trained on.

Chinchilla’s finding: For a compute-optimal model, the ratio should be:

20 tokens of training data per parameter

Optimal model size = Compute budget / (6 × 20)
Training tokens = 20 × parameters

Example: If you have $10^{23}$ FLOPs of compute:

Optimal model size: ~6B parameters
Optimal training tokens: ~120B tokens

Why this matters: Many models (including GPT-3) were significantly over-parameterized. You could get the same performance from a smaller model trained on more data, for less cost.

Current State (May 2026)

Scaling has evolved significantly since the original papers:

What’s changed:

Data quality > data quantity: The best models now prioritize curated, high-quality data over raw scale. Chinchilla assumed all tokens are equal — they’re not.
MoE decouples capacity from compute: Mixture of Experts models can have hundreds of billions of parameters but only use a fraction per token (see Part 11).
Test-time compute is a new axis: Models like o3 and DeepSeek R1 spend more compute at inference time (thinking) rather than just at training time.

The diminishing returns debate:

Pro-scaling: Larger models continue to improve on hard benchmarks (GPQA, SWE-bench)
Anti-scaling: Marginal gains shrink; better data, architecture, and inference techniques offer higher ROI
Consensus: Scaling still works, but the frontier has moved from “scale blindly” to “scale intelligently” (better data, better architecture, better training methods)

Practical implications:

Frontier models (Claude Opus, GPT-5.5) are in the 1T+ parameter range (with MoE)
Small models (Phi-4, Gemma) achieve GPT-3.5-class performance at 1/100th the size through better data and training
The cost of training frontier models has risen to $100M-$ 1B per run
Most teams should not train models — fine-tune existing ones instead

Part 9: Context Window

Models can only see a limited context:

Claude: 200K tokens (≈150K words)
GPT-4o: 128K tokens (≈96K words)
Gemini 3.1: 1M tokens (≈750K words)

Why the Limit?

Attention is O(n²) complexity:

1000 tokens → 1M operations
100K tokens → 10B operations (expensive!)

How It Matters

With 200K context, you can:

Read a 150-page document
Have a 100-turn conversation
But not both simultaneously

Part 10: Mixture of Experts (MoE) Architecture

Most frontier models no longer use a single dense neural network. Instead, they use a Mixture of Experts (MoE) architecture that decouples total parameter count from per-token compute cost.

The Problem MoE Solves

A dense model uses all its parameters for every token. A 1T parameter dense model would be impossibly expensive to run. MoE solves this by activating only a subset of parameters per token.

Dense model (inefficient at scale):
  Every token → All parameters → Output

MoE model (efficient at scale):
  Every token → Router → Top 2 of 16 experts → Output

How MoE Works

1. Experts: Feed-forward networks, each specialized in different types of processing. A model might have 16 to 256 experts.

2. Router (Gating Network): A small neural network that decides which experts to use for each token. It outputs a probability distribution over experts.

3. Top-K Routing: Only the top K experts (typically K=2) are activated per token. The others are skipped entirely.

Token: "bites"
  ↓
Router scores: Expert 7 (92%), Expert 12 (85%), Expert 3 (12%), ...
  ↓
Activate: Expert 7 + Expert 12 (weighted by router scores)
  ↓
Output: Combined expert output → next layer

Why K=2? Two experts provides enough specialization without the overhead of coordinating more. Some models use K=1 (cheapest) or K=4 (more capacity).

Expert Specialization

Experts naturally specialize during training without explicit supervision:

Some experts learn grammar patterns
Some experts learn factual knowledge
Some experts learn coding syntax
Some experts learn multi-step reasoning

This emerges from training — no one assigns roles to experts. The router learns to route tokens to the right experts based on the task.

Load Balancing

A critical challenge: the router might route all tokens to the same few experts, leaving others unused. Solutions:

Load balancing loss: A penalty term added during training that encourages the router to distribute tokens evenly across experts.

Expert capacity: A hard limit on how many tokens each expert can process. Tokens routed to a full expert are passed through (not processed) or rerouted.

MoE vs Dense: Tradeoffs

Aspect	Dense	MoE
Total parameters	N	4N to 10N (experts + shared)
Active parameters per token	N	~N/4 to N/2 (only K experts)
Training cost	Baseline	2-3x more (experts add gradient compute)
Inference cost	Baseline	Similar to dense (same active params)
Quality at same active params	Worse	Better (more total knowledge)
Memory (VRAM)	N	4N-10N (all experts must be loaded)

Key insight: MoE gives you the knowledge capacity of a much larger model at the inference cost of a smaller one. The tradeoff is higher memory (to store all experts) and higher training cost.

Real-World MoE Models (May 2026)

Model	Total Params	Active Params	Experts	Top-K
Llama 4	405B	~90B	16	2
Llama 4 Scout	109B	~26B	16	2
Grok 2	~300B	~60B	8	2
Mixtral 8x7B	47B	13B	8	2
DeepSeek V2/V3	236B	~21B	Varies	Varies
Gemini 3.1 Pro	Unknown	Unknown	Unknown	Unknown

MoE Challenges

Memory bandwidth: All experts must be loaded in memory even though only 2 are used per token. This is why MoE models need high-memory GPUs.
Expert collapse: If load balancing fails, some experts become useless. The model wastes parameters.
Fine-tuning difficulty: MoE models are harder to fine-tune because updates can destabilize the router.
Batch inference complexity: Different tokens in a batch might route to different experts, making efficient batching harder.

The Future: Dense vs MoE

As of May 2026, the industry is split:

Anthropic (Claude) uses dense architectures, arguing they produce more coherent outputs
Google (Gemini), Meta (Llama 4), and DeepSeek use MoE for cost-effective scaling
OpenAI (GPT-5.5) uses a hybrid approach — dense core with MoE-like routing for specialized tasks

Part 11: Fine-tuning

After training, you can adapt a model to your domain:

How: Start with a trained model, train on your data (thousands of examples), update 0.1-1% of parameters.

Cost: 10-100x cheaper than pre-training.

When to use: Model outputs wrong style/tone, needs domain expertise.

When not to: You have under 100 examples (use RAG instead), you need simple reasoning (use prompting).

For a complete guide, see Training & Fine-tuning.

Common Misconceptions

❌ “Bigger models understand better”
✅ They have more parameters, but understanding is still pattern matching

❌ “LLMs read sequentially”
✅ They process all tokens in parallel (that’s why transformers are fast)

❌ “Attention is human-interpretable”
✅ It’s interpretable compared to other neural networks, but still opaque

❌ “LLMs have memory”
✅ They have context window. Each conversation starts fresh

Key Takeaways

Tokens: Text → numbers for processing
Embeddings: Numbers → vectors representing meaning
Attention: Focus on relevant context
Transformers: Blocks of attention + feedforward layers
Training: Expensive (done once), learn parameters
Inference: Cheap (done constantly), use parameters
Scaling Laws: Performance scales with parameters, data, and compute — but with diminishing returns and intelligent tradeoffs
Context window: Limits what the model can see
MoE: Enables much larger models by only activating a fraction of parameters per token
Fine-tuning: Cheap adaptation to your data