Skip to content

How LLMs Work: Technical Deep Dive

📖 13 min read deep-divearchitecturellmscalingmoe
Technical deep dive - transformers, attention, embeddings, scaling laws, and MoE
Key Takeaways
  • Tokens become embeddings, attention identifies relevance, and transformer blocks build understanding
  • Training costs over $100M for frontier models while inference costs $0.001 per 1K tokens
  • MoE activates only a fraction of parameters per token, enabling much larger models at similar cost
  • Scaling laws predict diminishing returns with 10x parameters giving roughly 20% better performance

Understanding the mechanics of Large Language Models - from tokens to transformers to training.


Part 1: Tokens and Tokenization

The building blocks of everything an LLM does:

What Are Tokens?

LLMs don’t understand words. They understand numbers. A token is a number representing a piece of text.

Examples:

  • “Hello” → token 15234
  • ”,” → token 89
  • ” world” → token 62

One token ≈ 4 characters or 0.75 words.

Why tokens?

  1. Efficiency: Numbers are faster to process than text
  2. Consistency: Same words always become same tokens
  3. Compression: Reduces context size

The Tokenizer

A tokenizer is a lookup table (or learned algorithm) that converts text to tokens.

Text: "Hello, world!"
↓ (tokenizer)
Tokens: [15234, 89, 62]
↓ (reverse lookup)
Text: "Hello, world!"

Common tokenizers:

  • BPE (Byte Pair Encoding) - GPT uses this
  • SentencePiece - Used by many open models
  • WordPiece - Google’s Bert

Key insight: Different models use different tokenizers, so the same text has different token counts in different models.


Part 2: Embeddings

How words become numbers the model can process:

What Is an Embedding?

An embedding is a vector (list of numbers) representing the meaning of text.

Example:

  • “king” embedding: [0.2, -0.1, 0.8, 0.3, …, 0.1]
  • “queen” embedding: [0.25, -0.08, 0.75, 0.35, …, 0.12]
  • “man” embedding: [0.1, 0.2, 0.6, 0.1, …, 0.0]

Famous property (word2vec):

king - man + woman ≈ queen

Embeddings capture semantic meaning. Similar words have similar embeddings.

How Embeddings Are Made

LLMs have an embedding layer - a lookup table that converts tokens to vectors:

Token: 15234
Embedding Layer (lookup table)
Vector: [0.2, -0.1, 0.8, ..., 0.1] (768 dimensions for smaller models)

Sizes vary:

  • Small models: 768 dimensions
  • Medium models: 1024-2048 dimensions
  • Large models: 4096+ dimensions

Key insight: Embeddings are learned during training. The model learns to create embeddings where similar words are close together.


Part 3: Positional Encoding

LLMs process all tokens in parallel, not sequentially. But order matters: “dog bites man” ≠ “man bites dog”.

Solution: Positional encoding

Each token also gets a position vector:

Token 1: "The" → [0.2, -0.1, 0.8] + position_1
Token 2: "dog" → [0.5, 0.3, 0.1] + position_2
Token 3: "bites" → [0.1, 0.9, -0.2] + position_3

The model learns that position matters and uses it to understand sequence.


Part 4: The Attention Mechanism

This is the secret sauce. Attention is what makes transformers work.

The Problem Attention Solves

Without attention: The model treats all tokens equally. “The dog bites man” - all tokens are equally relevant to predicting the next word.

With attention: The model focuses on relevant tokens. When predicting the next word after “bites”, it focuses on “dog” and “man”, ignores “The”.

How Attention Works (Simplified)

For each token, compute:

  1. Query (Q): What am I looking for?
  2. Key (K): What information do I have?
  3. Value (V): What’s the information?
Query for "bites": "I need info about the subject/object"
Score each other token's Key:
"The" → low score (not important)
"dog" → high score (subject!)
"man" → medium score (object!)
Use scores to weight Values:
dog's value (40%) + man's value (35%) + The's value (25%)
Result: Weighted information = "attention output"

Why It’s Powerful

Attention lets the model:

  • Focus on relevant context
  • Understand long-range dependencies
  • Handle ambiguity (“bank” - financial or river?)

Part 5: The Transformer Architecture

A transformer is built from attention blocks stacked on top of each other.

One Transformer Block

Input tokens (with embeddings)
Multi-head Attention (8-16 attention heads in parallel)
Feed-forward Network (neural network)
Output (fed to next block or to prediction layer)

Why multiple attention heads? Each head can focus on different types of relationships:

  • Head 1: “focus on grammar”
  • Head 2: “focus on meaning”
  • Head 3: “focus on long-range dependencies”

Stacking Blocks

A modern LLM has 12-96 transformer blocks stacked:

Input
Block 1 (extract basic patterns)
Block 2 (extract medium-level patterns)
...
Block 96 (extract complex semantic understanding)
Output prediction

Intuition: Earlier blocks extract simple patterns (grammar, word relationships). Later blocks extract complex patterns (reasoning, meaning).


Part 6: Training vs Inference

How models learn versus how they’re used:

Training (Expensive, One-Time)

Goal: Learn the embeddings, attention weights, and parameters

Process:

  1. Start with random parameters
  2. Feed in text (billions of tokens)
  3. Predict next token (language modeling task)
  4. Compare prediction to actual
  5. Adjust parameters slightly to reduce error
  6. Repeat millions of times

Cost:

  • GPT-4: ~$100M in compute
  • Claude Opus: Similar scale
  • Llama 70B: ~$5-10M

Time: Weeks to months on clusters of GPUs/TPUs

Inference (Cheap, Happens Constantly)

Goal: Use the trained model to predict text

Process:

  1. Load trained parameters (frozen, don’t change)
  2. For each token, predict the next one
  3. Pick (sample) next token
  4. Repeat until done

Cost: ~$0.001 per thousand tokens (Claude)

Time: Milliseconds to seconds

Key insight: Training is expensive. Inference is cheap. That’s why you don’t fine-tune for simple tasks.


Part 7: Generation (How Models Produce Output)

How the model turns probabilities into text:

Greedy Decoding

Always pick the most likely next token:

Prompt: "The capital of France is"
Model thinks: 50% Paris, 20% Lyon, 10% Marseille
Result: Always pick "Paris" (greedy)
Output: "The capital of France is Paris"

Pros: Deterministic, fast
Cons: Boring, repetitive

Temperature Sampling

Scale probabilities and sample randomly:

Temperature = 0.7 (common):

Original: 50% Paris, 20% Lyon, 10% Marseille
After: 48% Paris, 22% Lyon, 12% Marseille
Sample: Randomly pick (usually Paris, sometimes Lyon)

Temperature = 0.1 (deterministic): Probabilities become more extreme → usually pick top choice

Temperature = 2.0 (creative): Probabilities flatten → more random choices


Part 8: Scaling Laws

There’s a predictable relationship between model size, data, compute, and performance. Understanding this is essential for making informed decisions about model selection and architecture.

The Core Finding (Kaplan et al., 2020)

Scaling laws describe how model performance improves as you increase three resources:

Performance improvement ≈ f(
Model parameters (size),
Training tokens (data),
Compute budget (FLOPs)
)

Empirical relationship (Kaplan):

  • 2x parameters → ~6% better loss
  • 10x parameters → ~20% better loss
  • 100x parameters → ~40% better loss

Key insight: Performance scales as a power-law with each resource. There are diminishing returns, but no “plateau” was observed at GPT-3 scale.

The Three Axes

AxisWhat it costsWhat it buys
ParametersMemory (VRAM), inference latencyModel capacity, knowledge retention
Training tokensData collection, compute for processingBroad knowledge, language fluency
Compute budgetGPU/TPU hours, dollarsOverall capability

The art of model design is deciding how to allocate your compute budget across parameters and data.

Chinchilla Optimal Compute (2022)

DeepMind showed that most models were undertrained — they had too many parameters for the amount of data they were trained on.

Chinchilla’s finding: For a compute-optimal model, the ratio should be:

  • 20 tokens of training data per parameter
Optimal model size = Compute budget / (6 × 20)
Training tokens = 20 × parameters

Example: If you have 102310^{23} FLOPs of compute:

  • Optimal model size: ~6B parameters
  • Optimal training tokens: ~120B tokens

Why this matters: Many models (including GPT-3) were significantly over-parameterized. You could get the same performance from a smaller model trained on more data, for less cost.

Current State (May 2026)

Scaling has evolved significantly since the original papers:

What’s changed:

  • Data quality > data quantity: The best models now prioritize curated, high-quality data over raw scale. Chinchilla assumed all tokens are equal — they’re not.
  • MoE decouples capacity from compute: Mixture of Experts models can have hundreds of billions of parameters but only use a fraction per token (see Part 11).
  • Test-time compute is a new axis: Models like o3 and DeepSeek R1 spend more compute at inference time (thinking) rather than just at training time.

The diminishing returns debate:

  • Pro-scaling: Larger models continue to improve on hard benchmarks (GPQA, SWE-bench)
  • Anti-scaling: Marginal gains shrink; better data, architecture, and inference techniques offer higher ROI
  • Consensus: Scaling still works, but the frontier has moved from “scale blindly” to “scale intelligently” (better data, better architecture, better training methods)

Practical implications:

  • Frontier models (Claude Opus, GPT-5.5) are in the 1T+ parameter range (with MoE)
  • Small models (Phi-4, Gemma) achieve GPT-3.5-class performance at 1/100th the size through better data and training
  • The cost of training frontier models has risen to 100M100M-1B per run
  • Most teams should not train models — fine-tune existing ones instead

Part 9: Context Window

Models can only see a limited context:

Claude: 200K tokens (≈150K words)
GPT-4o: 128K tokens (≈96K words)
Gemini 3.1: 1M tokens (≈750K words)

Why the Limit?

Attention is O(n²) complexity:

  • 1000 tokens → 1M operations
  • 100K tokens → 10B operations (expensive!)

How It Matters

With 200K context, you can:

  • Read a 150-page document
  • Have a 100-turn conversation
  • But not both simultaneously

Part 10: Mixture of Experts (MoE) Architecture

Most frontier models no longer use a single dense neural network. Instead, they use a Mixture of Experts (MoE) architecture that decouples total parameter count from per-token compute cost.

The Problem MoE Solves

A dense model uses all its parameters for every token. A 1T parameter dense model would be impossibly expensive to run. MoE solves this by activating only a subset of parameters per token.

Dense model (inefficient at scale):
Every token → All parameters → Output
MoE model (efficient at scale):
Every token → Router → Top 2 of 16 experts → Output

How MoE Works

1. Experts: Feed-forward networks, each specialized in different types of processing. A model might have 16 to 256 experts.

2. Router (Gating Network): A small neural network that decides which experts to use for each token. It outputs a probability distribution over experts.

3. Top-K Routing: Only the top K experts (typically K=2) are activated per token. The others are skipped entirely.

Token: "bites"
Router scores: Expert 7 (92%), Expert 12 (85%), Expert 3 (12%), ...
Activate: Expert 7 + Expert 12 (weighted by router scores)
Output: Combined expert output → next layer

Why K=2? Two experts provides enough specialization without the overhead of coordinating more. Some models use K=1 (cheapest) or K=4 (more capacity).

Expert Specialization

Experts naturally specialize during training without explicit supervision:

  • Some experts learn grammar patterns
  • Some experts learn factual knowledge
  • Some experts learn coding syntax
  • Some experts learn multi-step reasoning

This emerges from training — no one assigns roles to experts. The router learns to route tokens to the right experts based on the task.

Load Balancing

A critical challenge: the router might route all tokens to the same few experts, leaving others unused. Solutions:

Load balancing loss: A penalty term added during training that encourages the router to distribute tokens evenly across experts.

Expert capacity: A hard limit on how many tokens each expert can process. Tokens routed to a full expert are passed through (not processed) or rerouted.

MoE vs Dense: Tradeoffs

AspectDenseMoE
Total parametersN4N to 10N (experts + shared)
Active parameters per tokenN~N/4 to N/2 (only K experts)
Training costBaseline2-3x more (experts add gradient compute)
Inference costBaselineSimilar to dense (same active params)
Quality at same active paramsWorseBetter (more total knowledge)
Memory (VRAM)N4N-10N (all experts must be loaded)

Key insight: MoE gives you the knowledge capacity of a much larger model at the inference cost of a smaller one. The tradeoff is higher memory (to store all experts) and higher training cost.

Real-World MoE Models (May 2026)

ModelTotal ParamsActive ParamsExpertsTop-K
Llama 4405B~90B162
Llama 4 Scout109B~26B162
Grok 2~300B~60B82
Mixtral 8x7B47B13B82
DeepSeek V2/V3236B~21BVariesVaries
Gemini 3.1 ProUnknownUnknownUnknownUnknown

MoE Challenges

  1. Memory bandwidth: All experts must be loaded in memory even though only 2 are used per token. This is why MoE models need high-memory GPUs.
  2. Expert collapse: If load balancing fails, some experts become useless. The model wastes parameters.
  3. Fine-tuning difficulty: MoE models are harder to fine-tune because updates can destabilize the router.
  4. Batch inference complexity: Different tokens in a batch might route to different experts, making efficient batching harder.

The Future: Dense vs MoE

As of May 2026, the industry is split:

  • Anthropic (Claude) uses dense architectures, arguing they produce more coherent outputs
  • Google (Gemini), Meta (Llama 4), and DeepSeek use MoE for cost-effective scaling
  • OpenAI (GPT-5.5) uses a hybrid approach — dense core with MoE-like routing for specialized tasks

Part 11: Fine-tuning

After training, you can adapt a model to your domain:

How: Start with a trained model, train on your data (thousands of examples), update 0.1-1% of parameters.

Cost: 10-100x cheaper than pre-training.

When to use: Model outputs wrong style/tone, needs domain expertise.

When not to: You have under 100 examples (use RAG instead), you need simple reasoning (use prompting).

For a complete guide, see Training & Fine-tuning.


Common Misconceptions

“Bigger models understand better”
They have more parameters, but understanding is still pattern matching

“LLMs read sequentially”
They process all tokens in parallel (that’s why transformers are fast)

“Attention is human-interpretable”
It’s interpretable compared to other neural networks, but still opaque

“LLMs have memory”
They have context window. Each conversation starts fresh


Key Takeaways

  1. Tokens: Text → numbers for processing
  2. Embeddings: Numbers → vectors representing meaning
  3. Attention: Focus on relevant context
  4. Transformers: Blocks of attention + feedforward layers
  5. Training: Expensive (done once), learn parameters
  6. Inference: Cheap (done constantly), use parameters
  7. Scaling Laws: Performance scales with parameters, data, and compute — but with diminishing returns and intelligent tradeoffs
  8. Context window: Limits what the model can see
  9. MoE: Enables much larger models by only activating a fraction of parameters per token
  10. Fine-tuning: Cheap adaptation to your data

See Also: