Build an LLM from Scratch - Book Summary
This guide summarizes key concepts from Sebastian Raschka’s book “Build a Large Language Model (From Scratch)” - a practical, code-first journey through building a GPT-like LLM from the ground up.
Philosophy: “I don’t understand anything I can’t build.” - Richard Feynman
Chapter 1: Understanding Large Language Models
What is an LLM?
A Large Language Model is a deep neural network trained on vast amounts of text to predict the next word (or token) given previous words. This simple objective, when applied at scale, produces models with remarkable capabilities:
- Emergent abilities: Reasoning, translation, coding emerge from scale
- In-context learning: Can perform new tasks without explicit training
- Generative: Creates human-like text continuations
Stages of Building an LLM
1. Planning & Design → Architecture choice, scale2. Data Preparation → Collect, clean, tokenize3. Pretraining → Train on next-token prediction4. Fine-tuning → Adapt for specific tasks5. Evaluation → Benchmark performance6. Deployment → Serve efficientlyTransformer Architecture Overview
Input Tokens → Embedding → Transformer Blocks × N → Output ↓ Positional EncodingEach Transformer block:
- Multi-head self-attention (capture relationships between all tokens)
- Feed-forward network (process attention outputs)
- Residual connections + LayerNorm (stable training)
GPT vs BERT
| Aspect | GPT | BERT |
|---|---|---|
| Architecture | Decoder-only | Encoder-only |
| Training | Next token prediction | Masked language modeling |
| Use case | Generation | Understanding/Classification |
| Examples | GPT-2, GPT-3, GPT-4 | BERT, RoBERTa |
Chapter 2: Working with Text Data
Tokenization Pipeline
Raw Text → Split into Words → Tokenize → Convert to IDs → Add Special TokensByte Pair Encoding (BPE)
BPE is the most common tokenization method:
- Start with character-level vocabulary
- Find most frequent adjacent pair
- Merge into new token
- Repeat until desired vocabulary size
vocab = {"a", "b", "c", ...} # Start with characterswhile len(vocab) < target_size: pair = find_most_frequent_consecutive_pair(text) vocab.add(pair) merge_rule[pair] = new_token_idWhy BPE?
- Handles out-of-vocabulary words (subword units)
- Efficient vocabulary (vs. character-level)
- Used by GPT-2, GPT-3, GPT-4
Word Embeddings
Words mapped to dense vectors where similar words are close in embedding space:
# Simple conceptword_embedding["king"] ≈ [0.2, -0.5, 0.8, ...]word_embedding["queen"] ≈ [0.2, -0.4, 0.9, ...] # Similar!word_embedding["king"] - word_embedding["man"] + word_embedding["woman"] ≈ "queen"Position Embeddings
Since attention has no inherent sense of position, we add positional information:
- Sinusoidal: Fixed functions of position (sin/cos)
- Learned: Trainable position embeddings
Sliding Window for Training
# Create training pairs using sliding windowwindow_size = 512for i in range(len(text) - window_size): input = text[i:i+window_size] target = text[i+1:i+window_size+1]Chapter 3: Coding Attention Mechanisms
The Problem with RNNs
RNNs process sequentially - slow and struggle with long-range dependencies (vanishing gradients).
Self-Attention
Each token attends to ALL other tokens, computing relevance scores:
For each token: 1. Compute Query (what I'm looking for) 2. Compute Key (what I offer) 3. Compute Value (what I contain)
Attention(Q, K, V) = softmax(Q × K^T / √d) × VMulti-Head Attention
Run multiple attention “heads” in parallel, each learning different relationships:
# Simplified multi-headnum_heads = 8for head in range(num_heads): Q_head = W_Q[head] @ Q K_head = W_K[head] @ K V_head = W_V[head] @ V head_output = Attention(Q_head, K_head, V_head)output = concat(head_outputs) @ W_OCausal (Masked) Attention
In decoder-only models, each token can only attend to previous tokens:
# Mask future positionsmask = torch.tril(torch.ones(seq_len, seq_len))masked_scores = scores.masked_fill(mask == 0, -inf)Chapter 4: Implementing a GPT Model
Architecture Components
GPT Model:├── Token Embedding Layer├── Positional Embedding Layer├── Transformer Blocks × N│ ├── Multi-Head Attention (causal)│ ├── LayerNorm│ ├── Feed-Forward Network (GELU)│ └── LayerNorm└── Final LayerNorm → Linear → SoftmaxKey Components Implementation
class GPTModel(nn.Module): def __init__(self, config): self.tok_emb = nn.Embedding(config.vocab_size, config.d_emb) self.pos_emb = nn.Embedding(config.ctx_len, config.d_emb) self.blocks = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)]) self.ln = nn.LayerNorm(config.d_emb) self.head = nn.Linear(config.d_emb, config.vocab_size, bias=False)
def forward(self, x): x = self.tok_emb(x) + self.pos_emb(x) for block in self.blocks: x = block(x) x = self.ln(x) return self.head(x)GELU Activation
def gelu(x): return 0.5 * x * (1 + torch.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))Used in GPT, BERT - better than ReLU for language tasks.
Generating Text
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None): for _ in range(max_new_tokens): idx_cond = idx[:, -ctx_len:] # Crop context logits = model(idx_cond) logits = logits[:, -1, :] / temperature # Apply temperature
if top_k: v, _ = torch.topk(logits, top_k) logits[logits < v[:, -1]] = -float('inf')
probs = F.softmax(logits, dim=-1) idx_next = torch.multinomial(probs, num_samples=1) idx = torch.cat((idx, idx_next), dim=1) return idxChapter 5: Pretraining
Next-Token Prediction
# Forward passlogits = model(input_ids)loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))
# Backward passloss.backward()optimizer.step()Loss Calculation
Cross-Entropy Loss = -log(probability of correct next token)Lower loss = better model at predicting next token.
Evaluation: Perplexity
perplexity = exp(loss)Perplexity measures how “surprised” the model is by the test data. Lower = better.
Text Generation Strategies
| Strategy | Description | Trade-off |
|---|---|---|
| Greedy | Always pick highest prob | Deterministic, may repeat |
| Temperature | Adjust probability distribution | Higher = creative, lower = focused |
| Top-k | Sample from top k tokens | Controls diversity |
| Top-p (nucleus) | Sample from smallest set with p probability | Adaptive diversity |
Training Loop Essentials
for epoch in range(num_epochs): for batch in train_loader: # Forward pass logits = model(batch_input) loss = F.cross_entropy(logits.view(-1, vocab_size), batch_target)
# Backward pass loss.backward() optimizer.step() optimizer.zero_grad()
# Logging if batch_num % 100 == 0: print(f"Epoch {epoch}, Loss: {loss.item()}")Chapter 6: Fine-tuning for Classification
Adding a Classification Head
class GPTForClassification(nn.Module): def __init__(self, config, num_classes): super().__init__() self.transformer = GPTModel(config) self.classifier = nn.Linear(config.d_emb, num_classes)
def forward(self, x): x = self.transformer(x) # Use the last token's representation (or mean pooling) return self.classifier(x[:, -1, :])Fine-tuning Process
- Load pretrained GPT weights
- Replace output layer for classification
- Train with lower learning rate (lr = 1e-5 typically)
- Freeze earlier layers (optional for efficiency)
# Example: fine-tune with lower LRoptimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.01)Evaluation Metrics
- Accuracy: % correct predictions
- Precision/Recall/F1: For imbalanced classes
- AUROC: Discrimination ability
Chapter 7: Fine-tuning to Follow Instructions
RLHF (Reinforcement Learning from Human Feedback)
Three stages:
- SFT (Supervised Fine-tuning): Fine-tune on human-written responses
- Reward Model: Train model to predict human preference
- PPO: Optimize model to maximize reward
Instruction Fine-tuning
# Format: [INST] instruction [/INST] responseformatted_text = f"[INST] {instruction} [/INST] {response}"DPO (Direct Preference Optimization)
Simpler than RLHF - directly optimize against preference data:
# DPO loss (simplified)loss = -log(sigmoid(win_loss - lose_loss))Appendix: LoRA (Low-Rank Adaptation)
Efficient fine-tuning by adding small trainable matrices:
# Instead of training full weight matrix W# Train: W + A × B# where A is (r × n), B is (m × r), r << min(m,n)
class LoRALayer(nn.Module): def __init__(self, in_features, out_features, rank=8): super().__init__() self.A = nn.Parameter(torch.randn(in_features, rank)) self.B = nn.Parameter(torch.randn(rank, out_features))
def forward(self, x): return x @ (self.A @ self.B)Benefits:
- Reduce fine-tuning compute by 90%+
- Smaller model storage
- Good performance retention
Key Takeaways
- LLMs are simple at core: Next-token prediction at scale produces emergent intelligence
- Attention is key: Self-attention captures long-range dependencies efficiently
- Data preparation matters: Tokenization, embeddings, position encodings are foundational
- Pretraining is expensive but one-time: Fine-tuning is cheap and adaptable
- Code understanding: Building from scratch reveals the “how” behind libraries
See Also
- LLM Primer Cheatsheet
- Coding Assistants & Agents Cheatsheet
- Prompt Engineering Deep Dive
- RAG Architecture Deep Dive
- Training & Fine-tuning Deep Dive
Book Reference: Build a Large Language Model (From Scratch) by Sebastian Raschka, Manning Publications, 2024.