Training & Fine-tuning: Adapting Models to Your Data
When and how to customize large language models for your specific needs - from fine-tuning to training from scratch.
The Training Spectrum
Do NOT fine-tune if:
- You just need the model to know facts (use RAG instead)
- Your task is simple reasoning (use better prompting)
- You have fewer than 100 examples (not enough data)
- Cost is critical (fine-tuning is expensive)
DO fine-tune if:
- Model outputs don’t match your style/tone
- Model repeatedly makes the same mistakes
- You have 1000+ examples of your use case
- Inference cost savings justify training cost
Pre-training (Foundation Model Training)
What Is Pre-training?
Training a model from scratch (random weights) on massive text data:
Random Model ↓Read 1 trillion tokens (GPT-4 scale) ↓Learn to predict next token ↓Billions of gradient updates ↓Trained Foundation ModelThe Numbers
| Model | Parameters | Training Tokens | Cost | Time | Organization |
|---|---|---|---|---|---|
| GPT-3 | 175B | 300B | $10-15M | 3+ months | OpenAI |
| GPT-4 | 1.7T+ | 13T | $100M+ | 6+ months | OpenAI |
| Claude 3 Opus | 200B+ | 2T+ | $50-100M | 4+ months | Anthropic |
| Llama 3.1 405B | 405B | 15.6T | $50M+ | 6+ months | Meta |
Key insight: Foundation model training is a one-time, expensive investment. But once trained, inference is cheap for millions of users.
Why Pre-train?
- Better performance - More tokens → better understanding
- Broad knowledge - Covers internet, books, research
- General capability - Can do many tasks (zero-shot, few-shot)
When to Pre-train
Only if:
- Building new model architecture
- Designing closed ecosystem (can’t use existing models)
- Have 10B+ tokens of unique domain data
- Budget: 100M+ in compute
- Timeline: 6+ months
Almost never for most organizations.
Data Engineering for AI
Before any training or fine-tuning happens, you need data. The quality of your data determines the quality of your model more than any architectural choice.
Data Collection
Sources for training data:
| Source | Quality | Scale | Cost | Best For |
|---|---|---|---|---|
| Public web crawl (Common Crawl) | Low | Trillions of tokens | Free | Pre-training base |
| Books / research papers | High | Billions | $0-5M | Deep knowledge |
| Code repositories (GitHub) | Medium | Hundreds of billions | Free | Coding capability |
| Social media / forums | Low-medium | Trillions | Free | Dialogue, Q&A |
| Proprietary customer data | Very high | Millions-billions | N/A | Domain fine-tuning |
| Synthetic data generation | Medium | Unlimited | API cost | Augmentation |
Key considerations:
- Diversity matters more than volume. A model trained on 1T diverse tokens outperforms one trained on 10T repetitive tokens.
- Permission is critical. Scraping copyrighted content for training is legally contested. Use open datasets (Common Crawl, The Pile, Dolma) when possible.
- Domain balance. Most web data is English, technical, Western. Deliberately include non-English, non-technical, and diverse cultural sources.
Data Filtering & Cleaning
Raw data is messy. A standard pre-processing pipeline:
Raw text → Deduplication → Quality filter → PII removal → Toxicity filter → Clean text1. Deduplication:
Duplicate data wastes compute and can cause overfitting. Techniques:
- Exact deduplication: Remove identical documents (hash-based, O(n))
- Near-deduplication (MinHash): Remove documents that are 80%+ similar even if not identical
- Line-level dedup: Remove repeated boilerplate (navigation bars, copyright notices, HTML artifacts)
Impact: Removing duplicates can reduce dataset size by 10-30% with zero quality loss. This directly saves training compute.
2. Quality filtering:
Not all text is worth training on. Filter based on:
| Signal | What it catches | Threshold |
|---|---|---|
| Perplexity (using a small LM) | Gibberish, low-quality text | Remove top 10% highest perplexity |
| Number of punctuation errors | Machine-translated, OCR garbage | Remove if >5 errors/100 chars |
| Adult content score | NSFW content | Remove >0.8 score |
| Language ID | Non-target languages | Keep only desired languages |
| Document length | Too short (no content) or too long (merged docs) | Keep 100-100K chars |
The “FineWeb” approach: Recent research shows that simple heuristic filtering (perplexity + dedup) matches or exceeds the performance of complex learned filtering methods. Start simple.
3. PII (Personal Identifiable Information) removal:
Critical for privacy and compliance:
import re
def remove_pii(text): text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text) # SSN text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text) text = re.sub(r'\b\d{16}\b', '[CC_NUMBER]', text) # Credit card # ... more patterns return text4. Toxicity and bias filtering:
Remove or downweight hate speech, graphic violence, and other harmful content. This is a policy decision — some models (uncensored) choose to keep it.
Data Curation
After cleaning, you need to curate the data — decide what goes in and in what proportion.
Domain mixing:
The ratio of different data types matters enormously:
Typical pre-training mix: 50% Web text (Common Crawl, filtered) 20% Books and articles 15% Code 10% Academic papers 5% Other (social, forums, multilingual)Why mixing matters:
- Too much web text → model is fluent but shallow
- Too much code → model is good at logic but bad at prose
- Too much books → model is formal but can’t handle casual dialogue
Data selection for fine-tuning:
For fine-tuning, quality beats quantity by a wide margin:
1000 high-quality instruction examples > 10000 auto-generated examples > 100000 web-scraped examplesThe curation process:
- Collect 5x more data than you think you need
- Have domain experts review a sample (100-500 examples)
- Identify common quality issues (wrong format, hallucinations, ambiguity)
- Fix the issues in the collection process, not by hand-editing
- Iterate until expert review passes at 95%+ quality rate
Synthetic Data Generation
When you don’t have enough real data, you can generate synthetic data using a capable model (distillation).
When to use synthetic data:
- You have 50 real examples but need 1000
- You need variations on existing data (rewordings, perspectives)
- You need edge cases that don’t exist in your real data
- You want to teach the model to handle specific failure modes
The process:
# Generate 1000 synthetic instruction examplesprompt = """You are a data generator. Create 10 diverse examples of{customer_support_queries} in the format:{ "instruction": "customer question", "response": "support answer"}
Make sure examples cover:- Different products- Different issue types (billing, technical, account)- Different tones (frustrated, confused, happy)"""
synthetic_data = llm.generate(prompt, n=100) # Generate 100 batchesRisks of synthetic data:
- Model collapse: If you train on synthetic data from the same model, the model’s output quality degrades over generations. This is a well-documented phenomenon.
- Bias amplification: The synthetic data inherits the generating model’s biases, then the fine-tuned model amplifies them.
- Hallucination propagation: If the generating model hallucinates, those hallucinations become training data.
Safe use of synthetic data:
- Use a stronger model to generate data for a weaker model (distillation, not self-training)
- Always verify synthetic data (human spot-check, automated validation)
- Mix synthetic with real data (never use 100% synthetic)
- Limit to 20-30% of total training data
Data Contamination
The problem: Your training data may contain test data from benchmarks (MMLU, HumanEval, etc.). If so, your model appears to perform better than it actually does.
Examples of contamination:
- A model is trained on the internet, which includes the full MMLU test set
- The model “scores” 90% on MMLU, but it has seen the answers during training
- Real performance might be 70-80% — 10-20 points inflated
How to detect contamination:
- N-gram overlap: Check if test set examples appear verbatim in training data
- Perplexity analysis: Models have unusually low perplexity on contaminated test examples
- Membership inference: Train a classifier to distinguish training vs non-training data
How to prevent it:
- Use benchmarks released after your training data cutoff date
- Deduplicate training data against known benchmark sets
- Report contamination analysis alongside benchmark scores
- Test on “unseen” variants of benchmarks (MMLU-Redux, HumanEval-X)
Data Versioning
Treat training data like code: version it, track changes, document decisions.
What to track:
- Data source (URL, dataset name, version)
- Processing steps applied (filtering, dedup, cleaning)
- Date collected
- Selection criteria (what was included/excluded and why)
- License and usage terms
Tools:
- DVC (Data Version Control): Git-like versioning for datasets
- Hugging Face Datasets: Versioned dataset storage with provenance tracking
- LFS (Git Large File Storage): For smaller datasets (<5GB)
- Custom manifest files: JSON/YAML with hashes for each data version
Data Engineering Checklist
- Identify data sources (public + proprietary)
- Run deduplication (exact + near-dedup with MinHash)
- Apply quality filters (perplexity, length, language)
- Remove PII and sensitive information
- Curate domain mix ratios
- Verify sample quality (human review 100-500 examples)
- Check for benchmark contamination
- Version dataset (DVC or similar)
- Document all processing decisions
- Re-evaluate as new data becomes available
Instruction Fine-tuning (Most Important)
What Is Instruction Fine-tuning?
Taking a pre-trained model and training it on instruction examples (question-answer pairs):
Pre-trained Model (trained on next-token prediction) ↓Fine-tune on 1000-100000 <instruction, response> pairs ↓Model learns to follow instructions betterHow It Works
Before instruction fine-tuning:
User: "Classify this: Great product!"Model: "Great product! is a great example of a positive review in the market. Let me tell you why products..."Problem: Rambles, doesn't answer the questionAfter instruction fine-tuning:
User: "Classify this: Great product!"Model: "Positive"Better: Direct, follows instructionWhy It Works
The model learns:
- What instruction-following looks like (Q→A format)
- How to structure responses (short, direct, relevant)
- Diverse tasks (classification, summarization, extraction)
Data Requirements
| Quality | Examples Needed | Typical Cost | Effort |
|---|---|---|---|
| Low (scraped, auto-generated) | 10K | $1K | 1 week |
| Medium (human-reviewed) | 5-10K | $5-50K | 2-4 weeks |
| High (expert-curated) | 1-5K | $50-200K | 4-12 weeks |
Rule of thumb: 1000 good examples > 10000 mediocre examples.
Instruction Fine-tuning Checklist
- Collect or curate instruction examples (Q→A pairs)
- Split: 80% train, 10% validation, 10% test
- Format consistently (system message → user → assistant)
- Remove duplicates and near-duplicates
- Verify quality (human review first 100)
- Start with pre-trained model
- Train 1-3 epochs (more risks overfitting)
- Monitor validation loss (stop when it plateaus)
- Evaluate on test set (accuracy, F1, human ratings)
- A/B test: fine-tuned vs original model
- Only deploy if test results clearly better
Preference Fine-tuning (RLHF, DPO)
What Is Preference Fine-tuning?
Training on preferences instead of gold answers:
Base Model generates: A1, A2, A3 (multiple responses) ↓Human rater ranks: A2 > A1 > A3 ↓Model learns to predict preferred responses ↓Better, more natural outputsThe Difference: Instruction vs Preference
Instruction fine-tuning:
Q: "What's 2+2?"A: "4"Preference fine-tuning:
Q: "What's 2+2?"A1: "It equals 4"A2: "The sum is 4"A3: "2 plus 2 gives you 4"
Preference: A1 = A2 > A3 (all correct, but A1/A2 better style)RLHF (Reinforcement Learning from Human Feedback)
Standard approach used by OpenAI, Anthropic:
Base Model ↓Generate multiple responses ↓Collect human rankings ↓Train reward model (predict which response humans prefer) ↓Use reward model to fine-tune base model (RL) ↓Aligned ModelCost: $500K-10M (depends on scale)
Time: 2-4 months
Examples needed: 10K-100K human-rated pairs
DPO (Direct Preference Optimization)
Newer, faster approach (2023):
Preference pairs only (no reward model needed) ↓Direct loss function (compare preferred vs rejected) ↓Simpler, cheaper than RLHFCost: $50-200K
Time: 2-4 weeks
Examples needed: 5-20K preference pairs
When to Use Preference Fine-tuning
- Model outputs are technically correct but wrong style
- You care about tone, length, format
- You have human raters available
- You need reproducible quality
Domain Fine-tuning (Specialized Knowledge)
What Is Domain Fine-tuning?
Training on domain-specific text to improve performance on that domain:
Base: GPT-4 (trained on internet text) ↓Fine-tune on 10000 medical papers ↓Result: Better at medical diagnosis, terminology, reasoningExample: Medical Domain
Before domain fine-tuning:
Q: "Patient has elevated troponin, ECG shows ST elevation"Model: "Could be many things. Recommend seeing a doctor."Problem: Vague, misses obvious diagnosis (MI)After domain fine-tuning on medical literature:
Q: "Patient has elevated troponin, ECG shows ST elevation"Model: "Classic presentation of acute myocardial infarction (AMI). Likely STEMI. Needs immediate reperfusion therapy."Better: Correct diagnosis, uses proper terminologyData Requirements
- Structured data: 5-50K domain documents
- Quality: Higher is better (domain-relevant is key)
- Format: Mix of Q&A, research papers, case studies
When to Use Domain Fine-tuning
- Model lacks domain-specific knowledge
- Domain has specialized terminology
- High-stakes domain where accuracy matters
- You have domain-specific training data available
Adapter Models (Lightweight Fine-tuning)
What Are Adapters?
Instead of fine-tuning all weights, add small trainable layers:
Pre-trained Model (frozen weights) ↓+ Adapter Layer 1 (small, trainable) ↓+ Adapter Layer 2 (small, trainable) ↓Much cheaper to train, can have many adaptersThe Math
Full fine-tuning:
- Update all 7B parameters of a model
- 1 model per task/domain
- Storage: 7GB per model
Adapter fine-tuning:
- Update only 0.1-1% of parameters
- Multiple adapters can share base model
- Storage: 10-100MB per adapter
When to Use Adapters
- Many domains/tasks (can’t afford N × model size storage)
- Limited compute (adapters train faster)
- Need quick deployment of domain variants
- Cost-sensitive fine-tuning
Practical Fine-tuning Workflow
Step 1: Decide What to Fine-tune
Do you need to fine-tune?
Is model output correct but wrong style/tone?├─ Yes → Instruction or preference fine-tuning├─ No → Go to next question
Does model lack domain knowledge?├─ Yes → Domain fine-tuning or RAG├─ No → Go to next question
Is your use case repetitive and defined?├─ Yes → Instruction fine-tuning makes sense└─ No → Use RAG or agents insteadStep 2: Prepare Training Data
# Format: JSONL (one example per line){ "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's 2+2?"}, {"role": "assistant", "content": "4"} ]}Quality checks:
- Sample 100 random examples (human review)
- Check for duplicates (remove if >5% dups)
- Verify output format consistency
- Check token distribution (aren’t most examples too long/short)
Step 3: Run Fine-tuning
Option A: Using Anthropic’s API (Easiest)
import anthropic
client = anthropic.Anthropic()
response = client.beta.model_management.beta.model_create( model="claude-3-5-sonnet-20241022", training_data=[ { "messages": [ {"role": "user", "content": "What's 2+2?"}, {"role": "assistant", "content": "4"} ] }, # ... more examples ])
fine_tuned_model = response.idOption B: Using Open Source (More Control)
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
trainer = Trainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, args=TrainingArguments( output_dir="./fine-tuned-model", num_train_epochs=3, learning_rate=2e-5, per_device_train_batch_size=8, ))
trainer.train()Step 4: Evaluate
On test set (automated):
- Accuracy (for classification)
- BLEU/ROUGE (for generation)
- Exact match (for extraction)
Human evaluation (critical):
- Sample 50 outputs from fine-tuned + base model
- Rate on: correctness, style, usefulness (1-5 scale)
- If fine-tuned >base by 20%+, deploy
Step 5: Deploy
# Use your fine-tuned modelresponse = client.messages.create( model=fine_tuned_model, # Your custom model max_tokens=1024, messages=[{"role": "user", "content": "Your prompt"}])Monitor:
- Cost per token (usually higher than base)
- Latency (usually similar)
- Error rates (should be lower)
- User satisfaction (should improve)
Common Mistakes
❌ Fine-tuning for knowledge - Model won’t memorize facts
✅ Use RAG for knowledge, fine-tuning for style
❌ Using too little data (10 examples) - Overfits immediately
✅ Collect 1000+ examples minimum
❌ Fine-tuning base model - Loses general knowledge
✅ Fine-tune instruction-tuned models (Claude, GPT-4o)
❌ Ignoring validation set - Can’t tell if you’re overfitting
✅ Monitor validation loss during training
❌ Not comparing to baseline - Can’t tell if it’s better
✅ Always A/B test fine-tuned vs original model
Cost Comparison
Typical fine-tuning costs:
| Approach | Data Size | Training Cost | Inference Cost | Total (1 Year) |
|---|---|---|---|---|
| Prompt engineering only | 0 | $0 | $1000 | $1000 |
| RAG (basic) | 100 docs | $100 | $500 | $600 |
| Instruction fine-tune | 5K examples | $500 | $1500 | $2000 |
| Preference fine-tune | 10K pairs | $10K | $2000 | $12K |
| Domain fine-tune | 10K docs | $1000 | $2000 | $3000 |
Breakeven analysis:
- If base model costs 1.50/1000 tokens
- But fine-tuned reduces tokens needed by 20%
- Breakeven: ~100K tokens of queries
- ROI positive after 500K+ tokens
When Each Approach Makes Sense
How we tackled it:
Just Prompt Engineering
Use when: Simple tasks, clear instructions work
Cost: Minimal
Example: “Classify this review as positive/negative”
RAG (Knowledge)
Use when: Model needs to know facts about your company
Cost: $100-500 one-time + hosting
Example: “What’s our return policy?” (read current docs)
Fine-tuning
Use when: Model outputs wrong style/tone, needs domain expertise
Cost: $500-10000 one-time + inference surcharge
Example: “Generate product descriptions in our brand voice”
Pre-training
Use when: Building new model, huge unique dataset, huge budget
Cost: $50M+
Example: Only Google, OpenAI, Meta, Anthropic (nearly never for others)
Key Takeaways
- RAG > Fine-tuning for knowledge - Simpler, cheaper, fresher
- Fine-tune for style, not facts - Model can’t memorize
- 1000 examples >> 100 examples - Quality and quantity both matter
- Always baseline - Compare to original model before deploying
- Start simple - Prompt engineering → RAG → fine-tuning (in that order)
- Instruction fine-tuning ROI - Clear if you have defined tasks and data
- Preference fine-tuning is hard - Needs human raters, expensive
- Adapters for scale - Multiple domains without model duplication
See Also:
- How LLMs Work - Understanding pre-training
- RAG Architecture - Alternative to fine-tuning for knowledge
- Prompt Engineering - Often better than fine-tuning