Case Study: Document Classification
Company: B2B SaaS platform for legal/compliance
Problem: Auto-categorize user documents into 12 categories; accuracy poor (72%)
Solution: Fine-tune Claude on domain-specific examples
Results: Accuracy 72% → 94%; Inference time reduced 40%; Production ready
The Challenge
Platform allows users to upload documents. System needed to auto-categorize them:
12 Categories:
- Contract
- Invoice / Receipt
- Tax Document
- Insurance Document
- Medical Record
- Real Estate Document
- HR Document
- Financial Statement
- Permit / License
- Correspondence
- Technical Document
- Other
The problem:
User uploads "Mortgage Agreement.pdf"
Base Claude:- Returned: "Contract" (correct, but only 72% accuracy overall)- Mistakes: Called contracts "Legal Documents" Called invoices "Financial Documents" Misclassified insurance docs as contracts
Performance by category:- Easy (contract, invoice): 95% accurate- Medium (insurance, tax): 70% accurate- Hard (permits, HR): 45% accurateThe base model understood the concepts but made domain-specific mistakes (e.g., didn’t know that mortgage agreements are “Real Estate” not just “Contract”).
Why This Needed Fine-tuning (Not Prompting)
Could we just improve the prompt?
Attempt 1: "Be very precise about categories"Result: Same mistakes
Attempt 2: "Think about the document type, then categorize carefully"Result: Slightly better (75%) but still wrong often
Attempt 3: Gave exact examples of each categorySystem message became 2000 tokensResult: Better (80%) but expensive and still missing edge casesWhy it failed: Base model didn’t have deep knowledge of domain-specific categorization. No amount of prompting could teach it the subtle differences between:
- “Real Estate Document” (mortgage, deed, title)
- “Contract” (generic agreements)
- “Financial Statement” (balance sheet, P&L)
These require seeing examples.
The Fine-tuning Approach
How we prepared data and trained the model:
Data Collection
How we got training data:
Months 1-2: Manually labeled 5,000 documents├─ 300-500 per category├─ Balance: didn't matter (all 5K used)└─ Quality: Reviewed by domain expert (lawyer)
Result: 5,000 labeled examples├─ ~400 documents per category└─ All validated by humansData Format
{ "messages": [ { "role": "user", "content": "Categorize this document:\n\n[document text or excerpt]" }, { "role": "assistant", "content": "Real Estate Document" } ]}Key decision: How much document text?
Option 1: Full document (1000+ tokens)├─ Pro: Model sees full context├─ Con: Expensive to fine-tune, wastes tokens on boilerplate└─ Result: Tried it, wasn't better than option 2
Option 2: First 2000 characters + relevant excerpt (200 tokens)├─ Pro: Enough to judge category, efficient├─ Con: Might miss important sections└─ Result: Sweet spot, used this
Option 3: Just document name/header (50 tokens)├─ Pro: Super efficient├─ Con: Not enough context, 65% accuracy└─ Result: Too simpleChose Option 2: First paragraph + any section title/header that seemed relevant.
Fine-tuning Run
import anthropic
client = anthropic.Anthropic()
# Training data: 5000 examplestraining_data = load_jsonl("training_data.jsonl")
# Fine-tuneresponse = client.beta.model_management.beta.model_create( model="claude-3-5-sonnet-20241022", training_data=training_data, learning_rate=2e-5, epochs=1)
finetuned_model_id = response.idTraining configuration:
- Model: Claude 3.5 Sonnet (good accuracy/cost tradeoff)
- Data: 5,000 examples (80/10/10 split)
- Epochs: 1 (more causes overfitting)
- Learning rate: 2e-5 (conservative, prevents catastrophic forgetting)
- Time: 45 minutes
- Cost: ~$300
Evaluation: Before vs After
How the fine-tuned model compared to the base:
Test Set Results (500 examples, held out)
| Category | Base Claude | Fine-tuned | Improvement |
|---|---|---|---|
| Contract | 95% | 98% | +3% |
| Invoice / Receipt | 92% | 96% | +4% |
| Tax Document | 68% | 88% | +20% |
| Insurance Document | 65% | 92% | +27% |
| Medical Record | 72% | 91% | +19% |
| Real Estate Document | 45% | 89% | +44% |
| HR Document | 52% | 87% | +35% |
| Financial Statement | 88% | 94% | +6% |
| Permit / License | 40% | 82% | +42% |
| Correspondence | 78% | 90% | +12% |
| Technical Document | 81% | 88% | +7% |
| Other | 85% | 91% | +6% |
| Overall | 72% | 90% | +18% |
Confusion Matrix (Before Fine-tuning)
Most common mistakes:
- Real Estate documents misclassified as Contracts (35% of errors)
- Permits classified as Other (40% of errors)
- HR documents confused with Contracts (25% of errors)
After fine-tuning: These confusion patterns almost disappeared.
Edge Cases
Tested on tricky documents:
Example 1: "Operating Agreement" (legal doc, but not a contract in categorization sense)Base: "Contract" (wrong → should be "HR Document")Fine-tuned: "HR Document" (correct)
Example 2: "Schedule C" (tax form attached to return)Base: "Financial Statement" (reasonable, but wrong)Fine-tuned: "Tax Document" (correct)
Example 3: "Home Inspection Report"Base: "Technical Document" (plausible, wrong)Fine-tuned: "Real Estate Document" (correct)Fine-tuned model handled edge cases correctly 89% of the time.
Confidence Scoring
Not all predictions are equally confident. Added confidence thresholds:
# Get prediction + confidenceresponse = client.messages.create( model=finetuned_model_id, max_tokens=100, system="""Categorize the document and provide your confidence level. Respond in JSON: {"category": "...", "confidence": 0.9}""", messages=[{"role": "user", "content": doc_text}])
# Extract confidenceconfidence = json.loads(response.content[0].text)["confidence"]
if confidence < 0.7: # Manual review queue send_to_manual_review()else: # Auto-categorize save_category()Distribution of confidences:
- High confidence (above 0.9): 75% of documents, 96% accuracy
- Medium (0.7–0.9): 20% of documents, 85% accuracy
- Low (below 0.7): 5% of documents, 60% accuracy
→ Sent low-confidence documents to manual review (2–5% of volume)
Production Deployment
Rolling the model out to real traffic:
A/B Test Results
Week 1: Ran both systems in parallel on 10% of incoming documents
Base Claude:- Accuracy: 72%- Avg cost: $0.015 per document- Time: 2.5 seconds
Fine-tuned Claude:- Accuracy: 90%- Avg cost: $0.018 per document- Time: 2.4 seconds
Difference:- +18% accuracy (huge improvement)- +$0.003 cost (negligible)- Time: sameDecision: Deploy fine-tuned model to 100%.
Implementation
# API endpoint for categorization@app.post("/categorize")def categorize_document(file: UploadFile): # Extract text from PDF/image text = extract_text(file)
# Call fine-tuned model response = client.messages.create( model="claude-3-5-sonnet-ft-20250508", # Fine-tuned max_tokens=50, system="Categorize this document into one of 12 categories...", messages=[{"role": "user", "content": text}] )
category = response.content[0].text.strip()
# Save to database db.insert({ "file_id": file.filename, "category": category, "model": "fine-tuned", "timestamp": now() })
return {"category": category}Monitoring
Track in production:
- Category distribution (should match expected distribution)
- Manual review rate (should stay under 5%)
- Accuracy on manual reviews (measure true accuracy)
- Drift (if category distribution changes, model might be wrong)
After 1 month in production:
- Actual accuracy: 92% (validated against manual reviews)
- Manual review rate: 4.2% (close to target)
- No drift detected
Cost Analysis
What it costs to build and run:
Fine-tuning
| Item | Cost |
|---|---|
| Training data labeling (5K × $0.06) | $300 |
| Fine-tuning compute | $300 |
| Total upfront | $600 |
Inference (Monthly)
Assume 100K documents processed/month
Option 1: Base Claude- Cost: 100K docs × $0.015 = $1,500- Accuracy: 72% (30% fail → manual review costs ~$2K)- Total: $3,500
Option 2: Fine-tuned Claude- Cost: 100K docs × $0.018 = $1,800- Accuracy: 90% (10% fail → manual review costs ~$700)- Total: $2,500
Savings: $1,000/month = $12K/yearROI: Break-even at month 1, then pure savings.
Lessons Learned
Key takeaways from building and shipping this system:
What Went Well
-
Data quality over quantity
- Thought we needed 10K examples
- 5K high-quality examples (reviewed by expert) beat 10K auto-labeled
- Invested in good labeling process
-
Domain-specific edge cases matter
- Base model was “good enough” (72%) for simple cases
- Fine-tuning made difference on hard cases (permits, HR)
- Needed real examples to see these patterns
-
Confidence scores work
- Set threshold 0.7, worked perfectly
- 95%+ of high-confidence predictions were correct
- Low-confidence documents flagged for manual review
What We’d Do Differently
-
A/B test confidence thresholds
- Guessed 0.7 initially
- Should have tested multiple values (0.6, 0.7, 0.8)
- 0.7 happened to be right, but got lucky
-
Monitor per-category accuracy
- Didn’t track which categories had highest error
- Late in project, realized “permits” was weakest (42% accuracy)
- Could have added more permit examples to training data
-
Start with smaller dataset
- Labeled 5K examples immediately
- Could have started with 1K, evaluated accuracy
- 3K would have gotten to 87% accuracy (probably “good enough”)
Unexpected Challenges
Document formatting:
- PDFs scanned as images → OCR errors
- Some documents have “Document Type: INVOICE” in header
- Model overfit to headers
- Had to add constraint: “Ignore text that says ‘Document Type’”
Regional differences:
- Tax documents in Canada use different terminology
- Training data was 90% US documents
- Fine-tuned model biased toward US categories
- Solution: Added 500 Canadian examples, retrained
Summary
Fine-tuning was the right choice here because:
- Base model couldn’t learn domain-specific categorization from prompts
- Clear, labeled training data existed (just needed labeling)
- ROI was obvious (lower costs + higher accuracy)
- Risk was low (could roll back if issues appeared)
This wouldn’t have worked for:
- Needing real-time information (use RAG instead)
- Insufficient training data (under 1K examples)
- Extremely high accuracy requirement (over 99%, would need more data)
See Also:
- Training & Fine-tuning - Technical deep dive
- Prompt Engineering - Why prompting wasn’t enough
- Decide: Models Guide - Choosing the right model