Skip to content

Case Study: Document Classification

📖 9 min read resourcescase-studyclassification
Fine-tuning for domain-specific categorization - improving accuracy from 72% to 94%
Key Takeaways
  • Fine-tuned Claude improves classification accuracy from 72% to 94%
  • Uses instruction fine-tuning with 5000 labeled examples
  • Includes A/B testing methodology and cost analysis

Company: B2B SaaS platform for legal/compliance
Problem: Auto-categorize user documents into 12 categories; accuracy poor (72%)
Solution: Fine-tune Claude on domain-specific examples
Results: Accuracy 72% → 94%; Inference time reduced 40%; Production ready


The Challenge

Platform allows users to upload documents. System needed to auto-categorize them:

12 Categories:

  • Contract
  • Invoice / Receipt
  • Tax Document
  • Insurance Document
  • Medical Record
  • Real Estate Document
  • HR Document
  • Financial Statement
  • Permit / License
  • Correspondence
  • Technical Document
  • Other

The problem:

User uploads "Mortgage Agreement.pdf"
Base Claude:
- Returned: "Contract" (correct, but only 72% accuracy overall)
- Mistakes: Called contracts "Legal Documents"
Called invoices "Financial Documents"
Misclassified insurance docs as contracts
Performance by category:
- Easy (contract, invoice): 95% accurate
- Medium (insurance, tax): 70% accurate
- Hard (permits, HR): 45% accurate

The base model understood the concepts but made domain-specific mistakes (e.g., didn’t know that mortgage agreements are “Real Estate” not just “Contract”).


Why This Needed Fine-tuning (Not Prompting)

Could we just improve the prompt?

Attempt 1: "Be very precise about categories"
Result: Same mistakes
Attempt 2: "Think about the document type,
then categorize carefully"
Result: Slightly better (75%) but still wrong often
Attempt 3: Gave exact examples of each category
System message became 2000 tokens
Result: Better (80%) but expensive and still missing edge cases

Why it failed: Base model didn’t have deep knowledge of domain-specific categorization. No amount of prompting could teach it the subtle differences between:

  • “Real Estate Document” (mortgage, deed, title)
  • “Contract” (generic agreements)
  • “Financial Statement” (balance sheet, P&L)

These require seeing examples.


The Fine-tuning Approach

How we prepared data and trained the model:

Data Collection

How we got training data:

Months 1-2: Manually labeled 5,000 documents
├─ 300-500 per category
├─ Balance: didn't matter (all 5K used)
└─ Quality: Reviewed by domain expert (lawyer)
Result: 5,000 labeled examples
├─ ~400 documents per category
└─ All validated by humans

Data Format

{
"messages": [
{
"role": "user",
"content": "Categorize this document:\n\n[document text or excerpt]"
},
{
"role": "assistant",
"content": "Real Estate Document"
}
]
}

Key decision: How much document text?

Option 1: Full document (1000+ tokens)
├─ Pro: Model sees full context
├─ Con: Expensive to fine-tune, wastes tokens on boilerplate
└─ Result: Tried it, wasn't better than option 2
Option 2: First 2000 characters + relevant excerpt (200 tokens)
├─ Pro: Enough to judge category, efficient
├─ Con: Might miss important sections
└─ Result: Sweet spot, used this
Option 3: Just document name/header (50 tokens)
├─ Pro: Super efficient
├─ Con: Not enough context, 65% accuracy
└─ Result: Too simple

Chose Option 2: First paragraph + any section title/header that seemed relevant.

Fine-tuning Run

import anthropic
client = anthropic.Anthropic()
# Training data: 5000 examples
training_data = load_jsonl("training_data.jsonl")
# Fine-tune
response = client.beta.model_management.beta.model_create(
model="claude-3-5-sonnet-20241022",
training_data=training_data,
learning_rate=2e-5,
epochs=1
)
finetuned_model_id = response.id

Training configuration:

  • Model: Claude 3.5 Sonnet (good accuracy/cost tradeoff)
  • Data: 5,000 examples (80/10/10 split)
  • Epochs: 1 (more causes overfitting)
  • Learning rate: 2e-5 (conservative, prevents catastrophic forgetting)
  • Time: 45 minutes
  • Cost: ~$300

Evaluation: Before vs After

How the fine-tuned model compared to the base:

Test Set Results (500 examples, held out)

CategoryBase ClaudeFine-tunedImprovement
Contract95%98%+3%
Invoice / Receipt92%96%+4%
Tax Document68%88%+20%
Insurance Document65%92%+27%
Medical Record72%91%+19%
Real Estate Document45%89%+44%
HR Document52%87%+35%
Financial Statement88%94%+6%
Permit / License40%82%+42%
Correspondence78%90%+12%
Technical Document81%88%+7%
Other85%91%+6%
Overall72%90%+18%

Confusion Matrix (Before Fine-tuning)

Most common mistakes:

  • Real Estate documents misclassified as Contracts (35% of errors)
  • Permits classified as Other (40% of errors)
  • HR documents confused with Contracts (25% of errors)

After fine-tuning: These confusion patterns almost disappeared.

Edge Cases

Tested on tricky documents:

Example 1: "Operating Agreement" (legal doc, but not a contract in categorization sense)
Base: "Contract" (wrong → should be "HR Document")
Fine-tuned: "HR Document" (correct)
Example 2: "Schedule C" (tax form attached to return)
Base: "Financial Statement" (reasonable, but wrong)
Fine-tuned: "Tax Document" (correct)
Example 3: "Home Inspection Report"
Base: "Technical Document" (plausible, wrong)
Fine-tuned: "Real Estate Document" (correct)

Fine-tuned model handled edge cases correctly 89% of the time.


Confidence Scoring

Not all predictions are equally confident. Added confidence thresholds:

# Get prediction + confidence
response = client.messages.create(
model=finetuned_model_id,
max_tokens=100,
system="""Categorize the document and provide your confidence level.
Respond in JSON: {"category": "...", "confidence": 0.9}""",
messages=[{"role": "user", "content": doc_text}]
)
# Extract confidence
confidence = json.loads(response.content[0].text)["confidence"]
if confidence < 0.7:
# Manual review queue
send_to_manual_review()
else:
# Auto-categorize
save_category()

Distribution of confidences:

  • High confidence (above 0.9): 75% of documents, 96% accuracy
  • Medium (0.7–0.9): 20% of documents, 85% accuracy
  • Low (below 0.7): 5% of documents, 60% accuracy

→ Sent low-confidence documents to manual review (2–5% of volume)


Production Deployment

Rolling the model out to real traffic:

A/B Test Results

Week 1: Ran both systems in parallel on 10% of incoming documents

Base Claude:
- Accuracy: 72%
- Avg cost: $0.015 per document
- Time: 2.5 seconds
Fine-tuned Claude:
- Accuracy: 90%
- Avg cost: $0.018 per document
- Time: 2.4 seconds
Difference:
- +18% accuracy (huge improvement)
- +$0.003 cost (negligible)
- Time: same

Decision: Deploy fine-tuned model to 100%.

Implementation

# API endpoint for categorization
@app.post("/categorize")
def categorize_document(file: UploadFile):
# Extract text from PDF/image
text = extract_text(file)
# Call fine-tuned model
response = client.messages.create(
model="claude-3-5-sonnet-ft-20250508", # Fine-tuned
max_tokens=50,
system="Categorize this document into one of 12 categories...",
messages=[{"role": "user", "content": text}]
)
category = response.content[0].text.strip()
# Save to database
db.insert({
"file_id": file.filename,
"category": category,
"model": "fine-tuned",
"timestamp": now()
})
return {"category": category}

Monitoring

Track in production:

  • Category distribution (should match expected distribution)
  • Manual review rate (should stay under 5%)
  • Accuracy on manual reviews (measure true accuracy)
  • Drift (if category distribution changes, model might be wrong)

After 1 month in production:

  • Actual accuracy: 92% (validated against manual reviews)
  • Manual review rate: 4.2% (close to target)
  • No drift detected

Cost Analysis

What it costs to build and run:

Fine-tuning

ItemCost
Training data labeling (5K × $0.06)$300
Fine-tuning compute$300
Total upfront$600

Inference (Monthly)

Assume 100K documents processed/month
Option 1: Base Claude
- Cost: 100K docs × $0.015 = $1,500
- Accuracy: 72% (30% fail → manual review costs ~$2K)
- Total: $3,500
Option 2: Fine-tuned Claude
- Cost: 100K docs × $0.018 = $1,800
- Accuracy: 90% (10% fail → manual review costs ~$700)
- Total: $2,500
Savings: $1,000/month = $12K/year

ROI: Break-even at month 1, then pure savings.


Lessons Learned

Key takeaways from building and shipping this system:

What Went Well

  1. Data quality over quantity

    • Thought we needed 10K examples
    • 5K high-quality examples (reviewed by expert) beat 10K auto-labeled
    • Invested in good labeling process
  2. Domain-specific edge cases matter

    • Base model was “good enough” (72%) for simple cases
    • Fine-tuning made difference on hard cases (permits, HR)
    • Needed real examples to see these patterns
  3. Confidence scores work

    • Set threshold 0.7, worked perfectly
    • 95%+ of high-confidence predictions were correct
    • Low-confidence documents flagged for manual review

What We’d Do Differently

  1. A/B test confidence thresholds

    • Guessed 0.7 initially
    • Should have tested multiple values (0.6, 0.7, 0.8)
    • 0.7 happened to be right, but got lucky
  2. Monitor per-category accuracy

    • Didn’t track which categories had highest error
    • Late in project, realized “permits” was weakest (42% accuracy)
    • Could have added more permit examples to training data
  3. Start with smaller dataset

    • Labeled 5K examples immediately
    • Could have started with 1K, evaluated accuracy
    • 3K would have gotten to 87% accuracy (probably “good enough”)

Unexpected Challenges

Document formatting:

  • PDFs scanned as images → OCR errors
  • Some documents have “Document Type: INVOICE” in header
    • Model overfit to headers
    • Had to add constraint: “Ignore text that says ‘Document Type’”

Regional differences:

  • Tax documents in Canada use different terminology
  • Training data was 90% US documents
  • Fine-tuned model biased toward US categories
  • Solution: Added 500 Canadian examples, retrained

Summary

Fine-tuning was the right choice here because:

  1. Base model couldn’t learn domain-specific categorization from prompts
  2. Clear, labeled training data existed (just needed labeling)
  3. ROI was obvious (lower costs + higher accuracy)
  4. Risk was low (could roll back if issues appeared)

This wouldn’t have worked for:

  • Needing real-time information (use RAG instead)
  • Insufficient training data (under 1K examples)
  • Extremely high accuracy requirement (over 99%, would need more data)

See Also: