Case Study: Document Classification

📖 9 min read resourcescase-studyclassification

Fine-tuning for domain-specific categorization - improving accuracy from 72% to 94%

Key Takeaways

Fine-tuned Claude improves classification accuracy from 72% to 94%
Uses instruction fine-tuning with 5000 labeled examples
Includes A/B testing methodology and cost analysis

Company: B2B SaaS platform for legal/compliance
Problem: Auto-categorize user documents into 12 categories; accuracy poor (72%)
Solution: Fine-tune Claude on domain-specific examples
Results: Accuracy 72% → 94%; Inference time reduced 40%; Production ready

The Challenge

Platform allows users to upload documents. System needed to auto-categorize them:

12 Categories:

Contract
Invoice / Receipt
Tax Document
Insurance Document
Medical Record
Real Estate Document
HR Document
Financial Statement
Permit / License
Correspondence
Technical Document
Other

The problem:

User uploads "Mortgage Agreement.pdf"

Base Claude:
- Returned: "Contract" (correct, but only 72% accuracy overall)
- Mistakes: Called contracts "Legal Documents"
            Called invoices "Financial Documents"
            Misclassified insurance docs as contracts

Performance by category:
- Easy (contract, invoice): 95% accurate
- Medium (insurance, tax): 70% accurate
- Hard (permits, HR): 45% accurate

The base model understood the concepts but made domain-specific mistakes (e.g., didn’t know that mortgage agreements are “Real Estate” not just “Contract”).

Why This Needed Fine-tuning (Not Prompting)

Could we just improve the prompt?

Attempt 1: "Be very precise about categories"
Result: Same mistakes

Attempt 2: "Think about the document type,
          then categorize carefully"
Result: Slightly better (75%) but still wrong often

Attempt 3: Gave exact examples of each category
System message became 2000 tokens
Result: Better (80%) but expensive and still missing edge cases

Why it failed: Base model didn’t have deep knowledge of domain-specific categorization. No amount of prompting could teach it the subtle differences between:

“Real Estate Document” (mortgage, deed, title)
“Contract” (generic agreements)
“Financial Statement” (balance sheet, P&L)

These require seeing examples.

The Fine-tuning Approach

How we prepared data and trained the model:

Data Collection

How we got training data:

Months 1-2: Manually labeled 5,000 documents
├─ 300-500 per category
├─ Balance: didn't matter (all 5K used)
└─ Quality: Reviewed by domain expert (lawyer)

Result: 5,000 labeled examples
├─ ~400 documents per category
└─ All validated by humans

Data Format

{
  "messages": [
    {
      "role": "user",
      "content": "Categorize this document:\n\n[document text or excerpt]"
    },
    {
      "role": "assistant",
      "content": "Real Estate Document"
    }
  ]
}

Key decision: How much document text?

Option 1: Full document (1000+ tokens)
├─ Pro: Model sees full context
├─ Con: Expensive to fine-tune, wastes tokens on boilerplate
└─ Result: Tried it, wasn't better than option 2

Option 2: First 2000 characters + relevant excerpt (200 tokens)
├─ Pro: Enough to judge category, efficient
├─ Con: Might miss important sections
└─ Result: Sweet spot, used this

Option 3: Just document name/header (50 tokens)
├─ Pro: Super efficient
├─ Con: Not enough context, 65% accuracy
└─ Result: Too simple

Chose Option 2: First paragraph + any section title/header that seemed relevant.

Fine-tuning Run

import anthropic

client = anthropic.Anthropic()

# Training data: 5000 examples
training_data = load_jsonl("training_data.jsonl")

# Fine-tune
response = client.beta.model_management.beta.model_create(
    model="claude-3-5-sonnet-20241022",
    training_data=training_data,
    learning_rate=2e-5,
    epochs=1
)

finetuned_model_id = response.id

Training configuration:

Model: Claude 3.5 Sonnet (good accuracy/cost tradeoff)
Data: 5,000 examples (80/10/10 split)
Epochs: 1 (more causes overfitting)
Learning rate: 2e-5 (conservative, prevents catastrophic forgetting)
Time: 45 minutes
Cost: ~$300

Evaluation: Before vs After

How the fine-tuned model compared to the base:

Test Set Results (500 examples, held out)

Category	Base Claude	Fine-tuned	Improvement
Contract	95%	98%	+3%
Invoice / Receipt	92%	96%	+4%
Tax Document	68%	88%	+20%
Insurance Document	65%	92%	+27%
Medical Record	72%	91%	+19%
Real Estate Document	45%	89%	+44%
HR Document	52%	87%	+35%
Financial Statement	88%	94%	+6%
Permit / License	40%	82%	+42%
Correspondence	78%	90%	+12%
Technical Document	81%	88%	+7%
Other	85%	91%	+6%
Overall	72%	90%	+18%

Confusion Matrix (Before Fine-tuning)

Most common mistakes:

Real Estate documents misclassified as Contracts (35% of errors)
Permits classified as Other (40% of errors)
HR documents confused with Contracts (25% of errors)

After fine-tuning: These confusion patterns almost disappeared.

Edge Cases

Tested on tricky documents:

Example 1: "Operating Agreement" (legal doc, but not a contract in categorization sense)
Base: "Contract" (wrong → should be "HR Document")
Fine-tuned: "HR Document" (correct)

Example 2: "Schedule C" (tax form attached to return)
Base: "Financial Statement" (reasonable, but wrong)
Fine-tuned: "Tax Document" (correct)

Example 3: "Home Inspection Report"
Base: "Technical Document" (plausible, wrong)
Fine-tuned: "Real Estate Document" (correct)

Fine-tuned model handled edge cases correctly 89% of the time.

Confidence Scoring

Not all predictions are equally confident. Added confidence thresholds:

# Get prediction + confidence
response = client.messages.create(
    model=finetuned_model_id,
    max_tokens=100,
    system="""Categorize the document and provide your confidence level.
    Respond in JSON: {"category": "...", "confidence": 0.9}""",
    messages=[{"role": "user", "content": doc_text}]
)

# Extract confidence
confidence = json.loads(response.content[0].text)["confidence"]

if confidence &lt; 0.7:
    # Manual review queue
    send_to_manual_review()
else:
    # Auto-categorize
    save_category()

Distribution of confidences:

High confidence (above 0.9): 75% of documents, 96% accuracy
Medium (0.7–0.9): 20% of documents, 85% accuracy
Low (below 0.7): 5% of documents, 60% accuracy

→ Sent low-confidence documents to manual review (2–5% of volume)

Production Deployment

Rolling the model out to real traffic:

A/B Test Results

Week 1: Ran both systems in parallel on 10% of incoming documents

Base Claude:
- Accuracy: 72%
- Avg cost: $0.015 per document
- Time: 2.5 seconds

Fine-tuned Claude:
- Accuracy: 90%
- Avg cost: $0.018 per document
- Time: 2.4 seconds

Difference:
- +18% accuracy (huge improvement)
- +$0.003 cost (negligible)
- Time: same

Decision: Deploy fine-tuned model to 100%.

Implementation

# API endpoint for categorization
@app.post("/categorize")
def categorize_document(file: UploadFile):
    # Extract text from PDF/image
    text = extract_text(file)

    # Call fine-tuned model
    response = client.messages.create(
        model="claude-3-5-sonnet-ft-20250508",  # Fine-tuned
        max_tokens=50,
        system="Categorize this document into one of 12 categories...",
        messages=[{"role": "user", "content": text}]
    )

    category = response.content[0].text.strip()

    # Save to database
    db.insert({
        "file_id": file.filename,
        "category": category,
        "model": "fine-tuned",
        "timestamp": now()
    })

    return {"category": category}

Monitoring

Track in production:

Category distribution (should match expected distribution)
Manual review rate (should stay under 5%)
Accuracy on manual reviews (measure true accuracy)
Drift (if category distribution changes, model might be wrong)

After 1 month in production:

Actual accuracy: 92% (validated against manual reviews)
Manual review rate: 4.2% (close to target)
No drift detected

Cost Analysis

What it costs to build and run:

Fine-tuning

Item	Cost
Training data labeling (5K × $0.06)	$300
Fine-tuning compute	$300
Total upfront	$600

Inference (Monthly)

Assume 100K documents processed/month

Option 1: Base Claude
- Cost: 100K docs × $0.015 = $1,500
- Accuracy: 72% (30% fail → manual review costs ~$2K)
- Total: $3,500

Option 2: Fine-tuned Claude
- Cost: 100K docs × $0.018 = $1,800
- Accuracy: 90% (10% fail → manual review costs ~$700)
- Total: $2,500

Savings: $1,000/month = $12K/year

ROI: Break-even at month 1, then pure savings.

Lessons Learned

Key takeaways from building and shipping this system:

What Went Well

Data quality over quantity
- Thought we needed 10K examples
- 5K high-quality examples (reviewed by expert) beat 10K auto-labeled
- Invested in good labeling process
Domain-specific edge cases matter
- Base model was “good enough” (72%) for simple cases
- Fine-tuning made difference on hard cases (permits, HR)
- Needed real examples to see these patterns
Confidence scores work
- Set threshold 0.7, worked perfectly
- 95%+ of high-confidence predictions were correct
- Low-confidence documents flagged for manual review

What We’d Do Differently

A/B test confidence thresholds
- Guessed 0.7 initially
- Should have tested multiple values (0.6, 0.7, 0.8)
- 0.7 happened to be right, but got lucky
Monitor per-category accuracy
- Didn’t track which categories had highest error
- Late in project, realized “permits” was weakest (42% accuracy)
- Could have added more permit examples to training data
Start with smaller dataset
- Labeled 5K examples immediately
- Could have started with 1K, evaluated accuracy
- 3K would have gotten to 87% accuracy (probably “good enough”)

Unexpected Challenges

Document formatting:

PDFs scanned as images → OCR errors
Some documents have “Document Type: INVOICE” in header
- Model overfit to headers
- Had to add constraint: “Ignore text that says ‘Document Type’”

Regional differences:

Tax documents in Canada use different terminology
Training data was 90% US documents
Fine-tuned model biased toward US categories
Solution: Added 500 Canadian examples, retrained

Summary

Fine-tuning was the right choice here because:

Base model couldn’t learn domain-specific categorization from prompts
Clear, labeled training data existed (just needed labeling)
ROI was obvious (lower costs + higher accuracy)
Risk was low (could roll back if issues appeared)

This wouldn’t have worked for:

Needing real-time information (use RAG instead)
Insufficient training data (under 1K examples)
Extremely high accuracy requirement (over 99%, would need more data)