Skip to content

DeepSeek Workflows & Best Practices

📖 3 min read deepseekworkflowsbest-practicescost-optimization
Thinking mode usage strategies, cost optimization (caching, model routing), agent sharing patterns, rate limit management, and production deployment patterns for DeepSeek.
Key Takeaways
  • Thinking Mode: use for complex tasks (code, math, reasoning). Disable for simple tasks to save tokens and latency
  • Cost optimization: DeepSeek Flash for simple tasks, Pro for complex. KV cache reduces repeat costs 99%
  • Agent sharing: one DeepSeek API key powers all 15+ integrated agents — Claude Code, Copilot, etc.

Thinking Mode — When to Use

# Complex tasks — enable thinking
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": "Design a distributed rate limiter"}],
thinking={"type": "enabled"},
reasoning_effort="high"
)
# Simple tasks — disable thinking (V4 Flash only)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "What is 2+2?"}],
thinking={"type": "disabled"}
)
Task TypeModelThinkingreasoning_effort
Simple Q&A, classificationV4 FlashDisabled
Summarization, translationV4 FlashDisabled
Code generationV4 ProEnabledmedium
Debugging, refactoringV4 ProEnabledhigh
Architecture designV4 ProEnabledhigh
Complex math, researchV4 ProEnabledhigh

Cost Optimization

Model Routing

def route(task_type):
if task_type in ["classification", "simple_qa", "summarization"]:
return "deepseek-v4-flash" # $0.14/$0.28 — cheapest
elif task_type in ["code_gen", "analysis", "writing"]:
return "deepseek-v4-pro" # $0.435/$0.87 — quality
else:
return "deepseek-v4-pro" # Default to quality

Maximize Cache Hits

# ✅ Good: Reuse system prompts — cache hits at $0.0028/1M
SYSTEM_PROMPT = "You are a helpful assistant..." # Cached after first call
for query in user_queries:
client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query}
]
)
# ❌ Bad: Unique system prompt per request — full price each time
for query in user_queries:
client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": f"Context: {random_context()}"},
{"role": "user", "content": query}
]
)

Batch Operations

# Route high-volume, non-urgent tasks to V4 Flash
import concurrent.futures
def process_batch(items, model="deepseek-v4-flash"):
with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
futures = [
executor.submit(client.chat.completions.create,
model=model,
messages=[{"role": "user", "content": f"Classify: {item}"}],
thinking={"type": "disabled"},
max_tokens=50
) for item in items
]
return [f.result() for f in futures]

Agent Sharing Pattern

One DeepSeek API key powers all your coding agents:

Terminal window
# Set once, use everywhere
export OPENAI_API_KEY=sk-your-deepseek-key
export OPENAI_BASE_URL=https://api.deepseek.com
# Now all OpenAI-compatible tools automatically use DeepSeek
export ANTHROPIC_AUTH_TOKEN=sk-your-deepseek-key
export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
# Claude Code and Anthropic-compatible tools use DeepSeek

Rate Limit Management

ModelConcurrencyStrategy
V4 Flash2,500High-volume, classification, real-time
V4 Pro500Complex tasks, reasoning, code generation
import asyncio
from collections import deque
from datetime import datetime
class DeepSeekRateLimiter:
def __init__(self, max_concurrent=2500):
self.max_concurrent = max_concurrent
self.current = 0
self.queue = asyncio.Queue()
async def acquire(self):
while self.current >= self.max_concurrent:
await asyncio.sleep(0.1)
self.current += 1
def release(self):
self.current -= 1

Production Checklist

  • Use V4 Flash for high-volume, V4 Pro for quality-critical tasks
  • Reuse system prompts for cache hits (99% input savings)
  • Disable thinking for simple tasks (saves output tokens)
  • Set max_tokens appropriately — unused budget is still reserved
  • Use streaming for better UX on interactive apps
  • Monitor concurrency limits — 2500 (Flash), 500 (Pro)
  • Implement retry logic with exponential backoff for rate limits

Where Next

For broader prompt engineering techniques, see the Prompt Engineering Deep Dive.

For Claude-specific and GPT-specific workflows, see Claude Workflows and OpenAI Workflows.