DeepSeek Workflows & Best Practices
Thinking Mode — When to Use
# Complex tasks — enable thinkingresponse = client.chat.completions.create( model="deepseek-v4-pro", messages=[{"role": "user", "content": "Design a distributed rate limiter"}], thinking={"type": "enabled"}, reasoning_effort="high")
# Simple tasks — disable thinking (V4 Flash only)response = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "What is 2+2?"}], thinking={"type": "disabled"})| Task Type | Model | Thinking | reasoning_effort |
|---|---|---|---|
| Simple Q&A, classification | V4 Flash | Disabled | — |
| Summarization, translation | V4 Flash | Disabled | — |
| Code generation | V4 Pro | Enabled | medium |
| Debugging, refactoring | V4 Pro | Enabled | high |
| Architecture design | V4 Pro | Enabled | high |
| Complex math, research | V4 Pro | Enabled | high |
Cost Optimization
Model Routing
def route(task_type): if task_type in ["classification", "simple_qa", "summarization"]: return "deepseek-v4-flash" # $0.14/$0.28 — cheapest elif task_type in ["code_gen", "analysis", "writing"]: return "deepseek-v4-pro" # $0.435/$0.87 — quality else: return "deepseek-v4-pro" # Default to qualityMaximize Cache Hits
# ✅ Good: Reuse system prompts — cache hits at $0.0028/1MSYSTEM_PROMPT = "You are a helpful assistant..." # Cached after first callfor query in user_queries: client.chat.completions.create( model="deepseek-v4-flash", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": query} ] )
# ❌ Bad: Unique system prompt per request — full price each timefor query in user_queries: client.chat.completions.create( model="deepseek-v4-flash", messages=[ {"role": "system", "content": f"Context: {random_context()}"}, {"role": "user", "content": query} ] )Batch Operations
# Route high-volume, non-urgent tasks to V4 Flashimport concurrent.futures
def process_batch(items, model="deepseek-v4-flash"): with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor: futures = [ executor.submit(client.chat.completions.create, model=model, messages=[{"role": "user", "content": f"Classify: {item}"}], thinking={"type": "disabled"}, max_tokens=50 ) for item in items ] return [f.result() for f in futures]Agent Sharing Pattern
One DeepSeek API key powers all your coding agents:
# Set once, use everywhereexport OPENAI_API_KEY=sk-your-deepseek-keyexport OPENAI_BASE_URL=https://api.deepseek.com
# Now all OpenAI-compatible tools automatically use DeepSeekexport ANTHROPIC_AUTH_TOKEN=sk-your-deepseek-keyexport ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
# Claude Code and Anthropic-compatible tools use DeepSeekRate Limit Management
| Model | Concurrency | Strategy |
|---|---|---|
| V4 Flash | 2,500 | High-volume, classification, real-time |
| V4 Pro | 500 | Complex tasks, reasoning, code generation |
import asynciofrom collections import dequefrom datetime import datetime
class DeepSeekRateLimiter: def __init__(self, max_concurrent=2500): self.max_concurrent = max_concurrent self.current = 0 self.queue = asyncio.Queue()
async def acquire(self): while self.current >= self.max_concurrent: await asyncio.sleep(0.1) self.current += 1
def release(self): self.current -= 1Production Checklist
- Use V4 Flash for high-volume, V4 Pro for quality-critical tasks
- Reuse system prompts for cache hits (99% input savings)
- Disable thinking for simple tasks (saves output tokens)
- Set
max_tokensappropriately — unused budget is still reserved - Use streaming for better UX on interactive apps
- Monitor concurrency limits — 2500 (Flash), 500 (Pro)
- Implement retry logic with exponential backoff for rate limits
Where Next
For broader prompt engineering techniques, see the Prompt Engineering Deep Dive.
For Claude-specific and GPT-specific workflows, see Claude Workflows and OpenAI Workflows.