v2026-05 Updated May 3, 2026 · V4 models

DeepSeek API Reference

Complete developer reference for the DeepSeek API. The API is compatible with the OpenAI ChatCompletions format and the Anthropic Messages format — change two lines of code to migrate from either. Access powerful reasoning with V4-Pro and high-speed inference with V4-Flash.

https://api.deepseek.com
Base URL
$0.14/1M
V4-Flash input
1M tokens
Context window
5M free
New accounts
Jul 24, 2026
Legacy alias retirement

Authentication

All API requests require an API key passed as a Bearer token in the Authorization header. API keys are created in the DeepSeek Platform console.

⚠ Security

Never expose your API key in client-side code, version control, or logs. Use environment variables. If compromised, immediately revoke in the Platform console — keys cannot be recovered once deleted.

HTTP Header
Authorization: Bearer $DEEPSEEK_API_KEY Content-Type: application/json
Set API Key — Shell
# Add to ~/.bashrc or ~/.zshrc export DEEPSEEK_API_KEY="sk-your-key-here" # Verify it's set echo $DEEPSEEK_API_KEY

Your First API Call

Make your first request in under 2 minutes. Install the OpenAI SDK — no DeepSeek-specific package needed.

# pip install openai from openai import OpenAI import os client = OpenAI( api_key=os.getenv("DEEPSEEK_API_KEY"), base_url="https://api.deepseek.com", ) response = client.chat.completions.create( model="deepseek-v4-flash", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, ], stream=False, ) print(response.choices[0].message.content)
// npm install openai import OpenAI from 'openai'; const client = new OpenAI({ apiKey: process.env.DEEPSEEK_API_KEY, baseURL: 'https://api.deepseek.com', }); const response = await client.chat.completions.create({ model: 'deepseek-v4-flash', messages: [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'Hello!' }, ], stream: false, }); console.log(response.choices[0].message.content);
curl https://api.deepseek.com/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DEEPSEEK_API_KEY" \ -d '{ "model": "deepseek-v4-flash", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], "stream": false }'

Base URLs

Two base URL options — both route to the same models. The /v1 suffix is a compatibility alias for OpenAI SDK users who hardcode the version segment.

Base URLFormatNotes
https://api.deepseek.comOpenAI & AnthropicRecommended — no version suffix needed
https://api.deepseek.com/v1OpenAI onlyCompatibility alias for OpenAI SDK defaults
https://api.deepseek.com/betaOpenAI onlyRequired for FIM & Chat Prefix Completion
https://api.deepseek.com/anthropicAnthropicAnthropic Messages API format

Available Models

Two production V4 models and two legacy aliases pending retirement. Always use deepseek-v4-flash or deepseek-v4-pro for new integrations.

⚡ DEFAULT · FAST
DeepSeek-V4-Flash
deepseek-v4-flash

High-speed, cost-efficient frontier model. 284B MoE (13B active). 79% SWE-bench. 83 tok/s. Recommended for chat, coding, summaries, high-volume pipelines.

1M tokens
Context
384K
Max output
$0.14/1M
Input
$0.28/1M
Output
🧠 FLAGSHIP
DeepSeek-V4-Pro
deepseek-v4-pro

Frontier reasoning model. 1.6T MoE (49B active). 80.6% SWE-bench. Codeforces #1 (3206). Think Max for hard math and agentic tasks.

1M tokens
Context
384K
Max output
$0.435/1M
Input (promo)
$0.87/1M
Output (promo)
⚠ DEPRECATED
deepseek-chat
deepseek-chat → deepseek-v4-flash (non-thinking)

Legacy alias. Currently routes to V4-Flash non-thinking mode. Retires July 24, 2026 15:59 UTC. Will error with no fallback after that date.

Jul 24, 2026
Retirement
V4-Flash
Routes to
⚠ DEPRECATED
deepseek-reasoner
deepseek-reasoner → deepseek-v4-flash (thinking)

Legacy alias. Currently routes to V4-Flash thinking mode. Retires July 24, 2026 15:59 UTC. Migrate to deepseek-v4-flash or deepseek-v4-pro with thinking enabled.

Jul 24, 2026
Retirement
V4-Flash
Routes to
🚨 Migration Deadline

July 24, 2026 15:59 UTCdeepseek-chat and deepseek-reasoner will return errors with no fallback. Migration is a single line change: update the model parameter. Everything else stays identical. Search your entire codebase including retry handlers, config files, and fallback logic.

List Models (API)

Retrieve all currently available model IDs programmatically.

GET /models
Returns a list of available models with their IDs and metadata. Useful for verifying model availability before sending requests.
cURL
curl https://api.deepseek.com/models \ -H "Authorization: Bearer $DEEPSEEK_API_KEY"

POST /chat/completions

The primary API endpoint. Creates a model response for the given conversation. Returns a completion object or a stream of SSE chunks when stream: true.

POST https://api.deepseek.com/chat/completions
Creates a model response for the given chat conversation. Supports streaming, tool calling, JSON mode, thinking modes, and multi-turn conversation.

Request Parameters

Core Parameters

ParameterTypeRequiredDescription
model string Required Model ID. Use deepseek-v4-flash or deepseek-v4-pro. Legacy: deepseek-chat, deepseek-reasoner (retiring Jul 24, 2026).
messages array Required Conversation history. Each message has role (system, user, assistant, or tool) and content. Max context: 1M tokens.
stream boolean Optional If true, returns Server-Sent Events (SSE) chunks. Each chunk is a data: line with a partial completion object. Terminated with data: [DONE]. Default: false.
max_tokens integer Optional Maximum output tokens. Default varies by model. For Think Max mode, set to at least 32768. Hard maximum: 384000 (384K) for V4 models.
temperature number Optional Sampling temperature 0–2. Higher = more random. Default: 1.0. Recommended: 0.6–0.7 for thinking mode. Note: ignored in pure reasoning modes.
top_p number Optional Nucleus sampling. Considers tokens with cumulative probability up to top_p. Default: 1.0. Recommended: 0.95 for reasoning models.
response_format object Optional Set {"type": "json_object"} to force valid JSON output. Set {"type": "text"} (default) for natural language. See JSON Mode.
tools array Optional Array of tool definitions in OpenAI function-calling format. Each tool has type: "function" and a function object with name, description, parameters.
tool_choice string | object Optional Controls tool selection. "auto" (model decides), "none" (no tools), "required" (must call a tool), or {"type": "function", "function": {"name": "..."}} to force specific function.
stop string | array Optional Up to 4 stop sequences. Generation halts when any sequence is produced. Default: null.
frequency_penalty number Optional Range −2.0 to 2.0. Positive values penalize tokens proportional to their frequency, reducing repetition. Default: 0.
presence_penalty number Optional Range −2.0 to 2.0. Positive values penalize tokens that have appeared at all, encouraging new topics. Default: 0.
logprobs boolean Optional If true, returns log probabilities for each output token. Default: false.
top_logprobs integer Optional Number of top-k token log probability alternatives to return per token. Requires logprobs: true. Range: 0–20.

Thinking Mode Parameters

Control reasoning depth via extra_body in the OpenAI SDK, or directly in the request body when using cURL. V4 models support three reasoning effort levels.

ParameterTypeValuesDescription
thinking.type string enabled · disabled Enables or disables chain-of-thought reasoning. When enabled, responses include a reasoning_content field with the full thinking chain. Pass via extra_body={"thinking": {"type": "enabled"}}.
reasoning_effort string low · high · none Controls reasoning depth when thinking is enabled. high = Think Max (uses up to 50K+ tokens internally). low = faster CoT. none = no thinking (same as thinking.type: disabled).
Python — Think Max
response = client.chat.completions.create( model="deepseek-v4-pro", messages=[{"role": "user", "content": "Prove √2 is irrational."}], max_tokens=65536, reasoning_effort="high", extra_body={"thinking": {"type": "enabled"}}, ) # reasoning_content = full chain-of-thought chain = response.choices[0].message.reasoning_content answer = response.choices[0].message.content # ⚠ CRITICAL for multi-turn: never include reasoning_content # in conversation history — only include content

Response Object

The API returns a completion object on success. In streaming mode, multiple SSE chunks with partial deltas are emitted, terminated by data: [DONE].

Response · 200 OK 200
{ "id": "chatcmpl-930c60df-bf64-41c9-a88e-3ec75f81e00e", "object": "chat.completion", "created": 1746259200, "model": "deepseek-v4-flash", "choices": [{ "index": 0, "message": { "role": "assistant", "content": "Hello! How can I help you today?", "reasoning_content": null // populated in thinking mode }, "finish_reason": "stop" // stop | length | tool_calls | content_filter }], "usage": { "prompt_tokens": 16, "completion_tokens": 10, "total_tokens": 26, "prompt_cache_hit_tokens": 12, // billed at 10% (cache hit) "prompt_cache_miss_tokens": 4 // billed at 100% (cache miss) } }

finish_reason Values

ValueMeaning
stopGeneration completed naturally or reached a stop sequence.
lengthGeneration stopped due to max_tokens limit. Increase max_tokens to get a complete response.
tool_callsModel decided to call one or more tools. Process tool calls and continue the conversation with tool results.
content_filterGeneration stopped by content safety filter. Response may be partial or empty.

Function Calling / Tool Use

Define tools in the same JSON schema format as OpenAI. The model decides which tools to call and with what arguments. Both V4-Flash and V4-Pro support function calling in all thinking modes.

Python — Tool Definition & Response
tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, } }] response = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "Weather in Tokyo?"}], tools=tools, tool_choice="auto", ) # If finish_reason == "tool_calls", handle the call msg = response.choices[0].message if msg.tool_calls: for call in msg.tool_calls: print(call.function.name, call.function.arguments)
💡 Think + Tools

Enable thinking mode with tools to let the model reason about which tools to call before making the calls. This significantly improves accuracy on complex multi-step agentic tasks. Set extra_body={"thinking": {"type": "enabled", "budget": "high"}}.

JSON Mode

Force the model to produce a valid JSON object by setting response_format. The model will always return parseable JSON — no need to strip markdown fences or handle parse errors.

Python — JSON Mode
import json response = client.chat.completions.create( model="deepseek-v4-flash", messages=[{ "role": "user", "content": "Return a JSON object with fields: name (string), age (number), city (string). Example person: Alice, 30, Tokyo.", }], response_format={"type": "json_object"}, ) data = json.loads(response.choices[0].message.content) print(data["name"]) # "Alice"
ℹ Note

The system prompt or user message should describe the expected JSON structure. The model uses this description to format its output. JSON mode cannot be combined with FIM completion.

Streaming (SSE)

Set stream: true to receive Server-Sent Events. Each chunk delivers a delta of new content. V4-Flash delivers ~83 tokens/second. Use streaming for chat UIs and real-time applications.

stream = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "Tell me a joke."}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True)
const stream = await client.chat.completions.create({ model: 'deepseek-v4-flash', messages: [{ role: 'user', content: 'Tell me a joke.' }], stream: true, }); for await (const chunk of stream) { const text = chunk.choices[0]?.delta?.content ?? ''; process.stdout.write(text); }

Raw SSE Format

SSE Stream
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant","content":""},"index":0}]} data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]} data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"},"index":0}]} data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop","index":0}]} data: [DONE]

Context Caching

Context caching reduces costs by 90% for repeated prompt prefixes. It is fully automatic — no code changes, no explicit cache keys. Repeated content at the beginning of your prompt is served at cache-hit pricing.

💡 Cache Optimization

Structure prompts to maximize cache hits: put static content first (system prompts, tool definitions, reference documents) and variable content last (user messages, query-specific context). Cache persists for ~30 minutes of inactivity. Content below 64 tokens is never cached.

Reading Cache Usage

Every response includes precise cache hit/miss breakdown in the usage object for billing analysis.

Usage object with cache fields
"usage": { "prompt_tokens": 1024, "completion_tokens": 256, "total_tokens": 1280, "prompt_cache_hit_tokens": 900, // billed at $0.014/1M (V4-Flash) "prompt_cache_miss_tokens": 124 // billed at $0.14/1M (V4-Flash) }

FIM Completion Beta

Fill-In-Middle (FIM) completion predicts a missing segment given both prefix and suffix context. Used for IDE code completion, story continuation, and document infilling. Requires base_url="https://api.deepseek.com/beta".

POST https://api.deepseek.com/beta/completions Beta
FIM completion API. Compatible with OpenAI FIM API format. Only available with base_url="https://api.deepseek.com/beta". Only supported in non-thinking mode.
ParameterTypeRequiredDescription
modelstringRequireddeepseek-v4-pro (only model supporting FIM via beta)
promptstringRequiredCode/text prefix — content before the gap to fill.
suffixstringOptionalCode/text suffix — content after the gap. If omitted, model generates until completion.
max_tokensintegerOptionalMaximum tokens to generate for the middle segment. Default: 4096.
Python — FIM Code Completion
# Must use beta base URL for FIM beta_client = OpenAI( api_key=os.getenv("DEEPSEEK_API_KEY"), base_url="https://api.deepseek.com/beta", ) response = beta_client.completions.create( model="deepseek-v4-pro", prompt="def fibonacci(n):\n if n <= 1:\n return n\n", suffix="\n return fibonacci(n-1) + fibonacci(n-2)", max_tokens=256, ) print(response.choices[0].text) # filled middle segment

Chat Prefix Completion Beta

Force the assistant to begin its response with a specific prefix. Useful for output format control, structured completions, and guided generation. Set the last message role to assistant with prefix: true. Requires base_url="https://api.deepseek.com/beta".

Python — Prefix Completion
response = beta_client.chat.completions.create( model="deepseek-v4-flash", messages=[ {"role": "user", "content": "Write a haiku about programming."}, {"role": "assistant", "content": "Fingers on keyboard,", "prefix": True}, ], ) # Model completes the haiku starting from "Fingers on keyboard,"

Anthropic API Format

The DeepSeek API also supports the Anthropic Messages API format at https://api.deepseek.com/anthropic. This allows existing Anthropic SDK code to work with DeepSeek models by changing the base URL and API key only.

Python — Anthropic SDK
# pip install anthropic import anthropic client = anthropic.Anthropic( api_key=os.getenv("DEEPSEEK_API_KEY"), base_url="https://api.deepseek.com/anthropic", ) message = client.messages.create( model="deepseek-v4-flash", max_tokens=1024, messages=[{"role": "user", "content": "Hello!"}], ) print(message.content[0].text)

Multi-Turn Conversations

The DeepSeek API is stateless — each call is independent. To maintain conversation context, include the full message history on every request. This is identical to the OpenAI pattern.

🚨 Thinking Mode Multi-Turn

When using thinking mode, the response contains both content and reasoning_content. When building conversation history, only include content — never include reasoning_content. Including the thinking chain in history significantly degrades subsequent response quality.

Python — Stateless Multi-Turn
history = [ {"role": "system", "content": "You are a helpful assistant."} ] def chat(user_msg): history.append({"role": "user", "content": user_msg}) resp = client.chat.completions.create( model="deepseek-v4-flash", messages=history, ) assistant_content = resp.choices[0].message.content # Only append .content — NOT .reasoning_content history.append({"role": "assistant", "content": assistant_content}) return assistant_content chat("What is the capital of France?") chat("What language do they speak there?") # context is preserved

Agent & IDE Integrations

The DeepSeek API works directly with popular agent frameworks and coding tools — no wrapper code required. Switch the backend model in a config file.

ToolIntegration TypeConfig
Claude Code Agentic CLI coding assistant ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
GitHub Copilot IDE code completion (VS Code) Set custom model endpoint in Copilot settings
OpenCode Terminal AI coding assistant OpenAI-compatible — set baseUrl and apiKey
Continue.dev VS Code / JetBrains plugin Add DeepSeek provider with base_url and key in config.json
Cursor AI-first IDE Add custom OpenAI-compatible endpoint in Models settings
LangChain LLM orchestration framework ChatOpenAI(base_url="https://api.deepseek.com", model="deepseek-v4-flash")

GET /user/balance

Returns your current account balance. Useful for pre-flight checks before starting long-running jobs. The API returns both topped-up balance (purchased) and granted balance (free credits). Granted balance is consumed first.

GET https://api.deepseek.com/user/balance
cURL
curl https://api.deepseek.com/user/balance \ -H "Authorization: Bearer $DEEPSEEK_API_KEY"
Response · 200 OK200
{ "is_available": true, "balance_infos": [ { "currency": "CNY", "total_balance": "0.98", "granted_balance": "0.00", "topped_up_balance": "0.98" } ] }

Token Pricing

Pay-as-you-go per token. No monthly subscription. No minimum spend. Granted balance (free credits) is consumed before topped-up balance. All prices in USD per 1M tokens.

💡 V4-Pro Promo

V4-Pro is at 75% discount until May 31, 2026: $0.435/1M input (regular $1.74/1M), $0.87/1M output (regular $3.48/1M). After May 31, regular pricing applies. Cache hit prices are 1/10 of cache miss prices across all models (effective 2026-04-26).

Model Context Max Out Input (cache hit) Input (cache miss) Output Notes
deepseek-v4-flash 1M 384K $0.014/1M $0.14/1M $0.28/1M Default — recommended
deepseek-v4-pro 1M 384K $0.044/1M $0.435/1M 75% OFF $0.87/1M promo Promo until May 31
deepseek-v4-pro 1M 384K $0.174/1M $1.74/1M $3.48/1M Regular (after May 31)
deepseek-chat DEP 128K 8K $0.014/1M $0.14/1M $0.28/1M Retires Jul 24, 2026
deepseek-reasoner DEP 128K 64K CoT $0.014/1M $0.14/1M $0.28/1M Retires Jul 24, 2026

Cache hit price = 1/10 of cache miss price. New accounts: $5 in free API credits (≈ 35M input tokens on V4-Flash). Always verify at api-docs.deepseek.com.

Error Codes

HTTP status codes and error response format. All errors include a JSON body with error.message and error.type fields.

400
Bad Request / Invalid Format

Malformed request — invalid JSON, missing required field, or unsupported parameter combination. Check that messages is a non-empty array, model is a valid model ID, and parameter types are correct. FIM + JSON mode is not supported simultaneously.

401
Authentication Failure

Invalid, missing, or revoked API key. Verify Authorization: Bearer sk-... header. Check the API key is active in the Platform console. Do not wrap the key in extra quotes.

402
Insufficient Balance

Account balance is insufficient to complete the request. Add credits at platform.deepseek.com/billing. Use GET /user/balance to pre-check balance before long-running jobs.

422
Unprocessable Entity

Request structure is valid but contains logical errors — e.g., FIM mode with thinking enabled, prefix completion without prefix: true, or incompatible parameter combinations. Check the error message body for the specific violation.

429
Rate Limit Exceeded

Too many requests. Implement exponential backoff with jitter. Check the Retry-After header if present. For production systems requiring guaranteed capacity, consider routing through AWS Bedrock or Azure AI for dedicated throughput.

500
Internal Server Error

Unexpected error on the DeepSeek server. Retry with exponential backoff. If the error persists across multiple retries, check status.deepseek.com for ongoing incidents.

503
Service Unavailable / Overloaded

Server is temporarily overloaded. Implement retry logic with at least 30 seconds delay. For mission-critical workloads requiring 99.9%+ SLA, route through a cloud provider (AWS Bedrock, Azure AI, Together AI, Fireworks) for dedicated capacity.

Error Response Format

Error Response4xx
{ "error": { "message": "Invalid API key provided: sk-xxx...xxx", "type": "invalid_request_error", "code": "invalid_api_key" } }

Rate Limits

DeepSeek does not publish fixed public rate limits — limits are applied dynamically based on account tier and server load. Free-tier accounts (using granted balance) are subject to stricter limits than paid accounts.

ℹ Retry Strategy

Implement exponential backoff on 429 and 503 responses: start at 1s, double on each retry, cap at 60s, add 0–1s jitter. Check the Retry-After header first. For high-volume production workloads with SLA requirements, DeepSeek recommends routing through cloud provider endpoints (AWS Bedrock, Azure AI) which offer dedicated capacity.

Python — Retry with Backoff
import time, random def call_with_retry(client, **kwargs): max_retries, delay = 5, 1.0 for attempt in range(max_retries): try: return client.chat.completions.create(**kwargs) except Exception as e: status = getattr(e, "status_code", None) if status in (429, 500, 503) and attempt < max_retries - 1: jitter = random.uniform(0, 1) time.sleep(min(delay + jitter, 60)) delay *= 2 else: raise

Migration Guide

From OpenAI

Migrating from GPT to DeepSeek requires two line changes. All request/response structure, streaming, tool calling, and JSON mode work identically.

Python — OpenAI → DeepSeek
# BEFORE (OpenAI) client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) model = "gpt-4o" # AFTER (DeepSeek) — only these two lines change client = OpenAI( api_key=os.getenv("DEEPSEEK_API_KEY"), base_url="https://api.deepseek.com", # ← add this ) model = "deepseek-v4-flash" # ← change model name

From Legacy Aliases

If your code uses deepseek-chat or deepseek-reasoner, update before July 24, 2026:

Migration Map
# deepseek-chat → deepseek-v4-flash (no thinking, same pricing) # deepseek-reasoner → deepseek-v4-flash (with thinking enabled) # OR → deepseek-v4-pro (if you need stronger reasoning) # For deepseek-chat replacement: model = "deepseek-v4-flash" # drop-in, same price, better context # For deepseek-reasoner replacement: model = "deepseek-v4-flash" extra_body = {"thinking": {"type": "enabled"}} # Or use V4-Pro for stronger reasoning: model = "deepseek-v4-pro" reasoning_effort = "high"

API Changelog

2026-04-24
DeepSeek-V4 — V4-Pro & V4-Flash

New: deepseek-v4-pro and deepseek-v4-flash. 1M context window. Available via OpenAI and Anthropic format. Legacy aliases (deepseek-chat, deepseek-reasoner) deprecated — retirement date July 24, 2026. Cache hit price reduced to 1/10 of launch price.

2025-12-01
DeepSeek-V3.2 — deepseek-chat & deepseek-reasoner upgraded

Both aliases upgraded to DeepSeek-V3.2. deepseek-chat → V3.2 non-thinking. deepseek-reasoner → V3.2 thinking. V3.2-Speciale served via temporary endpoint (expired Dec 15, 2025).

2025-05-28
R1-0528 — deepseek-reasoner upgraded + function calling

deepseek-reasoner upgraded to R1-0528. Added function calling and JSON output support to R1. System prompts now supported. AIME 2025 accuracy: 70% → 87.5%.

2025-01-20
DeepSeek-R1 — deepseek-reasoner launched

deepseek-reasoner launched as the reasoning alias, serving DeepSeek-R1. 97.3% MATH-500. Introduced reasoning_content field and chain-of-thought API surface.

2024-08-02
Context Caching & FIM Beta

Launched automatic context caching on disk. Launched FIM Completion Beta and Chat Prefix Completion Beta via api.deepseek.com/beta.