DeepSeek V3.2 rewrites what open-source AI can do. DeepSeek Sparse Attention slashes long-context costs. Integrated thinking+tool-use enables real agent workflows. And Speciale earns Gold Medals at IMO, IOI, ICPC, and CMO 2025.
V3.2 ships in three configurations covering production agentic workloads, early experimentation, and frontier research reasoning — all sharing the same underlying architecture.
The production model — built for real agentic workflows. First open-source model to integrate chain-of-thought thinking directly into tool-use, enabling true end-to-end reasoning agents. Trained on 1,800+ environments and 85,000+ synthesized complex instructions. Supports tool-use in both thinking and non-thinking modes. GPT-5-level performance across multiple reasoning benchmarks.
Experimental release — the infrastructure testbed that shipped first. Introduced DeepSeek Sparse Attention (DSA) on top of V3.1-Terminus. Released to prepare the ecosystem and inference stack before the full V3.2 launch. Same architecture as V3.2 — same parameter count, same DSA mechanism, different post-training. Recommended for research into the DSA architecture itself.
The research frontier variant — built by relaxing length constraints to allow unlimited test-time compute. Achieves Gemini-3.0-Pro-level reasoning and surpasses GPT-5 on complex tasks. Gold Medal at IMO 2025, IOI 2025, ICPC World Finals 2025, and CMO 2025. Intended for deep reasoning research — does not support tool-calling. Higher token usage than V3.2.
deepseek-v4-flash or deepseek-v4-pro. V3.2 weights remain the best open-source option for self-hosted agentic deployments that require the V3.2 architecture specifically.
DeepSeek V3.2 doesn't just improve benchmark numbers — it solves structural problems that blocked open-source models from competing with frontier proprietary systems.
Standard transformer attention scales as O(n²) with sequence length — the reason long contexts are expensive. DeepSeek V3.2 replaces dense self-attention with DSA: a selective, relevance-driven attention mechanism that identifies the top-k most relevant tokens before applying attention.
A lightweight scorer called the Lightning Indexer evaluates token relevance without running full attention, then attention is applied only to the selected k tokens. This transforms the computational profile from quadratic to approximately linear:
Results: 2–3× faster inference, 30–40% lower memory usage, and dramatically better performance on long-context reasoning without the quality degradation seen in other sparse attention approaches. DSA is combined with Multi-Head Latent Attention (MLA) from V3 for KV cache compression.
Before V3.2, reasoning and tool-use were separate behaviors — a model either reasoned deeply (like R1) or used tools effectively (like V3), but not both simultaneously. V3.2 is the first open-source model to integrate thinking directly into tool-use, enabling true end-to-end reasoning agents.
Three operating modes per request:
In "Thinking + tools" mode, V3.2 generates chain-of-thought reasoning about which tools to use, in what order, with what parameters — before making any API calls. This dramatically improves multi-step agent tasks where naive tool selection leads to failure cascades. In τ²-Bench, MCP-Universe, and Tool-Decathlon, V3.2 thinking mode performs competitively with GPT-5 High and Gemini 3.0 Pro.
Previous models trained on limited agentic data struggled with the long-tail of complex interactive environments — real-world agent scenarios that require multi-step planning, error recovery, and tool composition. V3.2 addresses this with a novel synthesis pipeline.
The pipeline systematically generates training data at scale:
This synthesized data drives RL fine-tuning, significantly enhancing generalization and instruction-following robustness in complex interactive environments. DeepSeek's ablation shows that restricting RL to code and search alone doesn't improve agent benchmarks — the diversity of the synthetic environments is the key factor. The result is substantial improvements on Tau2Bench, MCP-Mark, and MCP-Universe versus any previous open-source model.
The core MoE structure is maintained from V3/V3.1: 671B total parameters, 37B activated per token. Each MoE layer uses Kᵣ=8 routed experts per token, chosen from 256 FFN experts plus 1 shared expert — resulting in 9 parallel expert computations per token.
Key changes in V3.2 vs V3.1:
The dynamic expert biasing (replacing auxiliary load balancing penalties) improves expert specialization — experts become more focused on specific task types, improving interpretability and stability. Load balancing quality is maintained without penalizing the loss function, improving overall model quality. Sampling recommendation: temperature 1.0, top-p 0.95 for local deployment.
V3.2 matches GPT-5 and Kimi-K2-Thinking across reasoning benchmarks. Speciale matches Gemini-3.0-Pro and surpasses GPT-5 on frontier tasks. All data from the official paper (arXiv:2512.02556).
Every major capability area is improved in V3.2 — from the raw attention mechanism to the training pipeline to the agentic behavior.
O(n·k) complexity via Lightning Indexer scoring. 2–3× faster inference, 30–40% less memory. Long-context reasoning without quality degradation. Handles 128K tokens efficiently where dense attention buckles.
DSA architectureFirst open model to chain thinking and tool-use — reason about which tools to call, then call them, all in a single model pass. Three modes: non-thinking+tools, thinking+tools, thinking-only (Speciale).
Novel capabilityTrained on 85,000+ synthesized complex instructions across 1,800+ distinct environments: code execution, web search, file operations, API integrations, and more. Long-tail agent tasks are deliberately emphasized.
Broadest coverageReplaces auxiliary load balance penalties with dynamic expert biasing — improving expert specialization while maintaining load balance. More interpretable stepwise reasoning and smoother agentic behavior.
MoE improvementWhen reasoning approaches 80% of context window, context management kicks in with simple but effective strategies to extend the token budget. Improves long agent task scores from 32.4% to 51.4% — a 19-point gain.
+19% agent scoreMATH-500 improved from 74.8% to 82.8% vs V3.1. Speciale achieves IMO and CMO Gold. Extended thinking budget with relaxed length constraints enables frontier-level mathematical reasoning.
+8% MATH-500LiveCodeBench improved from 29.2% to 34.4%. Speciale earns IOI and ICPC Gold — top performance in the world's hardest algorithmic contests. Understands complexity analysis, algorithmic patterns, and test-case design.
IOI GoldFull OpenAI-compatible function calling across all thinking modes (except Speciale, which is research-only). JSON output mode for structured responses. Production-ready tool integration APIs.
API readySame endpoint structure as OpenAI. Change base URL and API key. Existing streaming, function calling, and structured output integrations work without modification.
2-line migrationFull MIT license on all V3.2 weights — commercial use, fine-tuning, distillation, and redistribution all permitted. Download from Hugging Face and deploy anywhere.
Commercial ✓Deep compatibility with the Model Context Protocol (MCP) ecosystem — supports MCP-Mark and MCP-Universe benchmarks. Designed for integration with MCP servers across file, database, web, and code environments.
MCP readyExpert Storage Server (ESS) enables high-throughput, memory-efficient inference at 128K context by offloading expert weights on demand. Makes 671B MoE practical on realistic hardware setups.
Hardware efficientHow the V3 architecture evolved through V3.1, V3.2-Exp, and V3.2 before being succeeded by V4.
Original V3 released with 671B parameters, 37B active. MoE + MLA architecture. Multi-token prediction. First open-source model to challenge GPT-4o head-to-head across coding and reasoning. Built at an estimated $5.5M — far less than frontier closed-source models.
671B MoE · Challenged GPT-4oPost-training improvements drawing on RL techniques from R1. Math and coding gains — outperforms GPT-4.5 on math and coding evaluations. Smarter tool calling. API update: deepseek-chat alias routed to this version.
GPT-4.5 beater on math · V3.0324Major architectural milestone. Combines V3 general capabilities with R1 chain-of-thought reasoning in a single model — switch between thinking and non-thinking modes via chat template. 128K context via two-phase extension (630B + 209B tokens). Stronger tool calling and agent performance than both V3-0324 and R1-0528.
Hybrid CoT · 128K contextSmall but significant improvement to V3.1 checkpoint — improved training stability and base model quality. Became the foundation for V3.2-Exp's DSA continued training. Not widely publicized separately.
V3.1 stability refinementExperimental release introducing DeepSeek Sparse Attention (DSA) via continued training on V3.1-Terminus. Primary purpose: test inference infrastructure and ecosystem tools before the full V3.2 release. Benchmark performance on par with V3.1-Terminus — early release intentionally conservative. Identified and fixed a RoPE implementation discrepancy in November 2025.
First DSA model · Infrastructure prepThe main event. DSA for efficient long-context attention. Integrated thinking+tool-use in a single model. 1,800+ environment agentic synthesis pipeline. GPT-5-level reasoning benchmarks. Three variants: V3.2 (production), V3.2-Exp (experimental), and V3.2-Speciale (research frontier). Context management for extended agent tasks. MIT licensed.
GPT-5 level · Agent reasoning · MITV3.2 superseded by V4-Pro and V4-Flash. Key advances: 1M token context window, Compressed Sparse Attention (CSA) — the evolution of DSA — and Hierarchical Context Aggregation (HCA). Both V4 models trained on 32T+ tokens. V3.2 weights remain open-source and the best available for self-hosted deployments requiring the V3.2 architecture.
Superseded by V4 · V3.2 remains open-sourceV3.2's combination of sparse attention, hybrid thinking, and agentic training opens up applications that previous models couldn't reliably handle.
The 1,800+ environment training and integrated thinking+tools make V3.2 the best open-source choice for building real production agents — code execution agents, search agents, file-processing agents. The thinking mode reasons about which tools to use before making any calls.
DSA's O(n·k) complexity makes 128K context affordable at inference time. Process entire research papers, legal contracts, codebases, or technical manuals in a single request without the latency and cost penalty of dense attention at scale.
V3.2 was specifically designed for the Model Context Protocol ecosystem. Use it as the reasoning backbone for MCP-based agent systems — web browsing, database queries, code execution, and file management — all orchestrated through a single model.
MATH-500 at 82.8%, IMO Gold via Speciale. For mathematical research assistance — conjecture exploration, proof sketching, competition training — V3.2 delivers the best open-source results. Use Speciale for frontier-level work where token budget isn't a constraint.
IOI and ICPC Gold via Speciale; significant LiveCodeBench improvement in V3.2-Exp. For algorithm design, complexity analysis, and competitive programming practice, V3.2 matches the best closed-source models at the hardest levels.
MIT license + ESS offload architecture makes V3.2 the best self-hosted choice for enterprise agent infrastructure. Run full 671B reasoning agents on-premises — no API dependency, no data leaving your network.
V3.2 is available via API and open-source weights. The Speciale endpoint expired Dec 15, 2025 — for production use, use deepseek-v4-pro which supersedes V3.2.
V3.2 is available via open-source weights and the DeepSeek API. For most production use cases, V4 is now recommended.
Go to chat.deepseek.com. The platform now serves V4-Pro in Expert Mode — the successor with better benchmarks and 1M context. Toggle DeepThink for thinking mode.
The V3.2-Exp weights are on Hugging Face. V3.2 and Speciale share the same architecture — refer to V3.2-Exp inference code. Requires 8×80GB GPUs for BF16.
Critical: use non-interleaved RoPE layout in the indexer module, and interleaved in the MLA module. The original inference demo had this swapped — check for the November 2025 fix before running.
Official recommendation for local deployment: temperature=1.0, top_p=0.95. Do not use lower temperatures — they cause repetition. Note: V3.2 ignores most sampling parameters beyond temp/top_p.
Pass "thinking": {"type": "enabled", "budget": "high"} in extra_body to activate reasoning-before-tool-calling mode. This is V3.2's flagship capability — essential for complex multi-step agent tasks.
For new production projects, use deepseek-v4-flash or deepseek-v4-pro. V4 adds 1M context, Compressed Sparse Attention, and stronger benchmarks. deepseek-chat alias retires July 24, 2026.
How V3.2 and Speciale compare against the field at the time of release (December 2025).
| Model | Reasoning | Agent perf | Context | Open source | Think+Tools | Competition |
|---|---|---|---|---|---|---|
| DeepSeek-V3.2 | ≈ GPT-5 | Best open | 128K | ✓ MIT | ✓ Integrated | Standard |
| V3.2-Speciale | > GPT-5 | No tools | 128K+ | ✓ MIT | ✗ Research | 🥇 IMO·IOI·ICPC·CMO |
| GPT-5 | ≈ V3.2 | Strong | 128K | ✗ Closed | ✓ | Silver |
| Gemini 3.0 Pro | Leads | Strong | 1M | ✗ Closed | ✓ | 🥇 IMO Gold |
| Kimi-K2-Thinking | ≈ V3.2 | Partial | 128K | ✗ Closed | Partial | — |
Data from official paper (arXiv:2512.02556, Dec 2025). Speciale endpoint expired Dec 15, 2025. V4 supersedes V3.2 for API use from April 2026.
DeepSeek V3.2 (December 1, 2025) introduces three major improvements over V3.1: (1) DeepSeek Sparse Attention (DSA) — replacing O(n²) attention with O(n·k) using a Lightning Indexer scorer, delivering 2–3× faster inference and 30–40% lower memory for long contexts. (2) Integrated thinking + tool-use — the first open-source model where chain-of-thought reasoning is embedded directly into tool-calling workflows, enabling true reasoning agents. (3) Agentic synthesis pipeline — 1,800+ environments and 85,000+ complex instructions, dramatically improving agent task generalization.
Standard transformer attention computes relationships between every pair of tokens — O(n²) complexity that makes long-context inference exponentially more expensive. DSA uses a lightweight Lightning Indexer to score token relevance and select the top-k most relevant tokens before applying attention — transforming the complexity to O(n·k) where k << n. Results: 2–3× faster inference, 30–40% lower memory usage, and no quality degradation on long contexts. This makes 128K context practically affordable on real hardware, and laid the technical foundation for V4's Compressed Sparse Attention (CSA) for 1M context.
Speciale is a research variant built by relaxing length constraints — allowing unlimited test-time compute during reasoning. This pushed performance beyond GPT-5 on complex tasks, achieving Gemini-3.0-Pro-level reasoning and gold medals at IMO 2025, IOI 2025, ICPC World Finals 2025, and CMO 2025. The API endpoint (base_url ending in v3.2_speciale_expires_on_20251215) expired December 15, 2025. Speciale weights are on Hugging Face under MIT license — you can run it locally. It does not support tool calling (research use only).
For new API projects: use DeepSeek V4. V4-Flash and V4-Pro (April 2026) supersede V3.2 across all production metrics: 1M token context (vs 128K), stronger benchmarks (80.6% SWE-bench for V4-Pro), and Compressed Sparse Attention (CSA) — the evolution of V3.2's DSA. The deepseek-chat alias (which routed to V3.2) retires July 24, 2026. For self-hosted deployment where you specifically need the V3.2 architecture, the MIT-licensed weights remain the best available open-source option for 671B-scale reasoning.
V3.2 supports three modes via the API: (1) Non-thinking + tools: fast, direct tool calls without internal reasoning. (2) Thinking + tools: the model generates chain-of-thought reasoning about which tools to call, in what order, and with what parameters — before making any API calls. This dramatically improves complex multi-step tasks where naive tool selection cascades into failures. (3) Pure thinking (Speciale only): maximum reasoning depth with no tool-use. Enable thinking mode via extra_body={"thinking": {"type": "enabled", "budget": "high"}} in the API call.
V3.2 is a 671B MoE model — the same size as the V3 family. Minimum for BF16 inference: 8×80GB GPUs (e.g., 8×H100 80GB or 8×A100 80GB). The Expert Storage Server (ESS) offload architecture helps by keeping inactive expert weights on storage rather than always in GPU memory, making 128K context inference more practical. For quantized inference (FP8 or INT4), requirements drop — community implementations exist for 4×80GB setups. There is no consumer-hardware path for the full 671B model. For self-hosted reasoning on consumer hardware, use the R1 distilled models (7B–70B via Ollama).
A known implementation discrepancy was identified in November 2025: the input tensor to RoPE (Rotary Position Embedding) in the indexer module requires a non-interleaved layout, whereas RoPE in the MLA module expects an interleaved layout. Earlier inference demo code had these swapped, leading to degraded model performance — particularly on long sequences where RoPE position encoding matters most. The fix is in the updated inference demo code on Hugging Face. Before running V3.2-Exp locally, ensure you are using the post-November 2025 version of the inference code from the official repo.
Sparse attention that scales. Thinking agents that reason. Gold medals at the world's hardest competitions. MIT licensed. Download the weights or access via API today.