DeepSeek-V2 introduced two ideas that every major model now uses: Multi-head Latent Attention (MLA), which compresses the KV cache by 93.3%, and DeepSeekMoE, a fine-grained expert architecture that enables 236B parameter intelligence at 21B active cost. Together they produced 5.76× throughput, 42.5% cost savings, and the first open-source frontier model with 128K context.
The V2 family launched with four variants — the flagship 236B MoE, a lightweight Lite model, and Chat/Code specialised versions — all built on MLA and DeepSeekMoE.
The flagship. 236B total parameters with 21B activated per token via DeepSeekMoE. MLA reduces KV cache 93.3% vs the 67B dense predecessor. 128K context window. Trained on 8.1T tokens. On a single 8× H800 node achieves 50K+ tokens/sec throughput — 5.76× the old 67B model.
Community-requested lighter sibling — released days after V2 following high interest in MLA research. 16B total, 2.4B active per token. Trained on 5.7T tokens. Deployable on a single 40G GPU; fine-tunable on 8× 80G GPUs. Outperforms 7B dense and other 16B MoE models on English and Chinese benchmarks.
The instruction-tuned and RL-aligned chat variant. Achieves 38.9% win rate on AlpacaEval 2.0 and 8.97 overall score on MT-Bench — top open-source chat performance at release. Two sub-variants: V2-Chat (SFT) and V2-Chat (RL). Chat-RL shows further gains on math and coding vs SFT alone.
Code-specialised V2 variant. Continues pre-training from V2-Base on 10.2T tokens at 60% code / 10% math / 30% natural language. HumanEval 90.2%, MBPP 76.2%. First open-source model to surpass 10% on SWE-Bench. Rivals GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro on coding benchmarks.
deepseek-chat now routes to V3.2 / V4. See the changelog for the full API history.
Both architectural innovations introduced in V2 were adopted verbatim in V3, R1, V3.1, V3.2, and V4. MLA became the standard attention mechanism for the entire DeepSeek lineage. DeepSeekMoE became the expert routing strategy. Nothing has been replaced — only extended.
Traditional MHA stores a full Key and Value vector for every token in the KV cache. For a model with n_h = 128 heads and head dimension d_h = 128, that's 32,768 floats per token per layer — a massive memory bottleneck at 128K context.
MLA solves this with low-rank joint compression: instead of caching the full K and V matrices, it caches a single compressed latent vector c_KV ∈ ℝ^{d_c} where d_c ≪ d_h × n_h. During inference, K and V are reconstructed on-demand from this latent vector via learned up-projection matrices W_UK and W_UV. At inference time, W_UK can be absorbed into W_Q, and W_UV into W_O — eliminating even the decompression step entirely via weight absorption.
The result: only the tiny latent vector needs to be cached, reducing KV cache size by 93.3% compared to DeepSeek-67B. This directly enables 5.76× higher generation throughput, 128K context windows, and dramatically reduced serving costs — without sacrificing model quality. Empirically, MLA outperforms MHA, MQA, and GQA in ablation studies.
MLA also uses a decoupled RoPE strategy: position encoding is applied separately to a small RoPE portion of the query/key and does not interact with the compressed latent vector, keeping the absorption trick valid.
Conventional MoE architectures (e.g., GShard, Switch Transformer) use coarse expert segmentation: 8 or 16 large experts, 2 activated per token. DeepSeekMoE takes the opposite approach: fine-grained expert segmentation with many smaller experts and a higher activation ratio, plus a dedicated shared expert isolation mechanism.
In V2's DeepSeekMoE, each FFN layer contains 160 routed experts plus 2 shared experts that are always active. For each token, 6 of the 160 routed experts are selected (top-6 routing), plus both shared experts fire unconditionally. The result: the model activates experts covering diverse knowledge areas per token, while shared experts handle universal capabilities (syntax, basic reasoning) that should fire regardless of routing.
This design achieves higher expert specialisation than conventional MoE while keeping the total active parameter budget at 21B. During training, an auxiliary load-balancing loss is used to prevent expert collapse (some experts receiving all tokens, others starving). The loss coefficient is carefully tuned to balance load without harming model quality.
V2's expert parallelism strategy during training devises supplementary mechanisms to control communication overheads across GPU nodes — a critical engineering detail that enables 236B training to be economical on H800 clusters.
| Component | Value |
|---|---|
| Total parameters | 236B |
| Active per token | 21B |
| Routed experts per layer | 160 |
| Shared experts per layer | 2 (always active) |
| Top-k routing | 6 of 160 routed |
| Transformer layers | 60 |
| Attention heads | 128 |
| KV compression dim | 512 (d_c) |
| Context length | 128K tokens |
| Vocabulary | 100,014 (BPE) |
DeepSeek-V2 wasn't just a bigger model. It introduced architectural and engineering innovations that changed how the industry thinks about MoE inference, long context, and open-source economic efficiency.
Multi-head Latent Attention compresses the KV cache into a single low-rank latent vector per token. 93.3% reduction in KV cache size vs dense MHA. Empirically outperforms MHA in ablation studies — unlike GQA/MQA which trade quality for memory. Adopted in every DeepSeek model after V2. Now being retrofitted onto other architectures (MHA2MLA paper, 2025).
Architecture160 routed experts + 2 shared experts per FFN layer. Top-6 routing + always-on shared experts. Higher specialisation than GShard/Switch at same FLOP budget. Expert parallelism engineering controls inter-node communication overhead at scale. Enables 236B knowledge capacity at 21B activation cost per token.
ArchitectureFirst DeepSeek flagship with a 128K token context window — enabling entire codebases, long legal documents, and book-length analyses. Made practical by MLA: without the 93.3% KV cache reduction, 128K context on 8× H800 would exhaust GPU memory before reaching meaningful batch sizes.
ContextCompared to training a comparable-capability dense model at 67B. MoE sparse computation means only 21B parameters are activated and updated per forward pass. The FLOPs-per-token are comparable to a 21B dense model while knowledge capacity scales to 236B. DeepSeek trained V2 at economical cost on H800 GPUs with full engineering details published.
EconomicsOn a single 8× H800 GPU node, DeepSeek-V2 achieves over 50,000 tokens/second generation throughput — 5.76× the maximum throughput of DeepSeek 67B on identical hardware. Driven by MLA's smaller KV cache enabling larger batch sizes, plus FP8 quantisation and KV cache compression (6 bits/element) for serving.
PerformancePre-trained on a 8.1T token multi-source corpus — roughly 4× the V1 LLM's 2T. Broad coverage of English, Chinese, code, math, and scientific text. The training data quality and scale justify the strong performance gap over V1: V2 matches or exceeds models trained on far more compute simply through architectural efficiency.
DataCompared to the previous-generation DeepSeek 67B dense model with the same quality tier, V2 delivers dramatic efficiency gains across every dimension that matters for production deployment.
Results from the official arXiv:2405.04434 paper. At release, DeepSeek-V2 was the strongest open-source model on most benchmarks, matching or exceeding Llama 3 70B and Mixtral 8×22B with a fraction of activated parameters.
V2 was trained on more than 4× the tokens of V1's LLM — but at significantly lower cost per FLOP due to MoE sparse activation. Every training decision was documented in the published technical report.
Data: 8.1 trillion tokens from a high-quality multi-source corpus. Significantly broader and larger than V1's 2T tokens. Multi-source means code, English web text, Chinese web text, scientific papers, books, and technical documentation — with domain-specific quality filtering and deduplication at each stage.
Context extension: During the final pre-training phase, the context window is extended from 4K to 128K tokens using YaRN (Yet another RoPE extensioN) — a positional interpolation technique that allows pre-trained models to generalise to longer sequences without retraining from scratch.
Load balancing: An auxiliary loss is applied to balance expert utilisation across the 160 routed experts. Without this, routing collapse occurs — most tokens go to a handful of popular experts while others starve. The loss coefficient is carefully tuned to balance load without degrading model quality.
Supervised Fine-Tuning (SFT): Applied on top of the pre-trained Base model with a broad instruction-following dataset covering coding, mathematics, creative writing, safety, and Chinese-language tasks. The SFT dataset includes substantial math and code content — explaining why V2-Chat (SFT) already shows strong improvement in these domains vs the Base model.
Reinforcement Learning (RL): GRPO-based RL further improves performance on math and coding benchmarks. V2-Chat (RL) shows noticeable gains over V2-Chat (SFT) on GSM8K, MATH, and HumanEval — demonstrating that RL is particularly valuable for tasks with verifiable outcomes where reward signals are clear.
Serving: For deployment, V2 parameters are converted to FP8 precision and KV cache elements are further quantised to 6 bits on average — additional compression on top of MLA's latent caching that makes production serving economics favourable.
DeepSeek-V2 didn't just improve benchmark numbers. It changed what open-source AI could be: a viable alternative to GPT-4 at dramatically lower cost, with architectural innovations the entire industry subsequently adopted.
Multi-head Latent Attention — a new attention mechanism that beats MHA on quality while compressing KV cache 93.3%. First described here; now used in V3, R1, V3.1, V3.2, V4, and being retrofitted into other architectures.
New ArchitectureFine-grained expert segmentation with 160 routed + 2 shared experts. Higher specialisation than GShard/Switch at same FLOP cost. The MoE architecture that powers every subsequent DeepSeek model through V4.
New ArchitectureFirst DeepSeek flagship with 128K tokens — made practical by MLA's 93.3% KV cache reduction. Enables entire codebases, long legal documents, and book-level analyses in a single context.
Long Context5.76× the throughput of DeepSeek 67B on identical 8× H800 hardware. The higher throughput comes directly from MLA enabling larger batch sizes — the same batch that was GPU memory-limited is now unconstrained.
PerformanceSparse MoE means only 21B parameters are activated and updated per forward pass while knowledge scales to 236B. Economic training that enables frontier intelligence at sustainable cost.
Economics236B total parameters, 21B activated per token. This ratio — roughly 11× expansion factor — became the template for V3 (671B/37B) and V4-Pro (1.6T/49B). The specific balance of capacity vs activation cost was validated here.
ScaleDeepSeek-Coder-V2 was the first open-source model to surpass 10% on SWE-Bench Verified — the real-world software engineering benchmark. Previously only closed-source models had crossed this threshold.
MilestoneV2 outperforms LLaMA-3-70B by over 20 percentage points on CMMLU and C-Eval. The 100K BPE vocabulary (inherited from V1) and the bilingual training corpus make V2 the dominant open-source model for Chinese-language tasks.
ChineseDuring serving, W_UK is absorbed into W_Q and W_UV into W_O — eliminating the MLA decompression step entirely. This "free" speedup means the theoretical memory savings translate directly into practical inference speedups with no quality loss.
InferenceThe 16B / 2.4B active V2-Lite model was released as a community response to the surge of MLA research interest. Deployable on a single 40G GPU — making MLA accessible for researchers without large-scale infrastructure.
AccessiblearXiv:2405.04434 published all architecture details, training hyperparameters, alignment pipeline, serving optimisations, and ablation studies. Every innovation — MLA, DeepSeekMoE, expert load balancing — fully documented.
OpenV3 explicitly states: "MLA and DeepSeekMoE architectures, thoroughly validated in DeepSeek-V2." V3 inherited both innovations, extended them, and added FP8 training and MTP. V4 further extends this foundation. V2 is not a stepping stone — it's the bedrock.
LegacyV2 model weights remain on Hugging Face. The V2 API endpoint is historical (now routes to V3.2/V4), but the weights are permanently available for local inference, fine-tuning, and research.
The V2 generation powered the hosted DeepSeek API from May to December 2024. During this period, the endpoint went through three major model updates before transitioning to the V3 generation.
deepseek-chat upgraded to DeepSeek-V2.5-1210. Improved math (MATH-500: 74.8% → 82.8%), coding (LiveCodeBench: 29.2% → 34.38%), writing, and reasoning. Better file upload and webpage summarisation. This was the last deepseek-chat alias on the V2 architecture — the next update in December 2024 moved to V3.deepseek-chat and deepseek-coder endpoints. First all-in-one model for general use and programming.deepseek-chat alias moved to this checkpoint.deepseek-chat alias moved to DeepSeek-V2-0517 from the original V2 weights. Enhanced capability for structured data generation tasks.deepseek-chat moved from V1-era model to V2. DeepSeek-V2-Lite (16B/2.4B) released days later following community demand for a smaller MLA research model.The V2 paper was published in May 2024. By December 2024, V3 explicitly credited V2 as the architectural foundation. By April 2026, V4-Pro carried 1.6T parameters on the same two innovations. No architecture since has needed to replace MLA or DeepSeekMoE.
236B/21B MoE, 128K context, 93.3% KV cache reduction, 5.76× throughput. Two architectural innovations published openly. API price: $0.14/1M input — a fraction of GPT-4's $30/1M at the time. Community MLA interest triggers release of V2-Lite within days.
MLA invented · DeepSeekMoE established · 128K context · arXiv:2405.04434Code-specialised V2 variant. HumanEval 90.2%, MBPP 76.2%, MATH 75.7%. First open-source model to cross the 10% SWE-Bench threshold. Rivals GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro on coding and math. 338 programming languages. 128K context window.
First open-source SWE-Bench >10% · Code + math parity with GPT-4oMerged V2-0628 general chat and Coder-V2-0724 into one model — the first DeepSeek all-in-one model for both general tasks and code. Backward compatible via both deepseek-chat and deepseek-coder API aliases. Sets the template for V3's unified approach.
V3 paper explicitly states: "MLA and DeepSeekMoE architectures, thoroughly validated in DeepSeek-V2." V3 scales to 671B/37B, adds FP8 mixed-precision training and Multi-Token Prediction (MTP) objective. 2.788M H800 GPU hours. API: $0.27/1M — still far below GPT-4o's cost. Cost: ~$5.5M to train.
V2 architecture validated and scaled · FP8 training added · 671B/37BV4-Pro (1.6T/49B) and V4-Flash (284B/13B). Both use MLA and DeepSeekMoE — exactly as introduced in V2 two years earlier. Codeforces #1 (3206 Elo), 80.6% SWE-bench Verified, 1M token context. The architecture DeepSeek invented for 236B now scales to 1.6 trillion parameters.
MLA + MoE at 1.6T · Codeforces #1 · 1M context · 2 years from V2Multi-head Latent Attention (MLA) and Grouped-Query Attention (GQA) both reduce KV cache memory — but through fundamentally different approaches. GQA reduces cache size by sharing Key/Value heads across groups of Query heads: instead of n_h KV heads, you have n_h/g KV heads. This reduces memory at the cost of performance — GQA and MQA trade quality for efficiency. MLA takes a different approach: it compresses all K and V information into a single low-rank latent vector per token using learned down-projection and up-projection matrices. Only this tiny vector is cached. At inference time, K and V are reconstructed on demand — and the up-projection matrix can even be absorbed into W_Q and W_O, eliminating the decompression step entirely. The result: MLA achieves better performance than MHA (not worse) while cutting KV cache by 93.3%. GQA was a tradeoff; MLA is a genuine improvement. This is why every DeepSeek model since V2 uses MLA.
The V2 API era ended in December 2024. The deepseek-chat and deepseek-coder endpoints that pointed to V2-era models have since been upgraded — first to V3 in December 2024, and as of April 2026 to V4 models. There is no way to call the V2 model via the hosted API — it is a historical endpoint. For V2, you have two options: (1) Download the model weights from Hugging Face and run locally. (2) Use a community-hosted endpoint. For production use, use platform.deepseek.com with deepseek-chat (now V4-Flash) or deepseek-v4-pro.
Standard MoE (like Switch Transformer or GShard) uses coarse expert segmentation: typically 8–16 large experts with top-2 routing. This means each token activates 2 of 8 experts — a 25% activation ratio but with low specialisation, since each expert must handle a broad range of token types. DeepSeekMoE uses fine-grained expert segmentation: 160 smaller routed experts with top-6 routing, plus 2 shared experts that always fire. This means each token activates 8 of 162 experts — a ~5% activation ratio. The critical difference: with more but smaller experts, each expert can specialise on a much narrower subspace of token types (e.g., Python syntax, Chinese grammar, mathematical notation), while the shared experts handle universal capabilities that should fire for all tokens. The result is higher effective specialisation with the same total parameter activation budget.
The full 236B V2 model requires approximately ~250 GB GPU memory in BF16 — roughly 3× A100 80GB or 4× H100 80GB. For most researchers, V2-Lite is the practical choice: the 16B / 2.4B active model deploys on a single 40G GPU (A100 40G, RTX 3090, etc.) and is fine-tunable on 8× 80G GPUs. For quantised variants: the 236B model in Q4_K_M quantisation (4-bit) reduces to roughly 120–130 GB, fitting 2× A100 80GB. The community has published various GGUF quantisations via llama.cpp. For production inference without local hardware, the current V4 API is the recommended path.
V3 (December 2024) kept MLA and DeepSeekMoE verbatim from V2 and added four main advances: (1) Scale: 671B/37B vs 236B/21B — roughly 2.9× more total parameters, 1.76× more active. (2) FP8 mixed-precision training: V3 pioneered FP8 training, dramatically reducing memory and compute cost during pre-training. (3) Multi-Token Prediction (MTP): V3 trains to predict multiple future tokens simultaneously, improving sample efficiency and enabling speculative decoding. (4) Auxiliary-loss-free load balancing: V3 replaces V2's auxiliary load-balancing loss with a bias-based mechanism that achieves better balance with less quality degradation. Pre-training corpus: 14.8T tokens vs V2's 8.1T. Both were strong at release; V3 is significantly more capable.
On most coding benchmarks in 2026, Coder-V2 has been surpassed by V4-Pro (80.6% SWE-bench vs Coder-V2's ~12%), Claude 3.7, and GPT-4o. However, Coder-V2 retains value in specific contexts: (1) Local inference: If you need a strong code model you can run locally at 236B scale without needing the latest closed-source model, Coder-V2 still delivers competitive performance. (2) 128K context for code: Useful for repository-level analysis. (3) Historical research: As the first open-source model to pass 10% SWE-Bench, it's an important reference point. For production code generation in 2026, V4-Flash ($0.14/1M) offers significantly better performance at lower cost than hosting Coder-V2 locally.
May 2024. Multi-head Latent Attention. DeepSeekMoE. 93.3% less KV cache. 5.76× throughput. 128K context. 42.5% cheaper training. Every DeepSeek model since has been built on these two ideas — unmodified since V2.