May 7, 2024 — DeepSeek-V2 released · 236B MoE, 21B active MLA invented — Multi-head Latent Attention cuts KV cache 93.3% 5.76× throughput — vs DeepSeek 67B on single 8× H800 node 128K context — first DeepSeek flagship with 128K token window 42.5% cheaper training — compared to dense 67B model 8.1T training tokens — highest quality multi-source corpus at the time DeepSeekMoE — fine-grained expert segmentation + shared expert isolation $0.14/1M input — V2-era API pricing · dramatically undercut GPT-4 AlpacaEval 38.9% — top open-source chat performance at release V2-Lite also released — 16B / 2.4B active, fits single 40G GPU May 7, 2024 — DeepSeek-V2 released · 236B MoE, 21B active MLA invented — Multi-head Latent Attention cuts KV cache 93.3% 5.76× throughput — vs DeepSeek 67B on single 8× H800 node 128K context — first DeepSeek flagship with 128K token window 42.5% cheaper training — compared to dense 67B model 8.1T training tokens — highest quality multi-source corpus at the time DeepSeekMoE — fine-grained expert segmentation + shared expert isolation $0.14/1M input — V2-era API pricing · dramatically undercut GPT-4 AlpacaEval 38.9% — top open-source chat performance at release V2-Lite also released — 16B / 2.4B active, fits single 40G GPU
MLA · DeepSeekMoE · 128K Context · May 2024

The Architecture
that changed it all.

DeepSeek-V2 introduced two ideas that every major model now uses: Multi-head Latent Attention (MLA), which compresses the KV cache by 93.3%, and DeepSeekMoE, a fine-grained expert architecture that enables 236B parameter intelligence at 21B active cost. Together they produced 5.76× throughput, 42.5% cost savings, and the first open-source frontier model with 128K context.

🤗 Download V2 Chat arXiv Paper ↗ GitHub →
236BTotal parameters
21BActive per token
128KContext tokens
−93.3%KV cache vs dense
5.76×Generation throughput
8.1TTraining tokens
V2 Model Family

Four Models. One Architecture.

The V2 family launched with four variants — the flagship 236B MoE, a lightweight Lite model, and Chat/Code specialised versions — all built on MLA and DeepSeekMoE.

⚡ FLAGSHIP · MoE
🏆
DeepSeek-V2
236B / 21B active · May 7, 2024

The flagship. 236B total parameters with 21B activated per token via DeepSeekMoE. MLA reduces KV cache 93.3% vs the 67B dense predecessor. 128K context window. Trained on 8.1T tokens. On a single 8× H800 node achieves 50K+ tokens/sec throughput — 5.76× the old 67B model.

236B
Total params
21B
Active/token
128K
Context
160
Routed experts
🪶 LITE · MoE
DeepSeek-V2-Lite
16B / 2.4B active · May 2024

Community-requested lighter sibling — released days after V2 following high interest in MLA research. 16B total, 2.4B active per token. Trained on 5.7T tokens. Deployable on a single 40G GPU; fine-tunable on 8× 80G GPUs. Outperforms 7B dense and other 16B MoE models on English and Chinese benchmarks.

16B
Total params
2.4B
Active/token
5.7T
Training tokens
40G
Single GPU
💬 CHAT · SFT + RL
🗣️
DeepSeek-V2-Chat
236B / 21B active · SFT + RL-aligned

The instruction-tuned and RL-aligned chat variant. Achieves 38.9% win rate on AlpacaEval 2.0 and 8.97 overall score on MT-Bench — top open-source chat performance at release. Two sub-variants: V2-Chat (SFT) and V2-Chat (RL). Chat-RL shows further gains on math and coding vs SFT alone.

38.9%
AlpacaEval 2.0
8.97
MT-Bench
SFT+RL
Alignment
128K
Context
💻 CODER V2 · CODE
⌨️
DeepSeek-Coder-V2
236B / 21B active · Code-specialised

Code-specialised V2 variant. Continues pre-training from V2-Base on 10.2T tokens at 60% code / 10% math / 30% natural language. HumanEval 90.2%, MBPP 76.2%. First open-source model to surpass 10% on SWE-Bench. Rivals GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro on coding benchmarks.

90.2%
HumanEval
76.2%
MBPP
10.2T
Total tokens
338
Languages
Historical note: DeepSeek-V2 was released on May 7, 2024 — six months after the V1 models in November 2023. The "V2" naming retroactively made November 2023's DeepSeek Coder and DeepSeek LLM into "V1." DeepSeek-V2 is no longer the live API model — deepseek-chat now routes to V3.2 / V4. See the changelog for the full API history.
Architecture Deep Dive

MLA + DeepSeekMoE: Two Ideas That Lasted.

Both architectural innovations introduced in V2 were adopted verbatim in V3, R1, V3.1, V3.2, and V4. MLA became the standard attention mechanism for the entire DeepSeek lineage. DeepSeekMoE became the expert routing strategy. Nothing has been replaced — only extended.

🔑
Multi-head Latent Attention (MLA)

Traditional MHA stores a full Key and Value vector for every token in the KV cache. For a model with n_h = 128 heads and head dimension d_h = 128, that's 32,768 floats per token per layer — a massive memory bottleneck at 128K context.

MLA solves this with low-rank joint compression: instead of caching the full K and V matrices, it caches a single compressed latent vector c_KV ∈ ℝ^{d_c} where d_c ≪ d_h × n_h. During inference, K and V are reconstructed on-demand from this latent vector via learned up-projection matrices W_UK and W_UV. At inference time, W_UK can be absorbed into W_Q, and W_UV into W_O — eliminating even the decompression step entirely via weight absorption.

The result: only the tiny latent vector needs to be cached, reducing KV cache size by 93.3% compared to DeepSeek-67B. This directly enables 5.76× higher generation throughput, 128K context windows, and dramatically reduced serving costs — without sacrificing model quality. Empirically, MLA outperforms MHA, MQA, and GQA in ablation studies.

MLA also uses a decoupled RoPE strategy: position encoding is applied separately to a small RoPE portion of the query/key and does not interact with the compressed latent vector, keeping the absorption trick valid.

Standard MHA cache per token:
K: d_h × n_h + V: d_h × n_h = 32,768 floats / layer
MLA cache per token:
c_KV (latent): d_c −93.3% memory ✓
K and V reconstructed on-demand from c_KV via W_UK, W_UV. W_UK absorbed into W_Q at serving time → zero decompression overhead.
🧩
DeepSeekMoE Architecture

Conventional MoE architectures (e.g., GShard, Switch Transformer) use coarse expert segmentation: 8 or 16 large experts, 2 activated per token. DeepSeekMoE takes the opposite approach: fine-grained expert segmentation with many smaller experts and a higher activation ratio, plus a dedicated shared expert isolation mechanism.

In V2's DeepSeekMoE, each FFN layer contains 160 routed experts plus 2 shared experts that are always active. For each token, 6 of the 160 routed experts are selected (top-6 routing), plus both shared experts fire unconditionally. The result: the model activates experts covering diverse knowledge areas per token, while shared experts handle universal capabilities (syntax, basic reasoning) that should fire regardless of routing.

This design achieves higher expert specialisation than conventional MoE while keeping the total active parameter budget at 21B. During training, an auxiliary load-balancing loss is used to prevent expert collapse (some experts receiving all tokens, others starving). The loss coefficient is carefully tuned to balance load without harming model quality.

V2's expert parallelism strategy during training devises supplementary mechanisms to control communication overheads across GPU nodes — a critical engineering detail that enables 236B training to be economical on H800 clusters.

ComponentValue
Total parameters236B
Active per token21B
Routed experts per layer160
Shared experts per layer2 (always active)
Top-k routing6 of 160 routed
Transformer layers60
Attention heads128
KV compression dim512 (d_c)
Context length128K tokens
Vocabulary100,014 (BPE)
Key Innovations

Six Ideas That Defined V2.

DeepSeek-V2 wasn't just a bigger model. It introduced architectural and engineering innovations that changed how the industry thinks about MoE inference, long context, and open-source economic efficiency.

01
🔑
MLA — KV Cache Redefined

Multi-head Latent Attention compresses the KV cache into a single low-rank latent vector per token. 93.3% reduction in KV cache size vs dense MHA. Empirically outperforms MHA in ablation studies — unlike GQA/MQA which trade quality for memory. Adopted in every DeepSeek model after V2. Now being retrofitted onto other architectures (MHA2MLA paper, 2025).

Architecture
02
🧩
DeepSeekMoE — Fine-Grained Experts

160 routed experts + 2 shared experts per FFN layer. Top-6 routing + always-on shared experts. Higher specialisation than GShard/Switch at same FLOP budget. Expert parallelism engineering controls inter-node communication overhead at scale. Enables 236B knowledge capacity at 21B activation cost per token.

Architecture
03
📏
128K Context Window

First DeepSeek flagship with a 128K token context window — enabling entire codebases, long legal documents, and book-length analyses. Made practical by MLA: without the 93.3% KV cache reduction, 128K context on 8× H800 would exhaust GPU memory before reaching meaningful batch sizes.

Context
04
💰
42.5% Training Cost Reduction

Compared to training a comparable-capability dense model at 67B. MoE sparse computation means only 21B parameters are activated and updated per forward pass. The FLOPs-per-token are comparable to a 21B dense model while knowledge capacity scales to 236B. DeepSeek trained V2 at economical cost on H800 GPUs with full engineering details published.

Economics
05
5.76× Generation Throughput

On a single 8× H800 GPU node, DeepSeek-V2 achieves over 50,000 tokens/second generation throughput — 5.76× the maximum throughput of DeepSeek 67B on identical hardware. Driven by MLA's smaller KV cache enabling larger batch sizes, plus FP8 quantisation and KV cache compression (6 bits/element) for serving.

Performance
06
📖
8.1T Token Pre-training Corpus

Pre-trained on a 8.1T token multi-source corpus — roughly 4× the V1 LLM's 2T. Broad coverage of English, Chinese, code, math, and scientific text. The training data quality and scale justify the strong performance gap over V1: V2 matches or exceeds models trained on far more compute simply through architectural efficiency.

Data
Efficiency Wins vs Dense 67B

Same Quality. Fraction of the Cost.

Compared to the previous-generation DeepSeek 67B dense model with the same quality tier, V2 delivers dramatic efficiency gains across every dimension that matters for production deployment.

93.3%
KV Cache Reduction
vs DeepSeek-67B dense model
5.76×
Max Generation Throughput
50K+ tok/s on 8× H800
42.5%
Training Cost Savings
vs comparable-quality dense training
32×
Parameter Efficiency
236B capacity, 21B active FLOPs
Benchmarks — May 2024

Top-Tier Open-Source at Release.

Results from the official arXiv:2405.04434 paper. At release, DeepSeek-V2 was the strongest open-source model on most benchmarks, matching or exceeding Llama 3 70B and Mixtral 8×22B with a fraction of activated parameters.

MMLU (5-shot) — General Knowledge
57 subjects spanning STEM, humanities, professional knowledge
DS-V2 beats Mixtral 8×22B
DeepSeek-V2
78.5%
LLaMA-3-70B
79.5%
Mixtral 8×22B
77.8%
DeepSeek-67B
71.3%
BBH (3-shot CoT) — Complex Reasoning
BIG-Bench Hard — challenging tasks requiring multi-step reasoning
DS-V2 top open-source
DeepSeek-V2
78.9%
LLaMA-3-70B
81.0%
Mixtral 8×22B
78.9%
DeepSeek-67B
68.3%
AlpacaEval 2.0 LC Win Rate — Chat Quality
Length-controlled win rate vs GPT-4-Turbo as judge
DS-V2-Chat (RL): 38.9%
DS-V2-Chat (RL)
38.9%
DS-V2-Chat (SFT)
24.3%
LLaMA-3-70B-Chat
34.4%
HumanEval Python Pass@1
164 hand-written Python problems, zero-shot
Coder-V2: 90.2% — rivals GPT-4o
DS-Coder-V2
90.2%
DeepSeek-V2-Chat
73.7%
LLaMA-3-70B
75.6%
Mixtral 8×22B
75.6%
LiveCodeBench — Real-World Coding Contests
Questions from Dec 2023 – Jun 2024; no training contamination
DS-Coder-V2: 43.4%
DS-Coder-V2
43.4%
GPT-4-Turbo
42.3%
Claude-3-Opus
27.1%
SWE-Bench Verified — Real Software Engineering
First open-source model to exceed 10%
Historic open-source first
DS-Coder-V2
>10% ✓
Best prior open-source
<10%
GSM8K — Grade School Math (0-shot)
Multi-step arithmetic word problems
DS-V2-Chat: 92.2%
DS-V2-Chat (RL)
92.2%
LLaMA-3-70B-Chat
93.0%
DeepSeek-67B-Chat
84.1%
MATH Benchmark (0-shot)
Competition-level mathematics
+20pts vs V1 67B
DS-V2-Chat (RL)
52.7%
DS-Coder-V2
75.7%
LLaMA-3-70B-Chat
50.4%
DeepSeek-67B-Chat
32.6%
CRUXEval-I + CRUXEval-O — Code Reasoning
Input/output prediction for Python programs
DS-Coder-V2
74.2%
GPT-4-Turbo
66.9%
C-Eval — Chinese Academic Knowledge
52-subject Chinese university entrance exam
DS-V2 leads all open-source
DeepSeek-V2-Chat
81.7%
LLaMA-3-70B-Chat
61.6%
Mixtral 8×22B
59.6%
DeepSeek-67B-Chat
76.7%
CMMLU — Chinese Multi-task Language Understanding
Counterpart to English MMLU in Chinese
DS-V2: +20pts vs LLaMA-3-70B
DeepSeek-V2
84.0%
LLaMA-3-70B
64.5%
Mixtral 8×22B
60.0%
AlignBench — Chinese Open-ended Chat
GPT-4 judged Chinese conversational quality
DS-V2 Chat top-ranked
DS-V2-Chat (RL)
Top-ranked
LLaMA-3-70B-Chat
Lower
Training Details

8.1T Tokens. Economical by Design.

V2 was trained on more than 4× the tokens of V1's LLM — but at significantly lower cost per FLOP due to MoE sparse activation. Every training decision was documented in the published technical report.

📊 Pre-Training

Data: 8.1 trillion tokens from a high-quality multi-source corpus. Significantly broader and larger than V1's 2T tokens. Multi-source means code, English web text, Chinese web text, scientific papers, books, and technical documentation — with domain-specific quality filtering and deduplication at each stage.

Context extension: During the final pre-training phase, the context window is extended from 4K to 128K tokens using YaRN (Yet another RoPE extensioN) — a positional interpolation technique that allows pre-trained models to generalise to longer sequences without retraining from scratch.

Load balancing: An auxiliary loss is applied to balance expert utilisation across the 160 routed experts. Without this, routing collapse occurs — most tokens go to a handful of popular experts while others starve. The loss coefficient is carefully tuned to balance load without degrading model quality.

8.1T
Training tokens
4K→128K
Context ext.
BF16
Precision
AdamW
Optimiser
🎯 Post-Training (SFT + RL)

Supervised Fine-Tuning (SFT): Applied on top of the pre-trained Base model with a broad instruction-following dataset covering coding, mathematics, creative writing, safety, and Chinese-language tasks. The SFT dataset includes substantial math and code content — explaining why V2-Chat (SFT) already shows strong improvement in these domains vs the Base model.

Reinforcement Learning (RL): GRPO-based RL further improves performance on math and coding benchmarks. V2-Chat (RL) shows noticeable gains over V2-Chat (SFT) on GSM8K, MATH, and HumanEval — demonstrating that RL is particularly valuable for tasks with verifiable outcomes where reward signals are clear.

Serving: For deployment, V2 parameters are converted to FP8 precision and KV cache elements are further quantised to 6 bits on average — additional compression on top of MLA's latent caching that makes production serving economics favourable.

SFT
Stage 1
GRPO
RL method
FP8
Serving precision
6-bit
KV quantisation
What V2 Delivered

Twelve Advances in One Release.

DeepSeek-V2 didn't just improve benchmark numbers. It changed what open-source AI could be: a viable alternative to GPT-4 at dramatically lower cost, with architectural innovations the entire industry subsequently adopted.

🔑
MLA Invention

Multi-head Latent Attention — a new attention mechanism that beats MHA on quality while compressing KV cache 93.3%. First described here; now used in V3, R1, V3.1, V3.2, V4, and being retrofitted into other architectures.

New Architecture
🧩
DeepSeekMoE

Fine-grained expert segmentation with 160 routed + 2 shared experts. Higher specialisation than GShard/Switch at same FLOP cost. The MoE architecture that powers every subsequent DeepSeek model through V4.

New Architecture
📏
128K Context Window

First DeepSeek flagship with 128K tokens — made practical by MLA's 93.3% KV cache reduction. Enables entire codebases, long legal documents, and book-level analyses in a single context.

Long Context
50K+ Tokens/Sec Throughput

5.76× the throughput of DeepSeek 67B on identical 8× H800 hardware. The higher throughput comes directly from MLA enabling larger batch sizes — the same batch that was GPU memory-limited is now unconstrained.

Performance
💰
42.5% Lower Training Cost

Sparse MoE means only 21B parameters are activated and updated per forward pass while knowledge scales to 236B. Economic training that enables frontier intelligence at sustainable cost.

Economics
🔬
236B/21B MoE Configuration

236B total parameters, 21B activated per token. This ratio — roughly 11× expansion factor — became the template for V3 (671B/37B) and V4-Pro (1.6T/49B). The specific balance of capacity vs activation cost was validated here.

Scale
🏅
Open-Source SWE-Bench First

DeepSeek-Coder-V2 was the first open-source model to surpass 10% on SWE-Bench Verified — the real-world software engineering benchmark. Previously only closed-source models had crossed this threshold.

Milestone
🇨🇳
+20pts Chinese vs LLaMA-3

V2 outperforms LLaMA-3-70B by over 20 percentage points on CMMLU and C-Eval. The 100K BPE vocabulary (inherited from V1) and the bilingual training corpus make V2 the dominant open-source model for Chinese-language tasks.

Chinese
🧪
Weight Absorption Trick

During serving, W_UK is absorbed into W_Q and W_UV into W_O — eliminating the MLA decompression step entirely. This "free" speedup means the theoretical memory savings translate directly into practical inference speedups with no quality loss.

Inference
📦
V2-Lite for Single GPU

The 16B / 2.4B active V2-Lite model was released as a community response to the surge of MLA research interest. Deployable on a single 40G GPU — making MLA accessible for researchers without large-scale infrastructure.

Accessible
🔓
Full Technical Transparency

arXiv:2405.04434 published all architecture details, training hyperparameters, alignment pipeline, serving optimisations, and ablation studies. Every innovation — MLA, DeepSeekMoE, expert load balancing — fully documented.

Open
🌱
Foundation for V3, R1, V4

V3 explicitly states: "MLA and DeepSeekMoE architectures, thoroughly validated in DeepSeek-V2." V3 inherited both innovations, extended them, and added FP8 training and MTP. V4 further extends this foundation. V2 is not a stepping stone — it's the bedrock.

Legacy
Code Examples

Use DeepSeek-V2 Today.

V2 model weights remain on Hugging Face. The V2 API endpoint is historical (now routes to V3.2/V4), but the weights are permanently available for local inference, fine-tuning, and research.

# DeepSeek-V2-Chat — local inference with transformers # Requires ~250GB GPU memory (BF16) — use quantisation for consumer hardware from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "deepseek-ai/DeepSeek-V2-Chat" # Lite version (fits single 40G GPU): # model_id = "deepseek-ai/DeepSeek-V2-Lite-Chat" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) messages = [ {"role": "user", "content": "Explain Multi-head Latent Attention in simple terms."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) out = model.generate( inputs, max_new_tokens=2048, do_sample=True, temperature=1.0, top_p=0.95 ) print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
# DeepSeek-Coder-V2-Instruct — code generation # 90.2% HumanEval · rivals GPT-4o on code tasks from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "deepseek-ai/DeepSeek-Coder-V2-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) messages = [ {"role": "user", "content": "Implement a balanced BST with insert, delete, and search in Python."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) out = model.generate(inputs, max_new_tokens=4096) print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
# Current DeepSeek API (V4-era) — V2 API endpoints are now historical # Use platform.deepseek.com and these model strings for production from openai import OpenAI client = OpenAI( api_key="YOUR_DEEPSEEK_API_KEY", base_url="https://api.deepseek.com" ) # Current model strings (as of May 2026): # "deepseek-chat" → DeepSeek-V4-Flash (instant mode) # "deepseek-reasoner" → DeepSeek-V4-Flash with thinking # "deepseek-v4-pro" → DeepSeek-V4-Pro (expert mode) response = client.chat.completions.create( model="deepseek-chat", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How does MLA improve on GQA for inference?"}, ], stream=False ) print(response.choices[0].message.content)
V2-Era API Changelog

How the API Evolved Through V2.

The V2 generation powered the hosted DeepSeek API from May to December 2024. During this period, the endpoint went through three major model updates before transitioning to the V3 generation.

V2.5-1210
December 10, 2024
Final V2-era API update. API deepseek-chat upgraded to DeepSeek-V2.5-1210. Improved math (MATH-500: 74.8% → 82.8%), coding (LiveCodeBench: 29.2% → 34.38%), writing, and reasoning. Better file upload and webpage summarisation. This was the last deepseek-chat alias on the V2 architecture — the next update in December 2024 moved to V3.
V2.5
September 5, 2024
DeepSeek-V2.5 launched — merged the general chat path (V2-0628) and the coder path (DeepSeek-Coder-V2-0724) into one model. NEW Single model for both general and code tasks. Backwards compatible through both deepseek-chat and deepseek-coder endpoints. First all-in-one model for general use and programming.
V2-0628
June 28, 2024
DeepSeek-V2-0628 API update. API Stronger reasoning and role-playing behaviour. Improved instruction following for complex multi-turn conversations. deepseek-chat alias moved to this checkpoint.
V2-0517
May 17, 2024
First API update after V2 launch. API Substantially improved instruction following and JSON output quality. deepseek-chat alias moved to DeepSeek-V2-0517 from the original V2 weights. Enhanced capability for structured data generation tasks.
V2 Launch
May 7, 2024
DeepSeek-V2 released. NEW 236B MoE, 21B active, 128K context. MLA and DeepSeekMoE introduced. Open-source weights on Hugging Face. API endpoint launched. deepseek-chat moved from V1-era model to V2. DeepSeek-V2-Lite (16B/2.4B) released days later following community demand for a smaller MLA research model.
Heritage

V2: The Architecture Model That Stuck.

The V2 paper was published in May 2024. By December 2024, V3 explicitly credited V2 as the architectural foundation. By April 2026, V4-Pro carried 1.6T parameters on the same two innovations. No architecture since has needed to replace MLA or DeepSeekMoE.

May 7, 2024
DeepSeek-V2 — MLA + DeepSeekMoE Introduced

236B/21B MoE, 128K context, 93.3% KV cache reduction, 5.76× throughput. Two architectural innovations published openly. API price: $0.14/1M input — a fraction of GPT-4's $30/1M at the time. Community MLA interest triggers release of V2-Lite within days.

MLA invented · DeepSeekMoE established · 128K context · arXiv:2405.04434
June 2024
DeepSeek-Coder-V2 — First Open-Source SWE-Bench >10%

Code-specialised V2 variant. HumanEval 90.2%, MBPP 76.2%, MATH 75.7%. First open-source model to cross the 10% SWE-Bench threshold. Rivals GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro on coding and math. 338 programming languages. 128K context window.

First open-source SWE-Bench >10% · Code + math parity with GPT-4o
September 2024
DeepSeek-V2.5 — Chat + Code Unified

Merged V2-0628 general chat and Coder-V2-0724 into one model — the first DeepSeek all-in-one model for both general tasks and code. Backward compatible via both deepseek-chat and deepseek-coder API aliases. Sets the template for V3's unified approach.

First unified chat + code model · API consolidation
December 26, 2024
DeepSeek-V3 — MLA + MoE Extended to 671B

V3 paper explicitly states: "MLA and DeepSeekMoE architectures, thoroughly validated in DeepSeek-V2." V3 scales to 671B/37B, adds FP8 mixed-precision training and Multi-Token Prediction (MTP) objective. 2.788M H800 GPU hours. API: $0.27/1M — still far below GPT-4o's cost. Cost: ~$5.5M to train.

V2 architecture validated and scaled · FP8 training added · 671B/37B
April 24, 2026
DeepSeek-V4 — 1.6T Parameters on V2's Foundation

V4-Pro (1.6T/49B) and V4-Flash (284B/13B). Both use MLA and DeepSeekMoE — exactly as introduced in V2 two years earlier. Codeforces #1 (3206 Elo), 80.6% SWE-bench Verified, 1M token context. The architecture DeepSeek invented for 236B now scales to 1.6 trillion parameters.

MLA + MoE at 1.6T · Codeforces #1 · 1M context · 2 years from V2
FAQ

DeepSeek V2 Questions Answered.

What is MLA and why does it matter more than GQA?+

Multi-head Latent Attention (MLA) and Grouped-Query Attention (GQA) both reduce KV cache memory — but through fundamentally different approaches. GQA reduces cache size by sharing Key/Value heads across groups of Query heads: instead of n_h KV heads, you have n_h/g KV heads. This reduces memory at the cost of performance — GQA and MQA trade quality for efficiency. MLA takes a different approach: it compresses all K and V information into a single low-rank latent vector per token using learned down-projection and up-projection matrices. Only this tiny vector is cached. At inference time, K and V are reconstructed on demand — and the up-projection matrix can even be absorbed into W_Q and W_O, eliminating the decompression step entirely. The result: MLA achieves better performance than MHA (not worse) while cutting KV cache by 93.3%. GQA was a tradeoff; MLA is a genuine improvement. This is why every DeepSeek model since V2 uses MLA.

Can I still use the DeepSeek-V2 API endpoint?+

The V2 API era ended in December 2024. The deepseek-chat and deepseek-coder endpoints that pointed to V2-era models have since been upgraded — first to V3 in December 2024, and as of April 2026 to V4 models. There is no way to call the V2 model via the hosted API — it is a historical endpoint. For V2, you have two options: (1) Download the model weights from Hugging Face and run locally. (2) Use a community-hosted endpoint. For production use, use platform.deepseek.com with deepseek-chat (now V4-Flash) or deepseek-v4-pro.

What is DeepSeekMoE and how does it differ from standard MoE?+

Standard MoE (like Switch Transformer or GShard) uses coarse expert segmentation: typically 8–16 large experts with top-2 routing. This means each token activates 2 of 8 experts — a 25% activation ratio but with low specialisation, since each expert must handle a broad range of token types. DeepSeekMoE uses fine-grained expert segmentation: 160 smaller routed experts with top-6 routing, plus 2 shared experts that always fire. This means each token activates 8 of 162 experts — a ~5% activation ratio. The critical difference: with more but smaller experts, each expert can specialise on a much narrower subspace of token types (e.g., Python syntax, Chinese grammar, mathematical notation), while the shared experts handle universal capabilities that should fire for all tokens. The result is higher effective specialisation with the same total parameter activation budget.

How much GPU memory does V2 require to run locally?+

The full 236B V2 model requires approximately ~250 GB GPU memory in BF16 — roughly 3× A100 80GB or 4× H100 80GB. For most researchers, V2-Lite is the practical choice: the 16B / 2.4B active model deploys on a single 40G GPU (A100 40G, RTX 3090, etc.) and is fine-tunable on 8× 80G GPUs. For quantised variants: the 236B model in Q4_K_M quantisation (4-bit) reduces to roughly 120–130 GB, fitting 2× A100 80GB. The community has published various GGUF quantisations via llama.cpp. For production inference without local hardware, the current V4 API is the recommended path.

What changed from V2 to V3?+

V3 (December 2024) kept MLA and DeepSeekMoE verbatim from V2 and added four main advances: (1) Scale: 671B/37B vs 236B/21B — roughly 2.9× more total parameters, 1.76× more active. (2) FP8 mixed-precision training: V3 pioneered FP8 training, dramatically reducing memory and compute cost during pre-training. (3) Multi-Token Prediction (MTP): V3 trains to predict multiple future tokens simultaneously, improving sample efficiency and enabling speculative decoding. (4) Auxiliary-loss-free load balancing: V3 replaces V2's auxiliary load-balancing loss with a bias-based mechanism that achieves better balance with less quality degradation. Pre-training corpus: 14.8T tokens vs V2's 8.1T. Both were strong at release; V3 is significantly more capable.

Is DeepSeek-Coder-V2 still competitive in 2026?+

On most coding benchmarks in 2026, Coder-V2 has been surpassed by V4-Pro (80.6% SWE-bench vs Coder-V2's ~12%), Claude 3.7, and GPT-4o. However, Coder-V2 retains value in specific contexts: (1) Local inference: If you need a strong code model you can run locally at 236B scale without needing the latest closed-source model, Coder-V2 still delivers competitive performance. (2) 128K context for code: Useful for repository-level analysis. (3) Historical research: As the first open-source model to pass 10% SWE-Bench, it's an important reference point. For production code generation in 2026, V4-Flash ($0.14/1M) offers significantly better performance at lower cost than hosting Coder-V2 locally.

The Architecture

The model that
changed the formula.

May 2024. Multi-head Latent Attention. DeepSeekMoE. 93.3% less KV cache. 5.76× throughput. 128K context. 42.5% cheaper training. Every DeepSeek model since has been built on these two ideas — unmodified since V2.

🤗 Download V2 Chat → Read Paper ↗ GitHub →