DeepSeek V2 | Upgraded AI Model for Reasoning & Code

May 7, 2024 — DeepSeek-V2 released · 236B MoE, 21B active MLA invented — Multi-head Latent Attention cuts KV cache 93.3% 5.76× throughput — vs DeepSeek 67B on single 8× H800 node 128K context — first DeepSeek flagship with 128K token window 42.5% cheaper training — compared to dense 67B model 8.1T training tokens — highest quality multi-source corpus at the time DeepSeekMoE — fine-grained expert segmentation + shared expert isolation $0.14/1M input — V2-era API pricing · dramatically undercut GPT-4 AlpacaEval 38.9% — top open-source chat performance at release V2-Lite also released — 16B / 2.4B active, fits single 40G GPU May 7, 2024 — DeepSeek-V2 released · 236B MoE, 21B active MLA invented — Multi-head Latent Attention cuts KV cache 93.3% 5.76× throughput — vs DeepSeek 67B on single 8× H800 node 128K context — first DeepSeek flagship with 128K token window 42.5% cheaper training — compared to dense 67B model 8.1T training tokens — highest quality multi-source corpus at the time DeepSeekMoE — fine-grained expert segmentation + shared expert isolation $0.14/1M input — V2-era API pricing · dramatically undercut GPT-4 AlpacaEval 38.9% — top open-source chat performance at release V2-Lite also released — 16B / 2.4B active, fits single 40G GPU

V2 Model Family

Four Models. One Architecture.

The V2 family launched with four variants — the flagship 236B MoE, a lightweight Lite model, and Chat/Code specialised versions — all built on MLA and DeepSeekMoE.

⚡ FLAGSHIP · MoE

🏆

DeepSeek-V2

236B / 21B active · May 7, 2024

The flagship. 236B total parameters with 21B activated per token via DeepSeekMoE. MLA reduces KV cache 93.3% vs the 67B dense predecessor. 128K context window. Trained on 8.1T tokens. On a single 8× H800 node achieves 50K+ tokens/sec throughput — 5.76× the old 67B model.

236B

Total params

21B

Active/token

128K

Context

160

Routed experts

🪶 LITE · MoE

⚡

DeepSeek-V2-Lite

16B / 2.4B active · May 2024

Community-requested lighter sibling — released days after V2 following high interest in MLA research. 16B total, 2.4B active per token. Trained on 5.7T tokens. Deployable on a single 40G GPU; fine-tunable on 8× 80G GPUs. Outperforms 7B dense and other 16B MoE models on English and Chinese benchmarks.

16B

Total params

2.4B

Active/token

5.7T

Training tokens

40G

Single GPU

💬 CHAT · SFT + RL

🗣️

DeepSeek-V2-Chat

236B / 21B active · SFT + RL-aligned

The instruction-tuned and RL-aligned chat variant. Achieves 38.9% win rate on AlpacaEval 2.0 and 8.97 overall score on MT-Bench — top open-source chat performance at release. Two sub-variants: V2-Chat (SFT) and V2-Chat (RL). Chat-RL shows further gains on math and coding vs SFT alone.

38.9%

AlpacaEval 2.0

8.97

MT-Bench

SFT+RL

Alignment

128K

Context

💻 CODER V2 · CODE

⌨️

DeepSeek-Coder-V2

236B / 21B active · Code-specialised

Code-specialised V2 variant. Continues pre-training from V2-Base on 10.2T tokens at 60% code / 10% math / 30% natural language. HumanEval 90.2%, MBPP 76.2%. First open-source model to surpass 10% on SWE-Bench. Rivals GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro on coding benchmarks.

90.2%

HumanEval

76.2%

MBPP

10.2T

Total tokens

338

Languages

Historical note: DeepSeek-V2 was released on May 7, 2024 — six months after the V1 models in November 2023. The "V2" naming retroactively made November 2023's DeepSeek Coder and DeepSeek LLM into "V1." DeepSeek-V2 is no longer the live API model — deepseek-chat now routes to V3.2 / V4. See the changelog for the full API history.

Architecture Deep Dive

MLA + DeepSeekMoE: Two Ideas That Lasted.

Both architectural innovations introduced in V2 were adopted verbatim in V3, R1, V3.1, V3.2, and V4. MLA became the standard attention mechanism for the entire DeepSeek lineage. DeepSeekMoE became the expert routing strategy. Nothing has been replaced — only extended.

🔑

Multi-head Latent Attention (MLA)

Traditional MHA stores a full Key and Value vector for every token in the KV cache. For a model with n_h = 128 heads and head dimension d_h = 128, that's 32,768 floats per token per layer — a massive memory bottleneck at 128K context.

MLA solves this with low-rank joint compression: instead of caching the full K and V matrices, it caches a single compressed latent vector c_KV ∈ ℝ^{d_c} where d_c ≪ d_h × n_h. During inference, K and V are reconstructed on-demand from this latent vector via learned up-projection matrices W_UK and W_UV. At inference time, W_UK can be absorbed into W_Q, and W_UV into W_O — eliminating even the decompression step entirely via weight absorption.

The result: only the tiny latent vector needs to be cached, reducing KV cache size by 93.3% compared to DeepSeek-67B. This directly enables 5.76× higher generation throughput, 128K context windows, and dramatically reduced serving costs — without sacrificing model quality. Empirically, MLA outperforms MHA, MQA, and GQA in ablation studies.

MLA also uses a decoupled RoPE strategy: position encoding is applied separately to a small RoPE portion of the query/key and does not interact with the compressed latent vector, keeping the absorption trick valid.

Standard MHA cache per token:

K: d_h × n_h + V: d_h × n_h = 32,768 floats / layer

MLA cache per token:

c_KV (latent): d_c → −93.3% memory ✓

K and V reconstructed on-demand from c_KV via W_UK, W_UV. W_UK absorbed into W_Q at serving time → zero decompression overhead.

🧩

DeepSeekMoE Architecture

Conventional MoE architectures (e.g., GShard, Switch Transformer) use coarse expert segmentation: 8 or 16 large experts, 2 activated per token. DeepSeekMoE takes the opposite approach: fine-grained expert segmentation with many smaller experts and a higher activation ratio, plus a dedicated shared expert isolation mechanism.

In V2's DeepSeekMoE, each FFN layer contains 160 routed experts plus 2 shared experts that are always active. For each token, 6 of the 160 routed experts are selected (top-6 routing), plus both shared experts fire unconditionally. The result: the model activates experts covering diverse knowledge areas per token, while shared experts handle universal capabilities (syntax, basic reasoning) that should fire regardless of routing.

This design achieves higher expert specialisation than conventional MoE while keeping the total active parameter budget at 21B. During training, an auxiliary load-balancing loss is used to prevent expert collapse (some experts receiving all tokens, others starving). The loss coefficient is carefully tuned to balance load without harming model quality.

V2's expert parallelism strategy during training devises supplementary mechanisms to control communication overheads across GPU nodes — a critical engineering detail that enables 236B training to be economical on H800 clusters.

Component	Value
Total parameters	236B
Active per token	21B
Routed experts per layer	160
Shared experts per layer	2 (always active)
Top-k routing	6 of 160 routed
Transformer layers	60
Attention heads	128
KV compression dim	512 (d_c)
Context length	128K tokens
Vocabulary	100,014 (BPE)

Key Innovations

Six Ideas That Defined V2.

DeepSeek-V2 wasn't just a bigger model. It introduced architectural and engineering innovations that changed how the industry thinks about MoE inference, long context, and open-source economic efficiency.

🔑

MLA — KV Cache Redefined

Multi-head Latent Attention compresses the KV cache into a single low-rank latent vector per token. 93.3% reduction in KV cache size vs dense MHA. Empirically outperforms MHA in ablation studies — unlike GQA/MQA which trade quality for memory. Adopted in every DeepSeek model after V2. Now being retrofitted onto other architectures (MHA2MLA paper, 2025).

Architecture

🧩

DeepSeekMoE — Fine-Grained Experts

160 routed experts + 2 shared experts per FFN layer. Top-6 routing + always-on shared experts. Higher specialisation than GShard/Switch at same FLOP budget. Expert parallelism engineering controls inter-node communication overhead at scale. Enables 236B knowledge capacity at 21B activation cost per token.

Architecture

📏

128K Context Window

First DeepSeek flagship with a 128K token context window — enabling entire codebases, long legal documents, and book-length analyses. Made practical by MLA: without the 93.3% KV cache reduction, 128K context on 8× H800 would exhaust GPU memory before reaching meaningful batch sizes.

Context

💰

42.5% Training Cost Reduction

Compared to training a comparable-capability dense model at 67B. MoE sparse computation means only 21B parameters are activated and updated per forward pass. The FLOPs-per-token are comparable to a 21B dense model while knowledge capacity scales to 236B. DeepSeek trained V2 at economical cost on H800 GPUs with full engineering details published.

Economics

⚡

5.76× Generation Throughput

On a single 8× H800 GPU node, DeepSeek-V2 achieves over 50,000 tokens/second generation throughput — 5.76× the maximum throughput of DeepSeek 67B on identical hardware. Driven by MLA's smaller KV cache enabling larger batch sizes, plus FP8 quantisation and KV cache compression (6 bits/element) for serving.

Performance

📖

8.1T Token Pre-training Corpus

Pre-trained on a 8.1T token multi-source corpus — roughly 4× the V1 LLM's 2T. Broad coverage of English, Chinese, code, math, and scientific text. The training data quality and scale justify the strong performance gap over V1: V2 matches or exceeds models trained on far more compute simply through architectural efficiency.

Data

Benchmarks — May 2024

Top-Tier Open-Source at Release.

Results from the official arXiv:2405.04434 paper. At release, DeepSeek-V2 was the strongest open-source model on most benchmarks, matching or exceeding Llama 3 70B and Mixtral 8×22B with a fraction of activated parameters.

MMLU (5-shot) — General Knowledge

57 subjects spanning STEM, humanities, professional knowledge

DS-V2 beats Mixtral 8×22B

DeepSeek-V2

78.5%

LLaMA-3-70B

79.5%

Mixtral 8×22B

77.8%

DeepSeek-67B

71.3%

BBH (3-shot CoT) — Complex Reasoning

BIG-Bench Hard — challenging tasks requiring multi-step reasoning

DS-V2 top open-source

DeepSeek-V2

78.9%

LLaMA-3-70B

81.0%

Mixtral 8×22B

78.9%

DeepSeek-67B

68.3%

AlpacaEval 2.0 LC Win Rate — Chat Quality

Length-controlled win rate vs GPT-4-Turbo as judge

DS-V2-Chat (RL): 38.9%

DS-V2-Chat (RL)

38.9%

DS-V2-Chat (SFT)

24.3%

LLaMA-3-70B-Chat

34.4%

HumanEval Python Pass@1

164 hand-written Python problems, zero-shot

Coder-V2: 90.2% — rivals GPT-4o

DS-Coder-V2

90.2%

DeepSeek-V2-Chat

73.7%

LLaMA-3-70B

75.6%

Mixtral 8×22B

75.6%

LiveCodeBench — Real-World Coding Contests

Questions from Dec 2023 – Jun 2024; no training contamination

DS-Coder-V2: 43.4%

DS-Coder-V2

43.4%

GPT-4-Turbo

42.3%

Claude-3-Opus

27.1%

SWE-Bench Verified — Real Software Engineering

First open-source model to exceed 10%

Historic open-source first

DS-Coder-V2

>10% ✓

Best prior open-source

<10%

GSM8K — Grade School Math (0-shot)

Multi-step arithmetic word problems

DS-V2-Chat: 92.2%

DS-V2-Chat (RL)

92.2%

LLaMA-3-70B-Chat

93.0%

DeepSeek-67B-Chat

84.1%

MATH Benchmark (0-shot)

Competition-level mathematics

+20pts vs V1 67B

DS-V2-Chat (RL)

52.7%

DS-Coder-V2

75.7%

LLaMA-3-70B-Chat

50.4%

DeepSeek-67B-Chat

32.6%

CRUXEval-I + CRUXEval-O — Code Reasoning

Input/output prediction for Python programs

DS-Coder-V2

74.2%

GPT-4-Turbo

66.9%

C-Eval — Chinese Academic Knowledge

52-subject Chinese university entrance exam

DS-V2 leads all open-source

DeepSeek-V2-Chat

81.7%

LLaMA-3-70B-Chat

61.6%

Mixtral 8×22B

59.6%

DeepSeek-67B-Chat

76.7%

CMMLU — Chinese Multi-task Language Understanding

Counterpart to English MMLU in Chinese

DS-V2: +20pts vs LLaMA-3-70B

DeepSeek-V2

84.0%

LLaMA-3-70B

64.5%

Mixtral 8×22B

60.0%

AlignBench — Chinese Open-ended Chat

GPT-4 judged Chinese conversational quality

DS-V2 Chat top-ranked

DS-V2-Chat (RL)

Top-ranked

LLaMA-3-70B-Chat

Lower

Training Details

8.1T Tokens. Economical by Design.

V2 was trained on more than 4× the tokens of V1's LLM — but at significantly lower cost per FLOP due to MoE sparse activation. Every training decision was documented in the published technical report.

📊 Pre-Training

Data: 8.1 trillion tokens from a high-quality multi-source corpus. Significantly broader and larger than V1's 2T tokens. Multi-source means code, English web text, Chinese web text, scientific papers, books, and technical documentation — with domain-specific quality filtering and deduplication at each stage.

Context extension: During the final pre-training phase, the context window is extended from 4K to 128K tokens using YaRN (Yet another RoPE extensioN) — a positional interpolation technique that allows pre-trained models to generalise to longer sequences without retraining from scratch.

Load balancing: An auxiliary loss is applied to balance expert utilisation across the 160 routed experts. Without this, routing collapse occurs — most tokens go to a handful of popular experts while others starve. The loss coefficient is carefully tuned to balance load without degrading model quality.

8.1T

Training tokens

4K→128K

Context ext.

BF16

Precision

AdamW

Optimiser

🎯 Post-Training (SFT + RL)

Supervised Fine-Tuning (SFT): Applied on top of the pre-trained Base model with a broad instruction-following dataset covering coding, mathematics, creative writing, safety, and Chinese-language tasks. The SFT dataset includes substantial math and code content — explaining why V2-Chat (SFT) already shows strong improvement in these domains vs the Base model.

Reinforcement Learning (RL): GRPO-based RL further improves performance on math and coding benchmarks. V2-Chat (RL) shows noticeable gains over V2-Chat (SFT) on GSM8K, MATH, and HumanEval — demonstrating that RL is particularly valuable for tasks with verifiable outcomes where reward signals are clear.

Serving: For deployment, V2 parameters are converted to FP8 precision and KV cache elements are further quantised to 6 bits on average — additional compression on top of MLA's latent caching that makes production serving economics favourable.

SFT

Stage 1

GRPO

RL method

FP8

Serving precision

6-bit

KV quantisation

What V2 Delivered

Twelve Advances in One Release.

DeepSeek-V2 didn't just improve benchmark numbers. It changed what open-source AI could be: a viable alternative to GPT-4 at dramatically lower cost, with architectural innovations the entire industry subsequently adopted.

🔑

MLA Invention

Multi-head Latent Attention — a new attention mechanism that beats MHA on quality while compressing KV cache 93.3%. First described here; now used in V3, R1, V3.1, V3.2, V4, and being retrofitted into other architectures.

New Architecture

🧩

DeepSeekMoE

Fine-grained expert segmentation with 160 routed + 2 shared experts. Higher specialisation than GShard/Switch at same FLOP cost. The MoE architecture that powers every subsequent DeepSeek model through V4.

New Architecture

📏

128K Context Window

First DeepSeek flagship with 128K tokens — made practical by MLA's 93.3% KV cache reduction. Enables entire codebases, long legal documents, and book-level analyses in a single context.

Long Context

⚡

50K+ Tokens/Sec Throughput

5.76× the throughput of DeepSeek 67B on identical 8× H800 hardware. The higher throughput comes directly from MLA enabling larger batch sizes — the same batch that was GPU memory-limited is now unconstrained.

Performance

💰

42.5% Lower Training Cost

Sparse MoE means only 21B parameters are activated and updated per forward pass while knowledge scales to 236B. Economic training that enables frontier intelligence at sustainable cost.

Economics

🔬

236B/21B MoE Configuration

236B total parameters, 21B activated per token. This ratio — roughly 11× expansion factor — became the template for V3 (671B/37B) and V4-Pro (1.6T/49B). The specific balance of capacity vs activation cost was validated here.

Scale

🏅

Open-Source SWE-Bench First

DeepSeek-Coder-V2 was the first open-source model to surpass 10% on SWE-Bench Verified — the real-world software engineering benchmark. Previously only closed-source models had crossed this threshold.

Milestone

🇨🇳

+20pts Chinese vs LLaMA-3

V2 outperforms LLaMA-3-70B by over 20 percentage points on CMMLU and C-Eval. The 100K BPE vocabulary (inherited from V1) and the bilingual training corpus make V2 the dominant open-source model for Chinese-language tasks.

Chinese

🧪

Weight Absorption Trick

During serving, W_UK is absorbed into W_Q and W_UV into W_O — eliminating the MLA decompression step entirely. This "free" speedup means the theoretical memory savings translate directly into practical inference speedups with no quality loss.

Inference

📦

V2-Lite for Single GPU

The 16B / 2.4B active V2-Lite model was released as a community response to the surge of MLA research interest. Deployable on a single 40G GPU — making MLA accessible for researchers without large-scale infrastructure.

Accessible

🔓

Full Technical Transparency

arXiv:2405.04434 published all architecture details, training hyperparameters, alignment pipeline, serving optimisations, and ablation studies. Every innovation — MLA, DeepSeekMoE, expert load balancing — fully documented.

Open

🌱

Foundation for V3, R1, V4

V3 explicitly states: "MLA and DeepSeekMoE architectures, thoroughly validated in DeepSeek-V2." V3 inherited both innovations, extended them, and added FP8 training and MTP. V4 further extends this foundation. V2 is not a stepping stone — it's the bedrock.

Legacy

Code Examples

Use DeepSeek-V2 Today.

V2 model weights remain on Hugging Face. The V2 API endpoint is historical (now routes to V3.2/V4), but the weights are permanently available for local inference, fine-tuning, and research.

# DeepSeek-V2-Chat — local inference with transformers # Requires ~250GB GPU memory (BF16) — use quantisation for consumer hardware from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "deepseek-ai/DeepSeek-V2-Chat" # Lite version (fits single 40G GPU): # model_id = "deepseek-ai/DeepSeek-V2-Lite-Chat" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) messages = [ {"role": "user", "content": "Explain Multi-head Latent Attention in simple terms."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) out = model.generate( inputs, max_new_tokens=2048, do_sample=True, temperature=1.0, top_p=0.95 ) print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

# DeepSeek-Coder-V2-Instruct — code generation # 90.2% HumanEval · rivals GPT-4o on code tasks from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "deepseek-ai/DeepSeek-Coder-V2-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) messages = [ {"role": "user", "content": "Implement a balanced BST with insert, delete, and search in Python."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) out = model.generate(inputs, max_new_tokens=4096) print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

# Current DeepSeek API (V4-era) — V2 API endpoints are now historical # Use platform.deepseek.com and these model strings for production from openai import OpenAI client = OpenAI( api_key="YOUR_DEEPSEEK_API_KEY", base_url="https://api.deepseek.com" ) # Current model strings (as of May 2026): # "deepseek-chat" → DeepSeek-V4-Flash (instant mode) # "deepseek-reasoner" → DeepSeek-V4-Flash with thinking # "deepseek-v4-pro" → DeepSeek-V4-Pro (expert mode) response = client.chat.completions.create( model="deepseek-chat", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How does MLA improve on GQA for inference?"}, ], stream=False ) print(response.choices[0].message.content)

V2-Era API Changelog

How the API Evolved Through V2.

The V2 generation powered the hosted DeepSeek API from May to December 2024. During this period, the endpoint went through three major model updates before transitioning to the V3 generation.

V2.5-1210

December 10, 2024

Final V2-era API update. API deepseek-chat upgraded to DeepSeek-V2.5-1210. Improved math (MATH-500: 74.8% → 82.8%), coding (LiveCodeBench: 29.2% → 34.38%), writing, and reasoning. Better file upload and webpage summarisation. This was the last deepseek-chat alias on the V2 architecture — the next update in December 2024 moved to V3.

V2.5

September 5, 2024

DeepSeek-V2.5 launched — merged the general chat path (V2-0628) and the coder path (DeepSeek-Coder-V2-0724) into one model. NEW Single model for both general and code tasks. Backwards compatible through both deepseek-chat and deepseek-coder endpoints. First all-in-one model for general use and programming.

V2-0628

June 28, 2024

DeepSeek-V2-0628 API update. API Stronger reasoning and role-playing behaviour. Improved instruction following for complex multi-turn conversations. deepseek-chat alias moved to this checkpoint.

V2-0517

May 17, 2024

First API update after V2 launch. API Substantially improved instruction following and JSON output quality. deepseek-chat alias moved to DeepSeek-V2-0517 from the original V2 weights. Enhanced capability for structured data generation tasks.

V2 Launch

May 7, 2024

DeepSeek-V2 released. NEW 236B MoE, 21B active, 128K context. MLA and DeepSeekMoE introduced. Open-source weights on Hugging Face. API endpoint launched. deepseek-chat moved from V1-era model to V2. DeepSeek-V2-Lite (16B/2.4B) released days later following community demand for a smaller MLA research model.

Heritage

V2: The Architecture Model That Stuck.

The V2 paper was published in May 2024. By December 2024, V3 explicitly credited V2 as the architectural foundation. By April 2026, V4-Pro carried 1.6T parameters on the same two innovations. No architecture since has needed to replace MLA or DeepSeekMoE.

May 7, 2024

DeepSeek-V2 — MLA + DeepSeekMoE Introduced

236B/21B MoE, 128K context, 93.3% KV cache reduction, 5.76× throughput. Two architectural innovations published openly. API price: $0.14/1M input — a fraction of GPT-4's $30/1M at the time. Community MLA interest triggers release of V2-Lite within days.

MLA invented · DeepSeekMoE established · 128K context · arXiv:2405.04434

June 2024

DeepSeek-Coder-V2 — First Open-Source SWE-Bench >10%

Code-specialised V2 variant. HumanEval 90.2%, MBPP 76.2%, MATH 75.7%. First open-source model to cross the 10% SWE-Bench threshold. Rivals GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro on coding and math. 338 programming languages. 128K context window.

First open-source SWE-Bench >10% · Code + math parity with GPT-4o

September 2024

DeepSeek-V2.5 — Chat + Code Unified

Merged V2-0628 general chat and Coder-V2-0724 into one model — the first DeepSeek all-in-one model for both general tasks and code. Backward compatible via both deepseek-chat and deepseek-coder API aliases. Sets the template for V3's unified approach.

First unified chat + code model · API consolidation

December 26, 2024

DeepSeek-V3 — MLA + MoE Extended to 671B

V3 paper explicitly states: "MLA and DeepSeekMoE architectures, thoroughly validated in DeepSeek-V2." V3 scales to 671B/37B, adds FP8 mixed-precision training and Multi-Token Prediction (MTP) objective. 2.788M H800 GPU hours. API: $0.27/1M — still far below GPT-4o's cost. Cost: ~$5.5M to train.

V2 architecture validated and scaled · FP8 training added · 671B/37B

April 24, 2026

DeepSeek-V4 — 1.6T Parameters on V2's Foundation

V4-Pro (1.6T/49B) and V4-Flash (284B/13B). Both use MLA and DeepSeekMoE — exactly as introduced in V2 two years earlier. Codeforces #1 (3206 Elo), 80.6% SWE-bench Verified, 1M token context. The architecture DeepSeek invented for 236B now scales to 1.6 trillion parameters.

MLA + MoE at 1.6T · Codeforces #1 · 1M context · 2 years from V2

FAQ

DeepSeek V2 Questions Answered.

What is MLA and why does it matter more than GQA?+

Multi-head Latent Attention (MLA) and Grouped-Query Attention (GQA) both reduce KV cache memory — but through fundamentally different approaches. GQA reduces cache size by sharing Key/Value heads across groups of Query heads: instead of n_h KV heads, you have n_h/g KV heads. This reduces memory at the cost of performance — GQA and MQA trade quality for efficiency. MLA takes a different approach: it compresses all K and V information into a single low-rank latent vector per token using learned down-projection and up-projection matrices. Only this tiny vector is cached. At inference time, K and V are reconstructed on demand — and the up-projection matrix can even be absorbed into W_Q and W_O, eliminating the decompression step entirely. The result: MLA achieves better performance than MHA (not worse) while cutting KV cache by 93.3%. GQA was a tradeoff; MLA is a genuine improvement. This is why every DeepSeek model since V2 uses MLA.

Can I still use the DeepSeek-V2 API endpoint?+

The V2 API era ended in December 2024. The deepseek-chat and deepseek-coder endpoints that pointed to V2-era models have since been upgraded — first to V3 in December 2024, and as of April 2026 to V4 models. There is no way to call the V2 model via the hosted API — it is a historical endpoint. For V2, you have two options: (1) Download the model weights from Hugging Face and run locally. (2) Use a community-hosted endpoint. For production use, use platform.deepseek.com with deepseek-chat (now V4-Flash) or deepseek-v4-pro.

What is DeepSeekMoE and how does it differ from standard MoE?+

Standard MoE (like Switch Transformer or GShard) uses coarse expert segmentation: typically 8–16 large experts with top-2 routing. This means each token activates 2 of 8 experts — a 25% activation ratio but with low specialisation, since each expert must handle a broad range of token types. DeepSeekMoE uses fine-grained expert segmentation: 160 smaller routed experts with top-6 routing, plus 2 shared experts that always fire. This means each token activates 8 of 162 experts — a ~5% activation ratio. The critical difference: with more but smaller experts, each expert can specialise on a much narrower subspace of token types (e.g., Python syntax, Chinese grammar, mathematical notation), while the shared experts handle universal capabilities that should fire for all tokens. The result is higher effective specialisation with the same total parameter activation budget.

How much GPU memory does V2 require to run locally?+

The full 236B V2 model requires approximately ~250 GB GPU memory in BF16 — roughly 3× A100 80GB or 4× H100 80GB. For most researchers, V2-Lite is the practical choice: the 16B / 2.4B active model deploys on a single 40G GPU (A100 40G, RTX 3090, etc.) and is fine-tunable on 8× 80G GPUs. For quantised variants: the 236B model in Q4_K_M quantisation (4-bit) reduces to roughly 120–130 GB, fitting 2× A100 80GB. The community has published various GGUF quantisations via llama.cpp. For production inference without local hardware, the current V4 API is the recommended path.

What changed from V2 to V3?+

V3 (December 2024) kept MLA and DeepSeekMoE verbatim from V2 and added four main advances: (1) Scale: 671B/37B vs 236B/21B — roughly 2.9× more total parameters, 1.76× more active. (2) FP8 mixed-precision training: V3 pioneered FP8 training, dramatically reducing memory and compute cost during pre-training. (3) Multi-Token Prediction (MTP): V3 trains to predict multiple future tokens simultaneously, improving sample efficiency and enabling speculative decoding. (4) Auxiliary-loss-free load balancing: V3 replaces V2's auxiliary load-balancing loss with a bias-based mechanism that achieves better balance with less quality degradation. Pre-training corpus: 14.8T tokens vs V2's 8.1T. Both were strong at release; V3 is significantly more capable.

Is DeepSeek-Coder-V2 still competitive in 2026?+

On most coding benchmarks in 2026, Coder-V2 has been surpassed by V4-Pro (80.6% SWE-bench vs Coder-V2's ~12%), Claude 3.7, and GPT-4o. However, Coder-V2 retains value in specific contexts: (1) Local inference: If you need a strong code model you can run locally at 236B scale without needing the latest closed-source model, Coder-V2 still delivers competitive performance. (2) 128K context for code: Useful for repository-level analysis. (3) Historical research: As the first open-source model to pass 10% SWE-Bench, it's an important reference point. For production code generation in 2026, V4-Flash ($0.14/1M) offers significantly better performance at lower cost than hosting Coder-V2 locally.

The Architecture
that changed it all.

Four Models. One Architecture.

MLA + DeepSeekMoE: Two Ideas That Lasted.

Six Ideas That Defined V2.

Same Quality. Fraction of the Cost.

Top-Tier Open-Source at Release.

8.1T Tokens. Economical by Design.

Twelve Advances in One Release.

Use DeepSeek-V2 Today.

How the API Evolved Through V2.

V2: The Architecture Model That Stuck.

DeepSeek V2 Questions Answered.

The model that
changed the formula.

The Architecturethat changed it all.

Four Models. One Architecture.

MLA + DeepSeekMoE: Two Ideas That Lasted.

Six Ideas That Defined V2.

Same Quality. Fraction of the Cost.

Top-Tier Open-Source at Release.

8.1T Tokens. Economical by Design.

Twelve Advances in One Release.

Use DeepSeek-V2 Today.

How the API Evolved Through V2.

V2: The Architecture Model That Stuck.

DeepSeek V2 Questions Answered.

The model thatchanged the formula.

The Architecture
that changed it all.

The model that
changed the formula.