[Feat] DeepSeek V4 Rebased by ivanium · Pull Request #40860 · vllm-project/vllm

ivanium · 2026-04-25T04:40:59Z

Purpose

Rebased version of #40760

Roadmap: #40902

Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

`w8a8_triton_block_scaled_mm` falls back to a hardcoded default config when no pre-tuned `configs/N=*,K=*,device_name=*.json` file matches the GPU. The default uses `BLOCK_SIZE_M=64`, which wastes 98% of the M dimension in single-request decode (M=1). GPUs without a pre-tuned JSON file for their (N, K, device) tuple pay this cost. Narrow the change: only specialize the M<=8 case (single-request decode and short MTP-style draft batches). Larger M keeps the previous default unchanged so non-decode paths and tuned configs are not perturbed. M <= 8 (CUDA) -> BLOCK_SIZE_M=16, num_stages=3 (new) M <= 8 (ROCm) -> BLOCK_SIZE_M=16, num_stages=2 (new) else -> BLOCK_SIZE_M=64, num_stages=2 (previous default) num_stages=3 is gated to non-ROCm because MI300/MI250X LDS (64 KB) is borderline for 3-stage Triton pipelining at typical [128, 128] block sizes; on ROCm we keep num_stages=2 so the M<=8 branch still gets the BLOCK_SIZE_M=16 wave-quantisation win without LDS pressure. Pre-tuned JSON configs are unaffected (they short-circuit before this branch). Workloads that already have a JSON for their (N, K, device) get the same kernel as before. Verified on dual DGX Spark (GB10, sm_121, TP=2) running V4-Flash: median single-request decode goes from 5.45 t/s to 6.73 t/s (+23%) with no other changes. Output remains coherent. The win is expected to generalize to other architectures lacking a pre-tuned JSON for the target (N, K) pair, but only the GB10 case is verified here; reviewers on Hopper/Ampere are welcome to confirm or push back. Refs vllm-project#40860 (V4 rebase), vllm-project#40899 (jasl SM12x scope is orthogonal) Signed-off-by: Tony Liu <tonyliu0512@gmail.com>

tonyliu312 · 2026-04-27T01:36:27Z

Congrats @ivanium @zyongye @ywang96 — landing V4 in main is a big step. Quick heads-up for the sm_120 / sm_121 crowd (DGX Spark / GB10 / RTX 50-series users) who will pull main and try to deploy:

To get a working V4 / V4-Flash / V4-Pro on sm_12x out of post-#40860 main, two small follow-ups are still needed (both rebased clean on top of this merge):

[Kernel] Marlin MoE: include SM 12.x in default arch list #40923 [Kernel] Marlin MoE: include SM 12.x in default arch list — adds 12.0;12.1 to MARLIN_ARCHS / MARLIN_BF16_ARCHS / MARLIN_MOE_ARCHS, mirroring the existing MARLIN_FP8_ARCHS = "8.9;12.0;12.1" precedent. Without it, the 8.0+PTX JIT fallback produces no native sm_12x cubin and gives silently-wrong outputs on Marlin-MoE (verified end-to-end: gibberish → coherent on dual DGX Spark TP=2; independently re-verified by @idonati on 8× DGX Spark TP=8 running V4-Pro at DeepSeek V4 support on SM12x with Triton sparse MLA fallback #40899).
[Kernel] Tune default fp8 block-scaled Triton config for M<=8 decode #40925 [Kernel] Tune default fp8 block-scaled Triton config for M<=8 decode — narrow specialisation of the w8a8_triton_block_scaled_mm fallback default (only when no tuned JSON matches): BLOCK_SIZE_M=16, num_stages=3 for M <= 8, larger M unchanged. ROCm gated to num_stages=2 per gemini-code-assist review. +23% on GB10 V4-Flash single-request decode, no regression possible for M > 8 or for hosts with a tuned JSON.

Both are review-clean since 04-26 16:16 UTC (gemini-code-assist closed all concerns), CI gated only on first-time-contributor ready label (cc CODEOWNERs: @LucasWilkinson @tlrmchlsmth for the CMakeLists change, @mgoin @tlrmchlsmth for the fp8_utils tune). Posting here primarily as a heads-up for sm_12x users grabbing main today, not as a label nudge.

Thanks again for the V4 work.

BowenBao · 2026-04-27T16:27:38Z

    return Mxfp4MoeBackend.NONE, None


+def select_mxfp4_moe_backend(


imo we shouldn't create separate select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend

what's the reason that these two can't be merged?

cc @mgoin , @robertgshaw2-redhat

claude Bot reviewed Apr 25, 2026

View reviewed changes

zyongye mentioned this pull request Apr 27, 2026

[New Model] Support DeepseekV4 #40760

Closed

MengqingCao mentioned this pull request Apr 27, 2026

[Usage]: deepseek-v4-flash当前function call是否已经适配，目前function call参数使用v3效果没有那么好 vllm-project/vllm-ascend#8713

Open

tonyliu312 mentioned this pull request Apr 27, 2026

Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 #40082

Merged

noooop mentioned this pull request Apr 27, 2026

[Feature]: deepseek v4 support #40778

Closed

1 task

AlpinDale mentioned this pull request Apr 27, 2026

feat: implement DeepSeek-V4 model dphnAI/aphrodite-engine#1651

Merged

BowenBao reviewed Apr 27, 2026

View reviewed changes

This was referenced Apr 28, 2026

DeepSeek V4 + MegaMoE #40868

Closed

Integrate DeepGeMM MegaMoE #40843

Closed

Defilan mentioned this pull request Apr 28, 2026

test + integrate vLLM v0.20.0 (TurboQuant 2-bit KV, DeepSeek V4, FA4 default) defilantech/LLMKube#354

Closed

5 tasks

ProExpertProg mentioned this pull request Apr 28, 2026

[CI Failure]: Fusion E2E TP2 Quick (H100) #41156

Closed

3 tasks

Rohan138 mentioned this pull request Apr 28, 2026

[ROCm][Bugfix][GPTOSS]: fix input_ids and expert_map args for quark w4a8 gptoss #41165

Merged

4 tasks

pawel-olejniczak mentioned this pull request Apr 29, 2026

[FIX_FOR_VLLM_CUSTOM=5b39b268f506150dbab38f6f6c04b7c843e37c07] Fix upstream regressions: MoE refactor, DeepSeek V4 router, KV offload HMA vllm-project/vllm-gaudi#1403

Merged

This was referenced Apr 29, 2026

[Refactor] Merge select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend #41291

Open

[ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle #39136

Merged

demian-overflow mentioned this pull request Apr 30, 2026

[Refactor] Extract shared helpers from MXFP4 MoE backend selectors #41317

Closed

gcanlin mentioned this pull request May 2, 2026

[Misc][Main2Main] Upgrade vLLM to 0429(DSV4/v0.20.0) vllm-project/vllm-ascend#8856

Closed

shen-shanshan mentioned this pull request May 6, 2026

[Misc][Main2Main] Upgrade vLLM to 0427 vllm-project/vllm-ascend#8899

Merged

Ph0rk0z mentioned this pull request May 7, 2026

Feature Request: Deepseek V4-Flash? Qwen sized deepseek... ikawrakow/ik_llama.cpp#1752

Open

4 tasks

AlonKellner-RedHat mentioned this pull request May 11, 2026

[Bug]: RMSNorm kernel ignores weight dtype, always uses FP32 (regression in v0.20.0) #42325

Closed

3 tasks

liulanze mentioned this pull request May 12, 2026

[Bugfix] Fix RMSNorm kernels to multiply in weight's native dtype #42379

Merged

4 tasks

varjoranta mentioned this pull request May 15, 2026

[Bug]: DeepSeek V4 model fails to load with transformers ≥ 4.57 — compress_ratios attribute removed #42741

Open

This was referenced May 20, 2026

Fix int32 overflow in csrc/layernorm_kernels.cu indexing #43156

Closed

Fix int32 overflow in csrc/activation_kernels.cu indexing #43157

Closed

Mark chat_template_kwargs as not-yet-wired for the Responses API in docs #43158

Closed

pasta-paul mentioned this pull request May 23, 2026

DSV4-Pro MTP draft: stacked attn FP8 scale loader gap + MTP forward-path mainline-vs-fork divergence #43472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] DeepSeek V4 Rebased #40860

[Feat] DeepSeek V4 Rebased #40860
ywang96 merged 25 commits into
vllm-project:mainfrom
ivanium:feat/dsv4-support

ivanium commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

tonyliu312 commented Apr 27, 2026

Uh oh!

BowenBao Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

		return Mxfp4MoeBackend.NONE, None


		def select_mxfp4_moe_backend(

Uh oh!

Conversation

ivanium commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

tonyliu312 commented Apr 27, 2026

Uh oh!

BowenBao Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

ivanium commented Apr 25, 2026 •

edited

Loading