[Feat] DeepSeek V4 Rebased #40860
Conversation
`w8a8_triton_block_scaled_mm` falls back to a hardcoded default config when no pre-tuned `configs/N=*,K=*,device_name=*.json` file matches the GPU. The default uses `BLOCK_SIZE_M=64`, which wastes 98% of the M dimension in single-request decode (M=1). GPUs without a pre-tuned JSON file for their (N, K, device) tuple pay this cost. Narrow the change: only specialize the M<=8 case (single-request decode and short MTP-style draft batches). Larger M keeps the previous default unchanged so non-decode paths and tuned configs are not perturbed. M <= 8 (CUDA) -> BLOCK_SIZE_M=16, num_stages=3 (new) M <= 8 (ROCm) -> BLOCK_SIZE_M=16, num_stages=2 (new) else -> BLOCK_SIZE_M=64, num_stages=2 (previous default) num_stages=3 is gated to non-ROCm because MI300/MI250X LDS (64 KB) is borderline for 3-stage Triton pipelining at typical [128, 128] block sizes; on ROCm we keep num_stages=2 so the M<=8 branch still gets the BLOCK_SIZE_M=16 wave-quantisation win without LDS pressure. Pre-tuned JSON configs are unaffected (they short-circuit before this branch). Workloads that already have a JSON for their (N, K, device) get the same kernel as before. Verified on dual DGX Spark (GB10, sm_121, TP=2) running V4-Flash: median single-request decode goes from 5.45 t/s to 6.73 t/s (+23%) with no other changes. Output remains coherent. The win is expected to generalize to other architectures lacking a pre-tuned JSON for the target (N, K) pair, but only the GB10 case is verified here; reviewers on Hopper/Ampere are welcome to confirm or push back. Refs vllm-project#40860 (V4 rebase), vllm-project#40899 (jasl SM12x scope is orthogonal) Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
|
Congrats @ivanium @zyongye @ywang96 — landing V4 in main is a big step. Quick heads-up for the sm_120 / sm_121 crowd (DGX Spark / GB10 / RTX 50-series users) who will pull main and try to deploy: To get a working V4 / V4-Flash / V4-Pro on sm_12x out of post-#40860 main, two small follow-ups are still needed (both rebased clean on top of this merge):
Both are review-clean since 04-26 16:16 UTC (gemini-code-assist closed all concerns), CI gated only on first-time-contributor Thanks again for the V4 work. |
| return Mxfp4MoeBackend.NONE, None | ||
|
|
||
|
|
||
| def select_mxfp4_moe_backend( |
There was a problem hiding this comment.
imo we shouldn't create separate select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend
what's the reason that these two can't be merged?
Purpose
Rebased version of #40760
Roadmap: #40902
Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.