[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication by xieminghe1 · Pull Request #22316 · sgl-project/sglang

xieminghe1 · 2026-04-08T03:28:33Z

Motivation

Modifications

Accuracy Tests

Motivation

profiling:
deepseek-R1-0528:

deepseek-R1-0528-w4afp8:

1.When DeepEP is enabled, the communication latency of the DeepSeek-R1-0508-W4AFP8 model is twice that of the DeepSeek-R1-0528 model.
2.The root cause is that DeepEP Dispatch in DeepSeek-R1-0508-W4AFP8 model adopts BF16 for communication, resulting in increased bandwidth consumption and impacting inference performance.

Modifications

The DeepSeek-R1-0508-W4AFP8 model deep dispatch adopts FP8 for communication instead of BF16.
Develop a per-token to per-tensor quant kernel to adapt to the cutlass_w4a8_moe operator interface.

Accuracy Tests

dataset: gsm8k
main:

this pr:

dataset: ceval

main:

this pr:

compare:

Benchmarking and Profiling

GPU：H20

export NVIDIA_SPARSE_ENABLE=1
export NCCL_SOCKET_IFNAME=bond0
export GLOO_SOCKET_IFNAME=bond0
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_TC=160
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG_FILE=nccl_debug.log
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_NTHREADS=4 # added
export NCCL_NSOCKS_PERTHREAD=8 # added
export NCCL_IB_QPS_PER_CONNECTION=8 # added
export NCCL_IB_SPLIT_DATA_ON_QPS=1 # added
export OMP_NUM_THREADS=1 #TDDO
export NCCL_IB_HCA=mlx5_gdr_0:1,mlx5_gdr_1:1,mlx5_gdr_2:1,mlx5_gdr_3:1,mlx5_gdr_4:1,mlx5_gdr_5:1,mlx5_gdr_6:1,mlx5_gdr_7:1
export MC_TE_METRIC=true
export MC_LOG_LEVEL=TRACE
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_MOE_PADDING=1
export SGLANG_ENABLE_JIT_DEEPGEMM=1
export SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS=32
export SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD=16384
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export PYTHONUNBUFFERED=1
export UCX_TLS=tcp,sm
export UCX_NET_DEVICES=bond0
export METRICS_PREFIX=prefill
export SGLANG_HEALTH_CHECK_MEM_FREE=50
export ENABLE_PD_GUARDIAN=0
export SGLANG_DEEPEP_NORMAL_BF16_DISPATCH=1
python3 -m sglang.launch_server
--model-path /export/models/DeepSeek-R1-0528-W4AFP8
--host 0.0.0.0
--port 15000
--trust-remote-code
--tp-size 8
--dp-size 8
--enable-dp-attention
--moe-dense-tp-size 1
--enable-dp-lm-head
--ep-dispatch-algorithm dynamic
--eplb-algorithm deepseek
--disable-shared-experts-fusion
--moe-a2a-backend deepep
--moe-runner-backend cutlass
--deepep-mode low_latency
--chunked-prefill-size 16384
--mem-fraction-static 0.81
--cuda-graph-max-bs 32
--max-running-requests 256
--context-length 16384
--tokenizer-worker-num 8
--enable-dynamic-batch-tokenizer
--dynamic-batch-tokenizer-batch-size 8
--attention-backend flashinfer
--tool-call-parser deepseekv3
--speculative-algorithm EAGLE
--speculative-num-steps 1
--speculative-eagle-topk 1
--speculative-num-draft-tokens 2
--disable-radix-cache
--page-size 128

sglang.bench_serving --backend sglang --host localhost --port 15000 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --num-prompts 100 --sharegpt-output-len 256 --warmup-requests 10 --sharegpt-context-len 4096 --max-concurrency 64 --tokenizer /export/models/DeepSeek-R1-0528-W4AFP8 --request-rate 10 --model /export/models/DeepSeek-R1-0528-W4AFP8

main:

this pr:

Throughput improved by approximately 10% after this pr.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…20-30% TTFT]

gemini-code-assist · 2026-04-08T03:28:38Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BBuf · 2026-04-09T13:30:58Z

/tag-and-rerun-ci

…8 Communication (#22316) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>

…8 Communication (sgl-project#22316) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>

Todobe · 2026-04-14T12:18:56Z

        if input_global_scale is not None:
            use_nvfp4 = True
-        elif not envs.SGLANG_DEEPEP_BF16_DISPATCH.get():
+        else:


The environment variable SGLANG_DEEPEP_BF16_DISPATCH is unrelated to W4AFP8 weights. This environment variable is set for non-quantized weights. Directly deleting it here will cause non-quantized weights to be unable to correctly use deepep. Could you simply avoid setting export SGLANG_DEEPEP_BF16_DISPATCH=1 when starting the service for W4AFP8 weight, instead of modifying the code?

I confirm that this change breaks models with the wna16 quantization scheme, like Kimi-K2.5

Here is my quickfix - #22822, please review it. I think we can create some variables to pass quant_type to this function, I'll think about how to do it better

xieminghe1 · 2026-04-15T00:25:45Z

Ok  l will review it.

…

---Original--- From: "Артем ***@***.***> Date: Wed, Apr 15, 2026 03:28 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [sgl-project/sglang] [Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (PR #22316) @OrangeRedeng commented on this pull request. In python/sglang/srt/layers/moe/token_dispatcher/deepep.py: > @@ -609,7 +609,7 @@ def _dispatch_core( input_global_scale = self.quant_config.get("input_global_scale", None) if input_global_scale is not None: use_nvfp4 = True - elif not envs.SGLANG_DEEPEP_BF16_DISPATCH.get(): + else: Here is my quickfix - #22822, please review it. I think we can create some variables to pass quant_type to this function, I'll think about how to do it better — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

…8 Communication (sgl-project#22316) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>

undefined and others added 9 commits March 2, 2026 17:39

support fused_moe_triton and moe_sum_all_reduce kernel fusion[reduce …

eb21955

…20-30% TTFT]

Merge remote-tracking branch 'upstream/main'

683df2e

[PCG]add piecewise cuda graph support for marlin linear

0aa2a86

Merge branch 'main' into main

242b49f

Merge branch 'main' into main

7df449f

Merge branch 'main' into main

7e54bf7

Merge remote-tracking branch 'upstream/main'

15fe490

Merge remote-tracking branch 'upstream/main'

0677fe7

Update layer.py

9ba26cd

xieminghe1 requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, b8zhong, ch-wan, ispobock and merrymercy as code owners April 8, 2026 03:28

BBuf approved these changes Apr 9, 2026

View reviewed changes

BBuf changed the title ~~DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication~~ [Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication Apr 9, 2026

github-actions Bot added the run-ci label Apr 9, 2026

BBuf merged commit 18f41ac into sgl-project:main Apr 10, 2026
235 of 271 checks passed

Fridge003 pushed a commit that referenced this pull request Apr 11, 2026

[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP…

0463090

…8 Communication (#22316) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>

Todobe reviewed Apr 14, 2026

View reviewed changes

leejnau mentioned this pull request Apr 21, 2026

[Bug] glm_4.7_w8a8_mtp startup failure #23376

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication#22316

[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication#22316
BBuf merged 9 commits intosgl-project:mainfrom
xieminghe1:main

xieminghe1 commented Apr 8, 2026

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

BBuf commented Apr 9, 2026

Uh oh!

Uh oh!

Todobe Apr 14, 2026

Uh oh!

OrangeRedeng Apr 14, 2026

Uh oh!

OrangeRedeng Apr 14, 2026

Uh oh!

xieminghe1 commented Apr 15, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xieminghe1 commented Apr 8, 2026

Motivation

Modifications

Accuracy Tests

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

BBuf commented Apr 9, 2026

Uh oh!

Uh oh!

Todobe Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

OrangeRedeng Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

OrangeRedeng Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

xieminghe1 commented Apr 15, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants