[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication#22316
[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication#22316BBuf merged 9 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
…8 Communication (#22316) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>
…8 Communication (sgl-project#22316) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>
| if input_global_scale is not None: | ||
| use_nvfp4 = True | ||
| elif not envs.SGLANG_DEEPEP_BF16_DISPATCH.get(): | ||
| else: |
There was a problem hiding this comment.
The environment variable SGLANG_DEEPEP_BF16_DISPATCH is unrelated to W4AFP8 weights. This environment variable is set for non-quantized weights. Directly deleting it here will cause non-quantized weights to be unable to correctly use deepep. Could you simply avoid setting export SGLANG_DEEPEP_BF16_DISPATCH=1 when starting the service for W4AFP8 weight, instead of modifying the code?
There was a problem hiding this comment.
I confirm that this change breaks models with the wna16 quantization scheme, like Kimi-K2.5
There was a problem hiding this comment.
Here is my quickfix - #22822, please review it. I think we can create some variables to pass quant_type to this function, I'll think about how to do it better
|
Ok l will review it.
…---Original---
From: "Артем ***@***.***>
Date: Wed, Apr 15, 2026 03:28 AM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [sgl-project/sglang] [Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (PR #22316)
@OrangeRedeng commented on this pull request.
In python/sglang/srt/layers/moe/token_dispatcher/deepep.py:
> @@ -609,7 +609,7 @@ def _dispatch_core( input_global_scale = self.quant_config.get("input_global_scale", None) if input_global_scale is not None: use_nvfp4 = True - elif not envs.SGLANG_DEEPEP_BF16_DISPATCH.get(): + else:
Here is my quickfix - #22822, please review it. I think we can create some variables to pass quant_type to this function, I'll think about how to do it better
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
…8 Communication (sgl-project#22316) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>
Motivation
Modifications
Accuracy Tests
Motivation
profiling:

deepseek-R1-0528:
deepseek-R1-0528-w4afp8:

1.When DeepEP is enabled, the communication latency of the DeepSeek-R1-0508-W4AFP8 model is twice that of the DeepSeek-R1-0528 model.
2.The root cause is that DeepEP Dispatch in DeepSeek-R1-0508-W4AFP8 model adopts BF16 for communication, resulting in increased bandwidth consumption and impacting inference performance.
Modifications
Accuracy Tests
dataset: gsm8k

main:
this pr:

dataset: ceval
main:

this pr:

compare:
Benchmarking and Profiling
GPU:H20
export NVIDIA_SPARSE_ENABLE=1
export NCCL_SOCKET_IFNAME=bond0
export GLOO_SOCKET_IFNAME=bond0
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_TC=160
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG_FILE=nccl_debug.log
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_NTHREADS=4 # added
export NCCL_NSOCKS_PERTHREAD=8 # added
export NCCL_IB_QPS_PER_CONNECTION=8 # added
export NCCL_IB_SPLIT_DATA_ON_QPS=1 # added
export OMP_NUM_THREADS=1 #TDDO
export NCCL_IB_HCA=mlx5_gdr_0:1,mlx5_gdr_1:1,mlx5_gdr_2:1,mlx5_gdr_3:1,mlx5_gdr_4:1,mlx5_gdr_5:1,mlx5_gdr_6:1,mlx5_gdr_7:1
export MC_TE_METRIC=true
export MC_LOG_LEVEL=TRACE
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_MOE_PADDING=1
export SGLANG_ENABLE_JIT_DEEPGEMM=1
export SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS=32
export SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD=16384
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export PYTHONUNBUFFERED=1
export UCX_TLS=tcp,sm
export UCX_NET_DEVICES=bond0
export METRICS_PREFIX=prefill
export SGLANG_HEALTH_CHECK_MEM_FREE=50
export ENABLE_PD_GUARDIAN=0
export SGLANG_DEEPEP_NORMAL_BF16_DISPATCH=1
python3 -m sglang.launch_server
--model-path /export/models/DeepSeek-R1-0528-W4AFP8
--host 0.0.0.0
--port 15000
--trust-remote-code
--tp-size 8
--dp-size 8
--enable-dp-attention
--moe-dense-tp-size 1
--enable-dp-lm-head
--ep-dispatch-algorithm dynamic
--eplb-algorithm deepseek
--disable-shared-experts-fusion
--moe-a2a-backend deepep
--moe-runner-backend cutlass
--deepep-mode low_latency
--chunked-prefill-size 16384
--mem-fraction-static 0.81
--cuda-graph-max-bs 32
--max-running-requests 256
--context-length 16384
--tokenizer-worker-num 8
--enable-dynamic-batch-tokenizer
--dynamic-batch-tokenizer-batch-size 8
--attention-backend flashinfer
--tool-call-parser deepseekv3
--speculative-algorithm EAGLE
--speculative-num-steps 1
--speculative-eagle-topk 1
--speculative-num-draft-tokens 2
--disable-radix-cache
--page-size 128
sglang.bench_serving --backend sglang --host localhost --port 15000 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --num-prompts 100 --sharegpt-output-len 256 --warmup-requests 10 --sharegpt-context-len 4096 --max-concurrency 64 --tokenizer /export/models/DeepSeek-R1-0528-W4AFP8 --request-rate 10 --model /export/models/DeepSeek-R1-0528-W4AFP8
main:


this pr:
Throughput improved by approximately 10% after this pr.
Checklist