Skip to content

[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication#22316

Merged
BBuf merged 9 commits intosgl-project:mainfrom
xieminghe1:main
Apr 10, 2026
Merged

[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication#22316
BBuf merged 9 commits intosgl-project:mainfrom
xieminghe1:main

Conversation

@xieminghe1
Copy link
Copy Markdown
Contributor

Motivation

Modifications

Accuracy Tests

Motivation

profiling:
deepseek-R1-0528:
ME1764483666809

deepseek-R1-0528-w4afp8:
ME1764483641467

1.When DeepEP is enabled, the communication latency of the DeepSeek-R1-0508-W4AFP8 model is twice that of the DeepSeek-R1-0528 model.
2.The root cause is that DeepEP Dispatch in DeepSeek-R1-0508-W4AFP8 model adopts BF16 for communication, resulting in increased bandwidth consumption and impacting inference performance.

Modifications

  1. The DeepSeek-R1-0508-W4AFP8 model deep dispatch adopts FP8 for communication instead of BF16.
  2. Develop a per-token to per-tensor quant kernel to adapt to the cutlass_w4a8_moe operator interface.

Accuracy Tests

dataset: gsm8k
main:
image

this pr:
image

dataset: ceval

main:
image

this pr:
image

compare:

image image

Benchmarking and Profiling

GPU:H20

export NVIDIA_SPARSE_ENABLE=1
export NCCL_SOCKET_IFNAME=bond0
export GLOO_SOCKET_IFNAME=bond0
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_TC=160
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG_FILE=nccl_debug.log
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_NTHREADS=4 # added
export NCCL_NSOCKS_PERTHREAD=8 # added
export NCCL_IB_QPS_PER_CONNECTION=8 # added
export NCCL_IB_SPLIT_DATA_ON_QPS=1 # added
export OMP_NUM_THREADS=1 #TDDO
export NCCL_IB_HCA=mlx5_gdr_0:1,mlx5_gdr_1:1,mlx5_gdr_2:1,mlx5_gdr_3:1,mlx5_gdr_4:1,mlx5_gdr_5:1,mlx5_gdr_6:1,mlx5_gdr_7:1
export MC_TE_METRIC=true
export MC_LOG_LEVEL=TRACE
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_MOE_PADDING=1
export SGLANG_ENABLE_JIT_DEEPGEMM=1
export SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS=32
export SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD=16384
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64
export PYTHONUNBUFFERED=1
export UCX_TLS=tcp,sm
export UCX_NET_DEVICES=bond0
export METRICS_PREFIX=prefill
export SGLANG_HEALTH_CHECK_MEM_FREE=50
export ENABLE_PD_GUARDIAN=0
export SGLANG_DEEPEP_NORMAL_BF16_DISPATCH=1
python3 -m sglang.launch_server
--model-path /export/models/DeepSeek-R1-0528-W4AFP8
--host 0.0.0.0
--port 15000
--trust-remote-code
--tp-size 8
--dp-size 8
--enable-dp-attention
--moe-dense-tp-size 1
--enable-dp-lm-head
--ep-dispatch-algorithm dynamic
--eplb-algorithm deepseek
--disable-shared-experts-fusion
--moe-a2a-backend deepep
--moe-runner-backend cutlass
--deepep-mode low_latency
--chunked-prefill-size 16384
--mem-fraction-static 0.81
--cuda-graph-max-bs 32
--max-running-requests 256
--context-length 16384
--tokenizer-worker-num 8
--enable-dynamic-batch-tokenizer
--dynamic-batch-tokenizer-batch-size 8
--attention-backend flashinfer
--tool-call-parser deepseekv3
--speculative-algorithm EAGLE
--speculative-num-steps 1
--speculative-eagle-topk 1
--speculative-num-draft-tokens 2
--disable-radix-cache
--page-size 128

sglang.bench_serving --backend sglang --host localhost --port 15000 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --num-prompts 100 --sharegpt-output-len 256 --warmup-requests 10 --sharegpt-context-len 4096 --max-concurrency 64 --tokenizer /export/models/DeepSeek-R1-0528-W4AFP8 --request-rate 10 --model /export/models/DeepSeek-R1-0528-W4AFP8

main:
image
this pr:
image

Throughput improved by approximately 10% after this pr.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@BBuf BBuf changed the title DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication [Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication Apr 9, 2026
@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Apr 9, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 9, 2026
@BBuf BBuf merged commit 18f41ac into sgl-project:main Apr 10, 2026
235 of 271 checks passed
Fridge003 pushed a commit that referenced this pull request Apr 11, 2026
…8 Communication (#22316)

Co-authored-by: undefined <zhouchen.arrebol@jd.com>
Co-authored-by: xq25478 <xq25478@qq.com>
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
…8 Communication (sgl-project#22316)

Co-authored-by: undefined <zhouchen.arrebol@jd.com>
Co-authored-by: xq25478 <xq25478@qq.com>
if input_global_scale is not None:
use_nvfp4 = True
elif not envs.SGLANG_DEEPEP_BF16_DISPATCH.get():
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable SGLANG_DEEPEP_BF16_DISPATCH is unrelated to W4AFP8 weights. This environment variable is set for non-quantized weights. Directly deleting it here will cause non-quantized weights to be unable to correctly use deepep. Could you simply avoid setting export SGLANG_DEEPEP_BF16_DISPATCH=1 when starting the service for W4AFP8 weight, instead of modifying the code?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that this change breaks models with the wna16 quantization scheme, like Kimi-K2.5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my quickfix - #22822, please review it. I think we can create some variables to pass quant_type to this function, I'll think about how to do it better

@xieminghe1
Copy link
Copy Markdown
Contributor Author

xieminghe1 commented Apr 15, 2026 via email

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…8 Communication (sgl-project#22316)

Co-authored-by: undefined <zhouchen.arrebol@jd.com>
Co-authored-by: xq25478 <xq25478@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants