[Refactor] Refactor DeepEP dispatcher#22822
[Refactor] Refactor DeepEP dispatcher#22822OrangeRedeng wants to merge 35 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
OK, but for users, right now there’s no assertion, so it defaults to FP8.However, setting the env var export SGLANG_DEEPEP_BF16_DISPATCH=1 will disable the FP8 dispatch path.Could you add me on WeChat? My WeChat: xieminghe0119 |
Due to a bug, we can't enable SGLANG_DEEPEP_BF16_DISPATCH right now, even if we specify it, we will always return a quantized FP8/int8 hidden_states from dispather. As I understand it, this issue isn't limited to Ascend |
|
yes, export SGLANG_DEEPEP_BF16_DISPATCH=1 hase been deleted. you can fix the bug and restore SGLANG_DEEPEP_BF16_DISPATCH=1 functionality,but FP8_DISPATCH may be disabled. we may need a resonable method. |
I can't see a problem here. In case you need FP8_DISPATCH, just set SGLANG_DEEPEP_BF16_DISPATCH=false. Double checked with the quant type, W4FP8 should support FP8_DISPATCH due to its activation being FP8 dtype, so the question is why in the very beginning we need this assert? assert (
envs.SGLANG_DEEPEP_BF16_DISPATCH.get()
), "W4AFP8 does not support FP8 dispatch; please set SGLANG_DEEPEP_BF16_DISPATCH=1." |
We described it with @xieminghe1 and we thought it would be better to create a new variable that could control the type of the dispatcher's output (nvfp4, fp8, int8, bf16, etc.). Because it's not obvious now and is determined by input_global_scale. And everything related to SGLANG_DEEPEP_BF16_DISPATCH delete |
4202ec6 to
28bd3d7
Compare
|
/gemini review |
|
/gemini summary |
Description
Refactor the DeepEP dispatcher to introduce structured output dtype control, replacing the
SGLANG_DEEPEP_BF16_DISPATCHenvironment variable with aDeepOutputDtypeenum andautomatic detection. This is part of a broader effort to make the dispatch pipeline robust for
quantized MoE models — especially on Ascend NPU — while reducing memory overhead and
simplifying user configuration.
Motivation
The previous dispatch pipeline had several shortcomings:
non-NPU, non-flashinfer-cuteDSL paths, causing incorrect results for W8A8 quant and
BF16-pretrained models.
SGLANG_DEEPEP_BF16_DISPATCH=1to disable FP8, but this variable was not properlyplumbed through all code paths (e.g.,
_dispatch_corein low‑latency mode) and wasglobally scoped, leading to subtle bugs.
params_byteswas hardcoded to 2, which wasted up to 50%of the staging buffer memory when dispatching in FP8 or NVFP4 (where 1 byte
per parameter is sufficient).
ModelSlimW4A4Int4MoEandNPU‑compressed‑tensors MoE methods needed manual tuning but lacked any automatic
dtype selection.
Modifications
A concise summary of the changes follows; see the commit history for full details.
1. Introduce
DeepOutputDtypeenum (moe/utils.py)BF16,FP8,NVFP4.--deepep‑dispatch‑output‑dtypeacceptsbf16,fp8,nvfp4, orthe default
auto.SGLANG_DEEPEP_BF16_DISPATCH— removed from docs, scripts, and enforce‑path assert.
2. Automatic detection logic (
moe/utils.py)get_deepep_output_dtype()that infers the correct output dtype from:0. Explicit deprecated env variables.
auto).flashinfer_cutedslrunner backend → NVFP4 BF16 .BF16on NPU.FP8for standard FP8‑pretrained models.3. Dispatcher refactor (
deepep.py)_DeepEPDispatcherImplBase.__init__now receivesdeepep_output_dtype(asDeepOutputDtype), setsself.use_fp8,self.use_nvfp4, andcomputes
self.params_bytesdynamically based on the resolved dtype:BF16→params_bytes = 2FP8/NVFP4→params_bytes = 1dispatch_a/_dispatch_corein both normal and low‑latency implementationsno longer compute
use_fp8locally; they use the pre‑computed flags stored inself.nvl_buffer_size,rdma_buffer_size) shrink proportionallyas
params_bytesis now computed per‑layer.4. NPU‑specific path enhancements (
ep_moe/layer.py)isinstance(self.scheme, (ModelSlimW4A4Int4MoE,))check that callstorch_npu.npu_dynamic_quant(hidden_states, dst_type=torch.quint4x2)before dispatch,enabling the correct BF16 output path for
ModelSlimW4A4Int4MoEon Ascend.deepep_output_dtypeinferred from the scheme.5. Additional improvements
Kimi K2.5(and some other compressed‑tensors models): loading speed improved up to 3× on NPU.and remove references to the deprecated env var.
and remove references to the deprecated env var.
Accuracy Tests and Speed Tests
Kimi K2.5
Qwen3.5-35B-A3B-w8a8-mtp
server:
SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ASCEND_RT_VISIBLE_DEVICES=12,13,14,15 python3 -m sglang.launch_server --model-path "./weights/Qwen/Qwen3.5-35B-A3B-w8a8-mtp/" --tp 4 --mem-fraction-static 0.8 --max-total-tokens 66000 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30088 --moe-a2a-backend deepep --deepep-mode auto --cuda-graph-max-bs 128 --disable-radix-cacheclient:
python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ./datasets/gsm8k/test.jsonl --parallel 128Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci