[Refactor] Refactor DeepEP dispatcher by OrangeRedeng · Pull Request #22822 · sgl-project/sglang

OrangeRedeng · 2026-04-14T19:25:18Z

Description

Refactor the DeepEP dispatcher to introduce structured output dtype control, replacing the
SGLANG_DEEPEP_BF16_DISPATCH environment variable with a DeepOutputDtype enum and
automatic detection. This is part of a broader effort to make the dispatch pipeline robust for
quantized MoE models — especially on Ascend NPU — while reducing memory overhead and
simplifying user configuration.

Motivation

The previous dispatch pipeline had several shortcomings:

Hardcoded FP8 assumption. The dispatcher unconditionally enabled FP8 dispatch for
non-NPU, non-flashinfer-cuteDSL paths, causing incorrect results for W8A8 quant and
BF16-pretrained models.
Fragile env-var workaround. Users were forced to set
SGLANG_DEEPEP_BF16_DISPATCH=1 to disable FP8, but this variable was not properly
plumbed through all code paths (e.g., _dispatch_core in low‑latency mode) and was
globally scoped, leading to subtle bugs.
Static buffer sizing. params_bytes was hardcoded to 2, which wasted up to 50%
of the staging buffer memory when dispatching in FP8 or NVFP4 (where 1 byte
per parameter is sufficient).
No scheme‑aware dispatch. Quantization schemes like ModelSlimW4A4Int4MoE and
NPU‑compressed‑tensors MoE methods needed manual tuning but lacked any automatic
dtype selection.

Modifications

A concise summary of the changes follows; see the commit history for full details.

1. Introduce `DeepOutputDtype` enum (`moe/utils.py`)

New enum with members BF16, FP8, NVFP4.
Server argument --deepep‑dispatch‑output‑dtype accepts bf16, fp8, nvfp4, or
the default auto.
Deprecated SGLANG_DEEPEP_BF16_DISPATCH — removed from docs, scripts, and enforce‑
path assert.

2. Automatic detection logic (`moe/utils.py`)

New function get_deepep_output_dtype() that infers the correct output dtype from:
0. Explicit deprecated env variables.
1. Explicit server argument (if not auto).
2. If quant_config contains input_global_scale → NVFP4
3. If quant_config contains dispather_output_dtype → parse it
4. If flashinfer_cutedsl runner backend → NVFP4 BF16 .
5. Fallback to BF16 on NPU.
6. Fallback to FP8 for standard FP8‑pretrained models.

3. Dispatcher refactor (`deepep.py`)

_DeepEPDispatcherImplBase.__init__ now receives deepep_output_dtype (as
DeepOutputDtype), sets self.use_fp8, self.use_nvfp4, and
computes self.params_bytes dynamically based on the resolved dtype:
- BF16 → params_bytes = 2
- FP8 / NVFP4 → params_bytes = 1
dispatch_a / _dispatch_core in both normal and low‑latency implementations
no longer compute use_fp8 locally; they use the pre‑computed flags stored in self.
Staging buffer sizes (nvl_buffer_size, rdma_buffer_size) shrink proportionally
as params_bytes is now computed per‑layer.

4. NPU‑specific path enhancements (`ep_moe/layer.py`)

Explicit isinstance(self.scheme, (ModelSlimW4A4Int4MoE,)) check that calls
torch_npu.npu_dynamic_quant(hidden_states, dst_type=torch.quint4x2) before dispatch,
enabling the correct BF16 output path for ModelSlimW4A4Int4MoE on Ascend.
NPU compressed‑tensors fused W4A16 MoE method patched to pass through the proper
deepep_output_dtype inferred from the scheme.
AWQ DeepEP now supported on Ascend.

5. Additional improvements

Kimi K2.5 (and some other compressed‑tensors models): loading speed improved up to 3× on NPU.
CI: updated CI to mention the new server argument
and remove references to the deprecated env var.
Docs: updated expert‑parallelism documentation to mention the new server argument
and remove references to the deprecated env var.

Accuracy Tests and Speed Tests

Kimi K2.5

Qwen3.5-35B-A3B-w8a8-mtp

server:
SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ASCEND_RT_VISIBLE_DEVICES=12,13,14,15 python3 -m sglang.launch_server --model-path "./weights/Qwen/Qwen3.5-35B-A3B-w8a8-mtp/" --tp 4 --mem-fraction-static 0.8 --max-total-tokens 66000 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30088 --moe-a2a-backend deepep --deepep-mode auto --cuda-graph-max-bs 128 --disable-radix-cache

client:
python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ./datasets/gsm8k/test.jsonl --parallel 128

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-14T19:25:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-04-14T20:11:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

xieminghe1 · 2026-04-15T03:19:07Z

OK, but for users, right now there’s no assertion, so it defaults to FP8.However, setting the env var export SGLANG_DEEPEP_BF16_DISPATCH=1 will disable the FP8 dispatch path.Could you add me on WeChat? My WeChat: xieminghe0119

OrangeRedeng · 2026-04-15T06:38:19Z

OK, but for users, right now there’s no assertion, so it defaults to FP8.However, setting the env var export SGLANG_DEEPEP_BF16_DISPATCH=1 will disable the FP8 dispatch path.Could you add me on WeChat? My WeChat: xieminghe0119

Due to a bug, we can't enable SGLANG_DEEPEP_BF16_DISPATCH right now, even if we specify it, we will always return a quantized FP8/int8 hidden_states from dispather. As I understand it, this issue isn't limited to Ascend

xieminghe1 · 2026-04-15T06:56:57Z

yes, export SGLANG_DEEPEP_BF16_DISPATCH=1 hase been deleted. you can fix the bug and restore SGLANG_DEEPEP_BF16_DISPATCH=1 functionality，but FP8_DISPATCH may be disabled. we may need a resonable method.

iforgetmyname · 2026-04-15T07:13:48Z

yes, export SGLANG_DEEPEP_BF16_DISPATCH=1 hase been deleted. you can fix the bug and restore SGLANG_DEEPEP_BF16_DISPATCH=1 functionality，but FP8_DISPATCH may be disabled. we may need a resonable method.

I can't see a problem here. In case you need FP8_DISPATCH, just set SGLANG_DEEPEP_BF16_DISPATCH=false.

Double checked with the quant type, W4FP8 should support FP8_DISPATCH due to its activation being FP8 dtype, so the question is why in the very beginning we need this assert?

        assert (
            envs.SGLANG_DEEPEP_BF16_DISPATCH.get()
        ), "W4AFP8 does not support FP8 dispatch; please set SGLANG_DEEPEP_BF16_DISPATCH=1."

OrangeRedeng · 2026-04-15T07:43:37Z

yes, export SGLANG_DEEPEP_BF16_DISPATCH=1 hase been deleted. you can fix the bug and restore SGLANG_DEEPEP_BF16_DISPATCH=1 functionality，but FP8_DISPATCH may be disabled. we may need a resonable method.

I can't see a problem here. In case you need FP8_DISPATCH, just set SGLANG_DEEPEP_BF16_DISPATCH=false.

Double checked with the quant type, W4FP8 should support FP8_DISPATCH due to its activation being FP8 dtype, so the question is why in the very beginning we need this assert?
        assert (
            envs.SGLANG_DEEPEP_BF16_DISPATCH.get()
        ), "W4AFP8 does not support FP8 dispatch; please set SGLANG_DEEPEP_BF16_DISPATCH=1."

We described it with @xieminghe1 and we thought it would be better to create a new variable that could control the type of the dispatcher's output (nvfp4, fp8, int8, bf16, etc.). Because it's not obvious now and is determined by input_global_scale. And everything related to SGLANG_DEEPEP_BF16_DISPATCH delete

OrangeRedeng · 2026-05-05T08:24:55Z

/gemini review

OrangeRedeng · 2026-05-05T08:25:09Z

/gemini summary

This was referenced Apr 14, 2026

[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication #22316

Merged

[Bug] [NPU][Kimi K2.5] Cannot Run on 910B #22530

Closed

OrangeRedeng marked this pull request as ready for review April 14, 2026 20:11

OrangeRedeng requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 14, 2026 20:11

OrangeRedeng marked this pull request as draft April 15, 2026 07:33

OrangeRedeng mentioned this pull request Apr 15, 2026

[NPU] [Roadmap] NPU MoE 2026 Q1 Roadmap #22521

Open

28 tasks

github-actions Bot added documentation Improvements or additions to documentation npu deepseek labels Apr 15, 2026

OrangeRedeng mentioned this pull request Apr 27, 2026

[NPU] Fix DeepEP LL dispatch BF16 flag and skip triton kernel on NPU for Qwen3.5 #23815

Merged

5 tasks

OrangeRedeng changed the title ~~[Bugfix] Fix DeepEP BF16 dispatcher~~ [Refactor] Refactor DeepEP dispatcher May 4, 2026

Refactor DeepEP Dispather

28bd3d7

OrangeRedeng force-pushed the kimi-k2.5-fix branch from 4202ec6 to 28bd3d7 Compare May 5, 2026 06:21

Merge branch 'main' into kimi-k2.5-fix

1912420

OrangeRedeng requested a review from JustinTong0323 as a code owner May 5, 2026 11:28

OrangeRedeng added 2 commits May 5, 2026 14:28

Merge branch 'main' into kimi-k2.5-fix

cd3d7a0

Merge branch 'main' into kimi-k2.5-fix

c45f10d

ping1jing2 self-assigned this May 5, 2026

OrangeRedeng added 6 commits May 6, 2026 00:11

Fix Lint

a5663f3

Merge branch 'main' into kimi-k2.5-fix

06d8833

Merge branch 'main' into kimi-k2.5-fix

bfd812b

Merge branch 'main' into kimi-k2.5-fix

7bcf0f8

Merge branch 'main' into kimi-k2.5-fix

a15a279

Merge branch 'main' into kimi-k2.5-fix

51322b5

This was referenced May 7, 2026

[Bug][NPU]Capture cuda graph failed: call aclnnMoeDistributeDispatchV2 failed #24603

Closed

[Bug] glm_4.7_w8a8_mtp startup failure #23376

Closed

ch-wan reviewed May 8, 2026

View reviewed changes

Comment thread python/sglang/srt/environ.py

Comment thread python/sglang/srt/layers/moe/utils.py Outdated

Comment thread python/sglang/srt/layers/moe/ep_moe/layer.py Outdated

OrangeRedeng added 17 commits May 8, 2026 12:38

Remove _is_npu: pass

7dc5eca

Restore SGLANG_DEEPEP_BF16_DISPATCH to ensure backward compatibility

0dffc73

Add deprecation warning

4610735

Update utils.py

7d87a9e

Update deprecation warnings

38f9e33

Small fix

f2980ce

Simplify the code by parsing directly from the quantization config

9426143

Simplify the code by parsing directly from the quantization config

73a2c96

Update fused_moe_method_npu.py

8c81f14

Update utils.py

5f36c14

Quick fix

d311657

Add missing import

b02b5d8

Fix lint

7c70cdb

Fix lint

4f97454

Update utils.py

23f37cd

Merge branch 'main' into kimi-k2.5-fix

cab180b

Merge branch 'main' into kimi-k2.5-fix

5e0ad65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Refactor DeepEP dispatcher#22822

[Refactor] Refactor DeepEP dispatcher#22822
OrangeRedeng wants to merge 35 commits intosgl-project:mainfrom
OrangeRedeng:kimi-k2.5-fix

OrangeRedeng commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

xieminghe1 commented Apr 15, 2026

Uh oh!

OrangeRedeng commented Apr 15, 2026

Uh oh!

xieminghe1 commented Apr 15, 2026

Uh oh!

iforgetmyname commented Apr 15, 2026 •

edited

Loading

Uh oh!

OrangeRedeng commented Apr 15, 2026

Uh oh!

OrangeRedeng commented May 5, 2026

Uh oh!

OrangeRedeng commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

OrangeRedeng commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Modifications

1. Introduce DeepOutputDtype enum (moe/utils.py)

2. Automatic detection logic (moe/utils.py)

3. Dispatcher refactor (deepep.py)

4. NPU‑specific path enhancements (ep_moe/layer.py)

5. Additional improvements

Accuracy Tests and Speed Tests

Kimi K2.5

Qwen3.5-35B-A3B-w8a8-mtp

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

xieminghe1 commented Apr 15, 2026

Uh oh!

OrangeRedeng commented Apr 15, 2026

Uh oh!

xieminghe1 commented Apr 15, 2026

Uh oh!

iforgetmyname commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OrangeRedeng commented Apr 15, 2026

Uh oh!

OrangeRedeng commented May 5, 2026

Uh oh!

OrangeRedeng commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OrangeRedeng commented Apr 14, 2026 •

edited

Loading

1. Introduce `DeepOutputDtype` enum (`moe/utils.py`)

2. Automatic detection logic (`moe/utils.py`)

3. Dispatcher refactor (`deepep.py`)

4. NPU‑specific path enhancements (`ep_moe/layer.py`)

iforgetmyname commented Apr 15, 2026 •

edited

Loading