Skip to content

[Refactor] Refactor DeepEP dispatcher#22822

Open
OrangeRedeng wants to merge 35 commits intosgl-project:mainfrom
OrangeRedeng:kimi-k2.5-fix
Open

[Refactor] Refactor DeepEP dispatcher#22822
OrangeRedeng wants to merge 35 commits intosgl-project:mainfrom
OrangeRedeng:kimi-k2.5-fix

Conversation

@OrangeRedeng
Copy link
Copy Markdown
Contributor

@OrangeRedeng OrangeRedeng commented Apr 14, 2026

Description

Refactor the DeepEP dispatcher to introduce structured output dtype control, replacing the
SGLANG_DEEPEP_BF16_DISPATCH environment variable with a DeepOutputDtype enum and
automatic detection. This is part of a broader effort to make the dispatch pipeline robust for
quantized MoE models — especially on Ascend NPU — while reducing memory overhead and
simplifying user configuration.

Motivation

The previous dispatch pipeline had several shortcomings:

  • Hardcoded FP8 assumption. The dispatcher unconditionally enabled FP8 dispatch for
    non-NPU, non-flashinfer-cuteDSL paths, causing incorrect results for W8A8 quant and
    BF16-pretrained models.
  • Fragile env-var workaround. Users were forced to set
    SGLANG_DEEPEP_BF16_DISPATCH=1 to disable FP8, but this variable was not properly
    plumbed through all code paths (e.g., _dispatch_core in low‑latency mode) and was
    globally scoped, leading to subtle bugs.
  • Static buffer sizing. params_bytes was hardcoded to 2, which wasted up to 50%
    of the staging buffer memory when dispatching in FP8 or NVFP4 (where 1 byte
    per parameter is sufficient).
  • No scheme‑aware dispatch. Quantization schemes like ModelSlimW4A4Int4MoE and
    NPU‑compressed‑tensors MoE methods needed manual tuning but lacked any automatic
    dtype selection.

Modifications

A concise summary of the changes follows; see the commit history for full details.

1. Introduce DeepOutputDtype enum (moe/utils.py)

  • New enum with members BF16, FP8, NVFP4.
  • Server argument --deepep‑dispatch‑output‑dtype accepts bf16, fp8, nvfp4, or
    the default auto.
  • Deprecated SGLANG_DEEPEP_BF16_DISPATCH — removed from docs, scripts, and enforce‑
    path assert.

2. Automatic detection logic (moe/utils.py)

  • New function get_deepep_output_dtype() that infers the correct output dtype from:
    0. Explicit deprecated env variables.
    1. Explicit server argument (if not auto).
    2. If quant_config contains input_global_scale → NVFP4
    3. If quant_config contains dispather_output_dtype → parse it
    4. If flashinfer_cutedsl runner backend → NVFP4 BF16 .
    5. Fallback to BF16 on NPU.
    6. Fallback to FP8 for standard FP8‑pretrained models.

3. Dispatcher refactor (deepep.py)

  • _DeepEPDispatcherImplBase.__init__ now receives deepep_output_dtype (as
    DeepOutputDtype), sets self.use_fp8, self.use_nvfp4, and
    computes self.params_bytes dynamically based on the resolved dtype:
    • BF16params_bytes = 2
    • FP8 / NVFP4params_bytes = 1
  • dispatch_a / _dispatch_core in both normal and low‑latency implementations
    no longer compute use_fp8 locally; they use the pre‑computed flags stored in self.
  • Staging buffer sizes (nvl_buffer_size, rdma_buffer_size) shrink proportionally
    as params_bytes is now computed per‑layer.

4. NPU‑specific path enhancements (ep_moe/layer.py)

  • Explicit isinstance(self.scheme, (ModelSlimW4A4Int4MoE,)) check that calls
    torch_npu.npu_dynamic_quant(hidden_states, dst_type=torch.quint4x2) before dispatch,
    enabling the correct BF16 output path for ModelSlimW4A4Int4MoE on Ascend.
  • NPU compressed‑tensors fused W4A16 MoE method patched to pass through the proper
    deepep_output_dtype inferred from the scheme.
  • AWQ DeepEP now supported on Ascend.

5. Additional improvements

  • Kimi K2.5 (and some other compressed‑tensors models): loading speed improved up to 3× on NPU.
  • CI: updated CI to mention the new server argument
    and remove references to the deprecated env var.
  • Docs: updated expert‑parallelism documentation to mention the new server argument
    and remove references to the deprecated env var.

Accuracy Tests and Speed Tests

Kimi K2.5

image

Qwen3.5-35B-A3B-w8a8-mtp

server:
SGLANG_SET_CPU_AFFINITY=1 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True STREAMS_PER_DEVICE=32 HCCL_BUFFSIZE=1536 ASCEND_RT_VISIBLE_DEVICES=12,13,14,15 python3 -m sglang.launch_server --model-path "./weights/Qwen/Qwen3.5-35B-A3B-w8a8-mtp/" --tp 4 --mem-fraction-static 0.8 --max-total-tokens 66000 --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 30088 --moe-a2a-backend deepep --deepep-mode auto --cuda-graph-max-bs 128 --disable-radix-cache

client:
python ./benchmark/gsm8k/bench_sglang.py --num-questions 1319 --port 30088 --data-path ./datasets/gsm8k/test.jsonl --parallel 128

image

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@xieminghe1
Copy link
Copy Markdown
Contributor

OK, but for users, right now there’s no assertion, so it defaults to FP8.However, setting the env var export SGLANG_DEEPEP_BF16_DISPATCH=1 will disable the FP8 dispatch path.Could you add me on WeChat? My WeChat: xieminghe0119

@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

OK, but for users, right now there’s no assertion, so it defaults to FP8.However, setting the env var export SGLANG_DEEPEP_BF16_DISPATCH=1 will disable the FP8 dispatch path.Could you add me on WeChat? My WeChat: xieminghe0119

Due to a bug, we can't enable SGLANG_DEEPEP_BF16_DISPATCH right now, even if we specify it, we will always return a quantized FP8/int8 hidden_states from dispather. As I understand it, this issue isn't limited to Ascend

@xieminghe1
Copy link
Copy Markdown
Contributor

yes, export SGLANG_DEEPEP_BF16_DISPATCH=1 hase been deleted. you can fix the bug and restore SGLANG_DEEPEP_BF16_DISPATCH=1 functionality,but FP8_DISPATCH may be disabled. we may need a resonable method.

@iforgetmyname
Copy link
Copy Markdown
Collaborator

iforgetmyname commented Apr 15, 2026

yes, export SGLANG_DEEPEP_BF16_DISPATCH=1 hase been deleted. you can fix the bug and restore SGLANG_DEEPEP_BF16_DISPATCH=1 functionality,but FP8_DISPATCH may be disabled. we may need a resonable method.

I can't see a problem here. In case you need FP8_DISPATCH, just set SGLANG_DEEPEP_BF16_DISPATCH=false.

Double checked with the quant type, W4FP8 should support FP8_DISPATCH due to its activation being FP8 dtype, so the question is why in the very beginning we need this assert?

        assert (
            envs.SGLANG_DEEPEP_BF16_DISPATCH.get()
        ), "W4AFP8 does not support FP8 dispatch; please set SGLANG_DEEPEP_BF16_DISPATCH=1."

@OrangeRedeng OrangeRedeng marked this pull request as draft April 15, 2026 07:33
@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

yes, export SGLANG_DEEPEP_BF16_DISPATCH=1 hase been deleted. you can fix the bug and restore SGLANG_DEEPEP_BF16_DISPATCH=1 functionality,but FP8_DISPATCH may be disabled. we may need a resonable method.

I can't see a problem here. In case you need FP8_DISPATCH, just set SGLANG_DEEPEP_BF16_DISPATCH=false.

Double checked with the quant type, W4FP8 should support FP8_DISPATCH due to its activation being FP8 dtype, so the question is why in the very beginning we need this assert?

        assert (
            envs.SGLANG_DEEPEP_BF16_DISPATCH.get()
        ), "W4AFP8 does not support FP8 dispatch; please set SGLANG_DEEPEP_BF16_DISPATCH=1."

We described it with @xieminghe1 and we thought it would be better to create a new variable that could control the type of the dispatcher's output (nvfp4, fp8, int8, bf16, etc.). Because it's not obvious now and is determined by input_global_scale. And everything related to SGLANG_DEEPEP_BF16_DISPATCH delete

@github-actions github-actions Bot added documentation Improvements or additions to documentation npu deepseek labels Apr 15, 2026
@OrangeRedeng OrangeRedeng changed the title [Bugfix] Fix DeepEP BF16 dispatcher [Refactor] Refactor DeepEP dispatcher May 4, 2026
@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

/gemini review

@OrangeRedeng
Copy link
Copy Markdown
Contributor Author

/gemini summary

@ping1jing2 ping1jing2 self-assigned this May 5, 2026
Comment thread python/sglang/srt/environ.py
Comment thread python/sglang/srt/layers/moe/utils.py Outdated
Comment thread python/sglang/srt/layers/moe/ep_moe/layer.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation npu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants