[DRAFT] Support dispatch low latency by kaixih · Pull Request #10120 · sgl-project/sglang

kaixih · 2025-09-07T03:46:58Z

WIP

gemini-code-assist

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a new low-latency dispatch mechanism for Mixture-of-Experts (MoE) layers by integrating FlashInfer's CuteDSL kernels. The changes enhance the DeepEP configuration to support Bfloat16 dispatch and incorporate global scaling factors for improved handling of quantized operations. This work aims to optimize the performance of MoE computations within the system.

Highlights

New MoE Dispatch Backend: Introduces FLASHINFER_CUTEDSL as a new backend for Mixture-of-Experts (MoE) dispatch, leveraging FlashInfer's CuteDSL kernels for potentially lower latency.
Bfloat16 Dispatch Support: Adds an environment variable SGLANG_DEEPEP_BF16_DISPATCH to enable Bfloat16 precision for dispatch operations within the DeepEP configuration.
Quantization Scale Integration: Modifies the dispatch process to pass input global scales, crucial for accurate quantized operations, particularly with FP4 weights.
New Test Suite: Includes a new test file (test_flashinfer_cutedsl_scaled_mm.py) to validate the correctness of the FlashInfer CuteDSL grouped GEMM and MoE masked operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for a new low-latency dispatch mechanism for Mixture-of-Experts (MoE) layers, utilizing FlashInfer's CuteDSL kernels. This is exposed via a new flashinfer_cutedsl backend for the MoE runner. The changes include adding the new backend option, implementing the corresponding MoE computation path with FP4 quantization, and adding a new test file for the functionality. My review focuses on the correctness and consistency of these changes. I've found a critical issue in the new test file where an argument is passed with an incorrect type, which would lead to a runtime error. I've also identified some minor inconsistencies in docstrings and assertion messages that should be fixed for better maintainability.

gemini-code-assist · 2025-09-07T03:49:21Z

+    out = flashinfer_cutedsl_moe_masked(
+        hidden_states_3d.to(hidden_states.device),
+        input_global_scale,
+        w1_fp4.permute(2, 0, 1),
+        w1_blockscale,
+        w1_alpha,
+        w2_fp4.permute(2, 0, 1),
+        a2_global_scale,
+        w2_blockscale,
+        w2_alpha,
+        masked_m.to(hidden_states.device),
+    )


The flashinfer_cutedsl_moe_masked function expects hidden_states to be a tuple of two tensors (quantized data and scales), but hidden_states_3d which is a single tensor is passed. This will cause a runtime error. The hidden_states_3d tensor should be quantized before being passed to the function, for example by using scaled_fp4_grouped_quant.

gemini-code-assist · 2025-09-07T03:49:21Z

+    Args:
+        hidden_states (tuple[torch.Tensor, torch.Tensor]): [num_experts, m, k // 2], uint8, [num_experts, m, k // 16], float8_e4m3fn
+        input_global_scale (torch.Tensor): (l,)
+        w1 (torch.Tensor): fp4 weights, [l, 2 * n, k // 2], uint8
+        w1_blockscale (torch.Tensor): blockscale factors, e4m3,
+        w1_alpha (torch.Tensor): (l,)
+        w2 (torch.Tensor): fp4 weights, [l, k, n // 2], uint8
+        a2_global_scale (torch.Tensor): (l,)
+        w2_blockscale (torch.Tensor): blockscale factors, e4m3,
+        w2_alpha (torch.Tensor): (l,)
+        masked_m (torch.Tensor): Masked dimension indices


The docstring uses l as a placeholder for the number of experts, but the code uses num_experts. This should be made consistent for clarity. For example, (l,) should be (num_experts,). Also, the hidden_states description is a bit dense. Consider breaking it down for better readability.

gemini-code-assist · 2025-09-07T03:49:21Z

+    assert input_global_scale.shape == (
+        num_experts,
+    ), f"input_global_scale must be (l,), got {input_global_scale.shape}"
+    assert w1_alpha.shape == (
+        num_experts,
+    ), f"w1_alpha must be (l,), got {w1_alpha.shape}"
+    assert a2_global_scale.shape == (
+        num_experts,
+    ), f"a2_global_scale must be (l,), got {a2_global_scale.shape}"
+    assert w2_alpha.shape == (
+        num_experts,
+    ), f"w2_alpha must be (l,), got {w2_alpha.shape}"


The assertion messages use (l,) to describe the expected shape, but the variable l is not defined in this scope. The code actually checks against num_experts. For consistency and clarity, please use num_experts in the assertion messages.

Suggested change

assert input_global_scale.shape == (

num_experts,

), f"input_global_scale must be (l,), got {input_global_scale.shape}"

assert w1_alpha.shape == (

num_experts,

), f"w1_alpha must be (l,), got {w1_alpha.shape}"

assert a2_global_scale.shape == (

num_experts,

), f"a2_global_scale must be (l,), got {a2_global_scale.shape}"

assert w2_alpha.shape == (

num_experts,

), f"w2_alpha must be (l,), got {w2_alpha.shape}"

assert input_global_scale.shape == (

num_experts,

), f"input_global_scale must be (num_experts,), got {input_global_scale.shape}"

assert w1_alpha.shape == (

num_experts,

), f"w1_alpha must be (num_experts,), got {w1_alpha.shape}"

assert a2_global_scale.shape == (

num_experts,

), f"a2_global_scale must be (num_experts,), got {a2_global_scale.shape}"

assert w2_alpha.shape == (

num_experts,

), f"w2_alpha must be (num_experts,), got {w2_alpha.shape}"

fzyzcjy · 2025-09-10T07:18:16Z

FYI my future dev work will be in: feat/deepep_ll_nvfp4 #10263

fzyzcjy · 2025-09-10T12:12:01Z

well gsm8k for this branch is zero :/

kaixih · 2025-10-29T22:13:56Z

Closing. Since the new work in the #10263 and is already merged.

wenscarl and others added 21 commits August 27, 2025 20:24

Add flashinfer CuteDSL masked grouped gemm support

79ba699

zero to empty init.

64f72e7

Address comment

4e03150

Upd

0a1d699

Add masked_m

73aa90a

Merge branch 'main' into flashinfer_cutedsl_grp_gemm

b09d92d

Merge branch 'main' into flashinfer_cutedsl_grp_gemm

6b96c98

Update python/sglang/srt/layers/quantization/modelopt_quant.py

73b9605

fix error

f7fc26d

Merge branch 'main' into flashinfer_cutedsl_grp_gemm

ec2c719

Merge branch 'main' into flashinfer_cutedsl_grp_gemm

dfb3ac3

Merge branch 'sgl-project:main' into flashinfer_cutedsl_grp_gemm

f9bb5bc

Skip fusing scaling factor into router weights for cutedsl backend

7773823

Merge branch 'main' into flashinfer_cutedsl_grp_gemm

1497ccc

Make unittest rigorous

a88062e

Fix lint

7c7a6dc

Merge branch 'main' into flashinfer_cutedsl_grp_gemm

21ff185

Merge branch 'main-upstream' into flashinfer_cutedsl_grp_gemm

5950cd6

Add comment

6bac5df

Enable fused scaling factor

4d8812f

WIP

02db546

kaixih requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners September 7, 2025 03:46

gemini-code-assist Bot reviewed Sep 7, 2025

View reviewed changes

kaixih marked this pull request as draft September 7, 2025 03:47

gemini-code-assist Bot reviewed Sep 7, 2025

View reviewed changes

WIP working with deepep shifang/ll_dispatch_nvfp4

951c10f

fzyzcjy mentioned this pull request Sep 10, 2025

Support dispatch low latency #10263

Merged

4 tasks

kaixih closed this Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Support dispatch low latency#10120

[DRAFT] Support dispatch low latency#10120
kaixih wants to merge 22 commits intosgl-project:mainfrom
kaixih:dev_dispatch_low_latency

kaixih commented Sep 7, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Sep 7, 2025

Uh oh!

gemini-code-assist Bot Sep 7, 2025

Uh oh!

gemini-code-assist Bot Sep 7, 2025

Uh oh!

fzyzcjy commented Sep 10, 2025 •

edited

Loading

Uh oh!

fzyzcjy commented Sep 10, 2025

Uh oh!

kaixih commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaixih commented Sep 7, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Sep 10, 2025

Uh oh!

kaixih commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fzyzcjy commented Sep 10, 2025 •

edited

Loading