Revert "[AMD] support two batch overlapping for mori ep #17953" by Fridge003 · Pull Request #19161 · sgl-project/sglang

Fridge003 · 2026-02-22T17:13:50Z

Motivation

Fix broken CI https://github.com/sgl-project/sglang/actions/runs/22256775640/job/64445687327

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-22T17:14:22Z

Summary of Changes

Hello @Fridge003, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a comprehensive revert of a previous feature that introduced support for two-batch overlapping with Mori Expert Parallelism (EP), particularly targeting AMD platforms. The changes systematically remove the specialized low-latency dispatch and combine mechanisms, dual-stream communication, and associated metadata management. By eliminating this specific functionality and related conditional code paths, the PR aims to simplify the codebase and enhance stability.

Highlights

Mori EP Low-Latency Mode Removal: Removed explicit support for a low-latency mode within the Mori Expert Parallelism (EP) dispatcher, including associated data structures, communication stream handling, and related configuration options.
Two-Batch Overlapping (TBO) Simplification: Eliminated specific logic and dispatcher instantiation related to Mori EP within the two-batch overlapping engine, streamlining its implementation.
Attention Backend Metadata Cleanup: Simplified metadata initialization in the attention backend by removing conditional logic and variables previously used for Multi-Layer Attention (MLA) and speculative decoding within CUDA graph replay.
Configuration and Documentation Updates: Updated server arguments to automatically set deepep_mode to 'normal' for Mori EP and removed 'mori' as a low-latency option from the documentation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/advanced_features/server_arguments.md
- Removed 'mori' from the --moe-a2a-backend options list.
python/sglang/srt/batch_overlap/operations_strategy.py
- Removed is_hip utility import and conditional logic for deep_gemm_num_sms.
python/sglang/srt/batch_overlap/two_batch_overlap.py
- Removed MoriEPDispatcher import and its instantiation logic from TwoBatchOverlapEngine.
python/sglang/srt/layers/attention/aiter_backend.py
- Simplified init_forward_metadata by removing a condition for idle forward mode.
- Removed several metadata initialization blocks and related variables in init_forward_metadata_replay_cuda_graph.
python/sglang/srt/layers/moe/ep_moe/layer.py
- Removed MoriEPLLCombineInput import.
- Updated comments regarding DeepEP low-latency support.
- Refactored forward and run_moe_core methods to simplify dispatch output handling and combine input creation.
python/sglang/srt/layers/moe/fused_moe_triton/layer.py
- Adjusted create_moe_dispatcher to separate Mori EP dispatcher creation, re-adding specific MoriEPDispatcher instantiation outside of the MaybeTboDeepEPDispatcher block.
python/sglang/srt/layers/moe/token_dispatcher/init.py
- Removed MoriEPLLCombineInput and MoriEPLLDispatchOutput from imports and __all__ export list.
python/sglang/srt/layers/moe/token_dispatcher/moriep.py
- Removed DeepEPPDispatchHooks import and class, and is_tbo_enabled import.
- Removed MoriEPLLDispatchOutput and MoriEPLLCombineInput named tuples.
- Removed EpMode.LOW_LATENCY from the EpMode enum.
- Reduced lru_cache maxsize for init_mori_op from 2 to 1.
- Removed deepep_mode parameter and related async logic from init_mori_op.
- Removed CommStreamPool class.
- Simplified _MoriEPDispatcherImplBase by removing quant_config, overlap_args, and meta_overlap_args properties and methods.
- Simplified _MoriEPDispatcherImplNormal by removing async/dual-stream logic and related event handling.
- Modified _dispatch_core to use self.mori_op.dispatch instead of dispatch_send.
- Simplified combine_a and combine_b in _MoriEPDispatcherImplNormal to remove stream handling and use combine instead of combine_send.
- Refactored MoriEPDispatcher to remove _low_latency_dispatcher and related logic, explicitly raising NotImplementedError for low-latency mode.
python/sglang/srt/models/deepseek_v2.py
- Removed conditional logic related to Mori EP and _use_aiter in op_output.
- Removed num_tokens tracking in op_comm_prepare_attn.
python/sglang/srt/server_args.py
- Added self.deepep_mode = "normal" and a warning for Mori EP.
- Simplified chunked_prefill_size validation by removing a condition related to disaggregation_mode.
python/sglang/test/bench_one_batch_server_internal.py
- Simplified token capacity calculation in run_benchmark_internal.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request reverts the support for two-batch overlapping for Mori EP on AMD. While the revert successfully removes the overlapping logic, it introduces several regressions and potential bugs. Specifically, it introduces an AttributeError in the Mori MoE layer due to an unguarded attribute access, a double-scaling bug in the DeepSeek-V2 model when using the aiter backend, and a regression in how MLA metadata is initialized for CUDA graphs on AMD. Additionally, the benchmark script's token capacity calculation is now incorrect for multi-DP configurations.

gemini-code-assist · 2026-02-22T17:18:34Z

-        is_quark_w4a4 = hasattr(self, "scheme") and isinstance(
-            self.scheme, QuarkW4A4MXFp4MoE
-        )
+        is_quark_w4a4 = isinstance(self.scheme, QuarkW4A4MXFp4MoE)


This line will raise an AttributeError if self.scheme is not defined on the layer instance. The previous code correctly used hasattr(self, "scheme") to guard this check. Since MoriEPMoE (and its parent DeepEPMoE) does not guarantee the presence of a scheme attribute, this revert introduces a regression.

Suggested change

is_quark_w4a4 = isinstance(self.scheme, QuarkW4A4MXFp4MoE)

is_quark_w4a4 = hasattr(self, "scheme") and isinstance(self.scheme, QuarkW4A4MXFp4MoE)

gemini-code-assist · 2026-02-22T17:18:34Z

        if (shared_output := state.pop("shared_output")) is not None:
            x = shared_output
-            if _use_aiter:
-                x.add_(final_hidden_states)
-            else:
-                x.add_(final_hidden_states, alpha=self.routed_scaling_factor)
+            x.add_(final_hidden_states, alpha=self.routed_scaling_factor)
            final_hidden_states = x
-        elif _use_aiter:
-            # fused in aiter_biased_grouped_topk so we can skip here
-            pass
        else:
            final_hidden_states *= self.routed_scaling_factor



The removal of the if _use_aiter: check in this block introduces a double-scaling bug for the aiter backend on AMD. When _use_aiter is enabled, the routed_scaling_factor is already applied within aiter_biased_grouped_topk to the topk_weights. Applying it again here (either via alpha in add_ or via *= in the else block) will result in incorrect model outputs.

Suggested change

if (shared_output := state.pop("shared_output")) is not None:

x = shared_output

if _use_aiter:

x.add_(final_hidden_states)

else:

x.add_(final_hidden_states, alpha=self.routed_scaling_factor)

x.add_(final_hidden_states, alpha=self.routed_scaling_factor)

final_hidden_states = x

elif _use_aiter:

# fused in aiter_biased_grouped_topk so we can skip here

pass

else:

final_hidden_states *= self.routed_scaling_factor

if (shared_output := state.pop("shared_output")) is not None:

x = shared_output

if _use_aiter:

x.add_(final_hidden_states)

else:

x.add_(final_hidden_states, alpha=self.routed_scaling_factor)

final_hidden_states = x

elif _use_aiter:

# fused in aiter_biased_grouped_topk so we can skip here

pass

else:

final_hidden_states *= self.routed_scaling_factor

gemini-code-assist · 2026-02-22T17:18:34Z

+                if layer.layer_id == 0 and _use_mla_ps_kernel:
+                    self.make_mla_meta_data(
+                        self.forward_metadata.qo_indptr,
+                        self.forward_metadata.kv_indptr,
+                        self.forward_metadata.kv_last_page_len,
+                        work_metadata,
+                        work_info_set,
+                        work_indptr,
+                        reduce_indptr,
+                        reduce_final_map,
+                        reduce_partial_map,
+                        self.forward_metadata.max_q_len,
+                        fast_mode=fast_mode,
+                        max_split_per_batch=num_kv_splits,
+                        intra_batch_mode=intra_batch_mode,
+                    )


Moving make_mla_meta_data from the CUDA graph initialization phase (init_forward_metadata_replay_cuda_graph) into the forward pass (forward_extend and forward_decode) is a regression.

Efficiency: This is a CPU-side metadata preparation function. Calling it in every forward pass adds unnecessary CPU overhead to the critical path.

CUDA Graph Compatibility: Python logic inside a captured forward pass is not re-executed during graph replay. If this metadata needs to be updated for different batches (even with fixed shapes), the graph replay will use stale values. It should be kept in the initialization/replay-prep functions where it can be called once before capture or replay.

gemini-code-assist · 2026-02-22T17:18:34Z

+        skip_token_capacity_threshold = (
+            internal_state[0].get("memory_usage", {}).get("token_capacity", 1000000000)
+        )


This change incorrectly calculates the total token capacity by only looking at the first DP rank. In a multi-DP configuration, the total capacity should be the sum of capacities across all ranks. This will cause benchmarks to be skipped prematurely on multi-GPU systems.

Suggested change

skip_token_capacity_threshold = (

internal_state[0].get("memory_usage", {}).get("token_capacity", 1000000000)

)

# Get token capacity

skip_token_capacity_threshold = 0

for i in range(dp_size):

skip_token_capacity_threshold += (

internal_state[i]

.get("memory_usage", {})

.get("token_capacity", 1000000000)

)

HaiShaw · 2026-02-24T02:22:05Z

cc @billishyahao @kkHuang-amd

…7953" (sgl-project#19161)

revert

b73cfbf

Fridge003 requested review from BBuf, Edwardf0t1, HaiShaw, Qiaolin-Yu, Ying1123, ch-wan, fzyzcjy, hebiao064, ispobock, merrymercy and zhyncs as code owners February 22, 2026 17:13

github-actions Bot added documentation Improvements or additions to documentation deepseek labels Feb 22, 2026

Fridge003 mentioned this pull request Feb 22, 2026

[AMD] support two batch overlapping for mori ep #17953

Merged

5 tasks

gemini-code-assist Bot reviewed Feb 22, 2026

View reviewed changes

Fridge003 merged commit 43f8352 into main Feb 22, 2026
53 of 61 checks passed

Fridge003 deleted the revert branch February 22, 2026 17:19

Fridge003 added a commit that referenced this pull request Feb 22, 2026

Revert "[AMD] support two batch overlapping for mori ep #17953" (#19161)

543c051

michaelzhang-ai mentioned this pull request Feb 23, 2026

[AMD][DO NOT MERGE] Fix MI35x DeepSeek-V3.2 nightly timeout failures #19110

Closed

5 tasks

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

Revert "[AMD] support two batch overlapping for mori ep sgl-project#1…

08b2569

…7953" (sgl-project#19161)

sammysun0711 pushed a commit to sammysun0711/sglang that referenced this pull request Mar 20, 2026

Revert "[AMD] support two batch overlapping for mori ep sgl-project#1…

d86923e

…7953" (sgl-project#19161)

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Revert "[AMD] support two batch overlapping for mori ep sgl-project#1…

555032f

…7953" (sgl-project#19161)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "[AMD] support two batch overlapping for mori ep #17953"#19161

Revert "[AMD] support two batch overlapping for mori ep #17953"#19161
Fridge003 merged 1 commit intomainfrom
revert

Fridge003 commented Feb 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Uh oh!

Uh oh!

HaiShaw commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	is_quark_w4a4 = isinstance(self.scheme, QuarkW4A4MXFp4MoE)
	is_quark_w4a4 = hasattr(self, "scheme") and isinstance(self.scheme, QuarkW4A4MXFp4MoE)

-        skip_token_capacity_threshold = (
-            internal_state[0].get("memory_usage", {}).get("token_capacity", 1000000000)
-        )
+        # Get token capacity
+        skip_token_capacity_threshold = 0
+        for i in range(dp_size):
+            skip_token_capacity_threshold += (
+                internal_state[i]
+                .get("memory_usage", {})
+                .get("token_capacity", 1000000000)
+            )

Conversation

Fridge003 commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HaiShaw commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fridge003 commented Feb 22, 2026 •

edited

Loading