Skip to content

Revert "[AMD] support two batch overlapping for mori ep #17953"#19161

Merged
Fridge003 merged 1 commit intomainfrom
revert
Feb 22, 2026
Merged

Revert "[AMD] support two batch overlapping for mori ep #17953"#19161
Fridge003 merged 1 commit intomainfrom
revert

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented Feb 22, 2026

Motivation

Fix broken CI https://github.com/sgl-project/sglang/actions/runs/22256775640/job/64445687327

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Fridge003, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a comprehensive revert of a previous feature that introduced support for two-batch overlapping with Mori Expert Parallelism (EP), particularly targeting AMD platforms. The changes systematically remove the specialized low-latency dispatch and combine mechanisms, dual-stream communication, and associated metadata management. By eliminating this specific functionality and related conditional code paths, the PR aims to simplify the codebase and enhance stability.

Highlights

  • Mori EP Low-Latency Mode Removal: Removed explicit support for a low-latency mode within the Mori Expert Parallelism (EP) dispatcher, including associated data structures, communication stream handling, and related configuration options.
  • Two-Batch Overlapping (TBO) Simplification: Eliminated specific logic and dispatcher instantiation related to Mori EP within the two-batch overlapping engine, streamlining its implementation.
  • Attention Backend Metadata Cleanup: Simplified metadata initialization in the attention backend by removing conditional logic and variables previously used for Multi-Layer Attention (MLA) and speculative decoding within CUDA graph replay.
  • Configuration and Documentation Updates: Updated server arguments to automatically set deepep_mode to 'normal' for Mori EP and removed 'mori' as a low-latency option from the documentation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/advanced_features/server_arguments.md
    • Removed 'mori' from the --moe-a2a-backend options list.
  • python/sglang/srt/batch_overlap/operations_strategy.py
    • Removed is_hip utility import and conditional logic for deep_gemm_num_sms.
  • python/sglang/srt/batch_overlap/two_batch_overlap.py
    • Removed MoriEPDispatcher import and its instantiation logic from TwoBatchOverlapEngine.
  • python/sglang/srt/layers/attention/aiter_backend.py
    • Simplified init_forward_metadata by removing a condition for idle forward mode.
    • Removed several metadata initialization blocks and related variables in init_forward_metadata_replay_cuda_graph.
  • python/sglang/srt/layers/moe/ep_moe/layer.py
    • Removed MoriEPLLCombineInput import.
    • Updated comments regarding DeepEP low-latency support.
    • Refactored forward and run_moe_core methods to simplify dispatch output handling and combine input creation.
  • python/sglang/srt/layers/moe/fused_moe_triton/layer.py
    • Adjusted create_moe_dispatcher to separate Mori EP dispatcher creation, re-adding specific MoriEPDispatcher instantiation outside of the MaybeTboDeepEPDispatcher block.
  • python/sglang/srt/layers/moe/token_dispatcher/init.py
    • Removed MoriEPLLCombineInput and MoriEPLLDispatchOutput from imports and __all__ export list.
  • python/sglang/srt/layers/moe/token_dispatcher/moriep.py
    • Removed DeepEPPDispatchHooks import and class, and is_tbo_enabled import.
    • Removed MoriEPLLDispatchOutput and MoriEPLLCombineInput named tuples.
    • Removed EpMode.LOW_LATENCY from the EpMode enum.
    • Reduced lru_cache maxsize for init_mori_op from 2 to 1.
    • Removed deepep_mode parameter and related async logic from init_mori_op.
    • Removed CommStreamPool class.
    • Simplified _MoriEPDispatcherImplBase by removing quant_config, overlap_args, and meta_overlap_args properties and methods.
    • Simplified _MoriEPDispatcherImplNormal by removing async/dual-stream logic and related event handling.
    • Modified _dispatch_core to use self.mori_op.dispatch instead of dispatch_send.
    • Simplified combine_a and combine_b in _MoriEPDispatcherImplNormal to remove stream handling and use combine instead of combine_send.
    • Refactored MoriEPDispatcher to remove _low_latency_dispatcher and related logic, explicitly raising NotImplementedError for low-latency mode.
  • python/sglang/srt/models/deepseek_v2.py
    • Removed conditional logic related to Mori EP and _use_aiter in op_output.
    • Removed num_tokens tracking in op_comm_prepare_attn.
  • python/sglang/srt/server_args.py
    • Added self.deepep_mode = "normal" and a warning for Mori EP.
    • Simplified chunked_prefill_size validation by removing a condition related to disaggregation_mode.
  • python/sglang/test/bench_one_batch_server_internal.py
    • Simplified token capacity calculation in run_benchmark_internal.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the support for two-batch overlapping for Mori EP on AMD. While the revert successfully removes the overlapping logic, it introduces several regressions and potential bugs. Specifically, it introduces an AttributeError in the Mori MoE layer due to an unguarded attribute access, a double-scaling bug in the DeepSeek-V2 model when using the aiter backend, and a regression in how MLA metadata is initialized for CUDA graphs on AMD. Additionally, the benchmark script's token capacity calculation is now incorrect for multi-DP configurations.

is_quark_w4a4 = hasattr(self, "scheme") and isinstance(
self.scheme, QuarkW4A4MXFp4MoE
)
is_quark_w4a4 = isinstance(self.scheme, QuarkW4A4MXFp4MoE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line will raise an AttributeError if self.scheme is not defined on the layer instance. The previous code correctly used hasattr(self, "scheme") to guard this check. Since MoriEPMoE (and its parent DeepEPMoE) does not guarantee the presence of a scheme attribute, this revert introduces a regression.

Suggested change
is_quark_w4a4 = isinstance(self.scheme, QuarkW4A4MXFp4MoE)
is_quark_w4a4 = hasattr(self, "scheme") and isinstance(self.scheme, QuarkW4A4MXFp4MoE)

Comment on lines 1057 to 1063
if (shared_output := state.pop("shared_output")) is not None:
x = shared_output
if _use_aiter:
x.add_(final_hidden_states)
else:
x.add_(final_hidden_states, alpha=self.routed_scaling_factor)
x.add_(final_hidden_states, alpha=self.routed_scaling_factor)
final_hidden_states = x
elif _use_aiter:
# fused in aiter_biased_grouped_topk so we can skip here
pass
else:
final_hidden_states *= self.routed_scaling_factor

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The removal of the if _use_aiter: check in this block introduces a double-scaling bug for the aiter backend on AMD. When _use_aiter is enabled, the routed_scaling_factor is already applied within aiter_biased_grouped_topk to the topk_weights. Applying it again here (either via alpha in add_ or via *= in the else block) will result in incorrect model outputs.

Suggested change
if (shared_output := state.pop("shared_output")) is not None:
x = shared_output
if _use_aiter:
x.add_(final_hidden_states)
else:
x.add_(final_hidden_states, alpha=self.routed_scaling_factor)
x.add_(final_hidden_states, alpha=self.routed_scaling_factor)
final_hidden_states = x
elif _use_aiter:
# fused in aiter_biased_grouped_topk so we can skip here
pass
else:
final_hidden_states *= self.routed_scaling_factor
if (shared_output := state.pop("shared_output")) is not None:
x = shared_output
if _use_aiter:
x.add_(final_hidden_states)
else:
x.add_(final_hidden_states, alpha=self.routed_scaling_factor)
final_hidden_states = x
elif _use_aiter:
# fused in aiter_biased_grouped_topk so we can skip here
pass
else:
final_hidden_states *= self.routed_scaling_factor

Comment on lines +1369 to +1384
if layer.layer_id == 0 and _use_mla_ps_kernel:
self.make_mla_meta_data(
self.forward_metadata.qo_indptr,
self.forward_metadata.kv_indptr,
self.forward_metadata.kv_last_page_len,
work_metadata,
work_info_set,
work_indptr,
reduce_indptr,
reduce_final_map,
reduce_partial_map,
self.forward_metadata.max_q_len,
fast_mode=fast_mode,
max_split_per_batch=num_kv_splits,
intra_batch_mode=intra_batch_mode,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving make_mla_meta_data from the CUDA graph initialization phase (init_forward_metadata_replay_cuda_graph) into the forward pass (forward_extend and forward_decode) is a regression.

  1. Efficiency: This is a CPU-side metadata preparation function. Calling it in every forward pass adds unnecessary CPU overhead to the critical path.
  2. CUDA Graph Compatibility: Python logic inside a captured forward pass is not re-executed during graph replay. If this metadata needs to be updated for different batches (even with fixed shapes), the graph replay will use stale values. It should be kept in the initialization/replay-prep functions where it can be called once before capture or replay.

Comment on lines +795 to +797
skip_token_capacity_threshold = (
internal_state[0].get("memory_usage", {}).get("token_capacity", 1000000000)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change incorrectly calculates the total token capacity by only looking at the first DP rank. In a multi-DP configuration, the total capacity should be the sum of capacities across all ranks. This will cause benchmarks to be skipped prematurely on multi-GPU systems.

Suggested change
skip_token_capacity_threshold = (
internal_state[0].get("memory_usage", {}).get("token_capacity", 1000000000)
)
# Get token capacity
skip_token_capacity_threshold = 0
for i in range(dp_size):
skip_token_capacity_threshold += (
internal_state[i]
.get("memory_usage", {})
.get("token_capacity", 1000000000)
)

@Fridge003 Fridge003 merged commit 43f8352 into main Feb 22, 2026
53 of 61 checks passed
@Fridge003 Fridge003 deleted the revert branch February 22, 2026 17:19
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Feb 24, 2026

cc @billishyahao @kkHuang-amd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants