Conversation
Summary of ChangesHello @Fridge003, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request performs a comprehensive revert of a previous feature that introduced support for two-batch overlapping with Mori Expert Parallelism (EP), particularly targeting AMD platforms. The changes systematically remove the specialized low-latency dispatch and combine mechanisms, dual-stream communication, and associated metadata management. By eliminating this specific functionality and related conditional code paths, the PR aims to simplify the codebase and enhance stability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request reverts the support for two-batch overlapping for Mori EP on AMD. While the revert successfully removes the overlapping logic, it introduces several regressions and potential bugs. Specifically, it introduces an AttributeError in the Mori MoE layer due to an unguarded attribute access, a double-scaling bug in the DeepSeek-V2 model when using the aiter backend, and a regression in how MLA metadata is initialized for CUDA graphs on AMD. Additionally, the benchmark script's token capacity calculation is now incorrect for multi-DP configurations.
| is_quark_w4a4 = hasattr(self, "scheme") and isinstance( | ||
| self.scheme, QuarkW4A4MXFp4MoE | ||
| ) | ||
| is_quark_w4a4 = isinstance(self.scheme, QuarkW4A4MXFp4MoE) |
There was a problem hiding this comment.
This line will raise an AttributeError if self.scheme is not defined on the layer instance. The previous code correctly used hasattr(self, "scheme") to guard this check. Since MoriEPMoE (and its parent DeepEPMoE) does not guarantee the presence of a scheme attribute, this revert introduces a regression.
| is_quark_w4a4 = isinstance(self.scheme, QuarkW4A4MXFp4MoE) | |
| is_quark_w4a4 = hasattr(self, "scheme") and isinstance(self.scheme, QuarkW4A4MXFp4MoE) |
| if (shared_output := state.pop("shared_output")) is not None: | ||
| x = shared_output | ||
| if _use_aiter: | ||
| x.add_(final_hidden_states) | ||
| else: | ||
| x.add_(final_hidden_states, alpha=self.routed_scaling_factor) | ||
| x.add_(final_hidden_states, alpha=self.routed_scaling_factor) | ||
| final_hidden_states = x | ||
| elif _use_aiter: | ||
| # fused in aiter_biased_grouped_topk so we can skip here | ||
| pass | ||
| else: | ||
| final_hidden_states *= self.routed_scaling_factor | ||
|
|
There was a problem hiding this comment.
The removal of the if _use_aiter: check in this block introduces a double-scaling bug for the aiter backend on AMD. When _use_aiter is enabled, the routed_scaling_factor is already applied within aiter_biased_grouped_topk to the topk_weights. Applying it again here (either via alpha in add_ or via *= in the else block) will result in incorrect model outputs.
| if (shared_output := state.pop("shared_output")) is not None: | |
| x = shared_output | |
| if _use_aiter: | |
| x.add_(final_hidden_states) | |
| else: | |
| x.add_(final_hidden_states, alpha=self.routed_scaling_factor) | |
| x.add_(final_hidden_states, alpha=self.routed_scaling_factor) | |
| final_hidden_states = x | |
| elif _use_aiter: | |
| # fused in aiter_biased_grouped_topk so we can skip here | |
| pass | |
| else: | |
| final_hidden_states *= self.routed_scaling_factor | |
| if (shared_output := state.pop("shared_output")) is not None: | |
| x = shared_output | |
| if _use_aiter: | |
| x.add_(final_hidden_states) | |
| else: | |
| x.add_(final_hidden_states, alpha=self.routed_scaling_factor) | |
| final_hidden_states = x | |
| elif _use_aiter: | |
| # fused in aiter_biased_grouped_topk so we can skip here | |
| pass | |
| else: | |
| final_hidden_states *= self.routed_scaling_factor |
| if layer.layer_id == 0 and _use_mla_ps_kernel: | ||
| self.make_mla_meta_data( | ||
| self.forward_metadata.qo_indptr, | ||
| self.forward_metadata.kv_indptr, | ||
| self.forward_metadata.kv_last_page_len, | ||
| work_metadata, | ||
| work_info_set, | ||
| work_indptr, | ||
| reduce_indptr, | ||
| reduce_final_map, | ||
| reduce_partial_map, | ||
| self.forward_metadata.max_q_len, | ||
| fast_mode=fast_mode, | ||
| max_split_per_batch=num_kv_splits, | ||
| intra_batch_mode=intra_batch_mode, | ||
| ) |
There was a problem hiding this comment.
Moving make_mla_meta_data from the CUDA graph initialization phase (init_forward_metadata_replay_cuda_graph) into the forward pass (forward_extend and forward_decode) is a regression.
- Efficiency: This is a CPU-side metadata preparation function. Calling it in every forward pass adds unnecessary CPU overhead to the critical path.
- CUDA Graph Compatibility: Python logic inside a captured forward pass is not re-executed during graph replay. If this metadata needs to be updated for different batches (even with fixed shapes), the graph replay will use stale values. It should be kept in the initialization/replay-prep functions where it can be called once before capture or replay.
| skip_token_capacity_threshold = ( | ||
| internal_state[0].get("memory_usage", {}).get("token_capacity", 1000000000) | ||
| ) |
There was a problem hiding this comment.
This change incorrectly calculates the total token capacity by only looking at the first DP rank. In a multi-DP configuration, the total capacity should be the sum of capacities across all ranks. This will cause benchmarks to be skipped prematurely on multi-GPU systems.
| skip_token_capacity_threshold = ( | |
| internal_state[0].get("memory_usage", {}).get("token_capacity", 1000000000) | |
| ) | |
| # Get token capacity | |
| skip_token_capacity_threshold = 0 | |
| for i in range(dp_size): | |
| skip_token_capacity_threshold += ( | |
| internal_state[i] | |
| .get("memory_usage", {}) | |
| .get("token_capacity", 1000000000) | |
| ) |
Motivation
Fix broken CI https://github.com/sgl-project/sglang/actions/runs/22256775640/job/64445687327
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci