[AMD] Serialize cross-ProcessGroup collectives for dp_attention by hubertlu-tw · Pull Request #11184 · sgl-project/sglang

hubertlu-tw · 2025-10-03T00:08:21Z

Motivation

We previously landed a workaround in #10434 that switched _dp_gather_via_all_gather to _dp_gather_via_all_reduce to avoid an RCCL/HIP failure when DP was enabled on dsv3.

The root cause is stricter stream-capture checks in ROCm 7.0: chaining reduce_scatter_tensor on the attention TP process group immediately followed by all_gather_into_tensor on the TP process group inside a captured region can lead to hipErrorCapturedEvent (“operation not permitted on an event last recorded in a capturing stream”). ROCm 7 updated event/callback behavior during capture (e.g., hipEventQuery, hipStreamAddCallback) to match CUDA, so any out-of-order polling or cross-stream/event use triggers an error (https://rocm.docs.amd.com/projects/HIP/en/docs-develop/hip-7-changes.html#stream-capture-updates).

NCCL/RCCL collectives are capturable, but they require stable streams and explicit ordering; PyTorch also documents that when using multiple process groups, outstanding async ops on one PG must be synchronized before issuing collectives on another.

This PR restores the all-gather path and makes it graph-safe by launching reduce_scatter_tensor(..., async_op=True) on the attention process group (from get_attention_tp_group), then calling work.wait() before invoking all_gather_into_tensor on the TP process group (from get_tp_group) (both on the same capturing stream). With TORCH_NCCL_BLOCKING_WAIT=1, the wait avoids host-side polling of captured events. This preserves the original RS→AG algorithm while complying with ROCm 7’s capture rules.

Modifications

Accuracy Tests

# Server command
python3 -m sglang.launch_server --model-path /data/deepseek-ai/DeepSeek-R1-0528 --tp 8 --trust-remote-code --chunked-prefill-size 131072 --dp-size 8 --enable-dp-attention --mem-fraction-static 0.85

# Client command
python3 benchmark/gsm8k/bench_sglang.py --parallel 1400 --num-questions 1400
---
Accuracy: 0.941
Invalid: 0.000

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

CC: @HaiShaw

gemini-code-assist · 2025-10-03T00:08:35Z

Summary of Changes

Hello @hubertlu-tw, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical stream-capture issues encountered with ROCm 7.0 when performing collective operations in dp_attention. It restores the original, more performant reduce_scatter_tensor followed by all_gather_into_tensor pattern by introducing explicit asynchronous operation and synchronization, ensuring compatibility with ROCm's updated event and callback behavior without sacrificing efficiency.

Highlights

ROCm 7.0 Compatibility: Addresses a hipErrorCapturedEvent issue in ROCm 7.0 related to stricter stream-capture checks when chaining reduce_scatter_tensor and all_gather_into_tensor across different process groups.
Restored All-Gather Path: Reverts a previous workaround (_dp_gather_via_all_reduce) and restores the more efficient RS -> AG algorithm for dp_attention.
Graph-Safe Synchronization: Implements graph-safe collective operations by launching reduce_scatter_tensor asynchronously and explicitly waiting for its completion (work.wait()) before initiating all_gather_into_tensor, ensuring proper ordering and stream synchronization.
Environment Variable Configuration: Sets TORCH_NCCL_BLOCKING_WAIT=1 specifically for ROCm 7 environments to facilitate the work.wait() mechanism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a stream capture issue on ROCm 7.0 for dp_attention by serializing cross-process-group collectives. The fix involves using an asynchronous reduce_scatter followed by a wait() before all_gather. My review identified a critical bug that would cause a NameError on non-HIP systems, an unused import, and an inconsistent use of a communication wrapper that could bypass optimizations. I have provided suggestions to resolve these issues.

gemini-code-assist · 2025-10-03T00:10:17Z

+if _is_hip:
+    _USE_ROCM7 = get_rocm_version()[0] >= 7


The variable _USE_ROCM7 is only defined within the if _is_hip: block. This will cause a NameError on non-HIP platforms where _is_hip is False, as _USE_ROCM7 will be referenced later in _dp_gather_via_all_gather without being defined. To fix this, _USE_ROCM7 should be defined regardless of the platform. A cleaner way to write this would be to combine the check for _is_hip into the assignment.

_USE_ROCM7 = _is_hip and get_rocm_version()[0] >= 7

@hubertlu-tw this to take code-assist :)

gemini-code-assist · 2025-10-03T00:10:17Z

+        torch.distributed.all_gather_into_tensor(
+            global_tokens,
+            scattered_local_tokens,
+            group=get_tp_group().device_group,
+        )


For consistency and to leverage potential optimizations (like pynccl), it's better to use the GroupCoordinator wrapper get_tp_group().all_gather_into_tensor(...) here, similar to the else branch. The current implementation calls torch.distributed.all_gather_into_tensor directly, which bypasses the logic in the wrapper. The wrapper is designed to be graph-safe, so it should be suitable for this context.

get_tp_group().all_gather_into_tensor(global_tokens, scattered_local_tokens)

…_into_tensor

hubertlu-tw · 2025-10-05T22:47:08Z

@ch-wan could you please help review the PR? Thanks!

HaiShaw

Also - any performance indicator?

HaiShaw · 2025-10-07T06:34:08Z

+if _is_hip:
+    _USE_ROCM7 = get_rocm_version()[0] >= 7


@hubertlu-tw this to take code-assist :)

…pact + sglang precedent analysis - Correct PyTorch #176251 status: merged twice and reverted twice (latest revert 2026-03-31), so the watchdog workaround is currently NOT in main. Empirical check confirms no public torch wheel (upstream nightly, AMD gfx950-dcgpu nightly, rocm/pytorch images) ships RocmWatchdogEventQueryContextGuard. Strike out "upgrade to a nightly" path (Fallback 1) and add Fallback 2 (cherry-pick #176942 patch into a local torch 2.9.1 rebuild) as the only software-side option for staying on ROCm 7.2.0. - Refine §3 v0.1.11 vs v0.1.12 narrative: both releases ship the fused allreduce+rmsnorm+quant kernel; the actual change is v0.1.12 introducing dynamic in-graph output-buffer registration (is_broadcast_reg_outptr -> get_output_buffer_RD), which is what doubles the per-AR host-side bookkeeping inside the capture window. - Expand §4 multi-PG impact analysis: enumerate the actual NCCL-bearing PGs under --tp 4 --ep 2 (TP, MOE_EP, MOE_TP, WORLD), explain per-capture cost compounding, and note that other models on the same nightly suite are statistically lucky rather than immune. - Add sgl-project#10434 / sgl-project#11184 precedent: rocm-7.0.0-alpha cross-PG capture issue had a one-line algorithmic workaround (DpPaddingMode.MAX_LEN -> SUM_LEN). aiter#2857 has no equivalent algorithmic out and depends on the runtime fix in ROCm 7.2.1+.

hubertlu-tw requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners October 3, 2025 00:08

sglang-bot added the run-ci label Oct 3, 2025

gemini-code-assist Bot reviewed Oct 3, 2025

View reviewed changes

hubertlu-tw changed the title ~~[AMD] Serialize cross-PG collectives for dp_attention~~ [AMD] Serialize cross-ProcessGroup collectives for dp_attention Oct 3, 2025

hubertlu-tw marked this pull request as draft October 3, 2025 02:50

hubertlu-tw marked this pull request as ready for review October 3, 2025 16:17

Serialize cross-PG collectives: reduce_scatter_tensor then all_gather…

a9acedb

…_into_tensor

hubertlu-tw force-pushed the dp_fix branch from 016f310 to 760924b Compare October 3, 2025 16:31

hubertlu-tw marked this pull request as draft October 3, 2025 17:30

Fix a lint error from main

75ca6ec

hubertlu-tw force-pushed the dp_fix branch from 35b25de to 75ca6ec Compare October 3, 2025 19:18

hubertlu-tw marked this pull request as ready for review October 3, 2025 19:37

Merge branch 'main' into dp_fix

ae79332

HaiShaw requested changes Oct 7, 2025

View reviewed changes

HaiShaw added the DO NOT MERGE label Oct 8, 2025

merrymercy requested a review from Fridge003 as a code owner November 29, 2025 07:06

hubertlu-tw closed this Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Serialize cross-ProcessGroup collectives for dp_attention#11184

[AMD] Serialize cross-ProcessGroup collectives for dp_attention#11184
hubertlu-tw wants to merge 3 commits intosgl-project:mainfrom
hubertlu-tw:dp_fix

hubertlu-tw commented Oct 3, 2025

Uh oh!

gemini-code-assist Bot commented Oct 3, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Oct 3, 2025

Uh oh!

HaiShaw Oct 7, 2025

Uh oh!

Uh oh!

gemini-code-assist Bot Oct 3, 2025

Uh oh!

hubertlu-tw commented Oct 5, 2025

Uh oh!

HaiShaw left a comment

Uh oh!

HaiShaw Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hubertlu-tw commented Oct 3, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Oct 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

HaiShaw Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

hubertlu-tw commented Oct 5, 2025

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

HaiShaw Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants