Skip to content

[SymmMem] Multi-root tile reduction#164757

Closed
kwen2501 wants to merge 2 commits intogh/kwen2501/273/basefrom
gh/kwen2501/273/head
Closed

[SymmMem] Multi-root tile reduction#164757
kwen2501 wants to merge 2 commits intogh/kwen2501/273/basefrom
gh/kwen2501/273/head

Conversation

@kwen2501
Copy link
Copy Markdown
Collaborator

@kwen2501 kwen2501 commented Oct 6, 2025

Stack from ghstack (oldest at bottom):

Perform multiple tile reductions concurrently, with each tile reduced to a separate root.

  • The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API.

  • Currently supports NVLink SHARP scope only.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Oct 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164757

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit e17e8b3 with merge base a707042 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 6, 2025
kwen2501 added a commit that referenced this pull request Oct 6, 2025
ghstack-source-id: d62d23b
Pull-Request: #164757
@kwen2501 kwen2501 requested review from fduwjj, fegin and ngimel October 6, 2025 18:31
@kwen2501 kwen2501 added the release notes: distributed (symm_mem) release note label for symmetric memory label Oct 6, 2025
Copy link
Copy Markdown
Contributor

@weifengpy weifengpy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good on UX part

root = rank;
}
i++;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that root != world_size here?

Copy link
Copy Markdown
Collaborator Author

@kwen2501 kwen2501 Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation uses root == world_size to indicate that current rank does not need to reduce any tile. (Yet it still calls into this API to fulfill the collective requirement). You can see the "Note" above.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a test for it though (root==world_size), do we?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_multi_root_tile_reduce, when root_ratio is 2, we will exercise this case.
root_ratio=2 means only half of the ranks are root, the rest of ranks will provide root==world_size here to skip the reduction.

root = rank;
}
i++;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a test for it though (root==world_size), do we?

- `reduce_op` is the reduction operation to perform. Currently only "sum" is supported.
*/
TORCH_CHECK(reduce_op == "sum", "tile_reduce: only sum is supported for now");
TORCH_CHECK(out_tile.dtype() == at::kFloat, "Only float is supported");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you support at least BFloat16 also?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in base PR #162243

for (auto& in_tile : in_tiles) {
TORCH_CHECK(in_tile.dtype() == at::kFloat, "Only float is supported");
c10d::symmetric_memory::rendezvous(in_tile, group_name);
if (roots[i] == rank) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check that roots[i] is valid (>=0 and < world_size)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added now.

int nblocks = at::ceil_div(
out_tile.numel() * out_tile.element_size(),
(int64_t)THREADS_PER_BLOCK * 16);
nblocks = std::min(nblocks, 16);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why limit at 16? I think for cuda backend we limit at 24 at least, maybe we even need more for blackwell

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set it to 24 now.

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Oct 7, 2025
ghstack-source-id: c47225e
Pull-Request: #164757
@kwen2501
Copy link
Copy Markdown
Collaborator Author

kwen2501 commented Oct 8, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 8, 2025
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x ef426ec413907290b5c25b0005bf265ec2d0e6eb returned non-zero exit code 1

Auto-merging test/distributed/test_nvshmem.py
CONFLICT (content): Merge conflict in test/distributed/test_nvshmem.py
Auto-merging torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp
Auto-merging torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu
Auto-merging torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cuh
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cuh
error: could not apply ef426ec4139... [SymmMem] Multi-root tile reduction
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@kwen2501
Copy link
Copy Markdown
Collaborator Author

kwen2501 commented Oct 8, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom):

Perform multiple tile reductions concurrently, with each tile reduced to a separate root.

- The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API.

- Currently supports NVLink SHARP scope only.

Pull Request resolved: pytorch#164757
Approved by: https://github.com/weifengpy, https://github.com/fegin
ghstack dependencies: pytorch#162243
@github-actions github-actions Bot deleted the gh/kwen2501/273/head branch November 8, 2025 02:11
@nWEIdia
Copy link
Copy Markdown
Collaborator

nWEIdia commented Nov 19, 2025

In retrospective, it would have been great if the test could be guarded with:
"Currently supports NVLink SHARP scope only."
As we found numerical mismatches on platforms without NVLink (A100 or H100)

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Nov 19, 2025

Were those large precision errors, as in grossly incorrect results, or some accuracy mismatch? If the latter, feel free to submit PRs guarding the tests, if the former, can we disable those ops on hardware that doesn't support it?

@nWEIdia
Copy link
Copy Markdown
Collaborator

nWEIdia commented Nov 19, 2025

Were those large precision errors, as in grossly incorrect results, or some accuracy mismatch? If the latter, feel free to submit PRs guarding the tests, if the former, can we disable those ops on hardware that doesn't support it?

Below is an example of mismatch:

Mismatched elements: 16384 / 4194304 (0.4%)
Greatest absolute difference: 6.0 at index (0, 0) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
python test/distributed/test_nvshmem.py NVSHMEMTileCommTest.test_multi_root_tile_reduce_tile_size_128_root_ratio_2_float32

In certain cases, these test case caused other problems on non-NVL platforms which I'm still looking into.

@nWEIdia
Copy link
Copy Markdown
Collaborator

nWEIdia commented Nov 25, 2025

FYI: We also noticed that certain H100 systems not supporting NVLINK SHARP would have illegal memory access when running the multi_root tile unit test.
Our NVSHMEM team identified a "missed an assertion check on one of the (nullptr) pointers". They will fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category release notes: distributed (symm_mem) release note label for symmetric memory

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants