[SymmMem] Multi-root tile reduction by kwen2501 · Pull Request #164757 · pytorch/pytorch

kwen2501 · 2025-10-06T18:28:08Z

Stack from ghstack (oldest at bottom):

Perform multiple tile reductions concurrently, with each tile reduced to a separate root.

The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API.
Currently supports NVLink SHARP scope only.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-10-06T18:28:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164757

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Driver update on H100 and A100 instances

✅ No Failures

As of commit e17e8b3 with merge base a707042 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d62d23b Pull-Request: #164757

weifengpy

looking good on UX part

fegin · 2025-10-06T19:20:09Z

+      root = rank;
+    }
+    i++;
+  }


Should we check that root != world_size here?

This implementation uses root == world_size to indicate that current rank does not need to reduce any tile. (Yet it still calls into this API to fulfill the collective requirement). You can see the "Note" above.

we don't have a test for it though (root==world_size), do we?

In test_multi_root_tile_reduce, when root_ratio is 2, we will exercise this case.
root_ratio=2 means only half of the ranks are root, the rest of ranks will provide root==world_size here to skip the reduction.

ngimel · 2025-10-06T22:27:49Z

+      root = rank;
+    }
+    i++;
+  }


we don't have a test for it though (root==world_size), do we?

ngimel · 2025-10-06T22:28:24Z

+     - `reduce_op` is the reduction operation to perform. Currently only "sum" is supported.
+   */
+  TORCH_CHECK(reduce_op == "sum", "tile_reduce: only sum is supported for now");
+  TORCH_CHECK(out_tile.dtype() == at::kFloat, "Only float is supported");


can you support at least BFloat16 also?

Added in base PR #162243

ngimel · 2025-10-06T22:33:36Z

+  for (auto& in_tile : in_tiles) {
+    TORCH_CHECK(in_tile.dtype() == at::kFloat, "Only float is supported");
+    c10d::symmetric_memory::rendezvous(in_tile, group_name);
+    if (roots[i] == rank) {


we should check that roots[i] is valid (>=0 and < world_size)

ngimel · 2025-10-06T22:34:23Z

+  int nblocks = at::ceil_div(
+      out_tile.numel() * out_tile.element_size(),
+      (int64_t)THREADS_PER_BLOCK * 16);
+  nblocks = std::min(nblocks, 16);


why limit at 16? I think for cuda backend we limit at 24 at least, maybe we even need more for blackwell

Set it to 24 now.

[ghstack-poisoned]

ghstack-source-id: c47225e Pull-Request: #164757

kwen2501 · 2025-10-08T00:39:01Z

@pytorchbot merge

pytorchmergebot · 2025-10-08T00:40:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-08T03:14:04Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x ef426ec413907290b5c25b0005bf265ec2d0e6eb returned non-zero exit code 1

Auto-merging test/distributed/test_nvshmem.py
CONFLICT (content): Merge conflict in test/distributed/test_nvshmem.py
Auto-merging torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp
Auto-merging torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu
Auto-merging torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cuh
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cuh
error: could not apply ef426ec4139... [SymmMem] Multi-root tile reduction
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

kwen2501 · 2025-10-08T17:19:38Z

@pytorchbot merge

pytorchmergebot · 2025-10-08T17:22:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: pytorch#164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: pytorch#162243

nWEIdia · 2025-11-19T22:27:21Z

In retrospective, it would have been great if the test could be guarded with:
"Currently supports NVLink SHARP scope only."
As we found numerical mismatches on platforms without NVLink (A100 or H100)

ngimel · 2025-11-19T22:38:01Z

Were those large precision errors, as in grossly incorrect results, or some accuracy mismatch? If the latter, feel free to submit PRs guarding the tests, if the former, can we disable those ops on hardware that doesn't support it?

nWEIdia · 2025-11-19T22:51:09Z

Were those large precision errors, as in grossly incorrect results, or some accuracy mismatch? If the latter, feel free to submit PRs guarding the tests, if the former, can we disable those ops on hardware that doesn't support it?

Below is an example of mismatch:

Mismatched elements: 16384 / 4194304 (0.4%)
Greatest absolute difference: 6.0 at index (0, 0) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
python test/distributed/test_nvshmem.py NVSHMEMTileCommTest.test_multi_root_tile_reduce_tile_size_128_root_ratio_2_float32

In certain cases, these test case caused other problems on non-NVL platforms which I'm still looking into.

nWEIdia · 2025-11-25T23:56:20Z

FYI: We also noticed that certain H100 systems not supporting NVLINK SHARP would have illegal memory access when running the multi_root tile unit test.
Our NVSHMEM team identified a "missed an assertion check on one of the (nullptr) pointers". They will fix this.

Update

6667e2e

[ghstack-poisoned]

pytorch-bot Bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 6, 2025

kwen2501 added a commit that referenced this pull request Oct 6, 2025

[SymmMem] Multi-root tile reduction

21b6912

ghstack-source-id: d62d23b Pull-Request: #164757

kwen2501 mentioned this pull request Oct 3, 2025

[SymmMem] Tiled reduce #162243

Closed

kwen2501 requested review from fduwjj, fegin and ngimel October 6, 2025 18:31

kwen2501 added the release notes: distributed (symm_mem) release note label for symmetric memory label Oct 6, 2025

weifengpy approved these changes Oct 6, 2025

View reviewed changes

fegin approved these changes Oct 6, 2025

View reviewed changes

ngimel reviewed Oct 6, 2025

View reviewed changes

Update

e17e8b3

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Oct 7, 2025

[SymmMem] Multi-root tile reduction

ef426ec

ghstack-source-id: c47225e Pull-Request: #164757

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 8, 2025

pytorchmergebot added the merging label Oct 8, 2025

pytorchmergebot removed the merging label Oct 8, 2025

pytorchmergebot added the merging label Oct 8, 2025

pytorchmergebot closed this in 5c827a4 Oct 8, 2025

pytorchmergebot added Merged and removed merging labels Oct 8, 2025

github-actions Bot deleted the gh/kwen2501/273/head branch November 8, 2025 02:11

Conversation

kwen2501 commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164757

❗ 1 Active SEVs

✅ No Failures

Uh oh!

weifengpy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Oct 8, 2025

Uh oh!

pytorchmergebot commented Oct 8, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 8, 2025

Merge failed

Uh oh!

kwen2501 commented Oct 8, 2025

Uh oh!

pytorchmergebot commented Oct 8, 2025

Merge started

Uh oh!

nWEIdia commented Nov 19, 2025

Uh oh!

ngimel commented Nov 19, 2025

Uh oh!

nWEIdia commented Nov 19, 2025

Uh oh!

nWEIdia commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kwen2501 commented Oct 6, 2025 •

edited

Loading

pytorch-bot Bot commented Oct 6, 2025 •

edited

Loading

kwen2501 Oct 6, 2025 •

edited

Loading