[SymmMem] Tiled reduce by kwen2501 · Pull Request #162243 · pytorch/pytorch

kwen2501 · 2025-09-05T03:34:08Z

Stack from ghstack (oldest at bottom):

Added op: tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)

For now supports only:

NVSHMEM backed symmetric tensor;
2D tensor and tile;
torch.float.

Testing on right-bottom quandrant:

rank 0:
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0')
PASSED

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @ezyang

[ghstack-poisoned]

pytorch-bot · 2025-09-05T03:34:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162243

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Driver update on H100 and A100 instances

✅ No Failures

As of commit a622842 with merge base a707042 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 541b03e Pull-Request-resolved: #162243

[ghstack-poisoned]

ghstack-source-id: 016c9e8 Pull-Request-resolved: #162243

ngimel · 2025-09-05T19:11:19Z

Can we have some benchmarks for inter and intra node? E.g. compared to copy + nccl?

kwen2501 · 2025-09-05T19:12:46Z

@ngimel I need to add some util code to create multiple teams to boost the bandwidth. Stack PR coming :)

ngimel

Can you give a tl;dr how you expect multiple teams to improve perf?

ngimel · 2025-09-05T20:39:10Z

+  // src_tensor and dst_tensor are already the tiles to operate on, thus we set
+  // the start_coord to 0
+  auto start_coord = nvshmemx::make_shape(0, 0);
+  nvshmemx::tile_sum_reduce<decltype(src_tensor), decltype(dst_tensor), Shape2D, nvshmemx::tile_coll_algo_t::NVLS_ONE_SHOT_PULL_NBI>(


One-shot algorithms are ok only for small sizes, for larger sizes they result in 4x more network traffic for 8 world size

The docs say The users are expected to use tile_collective_wait routine to ensure completion of the non-blocking collectives., I don't see it here

That's the limitation of NVSHMEM today, only three algorithms are available:

tile_coll_algo_t::NVLS_ONE_SHOT_PUSH_NBI tile_coll_algo_t::NVLS_ONE_SHOT_PULL_NBI tile_coll_algo_t::NVLS_TWO_SHOT_PUSH_NBI

And I don't think TWO_SHOT would work for reduce.

One-shot reduce (not all-reduce) will not create extra traffic, but it would indeed create a hot-spot at the root GPU.

So this relates to whether the collective has access to intermediate buffers or not:

if not, reduce can do only one-shot, thus hot-spot and never bandwidth optimal;

if yes, then algorithms like ring is possible, thus bandwidth optimal.

if (for some not very small sizes) TWO_SHOT allreduce is faster than one-shot reduce we should be using it? At a higher level, what are you trying to achieve? Is it inter-node or intra-node?

TWO_SHOT allreduce will modify non-root ranks' buffer, so not so much a 1:1 in terms of semantics.

kwen2501 · 2025-09-05T20:46:37Z

Can you give a tl;dr how you expect multiple teams to improve perf?

@ngimel today the PR launches only 1 CUDA block to work on the tile. If we want to scale to multiple blocks, e.g. 1 block per k rows, we'd need 1 team per block, because that's the semantics of the tile_sum_reduce API.

[ghstack-poisoned]

ghstack-source-id: acf6e0f Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks

[ghstack-poisoned]

ghstack-source-id: d4fadf4 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks

[ghstack-poisoned]

ghstack-source-id: c40d27c Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

[ghstack-poisoned]

ghstack-source-id: 8e1b03c Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

[ghstack-poisoned]

ghstack-source-id: cff954c Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

[ghstack-poisoned]

ghstack-source-id: 1d96422 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

[ghstack-poisoned]

ghstack-source-id: 5b415a9 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op boundary Each block has a sub-tile

[ghstack-poisoned]

ghstack-source-id: 18cf071 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op Each block has a sub-tile Empty start_coord and boundary

[ghstack-poisoned]

ghstack-source-id: 1a81ab2 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op Each block has a sub-tile Empty start_coord and boundary Add benchmark

[ghstack-poisoned]

ngimel · 2025-10-07T22:34:16Z

+   * receiving the reduced tensor. */
+  TORCH_CHECK(reduce_op == "sum", "tile_reduce: only sum is supported for now");
+  TORCH_CHECK(in_tile.dim() == 2 && out_tile.dim() == 2, "Only 2D tensors are supported");
+  TORCH_CHECK_EQ(in_tile.dtype(), out_tile.dtype());


add user-friendly error messages here

I had a look at the macro expansion of TORCH_CHECK_EQ, looks okay friendly?

pytorch/c10/util/logging_is_not_google_glog.h

Lines 139 to 141 in 6861a27

#define TORCH_CHECK_OP(val1, val2, op) \

FATAL_IF(((val1)op(val2))) << "Check failed: " #val1 " " #op " " #val2 " (" \

<< (val1) << " vs. " << (val2) << ") "

kwen2501 · 2025-10-07T23:27:31Z

@pytorchbot merge

pytorchmergebot · 2025-10-07T23:29:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-08T00:40:59Z

Starting merge as part of PR stack under #164757

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: #164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: #162243

Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)` For now supports only: - NVSHMEM backed symmetric tensor; - 2D tensor and tile; - torch.float. Testing on right-bottom quandrant: ``` rank 0: tensor([[0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0') PASSED ``` Pull Request resolved: pytorch#162243 Approved by: https://github.com/ngimel

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: pytorch#164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: pytorch#162243

Update

03fadc4

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 5, 2025

[SymmMem] Tiled reduce

bdce412

ghstack-source-id: 541b03e Pull-Request-resolved: #162243

pytorch-bot Bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 5, 2025

Update

351b655

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 5, 2025

[SymmMem] Tiled reduce

4629a01

ghstack-source-id: 016c9e8 Pull-Request-resolved: #162243

kwen2501 requested review from fegin and ngimel September 5, 2025 03:36

Skylion007 reviewed Sep 5, 2025

View reviewed changes

Comment thread torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu Outdated

ngimel reviewed Sep 5, 2025

View reviewed changes

Update

30e0f87

[ghstack-poisoned]

kwen2501 mentioned this pull request Sep 6, 2025

[SymmMem] Add team pool to hold duplicated teams for the same rank group #162320

Closed

Update

8f5769c

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 6, 2025

[SymmMem] Tiled reduce

f86c78b

ghstack-source-id: acf6e0f Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks

Update

db1eed7

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 7, 2025

[SymmMem] Tiled reduce

043899c

ghstack-source-id: d4fadf4 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks

Update

ec707b7

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 7, 2025

[SymmMem] Tiled reduce

60ce2be

ghstack-source-id: c40d27c Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

Update

26b79b5

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 7, 2025

[SymmMem] Tiled reduce

5be3fa6

ghstack-source-id: 8e1b03c Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

Update

69e353d

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 8, 2025

[SymmMem] Tiled reduce

a5f40a7

ghstack-source-id: cff954c Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

Update

b8ea277

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 9, 2025

[SymmMem] Tiled reduce

a9410f8

ghstack-source-id: 1d96422 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op

Update

a2f4aa6

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Oct 2, 2025

[SymmMem] Tiled reduce

88f5c4b

ghstack-source-id: 5b415a9 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op boundary Each block has a sub-tile

Update

0fe1d2e

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Oct 2, 2025

[SymmMem] Tiled reduce

cd6a47a

ghstack-source-id: 18cf071 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op Each block has a sub-tile Empty start_coord and boundary

Update

94b5139

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Oct 3, 2025

[SymmMem] Tiled reduce

2ae6be6

ghstack-source-id: 1a81ab2 Pull-Request-resolved: #162243 use tile in arg name teams_dev wait boundry nblocks reduce op Each block has a sub-tile Empty start_coord and boundary Add benchmark

kwen2501 mentioned this pull request Oct 6, 2025

[SymmMem] Multi-root tile reduction #164757

Closed

Update

a622842

[ghstack-poisoned]

ngimel approved these changes Oct 7, 2025

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2025

pytorchmergebot added the merging label Oct 7, 2025

pytorchmergebot added the Merged label Oct 8, 2025

pytorchmergebot closed this in d444384 Oct 8, 2025

pytorchmergebot removed the merging label Oct 8, 2025

github-actions Bot deleted the gh/kwen2501/231/head branch November 7, 2025 02:15

	#define TORCH_CHECK_OP(val1, val2, op) \
	FATAL_IF(((val1)op(val2))) << "Check failed: " #val1 " " #op " " #val2 " (" \
	<< (val1) << " vs. " << (val2) << ") "

Conversation

kwen2501 commented Sep 5, 2025 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162243

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Uh oh!

ngimel commented Sep 5, 2025

Uh oh!

kwen2501 commented Sep 5, 2025

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Sep 5, 2025 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Sep 5, 2025 •

edited

Loading

kwen2501 Sep 5, 2025 •

edited

Loading

kwen2501 commented Sep 5, 2025 •

edited

Loading

kwen2501 Oct 7, 2025 •

edited

Loading