[SymmMem] Add team pool to hold duplicated teams for the same rank group by kwen2501 · Pull Request #162320 · pytorch/pytorch

kwen2501 · 2025-09-06T01:01:58Z

Stack from ghstack (oldest at bottom):

When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see Collective operations scopes and active sets.

This PR adds a util get_n_teams for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim

[ghstack-poisoned]

pytorch-bot · 2025-09-06T01:02:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162320

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 2f3b55a with merge base 7a83cf4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: cd46ce9 Pull-Request-resolved: #162320

[ghstack-poisoned]

ghstack-source-id: c9034ee Pull-Request-resolved: #162320

[ghstack-poisoned]

ngimel · 2025-09-06T20:42:02Z

+    if (it == team_pool_devptrs_.end()) {
+      // If not, allocate a new pool in device memory
+      C10_CUDA_CHECK(cudaMalloc((void**)&team_pool_dev, pool_bytes));
+      team_pool_devptrs_[group_name] = team_pool_dev;


you should think about structuring the code in such a way that when group is destroyed, it's team manager entries are also freed, right now you have cudaMalloc (and possibly other resources) that's leaking.

Added in the destructor now.

Wouldn't it be cleaner to use std::unique_ptrs to accomplish this (with a custom destructor passed to the unique_ptr constructor)?

Normally yes (thanks for the suggestion)! But here I have specific comments + warning message, so I kind of prefer writing it out explicitly.

@kwen2501 True, but comments has way more footguns than RAII

Agree, using standard RAII techniques is preferable

[ghstack-poisoned]

Skylion007 · 2025-09-07T19:24:21Z

+      }
+    } catch (...) {
+      // Ignore the error
+      std::cerr << "Failed to free the team pool in device memory, skipping\n";


Why not use our logging utility we already have in TORCH?

I was worried that those logging utility might have been destructed at this point. Can you shed some light on this?

Skylion007 · 2025-09-07T19:25:29Z

+    if (it == team_pool_devptrs_.end()) {
+      // If not, allocate a new pool in device memory
+      C10_CUDA_CHECK(cudaMalloc((void**)&team_pool_dev, pool_bytes));
+      team_pool_devptrs_[group_name] = team_pool_dev;


Wouldn't it be cleaner to use std::unique_ptrs to accomplish this (with a custom destructor passed to the unique_ptr constructor)?

kwen2501 · 2025-09-07T19:25:59Z

+  ~TeamManager() {
+    // Free the team pools in device memory
+    // Note that we do it in a best effort manner because the team pool is
+    // managed by a static TeamManager and the destruction order of static
+    // objects is undetermined. If the destructor is called after the CUDA
+    // context is destroyed, cudaFree would fail.
+    try {
+      // cudaFree generally implies a device synchronization, meaning it will
+      // block until all preceding CUDA operations on the device have completed
+      // before freeing the memory. Thus we don't need to worry about freeing
+      // the memory before CUDA kernels complete.
+      for (auto& [_, team_pool_dev] : team_pool_devptrs_) {
+        c10::cuda::CUDACachingAllocator::raw_delete(team_pool_dev);
+      }
+    } catch (...) {
+      // Ignore the error
+      std::cerr << "Failed to free the team pool in device memory, skipping\n";
+    }
+  }


@ngimel I am doing the free in best-effort manner due to undetermined destruction order of static objects and CUDA context. In my test runs, I never see the cerr message being printed tho, so I guess we are lucky enough.

Groups are a bit static today too. So I guess it would make little difference if we were to implement a callback from group destructor.

[ghstack-poisoned]

ngimel

I take it, there will be tests for it some time down the stack?

kwen2501 · 2025-09-09T01:18:47Z

All the existing tests will exercise the host-side get_team API.
The tile_reduce feature on top will exercise the device-side get_n_teams API.

kwen2501 · 2025-09-09T01:19:12Z

@pytorchbot merge

pytorchmergebot · 2025-09-09T01:21:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-09T01:25:31Z

Starting merge as part of PR stack under #162394

NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel. Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases. Pull Request resolved: #162394 Approved by: https://github.com/ngimel, https://github.com/fegin ghstack dependencies: #162320

…oup (pytorch#162320) When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets). This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side. Pull Request resolved: pytorch#162320 Approved by: https://github.com/ngimel

NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel. Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases. Pull Request resolved: pytorch#162394 Approved by: https://github.com/ngimel, https://github.com/fegin ghstack dependencies: pytorch#162320

…oup (pytorch#162320) When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets). This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side. Pull Request resolved: pytorch#162320 Approved by: https://github.com/ngimel

NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel. Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases. Pull Request resolved: pytorch#162394 Approved by: https://github.com/ngimel, https://github.com/fegin ghstack dependencies: pytorch#162320

…oup (pytorch#162320) When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets). This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side. Pull Request resolved: pytorch#162320 Approved by: https://github.com/ngimel

NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel. Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases. Pull Request resolved: pytorch#162394 Approved by: https://github.com/ngimel, https://github.com/fegin ghstack dependencies: pytorch#162320

…oup (pytorch#162320) When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets). This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side. Pull Request resolved: pytorch#162320 Approved by: https://github.com/ngimel

NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel. Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases. Pull Request resolved: pytorch#162394 Approved by: https://github.com/ngimel, https://github.com/fegin ghstack dependencies: pytorch#162320

Update

0570f9d

[ghstack-poisoned]

pytorch-bot Bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 6, 2025

kwen2501 mentioned this pull request Sep 6, 2025

[SymmMem] Tiled reduce #162243

Closed

kwen2501 added a commit that referenced this pull request Sep 6, 2025

[SymmMem] Add team pool to hold duplicated teams for the same rank group

9bb496f

ghstack-source-id: cd46ce9 Pull-Request-resolved: #162320

Update

d329963

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Sep 6, 2025

[SymmMem] Add team pool to hold duplicated teams for the same rank group

f6cad7a

ghstack-source-id: c9034ee Pull-Request-resolved: #162320

kwen2501 requested review from Skylion007, fegin and ngimel September 6, 2025 01:13

ngimel reviewed Sep 6, 2025

View reviewed changes

Comment thread torch/csrc/distributed/c10d/symm_mem/nvshmem_team_manager.hpp Outdated

Comment thread torch/csrc/distributed/c10d/symm_mem/nvshmem_team_manager.hpp Outdated

Comment thread torch/csrc/distributed/c10d/symm_mem/nvshmem_team_manager.hpp Outdated

Update

b2d47a2

[ghstack-poisoned]

ngimel reviewed Sep 6, 2025

View reviewed changes

Update

4a8453d

[ghstack-poisoned]

Skylion007 reviewed Sep 7, 2025

View reviewed changes

kwen2501 commented Sep 7, 2025

View reviewed changes

Skylion007 reviewed Sep 7, 2025

View reviewed changes

Comment thread torch/csrc/distributed/c10d/symm_mem/nvshmem_team_manager.hpp Outdated

kwen2501 added 2 commits September 7, 2025 13:12

Update

a96a2b5

[ghstack-poisoned]

Update

910dfa9

[ghstack-poisoned]

kwen2501 mentioned this pull request Sep 8, 2025

[SymmMem] Use global pe for put and get #162394

Closed

kwen2501 added 2 commits September 8, 2025 15:37

Update

feb55d0

[ghstack-poisoned]

Update

2f3b55a

[ghstack-poisoned]

ngimel approved these changes Sep 9, 2025

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 9, 2025

pytorchmergebot added the merging label Sep 9, 2025

pytorchmergebot closed this in 98ecc0f Sep 9, 2025

pytorchmergebot added the Merged label Sep 9, 2025

pytorchmergebot removed the merging label Sep 9, 2025

github-actions Bot deleted the gh/kwen2501/232/head branch October 10, 2025 02:09

Conversation

kwen2501 commented Sep 6, 2025 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162320

⏳ No Failures, 1 Pending

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngimel Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Sep 9, 2025

Uh oh!

kwen2501 commented Sep 9, 2025

Uh oh!

pytorchmergebot commented Sep 9, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Sep 6, 2025 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Sep 6, 2025 •

edited

Loading

Skylion007 Sep 7, 2025 •

edited

Loading

Skylion007 Sep 7, 2025 •

edited

Loading

kwen2501 Sep 7, 2025 •

edited

Loading