[c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1 by fduwjj · Pull Request #168129 · pytorch/pytorch

fduwjj · 2025-11-19T00:17:15Z

Stack from ghstack (oldest at bottom):

-> [c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1 #168129

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:

To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
With Matrix multiplication operator #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
Show that symmem from nccl backend works with traditional c10d collective as well in UT.
Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682

[ghstack-poisoned]

pytorch-bot · 2025-11-19T00:17:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168129

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/check-tpu'

⏳ 17 Pending, 1 Unrelated Failure

As of commit db8eaef with merge base fc4f334 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4) (gh) (disabled by #170373)
test/distributed/fsdp/test_fsdp_core.py::TestParityWithDDPCUDA::test_nested_wrapped_model_offload_true_no_shard_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8c46a5c Pull Request resolved: #168129

Skylion007 · 2025-11-19T18:05:59Z

We can use 2.28.7 due to perf regressions. We are in the process of updating CU13 to 2.28.9, I'd recommend updating CU12 to that value as well.

.ci/docker/ci_commit_pins/nccl-cu13.txt

torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu

[ghstack-poisoned]

ghstack-source-id: aa4665e Pull Request resolved: #168129

[ghstack-poisoned]

ghstack-source-id: b75a599 Pull Request resolved: #168129

torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu

kwen2501

looks good.
Can you please have a look at the comments? Thanks.

test/distributed/test_nccl.py

kwen2501 · 2025-11-26T14:59:01Z

torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu

+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  for (int peer = tid; peer < world_size; peer += stride) {
+      table[peer] = ncclGetLsaPointer(handle, offset, peer);


Does this nccl API have a host side version? If it does then we don't need to fill ptr array using a CUDA kernel.

No... At least I didn't find one. Now you are in NV and comm department, you can tell them to add one. lol

kwen2501 · 2025-11-26T15:04:59Z

torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu

+      ncclWindow_t buffer_handle,
+      ncclWindow_t signal_handle,
+      ncclDevComm devComm,
+      void* signal_pad_ptr)


Is this argument used?

sure, we can get rid of all signal_pads/buffers on the host side for now.

torch/csrc/distributed/c10d/symm_mem/nccl_extension.cu

kwen2501 · 2025-11-26T22:58:08Z

Oh can you please put the NCCL upgrade to a separate PR? So that it is easier to search?

….28.9-1" (This PR will be rebased on #166174) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. [ghstack-poisoned]

ghstack-source-id: e43579a Pull Request resolved: #168129

….28.9-1" (This PR will be rebased on #166174) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. [ghstack-poisoned]

ghstack-source-id: d91467a Pull Request resolved: #168129

test/distributed/test_nccl.py

ngimel · 2025-12-01T21:20:25Z

torch/csrc/distributed/c10d/symm_mem/nccl_extension.cu

+    return base + byte_offset;
+}
+
+template <typename T>


You should dig down to why that happens

torch/csrc/distributed/c10d/symm_mem/nccl_extension.cu

….28.9-1" (This PR will be rebased on #166174) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. [ghstack-poisoned]

albanD · 2025-12-11T20:21:49Z

@pytorchbot revert -c weird

The inductor test fails when compiling nccl code.
This is super weird that it is flaky compilation failure though.

pytorch-bot · 2025-12-11T20:21:51Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}

Try @pytorchbot --help for more info.

albanD · 2025-12-11T20:22:17Z

@pytorchbot revert -m "the auto-revert was right" -c weird

The inductor test fails when compiling nccl code.
This is super weird that it is flaky compilation failure though.

pytorchmergebot · 2025-12-11T20:23:57Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

….9-1 (#168129)" This reverts commit 033659b. Reverted #168129 on behalf of https://github.com/albanD due to the auto-revert was right ([comment](#168129 (comment)))

pytorchmergebot · 2025-12-11T20:24:05Z

@fduwjj your PR has been successfully reverted.

fduwjj · 2025-12-11T21:42:00Z

@pytorchbot rebase

pytorchmergebot · 2025-12-11T21:43:31Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-12-11T21:43:36Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/20148333937

kwen2501 · 2025-12-11T22:46:38Z

torch/csrc/distributed/c10d/symm_mem/nccl_extension.cuh

+#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 28, 0)
+#define NCCL_HAS_SYMMEM_SUPPORT
+#endif


This will entirely disable NCCL SymmMem for builds with NCCL 2.27? torch 2.10 CU12 build still uses 2.27.

I really don't like this idea of multiple macros, but yes we can do that.

tinglvv · 2025-12-12T07:07:58Z

Suggesting to run the ciflow/binaries test as well as this PR added nccl_extension.cu into the build.

….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]

fduwjj · 2025-12-13T05:18:47Z

@pytorchbot merge

pytorchmergebot · 2025-12-13T05:20:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-12-13T07:13:06Z

Merge failed

Reason: HTTP Error 502: Bad Gateway

Details for Dev Infra team

Raised by workflow job

kwen2501 · 2025-12-13T07:58:12Z

@pytorchbot merge -i

pytorchmergebot · 2025-12-13T08:00:27Z

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2025-12-13T08:04:30Z

@pytorchbot merge -f "rocm build has been running > 4hrs, bypassing it"

pytorchmergebot · 2025-12-13T08:04:47Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2025-12-13T08:06:35Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

fduwjj · 2025-12-13T22:28:35Z

@pytorchbot cherry-pick --onto release/2.10 -c fixnewfeature --fixes #167683

pytorchbot · 2025-12-13T22:34:37Z

Cherry picking #168129

The cherry pick PR is at #170389 and it is linked with issue #167683. The following tracker issues are updated:

[v.2.10.0] Release Tracker #170119 (comment)

Details for Dev Infra team

Raised by workflow job

[Symm mem] Make nccl backend full fleged

aed7c54

[ghstack-poisoned]

fduwjj requested a review from jeffdaily as a code owner November 19, 2025 00:17

pytorch-bot bot added ciflow/h100-symm-mem ciflow/inductor labels Nov 19, 2025

fduwjj added a commit that referenced this pull request Nov 19, 2025

[Symm mem] Make nccl backend full fleged

3ee66d7

ghstack-source-id: 8c46a5c Pull Request resolved: #168129

pytorch-bot bot added the release notes: releng release notes category label Nov 19, 2025

Skylion007 mentioned this pull request Nov 19, 2025

[CI][CUDA][Distributed] Update NCCL to 2.28.9 for CUDA13 #168091

Closed

Skylion007 reviewed Nov 19, 2025

View reviewed changes

.ci/docker/ci_commit_pins/nccl-cu13.txt Outdated Show resolved Hide resolved

Skylion007 reviewed Nov 22, 2025

View reviewed changes

torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu Show resolved Hide resolved

Update on "[Symm mem] Make nccl backend full fleged"

e29c4ee

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Nov 25, 2025

[Symm mem] Make nccl backend full fleged

7f4977a

ghstack-source-id: aa4665e Pull Request resolved: #168129

Update on "[Symm mem] Make nccl backend full fleged"

8e3ee84

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Nov 25, 2025

[Symm mem] Make nccl backend full fleged

d351382

ghstack-source-id: b75a599 Pull Request resolved: #168129

fduwjj changed the title ~~[Symm mem] Make nccl backend full fleged~~ [c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1 Nov 25, 2025

fduwjj requested review from Skylion007, fegin, kwen2501 and ngimel November 25, 2025 23:17

ngimel reviewed Nov 26, 2025

View reviewed changes

torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu Outdated Show resolved Hide resolved

kwen2501 approved these changes Nov 26, 2025

View reviewed changes

fduwjj added a commit that referenced this pull request Nov 27, 2025

[Symm mem] Make nccl backend full fleged

f11e47c

ghstack-source-id: e43579a Pull Request resolved: #168129

fduwjj added a commit that referenced this pull request Nov 27, 2025

[Symm mem] Make nccl backend full fleged

bf6afaf

ghstack-source-id: d91467a Pull Request resolved: #168129

fegin reviewed Dec 1, 2025

View reviewed changes

test/distributed/test_nccl.py Show resolved Hide resolved

test/distributed/test_nccl.py Outdated Show resolved Hide resolved

ngimel reviewed Dec 1, 2025

View reviewed changes

pytorchmergebot closed this in 033659b Dec 11, 2025

pytorchmergebot removed the merging label Dec 11, 2025

pytorch-auto-revert bot mentioned this pull request Dec 11, 2025

[DO NOT CLOSE] Autorevert actions shadow mode stream #163650

Open

kwen2501 reviewed Dec 11, 2025

View reviewed changes

fduwjj added 3 commits December 12, 2025 16:35

kwen2501 mentioned this pull request Dec 13, 2025

DISABLED test_nested_wrapped_model_offload_true_no_shard_cuda (__main__.TestParityWithDDPCUDA) #170373

Open

pytorchbot mentioned this pull request Dec 13, 2025

[c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1 #170389

Closed

pytorchbot mentioned this pull request Dec 13, 2025

[v.2.10.0] Release Tracker #170119

Closed

Conversation

fduwjj commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168129

❗ 1 Active SEVs

⏳ 17 Pending, 1 Unrelated Failure

Uh oh!

Skylion007 commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwen2501 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwen2501 commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

ngimel Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albanD commented Dec 11, 2025

Uh oh!

pytorch-bot bot commented Dec 11, 2025

Uh oh!

albanD commented Dec 11, 2025

Uh oh!

pytorchmergebot commented Dec 11, 2025

Uh oh!

pytorchmergebot commented Dec 11, 2025

Uh oh!

fduwjj commented Dec 11, 2025

Uh oh!

pytorchmergebot commented Dec 11, 2025

Uh oh!

pytorchmergebot commented Dec 11, 2025

Uh oh!

kwen2501 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

tinglvv commented Dec 12, 2025

Uh oh!

fduwjj commented Dec 13, 2025

Uh oh!

pytorchmergebot commented Dec 13, 2025

Merge started

Uh oh!

pytorchmergebot commented Dec 13, 2025

Merge failed

Uh oh!

fduwjj commented Nov 19, 2025 •

edited

Loading

pytorch-bot bot commented Nov 19, 2025 •

edited

Loading

fduwjj Nov 26, 2025 •

edited

Loading