Skip to content

[c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1#168129

Closed
fduwjj wants to merge 26 commits intogh/fduwjj/238/basefrom
gh/fduwjj/238/head
Closed

[c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1#168129
fduwjj wants to merge 26 commits intogh/fduwjj/238/basefrom
gh/fduwjj/238/head

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Nov 19, 2025

Stack from ghstack (oldest at bottom):

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:

  1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
  2. With Matrix multiplication operator #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
  3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
  4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
  5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682

@fduwjj fduwjj requested a review from jeffdaily as a code owner November 19, 2025 00:17
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168129

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⏳ 17 Pending, 1 Unrelated Failure

As of commit db8eaef with merge base fc4f334 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj added a commit that referenced this pull request Nov 19, 2025
ghstack-source-id: 8c46a5c
Pull Request resolved: #168129
@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Nov 19, 2025
@Skylion007
Copy link
Collaborator

We can use 2.28.7 due to perf regressions. We are in the process of updating CU13 to 2.28.9, I'd recommend updating CU12 to that value as well.

fduwjj added a commit that referenced this pull request Nov 25, 2025
ghstack-source-id: aa4665e
Pull Request resolved: #168129
fduwjj added a commit that referenced this pull request Nov 25, 2025
ghstack-source-id: b75a599
Pull Request resolved: #168129
@fduwjj fduwjj changed the title [Symm mem] Make nccl backend full fleged [c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1 Nov 25, 2025
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.
Can you please have a look at the comments? Thanks.

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int peer = tid; peer < world_size; peer += stride) {
table[peer] = ncclGetLsaPointer(handle, offset, peer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this nccl API have a host side version? If it does then we don't need to fill ptr array using a CUDA kernel.

Copy link
Contributor Author

@fduwjj fduwjj Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No... At least I didn't find one. Now you are in NV and comm department, you can tell them to add one. lol

ncclWindow_t buffer_handle,
ncclWindow_t signal_handle,
ncclDevComm devComm,
void* signal_pad_ptr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this argument used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can get rid of all signal_pads/buffers on the host side for now.

@kwen2501
Copy link
Collaborator

Oh can you please put the NCCL upgrade to a separate PR? So that it is easier to search?

….28.9-1"


(This PR will be rebased on #166174)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Nov 27, 2025
ghstack-source-id: e43579a
Pull Request resolved: #168129
….28.9-1"


(This PR will be rebased on #166174)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Nov 27, 2025
ghstack-source-id: d91467a
Pull Request resolved: #168129
return base + byte_offset;
}

template <typename T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should dig down to why that happens

….28.9-1"


(This PR will be rebased on #166174)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

[ghstack-poisoned]
@albanD
Copy link
Collaborator

albanD commented Dec 11, 2025

@pytorchbot revert -c weird

The inductor test fails when compiling nccl code.
This is super weird that it is flaky compilation failure though.

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 11, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}

Try @pytorchbot --help for more info.

@albanD
Copy link
Collaborator

albanD commented Dec 11, 2025

@pytorchbot revert -m "the auto-revert was right" -c weird

The inductor test fails when compiling nccl code.
This is super weird that it is flaky compilation failure though.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Dec 11, 2025
….9-1 (#168129)"

This reverts commit 033659b.

Reverted #168129 on behalf of https://github.com/albanD due to the auto-revert was right ([comment](#168129 (comment)))
@pytorchmergebot
Copy link
Collaborator

@fduwjj your PR has been successfully reverted.

@fduwjj
Copy link
Contributor Author

fduwjj commented Dec 11, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/20148333937

Comment on lines +9 to +11
#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 28, 0)
#define NCCL_HAS_SYMMEM_SUPPORT
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will entirely disable NCCL SymmMem for builds with NCCL 2.27? torch 2.10 CU12 build still uses 2.27.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this idea of multiple macros, but yes we can do that.

@tinglvv
Copy link
Collaborator

tinglvv commented Dec 12, 2025

Suggesting to run the ciflow/binaries test as well as this PR added nccl_extension.cu into the build.

….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
@fduwjj
Copy link
Contributor Author

fduwjj commented Dec 13, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: HTTP Error 502: Bad Gateway

Details for Dev Infra team Raised by workflow job

@kwen2501
Copy link
Collaborator

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kwen2501
Copy link
Collaborator

@pytorchbot merge -f "rocm build has been running > 4hrs, bypassing it"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@fduwjj
Copy link
Contributor Author

fduwjj commented Dec 13, 2025

@pytorchbot cherry-pick --onto release/2.10 -c fixnewfeature --fixes #167683

@pytorchbot
Copy link
Collaborator

Cherry picking #168129

The cherry pick PR is at #170389 and it is linked with issue #167683. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autorevert: disable Disable autorevert for a specific PR ci-no-td Do not run TD on this PR ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/h100-symm-mem ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged no-runner-experiments Bypass Meta/LF runner determinator release notes: releng release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.