[c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1#168129
[c10d][Sym mem] Make nccl backend full fledged with nccl 2.28.9-1#168129fduwjj wants to merge 26 commits intogh/fduwjj/238/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168129
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ⏳ 17 Pending, 1 Unrelated FailureAs of commit db8eaef with merge base fc4f334 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
We can use 2.28.7 due to perf regressions. We are in the process of updating CU13 to 2.28.9, I'd recommend updating CU12 to that value as well. |
[ghstack-poisoned]
[ghstack-poisoned]
kwen2501
left a comment
There was a problem hiding this comment.
looks good.
Can you please have a look at the comments? Thanks.
| int tid = blockIdx.x * blockDim.x + threadIdx.x; | ||
| int stride = blockDim.x * gridDim.x; | ||
| for (int peer = tid; peer < world_size; peer += stride) { | ||
| table[peer] = ncclGetLsaPointer(handle, offset, peer); |
There was a problem hiding this comment.
Does this nccl API have a host side version? If it does then we don't need to fill ptr array using a CUDA kernel.
There was a problem hiding this comment.
No... At least I didn't find one. Now you are in NV and comm department, you can tell them to add one. lol
| ncclWindow_t buffer_handle, | ||
| ncclWindow_t signal_handle, | ||
| ncclDevComm devComm, | ||
| void* signal_pad_ptr) |
There was a problem hiding this comment.
sure, we can get rid of all signal_pads/buffers on the host side for now.
|
Oh can you please put the NCCL upgrade to a separate PR? So that it is easier to search? |
….28.9-1" (This PR will be rebased on #166174) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. [ghstack-poisoned]
| return base + byte_offset; | ||
| } | ||
|
|
||
| template <typename T> |
There was a problem hiding this comment.
You should dig down to why that happens
….28.9-1" (This PR will be rebased on #166174) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. [ghstack-poisoned]
|
@pytorchbot revert -c weird The inductor test fails when compiling nccl code. |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "the auto-revert was right" -c weird The inductor test fails when compiling nccl code. |
|
@pytorchbot successfully started a revert job. Check the current status here. |
….9-1 (#168129)" This reverts commit 033659b. Reverted #168129 on behalf of https://github.com/albanD due to the auto-revert was right ([comment](#168129 (comment)))
|
@fduwjj your PR has been successfully reverted. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Raised by https://github.com/pytorch/pytorch/actions/runs/20148333937 |
| #if NCCL_VERSION_CODE >= NCCL_VERSION(2, 28, 0) | ||
| #define NCCL_HAS_SYMMEM_SUPPORT | ||
| #endif |
There was a problem hiding this comment.
This will entirely disable NCCL SymmMem for builds with NCCL 2.27? torch 2.10 CU12 build still uses 2.27.
There was a problem hiding this comment.
I really don't like this idea of multiple macros, but yes we can do that.
|
Suggesting to run the ciflow/binaries test as well as this PR added nccl_extension.cu into the build. |
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: HTTP Error 502: Bad Gateway Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "rocm build has been running > 4hrs, bypassing it" |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.10 -c fixnewfeature --fixes #167683 |
Cherry picking #168129The cherry pick PR is at #170389 and it is linked with issue #167683. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Stack from ghstack (oldest at bottom):
(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)
We did the following thing:
Resolves #167682