Enables NCCL symmetric memory kernels through mempool registration#155134
Enables NCCL symmetric memory kernels through mempool registration#155134syed-ahmed wants to merge 4 commits intogh/syed-ahmed/2/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155134
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit acd9b65 with merge base 1036f6d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Needs NCCL 2.27. |
kwen2501
left a comment
There was a problem hiding this comment.
Thanks! LGTM in general. Left some comments inline.
c10/cuda/CUDACachingAllocator.h
Outdated
| ~MemPool(); | ||
|
|
||
| MempoolId_t id(); | ||
| bool symm_mem(); |
| // register future segments allocated in this pool (this call is idempotent). | ||
| attachAllocatorHooks(); | ||
| auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(pool->id()); | ||
| auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(); |
There was a problem hiding this comment.
why is pool->id() removed?
There was a problem hiding this comment.
I think this a typo when I had to rebase. We should have the pool id.
| // TODO: | ||
| // if(pool->symm_mem()) { |
There was a problem hiding this comment.
do you feel there is some granularity mismatch here?
pool->symm_mem()is a pool-level attribute;- In the
registerSegmentfunction above, it seems possible to create symmetric window for some segments while not for others (the previousncclCommRegistercall)
There was a problem hiding this comment.
Yeah because symm_mem is a pool level attribute, we should make sure all memory in it is symmetric. I don't have a good solution right now other than having these two all gathers check and raise an error if the pool becomes not symmetric. Left this as a todo because not sure how it would impact performance or if we should only have these checks as a debugging feature.
|
2.27 upgrade here: #155233 |
kwen2501
left a comment
There was a problem hiding this comment.
Approving to unblock.
NCCL 2.27 has landed -- do you mind rebase this PR to make sure all tests pass?
Then hopefully we can land this feature in PyTorch 2.8.
Thanks!
torch/csrc/cuda/MemPool.cpp
Outdated
| bool is_user_created, | ||
| bool use_on_oom) { | ||
| bool use_on_oom, | ||
| bool symm_mem) { |
| // register future segments allocated in this pool (this call is idempotent). | ||
| attachAllocatorHooks(); | ||
| auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(pool->id()); | ||
| auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(); |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
`all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix. Pull Request resolved: #156582 Approved by: https://github.com/fduwjj ghstack dependencies: #155134
As mentioned here: #155134 (comment) Pull Request resolved: #157293 Approved by: https://github.com/Skylion007
|
Thank you for adding this support, @syed-ahmed Why didn't the PR include user-facing documentation? What is the point of having a great feature if nobody knows about it? Thank you. |
Stack from ghstack (oldest at bottom):
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k