Build Error
CI build failing with:
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu(143): error: identifier "offset_" is undefined
ncclGetLsaMultimemDevicePointer(buffer_win_, offset_, &mc_addr_)
Root Cause
PR #172185 introduced this error at NCCLSymmetricMemory.cu:144. The code attempts to use offset_ in the NCCLPeerAllocInfo constructor, but offset_ is a member of NCCLSymmetricMemory, not NCCLPeerAllocInfo.
Proposed Fix
Change line 144 from:
ncclGetLsaMultimemDevicePointer(buffer_win_, offset_, &mc_addr_),
to:
ncclGetLsaMultimemDevicePointer(buffer_win_, 0, &mc_addr_),
This matches the pattern in CUDASymmetricMemory and NVSHMEMSymmetricMemory, where the multicast pointer is obtained for the base allocation (offset 0), and the specific offset is applied later in get_multicast_ptr() at line 213 (which is already correctly implemented).
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @malfet @seemethere @kwen2501 @Skylion007 @dzmitry-huba
Build Error
CI build failing with:
Root Cause
PR #172185 introduced this error at NCCLSymmetricMemory.cu:144. The code attempts to use
offset_in theNCCLPeerAllocInfoconstructor, butoffset_is a member ofNCCLSymmetricMemory, notNCCLPeerAllocInfo.Proposed Fix
Change line 144 from:
ncclGetLsaMultimemDevicePointer(buffer_win_, offset_, &mc_addr_),to:
This matches the pattern in
CUDASymmetricMemoryandNVSHMEMSymmetricMemory, where the multicast pointer is obtained for the base allocation (offset 0), and the specific offset is applied later inget_multicast_ptr()at line 213 (which is already correctly implemented).cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @malfet @seemethere @kwen2501 @Skylion007 @dzmitry-huba