Skip to content

Build failure in NCCLSymmetricMemory.cu: undefined identifier offset_ #172348

@xwang233

Description

@xwang233

Build Error

CI build failing with:

/opt/pytorch/pytorch/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu(143): error: identifier "offset_" is undefined
    ncclGetLsaMultimemDevicePointer(buffer_win_, offset_, &mc_addr_)

Root Cause

PR #172185 introduced this error at NCCLSymmetricMemory.cu:144. The code attempts to use offset_ in the NCCLPeerAllocInfo constructor, but offset_ is a member of NCCLSymmetricMemory, not NCCLPeerAllocInfo.

Proposed Fix

Change line 144 from:

ncclGetLsaMultimemDevicePointer(buffer_win_, offset_, &mc_addr_),

to:

ncclGetLsaMultimemDevicePointer(buffer_win_, 0, &mc_addr_),

This matches the pattern in CUDASymmetricMemory and NVSHMEMSymmetricMemory, where the multicast pointer is obtained for the base allocation (offset 0), and the specific offset is applied later in get_multicast_ptr() at line 213 (which is already correctly implemented).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @malfet @seemethere @kwen2501 @Skylion007 @dzmitry-huba

Metadata

Metadata

Assignees

Labels

module: buildBuild system issuesmodule: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions