Refactor: decouple segment tracking from comm registration#21392
Refactor: decouple segment tracking from comm registration#21392ShangmingCai merged 24 commits intosgl-project:mainfrom
Conversation
Signed-off-by: wangfakang <fakangwang@gmail.com>
… comms Signed-off-by: wangfakang <fakangwang@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the symmetric memory allocator to decouple NCCL window registration from memory allocation. The previous approach led to issues when multiple communication groups shared a MemPool, as memory registered with one communicator could be incorrectly reused by another. The new design defers registration to the context exit time, allowing memory to be tracked at the C++ level and registered with the appropriate communicator in Python, which also significantly reduces CPU overhead compared to the prior Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the PyNCCL allocator to introduce a C++-based segment tracking mechanism, deferring memory registration to the Python layer within the SymmetricMemoryContext. A new benchmark script is added to evaluate the CPU overhead of this new tracking method. Review feedback indicates a memory leak in the _ptr_to_registered_comms dictionary, as stale entries are not removed, and suggests a performance improvement for the C++ untrack_segment function by considering a different data structure than std::vector for better deallocation time complexity.
There was a problem hiding this comment.
Code Review
This pull request introduces a new benchmark script to compare the CPU overhead of custom C++ segment tracking (using std::vector) against PyTorch's memory_snapshot(). The core changes in pynccl_allocator.py refactor how NCCL memory segments are tracked and registered. Segments are now tracked in a C++ std::vector and exposed to Python via ctypes. The registration with NCCL communicators is deferred to the Python SymmetricMemoryContext's __exit__ method, allowing for a single shared memory pool across groups and handling registration for both new and reused memory. Review comments point out a potential performance bottleneck in untrack_segment due to a linear scan, a thread-safety issue in the global _ptr_to_registered_comms dictionary, and several minor issues in the benchmark script including an incorrect type hint, an unused import, and code style improvements. An outdated comment in the C++ source also needs to be updated to reflect the use of std::vector instead of map.
Signed-off-by: wangfakang <fakangwang@gmail.com>
…cator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
nvcastet
left a comment
There was a problem hiding this comment.
Instead of copying list of segments from C++ to python, can you have a single C++ API register_segments_with_comm(comm_ptr)and have the registrations and book keeping inside C++.
Ideally you would have an unordered_map mapping a comm_ptr to next index idx in g_segments to register then you would register with:
next_idx = map[comm_ptr]; // next_idx will be 0 if map does not contain comm_prt
ncclComm_t comm = (ncclComm_t)(comm_ptr);
for (size_t i = next_idx; i < g_segments.size(); ++i) {
auto seg = g_segments[i];
ncclWindow_t win;
NCCLCHECK(ncclCommWindowRegister(comm, seg[0], seg[1], &win, NCCL_WIN_COLL_SYMMETRIC));
}
map[comm_ptr] = g_segments.size();
Signed-off-by: wangfakang <fakangwang@gmail.com>
@nvcastet Thank you for the suggestions. I have addressed all comments. |
Signed-off-by: wangfakang <fakangwang@gmail.com>
|
/tag-and-rerun-ci |
|
/rerun-failed-ci stage-b-test-2-gpu-large (2) install dependencies failed
|
|
/rerun-failed-ci Trigger waiting test task. |
|
Hello @nvcastet, the GPU-related test cases ( |
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
|
You will need a code owner to review first for the PR to be merged. |
hello @nvcastet , thanks for the reminder and your patient review. This PR has already been approved by the code owner, @yizhang2077. |
|
stage-c-test-4-gpu-b200 times out. I don't know if it is related to your changes. |
Hi @nvcastet , I checked the logs and found that the failing test case is test_fp8_blockwise_gemm.py. Since this test case doesn't enable symm, it won't execute the code modified in this PR. Therefore, the timeout issue is unrelated to the changes in this PR.
Meanwhile, the corresponding |
Frendly ping @nvcastet |
|
There is a conflict to solve but looks good to me. |
Signed-off-by: wangfakang <fakangwang@gmail.com>
hello @nvcastet, Conflicts resolved with no logic changes. PTAL when you have time. Thanks! |
|
LGTM. |
Thank you for the review, @nvcastet and @yizhang2077. Ping @ch-wan, @ShangmingCai, or @Fridge003 could you please take a look and help merge this when you have a moment? Thank you! |
Signed-off-by: wangfakang <fakangwang@gmail.com>
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
I checked the logs and found that the failing test case is test_qwen35_models.py . Since this test case doesn't enable symm, it won't execute the code modified in this PR. The error message is
This PR has been thoroughly validated for both performance and accuracy, as detailed in the previous comment: #21392 (comment). Based on these results, I believe it's ready to merge.
|





CC @nvcastet @yizhang2077 @merrymercy @ShangmingCai @Fridge003 @ch-wan PTAL, thx.
Motivation
When multiple communication groups share a single global MemPool, memory blocks released by one group's comm may be reused by another group's comm. However, symmetric memory requires buffers to be registered with a specific
ncclCommviancclCommWindowRegister. Reusing memory across groups causes the registration to be associated with the wrong communicator.So redesign symmetric memory allocator to defer NCCL window registration from allocation-time to context exit-time. This enables correct memory reuse across different communicators and eliminates the CPU overhead of
snapshot(). Thanks to @nvcastet for the help!Related PR: #19329 (comment) and #20153
Modifications
Key changes:
Allocation-time tracking: C++ layer now tracks memory segments (ptr, size) during their lifetime without registering to any comm.
Deferred registration: Registration with the NCCL communicator happens at SymmetricMemoryContext.exit() using pynccl API, enabling proper handling of both newly allocated and reused memory.
Shared MemPool: All groups share a single MemPool to reduce memory fragmentation, with proper per-comm registration tracking.
Benchmarking and Profiling
In benchmark testing, it was found that the cpu overhead of
_get_tracked_segments()is about 25 times lower than thesnapshot()function (5.351μs vs 134.320μs).Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci