Refactor: decouple segment tracking from comm registration by wangfakang · Pull Request #21392 · sgl-project/sglang

wangfakang · 2026-03-25T09:09:49Z

CC @nvcastet @yizhang2077 @merrymercy @ShangmingCai @Fridge003 @ch-wan PTAL, thx.

Motivation

When multiple communication groups share a single global MemPool, memory blocks released by one group's comm may be reused by another group's comm. However, symmetric memory requires buffers to be registered with a specific ncclComm via ncclCommWindowRegister. Reusing memory across groups causes the registration to be associated with the wrong communicator.

So redesign symmetric memory allocator to defer NCCL window registration from allocation-time to context exit-time. This enables correct memory reuse across different communicators and eliminates the CPU overhead of snapshot(). Thanks to @nvcastet for the help!

Related PR: #19329 (comment) and #20153

Modifications

Key changes:

Allocation-time tracking: C++ layer now tracks memory segments (ptr, size) during their lifetime without registering to any comm.
Deferred registration: Registration with the NCCL communicator happens at SymmetricMemoryContext.exit() using pynccl API, enabling proper handling of both newly allocated and reused memory.
Shared MemPool: All groups share a single MemPool to reduce memory fragmentation, with proper per-comm registration tracking.

Benchmarking and Profiling

In benchmark testing, it was found that the cpu overhead of _get_tracked_segments() is about 25 times lower than the snapshot() function (5.351μs vs 134.320μs).

#NCCL_DEBUG=WARN python benchmark/bench_pynccl_allocator/bench_segment_tracking.py --num-segments 50 --num-iters 1000
================================================================================
Benchmark: Segment Tracking CPU Overhead
================================================================================
Segment size: 1.00 MB
Iterations per measurement: 1000

Segments     _get_tracked_segments (µs)     snapshot (µs)        Speedup   
--------------------------------------------------------------------------------
25           5.351                          134.320              25.10     x
--------------------------------------------------------------------------------

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: wangfakang <fakangwang@gmail.com>

… comms Signed-off-by: wangfakang <fakangwang@gmail.com>

gemini-code-assist · 2026-03-25T09:10:13Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the symmetric memory allocator to decouple NCCL window registration from memory allocation. The previous approach led to issues when multiple communication groups shared a MemPool, as memory registered with one communicator could be incorrectly reused by another. The new design defers registration to the context exit time, allowing memory to be tracked at the C++ level and registered with the appropriate communicator in Python, which also significantly reduces CPU overhead compared to the prior snapshot() method.

Highlights

Allocation-time tracking: The C++ layer now tracks memory segments (pointer, size) throughout their lifetime without performing immediate NCCL registration.
Deferred registration: Registration with the NCCL communicator is now handled at the SymmetricMemoryContext.__exit__() using the pynccl API, ensuring correct handling for both newly allocated and reused memory.
Shared MemPool: All communication groups now share a single MemPool to minimize memory fragmentation, with proper per-communicator registration tracking managed by the Python layer.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the PyNCCL allocator to introduce a C++-based segment tracking mechanism, deferring memory registration to the Python layer within the SymmetricMemoryContext. A new benchmark script is added to evaluate the CPU overhead of this new tracking method. Review feedback indicates a memory leak in the _ptr_to_registered_comms dictionary, as stale entries are not removed, and suggests a performance improvement for the C++ untrack_segment function by considering a different data structure than std::vector for better deallocation time complexity.

gemini-code-assist

Code Review

This pull request introduces a new benchmark script to compare the CPU overhead of custom C++ segment tracking (using std::vector) against PyTorch's memory_snapshot(). The core changes in pynccl_allocator.py refactor how NCCL memory segments are tracked and registered. Segments are now tracked in a C++ std::vector and exposed to Python via ctypes. The registration with NCCL communicators is deferred to the Python SymmetricMemoryContext's __exit__ method, allowing for a single shared memory pool across groups and handling registration for both new and reused memory. Review comments point out a potential performance bottleneck in untrack_segment due to a linear scan, a thread-safety issue in the global _ptr_to_registered_comms dictionary, and several minor issues in the benchmark script including an incorrect type hint, an unused import, and code style improvements. An outdated comment in the C++ source also needs to be updated to reflect the use of std::vector instead of map.

Signed-off-by: wangfakang <fakangwang@gmail.com>

…cator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: wangfakang <fakangwang@gmail.com>

wangfakang · 2026-03-26T01:40:34Z

/rerun-failed-ci

wangfakang · 2026-03-27T05:40:11Z

/rerun-failed-ci

nvcastet

Instead of copying list of segments from C++ to python, can you have a single C++ API register_segments_with_comm(comm_ptr)and have the registrations and book keeping inside C++.
Ideally you would have an unordered_map mapping a comm_ptr to next index idx in g_segments to register then you would register with:

next_idx = map[comm_ptr]; // next_idx will be 0 if map does not contain comm_prt
ncclComm_t comm = (ncclComm_t)(comm_ptr);
for (size_t i = next_idx; i < g_segments.size(); ++i) {
  auto seg = g_segments[i];
  ncclWindow_t win;
  NCCLCHECK(ncclCommWindowRegister(comm, seg[0], seg[1], &win, NCCL_WIN_COLL_SYMMETRIC));
}
map[comm_ptr] = g_segments.size();

Signed-off-by: wangfakang <fakangwang@gmail.com>

wangfakang · 2026-04-09T07:40:50Z

Instead of copying list of segments from C++ to python, can you have a single C++ API register_segments_with_comm(comm_ptr)and have the registrations and book keeping inside C++. Ideally you would have an unordered_map mapping a comm_ptr to next index idx in g_segments to register then you would register with:
next_idx = map[comm_ptr]; // next_idx will be 0 if map does not contain comm_prt
ncclComm_t comm = (ncclComm_t)(comm_ptr);
for (size_t i = next_idx; i < g_segments.size(); ++i) {
  auto seg = g_segments[i];
  ncclWindow_t win;
  NCCLCHECK(ncclCommWindowRegister(comm, seg[0], seg[1], &win, NCCL_WIN_COLL_SYMMETRIC));
}
map[comm_ptr] = g_segments.size();

@nvcastet Thank you for the suggestions. I have addressed all comments.

nvcastet

Once all changes are done.
Please make sure to test TP config and DP+EP config for deepseekR1 fp4 and check gpqa eval accuracy:
See comments at:
#8238
#9358

Signed-off-by: wangfakang <fakangwang@gmail.com>

nvcastet · 2026-04-20T21:52:22Z

/tag-and-rerun-ci

wangfakang · 2026-04-22T02:43:09Z

/rerun-failed-ci

stage-b-test-2-gpu-large (2) install dependencies failed

wangfakang · 2026-04-23T12:30:53Z

/rerun-failed-ci

Trigger waiting test task.

wangfakang · 2026-04-23T13:09:54Z

Hello @nvcastet, the GPU-related test cases (stage-a-test-*, stage-b-test-*, and some stage-c-test-*) have all passed. I don't have permission to trigger the remaining stage-c-test-* cases individually. Could you please help with those? Thanks!

wangfakang · 2026-04-27T02:52:00Z

Hello @nvcastet, the GPU-related test cases (stage-a-test-*, stage-b-test-*, and some stage-c-test-*) have all passed. I don't have permission to trigger the remaining stage-c-test-* cases individually. Could you please help with those? Thanks!

Frendly ping @nvcastet

nvcastet · 2026-04-27T14:27:41Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-04-27T14:28:08Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

nvcastet · 2026-04-27T14:28:37Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-04-27T14:29:18Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

nvcastet · 2026-04-27T14:30:16Z

You will need a code owner to review first for the PR to be merged.

wangfakang · 2026-04-28T07:27:03Z

You will need a code owner to review first for the PR to be merged.

hello @nvcastet , thanks for the reminder and your patient review. This PR has already been approved by the code owner, @yizhang2077.

nvcastet · 2026-04-28T14:14:52Z

stage-c-test-4-gpu-b200 times out. I don't know if it is related to your changes.

wangfakang · 2026-04-29T07:30:21Z

stage-c-test-4-gpu-b200 times out. I don't know if it is related to your changes.

Hi @nvcastet , I checked the logs and found that the failing test case is test_fp8_blockwise_gemm.py. Since this test case doesn't enable symm, it won't execute the code modified in this PR. Therefore, the timeout issue is unrelated to the changes in this PR.
Additionally, I noticed that these test cases have had timeout problems before, as shown in these previous fix PRs:

Meanwhile, the corresponding deepseek-v3-fp4 and deepseek-v32 in this PR were also executed successfully, as the logs. This PR has been thoroughly validated in terms of both performance and accuracy, as detailed in the previous report.

wangfakang · 2026-05-04T15:46:38Z

stage-c-test-4-gpu-b200 times out. I don't know if it is related to your changes.

Hi @nvcastet , I checked the logs and found that the failing test case is test_fp8_blockwise_gemm.py. Since this test case doesn't enable symm, it won't execute the code modified in this PR. Therefore, the timeout issue is unrelated to the changes in this PR. Additionally, I noticed that these test cases have had timeout problems before, as shown in these previous fix PRs:

[CI] Update B200 est_times to prevent timeouts on slower machine #22609

[ci] split stage-c-test-4-gpu-b200 to enable a low-disk runner pool #23417

Meanwhile, the corresponding deepseek-v3-fp4 and deepseek-v32 in this PR were also executed successfully, as the logs. This PR has been thoroughly validated in terms of both performance and accuracy, as detailed in the previous report.

Frendly ping @nvcastet

nvcastet · 2026-05-04T15:51:18Z

There is a conflict to solve but looks good to me.

Signed-off-by: wangfakang <fakangwang@gmail.com>

wangfakang · 2026-05-04T16:25:53Z

There is a conflict to solve but looks good to me.

hello @nvcastet, Conflicts resolved with no logic changes. PTAL when you have time. Thanks!

nvcastet · 2026-05-04T16:33:16Z

LGTM.
Someone with merge permission would need to push it. @Fridge003 ?

wangfakang · 2026-05-04T16:56:27Z

LGTM. Someone with merge permission would need to push it. @Fridge003 ?

Thank you for the review, @nvcastet and @yizhang2077. Ping @ch-wan, @ShangmingCai, or @Fridge003 could you please take a look and help merge this when you have a moment? Thank you!

Signed-off-by: wangfakang <fakangwang@gmail.com>

ShangmingCai · 2026-05-06T03:51:08Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-05-06T03:51:40Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

wangfakang · 2026-05-06T05:37:08Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

I checked the logs and found that the failing test case is test_qwen35_models.py . Since this test case doesn't enable symm, it won't execute the code modified in this PR. The error message is CUDA out of memory , which is unrelated to the changes in this PR.

cc @ShangmingCai

This PR has been thoroughly validated for both performance and accuracy, as detailed in the previous comment: #21392 (comment). Based on these results, I believe it's ready to merge.

ShangmingCai · 2026-05-06T09:07:35Z

Related CI has passed.

wangfakang added 2 commits March 20, 2026 20:59

Refactor: decouple segment tracking from comm registration

2f6f2d9

Signed-off-by: wangfakang <fakangwang@gmail.com>

Strictly ensure the registration order of trace ptr list in different…

c12cc8e

… comms Signed-off-by: wangfakang <fakangwang@gmail.com>

wangfakang requested review from ch-wan, merrymercy and yizhang2077 as code owners March 25, 2026 09:09

Merge branch 'main' into symm_spool_track

1a7a8a9

gemini-code-assist Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread python/sglang/srt/distributed/device_communicators/pynccl_allocator.py Outdated

Comment thread python/sglang/srt/distributed/device_communicators/pynccl_allocator.py Outdated

gemini-code-assist Bot reviewed Mar 25, 2026

View reviewed changes

wangfakang and others added 4 commits March 25, 2026 17:24

fix code style

895def9

Signed-off-by: wangfakang <fakangwang@gmail.com>

Update python/sglang/srt/distributed/device_communicators/pynccl_allo…

bfc60f4

…cator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix style

ee9c876

Signed-off-by: wangfakang <fakangwang@gmail.com>

Merge branch 'main' into symm_spool_track

ce53112

wangfakang mentioned this pull request Mar 25, 2026

Refactor: fix symmetric memory pool isolation per communication group #20153

Open

5 tasks

yizhang2077 added the run-ci label Mar 25, 2026

Merge branch 'main' into symm_spool_track

fd66ddd

nvcastet requested changes Apr 8, 2026

View reviewed changes

Comment thread benchmark/bench_pynccl_allocator/bench_segment_tracking.py Outdated

Comment thread python/sglang/srt/distributed/device_communicators/pynccl_allocator.py Outdated

Comment thread python/sglang/srt/distributed/device_communicators/pynccl_allocator.py Outdated

wangfakang added 5 commits April 9, 2026 15:18

refactor segment tracking to avoid copying data between Python and C++

c2e1268

Signed-off-by: wangfakang <fakangwang@gmail.com>

Merge branch 'main' into symm_spool_track

a217fc1

fix typo

c70c777

Signed-off-by: wangfakang <fakangwang@gmail.com>

Merge branch 'main' into symm_spool_track

a525e69

Merge branch 'main' into symm_spool_track

470ba86

Merge branch 'main' into symm_spool_track

0f63693

nvcastet reviewed Apr 10, 2026

View reviewed changes

wangfakang added 3 commits April 13, 2026 11:42

simplify tracking & add assertions

5be331d

Signed-off-by: wangfakang <fakangwang@gmail.com>

fix typo.

c8ea6ba

Signed-off-by: wangfakang <fakangwang@gmail.com>

Merge branch 'main' into symm_spool_track

16ba880

wangfakang added 2 commits April 21, 2026 10:25

Merge branch 'main' into symm_spool_track

82a98e5

Merge branch 'main' into symm_spool_track

77f9982

yizhang2077 approved these changes Apr 28, 2026

View reviewed changes

wangfakang mentioned this pull request Apr 30, 2026

[CP] Register KV cache allgather buffer with symmetric memory #24040

Merged

5 tasks

wangfakang added 2 commits May 5, 2026 00:12

Merge branch 'main' into symm_spool_track

dd156b8

fix style.

c386a54

Signed-off-by: wangfakang <fakangwang@gmail.com>

fix style.

bb40ac8

Signed-off-by: wangfakang <fakangwang@gmail.com>

ShangmingCai merged commit c8bc235 into sgl-project:main May 6, 2026
78 of 108 checks passed

Conversation

wangfakang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangfakang commented Mar 26, 2026

Uh oh!

wangfakang commented Mar 27, 2026

Uh oh!

nvcastet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangfakang commented Apr 9, 2026

Uh oh!

nvcastet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvcastet commented Apr 20, 2026

Uh oh!

wangfakang commented Apr 22, 2026

Uh oh!

wangfakang commented Apr 23, 2026

Uh oh!

wangfakang commented Apr 23, 2026

Uh oh!

wangfakang commented Apr 27, 2026

Uh oh!

nvcastet commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

nvcastet commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

nvcastet commented Apr 27, 2026

Uh oh!

wangfakang commented Apr 28, 2026

Uh oh!

nvcastet commented Apr 28, 2026

Uh oh!

wangfakang commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

wangfakang commented Mar 25, 2026 •

edited

Loading

wangfakang commented Apr 29, 2026 •

edited

Loading