[Fix] Solve the error lead by _commit_transfer_to_req() when using IntraNode NVLink in PD disaggregation by TTThanos · Pull Request #23252 · sgl-project/sglang

TTThanos · 2026-04-20T11:36:29Z

Motivation

When using Mooncake IntraNode nvlink as kv cache transport backend in PD disaggregation, decode server will crash and error log can be observed

This is related to the commit #8ed35df introduced metadata validation in _commit_transfer_to_req(). In IntraNode nvlink scenarios, metadata buffer is allocated on the GPU and transferred via nvlink. However, due to asynchronous nature of NVLink transport, metadata buffer may not have been transferred when poll == KVPoll.Success. Consequently, bootstrap_room remains 0 for some ranks, while other ranks have successfully read and removed their metadata, leading to mismatch in the subsequent poll_and_all_reduce() call.

Modifications

Accuracy Tests

launch server
`model_path=/mnt/models/Qwen3-235B-A22B-FP8
FILE_NAME_PREFIX=Decode_Mooncake_INTRANVLINK_kv_transfer_Hicache_test_qwen3_235b_tp4_0412

export MC_TE_METRIC=true
export MC_INTRANODE_NVLINK=true
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=true SGLANG_MOONCAKE_CUSTOM_MEM_POOL=INTRA_NODE_NVLINK MC_LOG_LEVEL=INFO SGLANG_TORCH_PROFILER_DIR=/root/profile/ python3 -m sglang.launch_server
--model-path ${model_path}
--tp 4
--mem-fraction-static 0.85
--base-gpu-id 4
--disaggregation-mode decode
--port 7002
--watchdog-timeout 1000000 --decode-log-interval 1 >/root/log/${FILE_NAME_PREFIX}.log 2>&1`

Speed Tests and Profiling

Tested on 4K/2K input/output using bench serving test

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…link connection

…ack is CudaMalloc

…ate type

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…d else if condition for custom_mem_pool

gemini-code-assist

Code Review

This pull request updates the Mooncake disaggregation logic to support the INTRA_NODE_NVLINK custom memory pool type. Key changes include forcing the use of TCP for auxiliary data transfers and switching the device type to CPU for specific memory pool configurations. A review comment suggests improving code readability in conn.py by using internal class properties instead of directly accessing environment variables in conditional checks.

gemini-code-assist · 2026-04-20T11:39:35Z

        if (
            self.enable_custom_mem_pool and self.custom_mem_pool_type == "NVLINK"
-        ) or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get():
+        ) or envs.SGLANG_MOONCAKE_CUSTOM_MEM_POOL.get() == "INTRA_NODE_NVLINK" or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get():


For better readability and to avoid re-reading environment variables, consider using the self.custom_mem_pool_type property which should already hold the correct value. This simplifies the condition and makes it more consistent with the surrounding code.

Suggested change

if (

self.enable_custom_mem_pool and self.custom_mem_pool_type == "NVLINK"

) or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get():

) or envs.SGLANG_MOONCAKE_CUSTOM_MEM_POOL.get() == "INTRA_NODE_NVLINK" or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get():

if (

self.enable_custom_mem_pool

and self.custom_mem_pool_type in ("NVLINK", "INTRA_NODE_NVLINK")

) or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get():

ShangmingCai

The bootstrap room validation makes sure the values are correct on the decode side. Maybe we can sync the status before marking it as successful, then moving it to the waiting queue.

This fix should work, but not a real fix, actually. I will look into it.

ShangmingCai · 2026-04-20T16:17:17Z

please fix lint first.

stmatengss · 2026-04-20T16:23:50Z

Following our offline discussion, LGTM.

TTThanos · 2026-04-21T01:40:53Z

The bootstrap room validation makes sure the values are correct on the decode side. Maybe we can sync the status before marking it as successful, then moving it to the waiting queue.

This fix should work, but not a real fix, actually. I will look into it.

Exactly, sychronization can be made to avoid this problem.

…traNode NVLink in PD disaggregation (sgl-project#23252) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>

百麒 and others added 20 commits December 27, 2025 08:16

Refine for intraNode NVLINK using CUDAIPC

e88309d

Compatible with enumerate type which indicate mem backend type for nv…

73febe1

…link connection

Set environment variable to enable intraNode nvlink when detect_mem_b…

3ede45a

…ack is CudaMalloc

Merge branch 'main' into Feature/nvlink_refine

1419441

Modify SGlang to compatible with Mooncake IntraNode IPC

cd3ee89

import MemoryBackend from nvlink allocator to replace previous enumer…

2761497

…ate type

Refine mooncake utils by deleting extra printing

c25305d

Apply suggestions from code review

40da980

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Use IntraNode-nvlink w/o set SGLANG_MOONCAKE_CUSTOM_MEM_POOL

0c96a63

Revert utils to delete detect_mem_backend and add Intra_nvlink relate…

f7a7615

…d else if condition for custom_mem_pool

Delete redundant if condition related to MC_INTRA_NVLINK

bb48130

Minimize the change to keep original version

fd35446

keep to original version

0ce0b53

[Fix] Add "Intra_NVLINK" to send aux_tcp

c963759

Modify the metadataBuffer allocation type, considering intraNode nvlink

dd46f71

Pre-commit fix problems

aa1092c

Refine the name of IntraNode nvlink for custom_mem_pool

121dfdf

Merge branch 'Feature/new_intra_nvlink' into Debug/new_intra_nvlink

96c4446

Solve metedata status mismatch in TP when using IntraNode NVLINK

38eb683

Merge branch 'main' into Debug/new_intra_nvlink

c172914

TTThanos requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners April 20, 2026 11:36

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

ShangmingCai approved these changes Apr 20, 2026

View reviewed changes

stmatengss approved these changes Apr 20, 2026

View reviewed changes

Fix lint

5bdcfb0

unite logic

efc3574

ShangmingCai merged commit 0d04052 into sgl-project:main Apr 21, 2026
56 of 64 checks passed

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

[Fix] Solve the error lead by _commit_transfer_to_req() when using In…

dc6fbfc

…traNode NVLink in PD disaggregation (sgl-project#23252) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

[Fix] Solve the error lead by _commit_transfer_to_req() when using In…

defaf4f

…traNode NVLink in PD disaggregation (sgl-project#23252) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Solve the error lead by _commit_transfer_to_req() when using IntraNode NVLink in PD disaggregation#23252

[Fix] Solve the error lead by _commit_transfer_to_req() when using IntraNode NVLink in PD disaggregation#23252
ShangmingCai merged 22 commits intosgl-project:mainfrom
TTThanos:Debug/new_intra_nvlink

TTThanos commented Apr 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

ShangmingCai left a comment

Uh oh!

ShangmingCai commented Apr 20, 2026

Uh oh!

stmatengss commented Apr 20, 2026

Uh oh!

TTThanos commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TTThanos commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Apr 20, 2026

Uh oh!

stmatengss commented Apr 20, 2026

Uh oh!

TTThanos commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TTThanos commented Apr 20, 2026 •

edited

Loading