[PD] feat: support mooncake intra-node nvlink kv transfer by TTThanos · Pull Request #17866 · sgl-project/sglang

TTThanos · 2026-01-28T07:13:51Z

Motivation

Add new feature : Enable Intra_Node nvlink in SGlang to compatible with Mooncake INTRA_NODE NVLINK isolation PR
kvcache-ai/Mooncake#1341 (comment)

Modifications

Mainly modified part is /Mooncake/utils.py

Accuracy Tests

To enable intra-Node nvlink, you can launch server using the following command

export NCCL_IB_GID_INDEX=1 export NCCL_SOCKET_IFNAME=eth0,eth1 export NCCL_IB_DISABLE=0 export GLOO_SOCKET_IFNAME=eth0 model_path=/mnt/models/Qwen3-235B-A22B-FP8 FILE_NAME_PREFIX=Prefill_Mooncake_INRTANVLINK_kv_transfer_Hicache_test_qwen3_235b_tp4_1210_0 SGLANG_MOONCAKE_CUSTOM_MEM_POOL=true SGLANG_MOONCAKE_CUSTOM_MEM_POOL=INTRA_NVLINK MC_LOG_LEVEL=INFO MC_TE_METRIC=true MC_INTRANODE_NVLINK=true SGLANG_TORCH_PROFILER_DIR=/root/Yaozhong_hiecache/profile/ python3 -m sglang.launch_server \ --model-path ${model_path} \ --tp 4 \ --mem-fraction-static 0.85 \ --disaggregation-mode prefill \ --port 7001 \ --watchdog-timeout 1000000 --decode-log-interval 1 >/root/LYZ_hicache/log/${FILE_NAME_PREFIX}.log 2>&1

SGLANG_MOONCAKE_CUSTOM_MEM_POOL=true is to align with the design when use MNNVL "MC_FORCE_MNNVL=true". Please be awared, "MC_FORCE_MNNVL=true" and "MC_INTRANODE_NVLINK=true" must be exclusively used when launch SGlang server.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…link connection

…ack is CudaMalloc

…ate type

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…d else if condition for custom_mem_pool

gemini-code-assist · 2026-01-28T07:13:55Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

stmatengss · 2026-01-28T09:05:41Z

/tag-and-rerun-ci

ShangmingCai

LGTM.

ShangmingCai · 2026-01-29T04:34:11Z

-        if (
-            self.enable_custom_mem_pool and self.custom_mem_pool_type == "NVLINK"
-        ) or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get():
+        if self.enable_custom_mem_pool and self.custom_mem_pool_type in ("NVLINK", "INTRA_NVLINK"):


Is this required? Why intra nvlink require send aux with tcp?

According to log, this is required for intra nvlink. Eitherwise, it will show Prefill transfer failed for request rank=xxx

, and I don't know the reason.
By enable sending aux, the problem disappears.

Was there a precision issue with NVL72 previously? So tcp is a workaround for aux data. @ShangmingCai

No modifications are needed. "Fix me when Mooncake's nvlink_transport is bug-free" applies to mnnvl.

Yeah, mnnvl has a sync issue when transferring tiny data. So maybe this happens for intra-node nvlink as well, the granularity issue. I am shepherding this PR: #17430, maybe it will help.

This logic might be wrong? I think intra-node nvlink should not set up SGLANG_MOONCAKE_CUSTOM_MEM_POOL, so self.enable_custom_mem_pool is False.

This logic might be wrong? I think intra-node nvlink should not set up SGLANG_MOONCAKE_CUSTOM_MEM_POOL, so self.enable_custom_mem_pool is False.

Problem solved by adding condition:
'elif envs.SGLANG_MOONCAKE_CUSTOM_MEM_POOL.get() == "INTRA_NVLINK":'
Due to device = 'cpu' in previous version, the register_memory will failed in Mooncake when registering aux_data_ptrs. Now, device will be assigned as 'cuda' when using INTRA_NVLINK and aux_data_ptrs will be successfully registered and also no need to send aux with tcp.

ShangmingCai · 2026-01-29T06:42:04Z

Let me fix the conflicts

ShangmingCai · 2026-01-29T09:23:36Z

please fix lint

TTThanos · 2026-01-29T09:36:00Z

please fix lint

Solved

ShangmingCai · 2026-01-29T09:54:44Z

/tag-and-rerun-ci

stmatengss · 2026-01-29T15:31:21Z


 # Global constants for custom memory pool types
-SUPPORTED_MOONCAKE_CUSTOM_MEM_POOL_TYPES = ["NVLINK", "BAREX"]
+SUPPORTED_MOONCAKE_CUSTOM_MEM_POOL_TYPES = ["NVLINK", "BAREX", "INTRA_NVLINK"]


INTRA_NVLINK appears unprofessional. How about INTRA_NODE_NVLINK?

stmatengss · 2026-02-01T08:39:54Z

Merge main due to #18044

stmatengss · 2026-02-02T07:12:51Z

/rerun-failed-ci

stmatengss · 2026-02-02T17:10:00Z

/rerun-failed-ci

stmatengss · 2026-02-03T07:23:57Z

/rerun-failed-ci

…t#17866) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>

百麒 and others added 10 commits December 27, 2025 08:16

Refine for intraNode NVLINK using CUDAIPC

e88309d

Compatible with enumerate type which indicate mem backend type for nv…

73febe1

…link connection

Set environment variable to enable intraNode nvlink when detect_mem_b…

3ede45a

…ack is CudaMalloc

Merge branch 'main' into Feature/nvlink_refine

1419441

Modify SGlang to compatible with Mooncake IntraNode IPC

cd3ee89

import MemoryBackend from nvlink allocator to replace previous enumer…

2761497

…ate type

Refine mooncake utils by deleting extra printing

c25305d

Apply suggestions from code review

40da980

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Use IntraNode-nvlink w/o set SGLANG_MOONCAKE_CUSTOM_MEM_POOL

0c96a63

Revert utils to delete detect_mem_backend and add Intra_nvlink relate…

f7a7615

…d else if condition for custom_mem_pool

TTThanos requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners January 28, 2026 07:13

stmatengss assigned stmatengss and ShangmingCai Jan 28, 2026

百麒 added 3 commits January 28, 2026 08:50

Delete redundant if condition related to MC_INTRA_NVLINK

bb48130

Minimize the change to keep original version

fd35446

keep to original version

0ce0b53

github-actions Bot added the run-ci label Jan 28, 2026

stmatengss changed the title ~~Feature/new intra nvlink~~ [PD] feat: support intra nvlink kv transfer Jan 28, 2026

ShangmingCai approved these changes Jan 29, 2026

View reviewed changes

[Fix] Add "Intra_NVLINK" to send aux_tcp

c963759

ShangmingCai changed the title ~~[PD] feat: support intra nvlink kv transfer~~ [PD] feat: support mooncake intra nvlink kv transfer Jan 29, 2026

ShangmingCai changed the title ~~[PD] feat: support mooncake intra nvlink kv transfer~~ [PD] feat: support mooncake intra-node nvlink kv transfer Jan 29, 2026

ShangmingCai reviewed Jan 29, 2026

View reviewed changes

Modify the metadataBuffer allocation type, considering intraNode nvlink

dd46f71

Pre-commit fix problems

aa1092c

stmatengss reviewed Jan 29, 2026

View reviewed changes

Refine the name of IntraNode nvlink for custom_mem_pool

121dfdf

Merge branch 'main' into Feature/new_intra_nvlink

7887d55

stmatengss approved these changes Feb 2, 2026

View reviewed changes

ShangmingCai merged commit a45647b into sgl-project:main Feb 3, 2026
336 of 360 checks passed

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[PD] feat: support mooncake intra-node nvlink kv transfer (sgl-projec…

fe3f8aa

…t#17866) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[PD] feat: support mooncake intra-node nvlink kv transfer (sgl-projec…

58f3120

…t#17866) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[PD] feat: support mooncake intra-node nvlink kv transfer (sgl-projec…

7a50551

…t#17866) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>

Conversation

TTThanos commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Uh oh!

stmatengss commented Jan 28, 2026

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

TTThanos Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

stmatengss Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

stmatengss Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

TTThanos Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Jan 29, 2026

Uh oh!

ShangmingCai commented Jan 29, 2026

Uh oh!

TTThanos commented Jan 29, 2026

Uh oh!

ShangmingCai commented Jan 29, 2026

Uh oh!

stmatengss Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Feb 1, 2026

Uh oh!

stmatengss commented Feb 2, 2026

Uh oh!

stmatengss commented Feb 2, 2026

Uh oh!

stmatengss commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TTThanos commented Jan 28, 2026 •

edited

Loading

TTThanos Jan 29, 2026 •

edited

Loading