[Fix] Solve the error lead by _commit_transfer_to_req() when using IntraNode NVLink in PD disaggregation#23252
Conversation
…ack is CudaMalloc
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…d else if condition for custom_mem_pool
There was a problem hiding this comment.
Code Review
This pull request updates the Mooncake disaggregation logic to support the INTRA_NODE_NVLINK custom memory pool type. Key changes include forcing the use of TCP for auxiliary data transfers and switching the device type to CPU for specific memory pool configurations. A review comment suggests improving code readability in conn.py by using internal class properties instead of directly accessing environment variables in conditional checks.
| if ( | ||
| self.enable_custom_mem_pool and self.custom_mem_pool_type == "NVLINK" | ||
| ) or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get(): | ||
| ) or envs.SGLANG_MOONCAKE_CUSTOM_MEM_POOL.get() == "INTRA_NODE_NVLINK" or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get(): |
There was a problem hiding this comment.
For better readability and to avoid re-reading environment variables, consider using the self.custom_mem_pool_type property which should already hold the correct value. This simplifies the condition and makes it more consistent with the surrounding code.
| if ( | |
| self.enable_custom_mem_pool and self.custom_mem_pool_type == "NVLINK" | |
| ) or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get(): | |
| ) or envs.SGLANG_MOONCAKE_CUSTOM_MEM_POOL.get() == "INTRA_NODE_NVLINK" or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get(): | |
| if ( | |
| self.enable_custom_mem_pool | |
| and self.custom_mem_pool_type in ("NVLINK", "INTRA_NODE_NVLINK") | |
| ) or envs.SGLANG_MOONCAKE_SEND_AUX_TCP.get(): |
ShangmingCai
left a comment
There was a problem hiding this comment.
The bootstrap room validation makes sure the values are correct on the decode side. Maybe we can sync the status before marking it as successful, then moving it to the waiting queue.
This fix should work, but not a real fix, actually. I will look into it.
|
please fix lint first. |
|
Following our offline discussion, LGTM. |
Exactly, sychronization can be made to avoid this problem. |
…traNode NVLink in PD disaggregation (sgl-project#23252) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>
…traNode NVLink in PD disaggregation (sgl-project#23252) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>
Motivation
When using Mooncake IntraNode nvlink as kv cache transport backend in PD disaggregation, decode server will crash and error log can be observed

This is related to the commit #8ed35df introduced metadata validation in _commit_transfer_to_req(). In IntraNode nvlink scenarios, metadata buffer is allocated on the GPU and transferred via nvlink. However, due to asynchronous nature of NVLink transport, metadata buffer may not have been transferred when poll == KVPoll.Success. Consequently, bootstrap_room remains 0 for some ranks, while other ranks have successfully read and removed their metadata, leading to mismatch in the subsequent poll_and_all_reduce() call.
Modifications
Accuracy Tests
launch server
`model_path=/mnt/models/Qwen3-235B-A22B-FP8
FILE_NAME_PREFIX=Decode_Mooncake_INTRANVLINK_kv_transfer_Hicache_test_qwen3_235b_tp4_0412
export MC_TE_METRIC=true
export MC_INTRANODE_NVLINK=true
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=true SGLANG_MOONCAKE_CUSTOM_MEM_POOL=INTRA_NODE_NVLINK MC_LOG_LEVEL=INFO SGLANG_TORCH_PROFILER_DIR=/root/profile/ python3 -m sglang.launch_server
--model-path ${model_path}
--tp 4
--mem-fraction-static 0.85
--base-gpu-id 4
--disaggregation-mode decode
--port 7002
--watchdog-timeout 1000000 --decode-log-interval 1 >/root/log/${FILE_NAME_PREFIX}.log 2>&1`
Speed Tests and Profiling
Tested on 4K/2K input/output using bench serving test

Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci