Summary
SGLang intra-node PD disaggregation does not work correctly in TENT.
The root cause appears to be that the transfer engine gives wrong address for the metadata buffer, so the Prefill node metadata is not delivered to the Decode node, which prevents the Decode node from starting properly.
Problem
Because the metadata is not propagated as expected, the Decode node gets stuck at:
decode.py#L988
https://github.com/sgl-project/sglang/blob/f6e85676b578108207c7b7d4e3e8a65625e90f00/python/sglang/srt/disaggregation/decode.py#L989
As a result, the Decode node never begins processing.
The metadata is registered in Mooncake here:
|
Status NVLinkTransport::addMemoryBuffer(BufferDesc& desc, |
The root cause was found and is discussed in
#1622
Possible Fix
The same fix can be migrated to TENT:
#1622
@ishandhanani @TTThanos @stmatengss
Summary
SGLang intra-node PD disaggregation does not work correctly in TENT.
The root cause appears to be that the transfer engine gives wrong address for the metadata buffer, so the Prefill node metadata is not delivered to the Decode node, which prevents the Decode node from starting properly.
Problem
Because the metadata is not propagated as expected, the Decode node gets stuck at:
decode.py#L988
https://github.com/sgl-project/sglang/blob/f6e85676b578108207c7b7d4e3e8a65625e90f00/python/sglang/srt/disaggregation/decode.py#L989
As a result, the Decode node never begins processing.
The metadata is registered in Mooncake here:
Mooncake/mooncake-transfer-engine/tent/src/transport/nvlink/nvlink_transport.cpp
Line 211 in 6da2572
The root cause was found and is discussed in #1622
Possible Fix
The same fix can be migrated to TENT:
#1622
@ishandhanani @TTThanos @stmatengss