Skip to content

[Bug][TENT]: SGLang P/D intra-node communication fails due to wrong IPC address #1829

@shuoerw

Description

@shuoerw

Summary

SGLang intra-node PD disaggregation does not work correctly in TENT.
The root cause appears to be that the transfer engine gives wrong address for the metadata buffer, so the Prefill node metadata is not delivered to the Decode node, which prevents the Decode node from starting properly.

Problem

Because the metadata is not propagated as expected, the Decode node gets stuck at:
decode.py#L988
https://github.com/sgl-project/sglang/blob/f6e85676b578108207c7b7d4e3e8a65625e90f00/python/sglang/srt/disaggregation/decode.py#L989
As a result, the Decode node never begins processing.

The metadata is registered in Mooncake here:

Status NVLinkTransport::addMemoryBuffer(BufferDesc& desc,

The root cause was found and is discussed in #1622

Possible Fix

The same fix can be migrated to TENT:
#1622

@ishandhanani @TTThanos @stmatengss

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions