Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…s used with external routing Remove the follow_bootstrap_room fast path in PD disaggregation DP rank resolution. The fast path assumed the prefill DP rank is always bootstrap_room % dp_size, which is incorrect when an external router (e.g., the model gateway) overrides routing via routed_dp_rank / data_parallel_rank on the request. The DP controller correctly respects this override, but the fast path bypassed the actual rank registration and query: - Prefill: skipped _register_prefill_dp_rank when load_balance_method was follow_bootstrap_room, so the bootstrap server never learned the actual DP rank used. - Decode: returned bootstrap_room % dp_size directly instead of querying the bootstrap server, getting the wrong rank when external routing was in effect. Now prefill always registers its DP rank (when dp_size > 1), and decode always queries for it when disagg_prefill_dp_rank is not explicitly set. Made-with: Cursor
0bc16df to
638a986
Compare
ShangmingCai
left a comment
There was a problem hiding this comment.
Looks good. But then it is not pure follow bootstrap room lb strategy anymore, and will increase the TTFT a little bit. I think it is a mixed strategy. Should we add a condition to check whether dp rank has been assigned by the router then we decide whether to bypass dp rank register and query?
|
CC @hnyls2002 |
|
|
Can we disallow using explicit dp rank for |
|
/rerun-test test_disaggregation_basic.py test_disaggregation_dp_attention.py |
|
/rerun-test test/registered/disaggregation/test_disaggregation_basic.py |
|
/rerun-test test/registered/disaggregation/test_disaggregation_basic.py test_disaggregation_dp_attention.py |
|
✅ ✅ |
… DP rank resolution (#22901)
… DP rank resolution (sgl-project#22901)
… DP rank resolution (sgl-project#22901)
… DP rank resolution (sgl-project#22901)
… DP rank resolution (sgl-project#22901)
Motivation
In PD (Prefill-Decode) disaggregation mode with data parallelism, the decode server needs to know which prefill DP worker handled a given request to establish the correct KV transfer connection. There are two mechanisms for determining the DP worker:
data_parallel_rank(mapped torouted_dp_rank), which the DP controller dispatcher respects with highest priority viamaybe_external_dp_rank_routing, overriding any load balance method.round_robin,follow_bootstrap_room, etc.).To communicate the actual prefill DP rank to the decode server:
_register_prefill_dp_rankto register the request's assigned DP rank to the bootstrap server.query_prefill_dp_ranksto retrieve the actual DP rank used by prefill.Bug: When the prefill load balance method is
follow_bootstrap_room, there were two fast-path optimizations that assumed the prefill DP rank is alwaysbootstrap_room % dp_size:CommonKVSender.__init__): Skipped calling_register_prefill_dp_rankwhenload_balance_method == "follow_bootstrap_room", since the decode side could infer the rank from the bootstrap room number._resolve_prefill_dp_rank): Returnedbootstrap_room % dp_sizedirectly whenprefill_info.follow_bootstrap_roomwas true, bypassing thequery_prefill_dp_rankscall.This assumption breaks when an external router (e.g., the model gateway) sets
routed_dp_rankon the request. The DP controller routes to the externally-specified rank (which may differ frombootstrap_room % dp_size), but prefill never registers the actual rank and decode computes the wrong one.Revised Fix
The original approach removed the fast path entirely and always fell through to register/query. This causes unnecessary perf regression for strict
follow_bootstrap_roomdeployments (extra HTTP round-trip per request).follow_bootstrap_roomis a strict dispatch policy — SGLang assumes that if prefill uses it, the rank is alwaysbootstrap_room % dp_size, and decode infers the rank accordingly without querying. The real issue is thatrouted_dp_rankcan silently override this policy on the prefill side, causing a mismatch that decode cannot detect.The revised fix takes a different approach:
bootstrap_room % dp_sizewhenfollow_bootstrap_roomis active. No perf regression for existing deployments.follow_bootstrap_roomis configured but the actualattn_dp_rank != bootstrap_room % dp_size(i.e., an externalrouted_dp_rankoverrode the dispatch), the request is aborted with a clear error message. This prevents decode from silently computing the wrong rank.SGLANG_DISAGGREGATION_FORCE_QUERY_PREFILL_DP_RANK=1disables the fast path on both sides — prefill always registers, decode always queries. This supports non-strict routing (e.g., external router overridingfollow_bootstrap_room) at the cost of an extra round-trip.Checklist