[P/D disagg] - support decode side radix cache#19746
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@ishandhanani Can this feature be understood as follows: Current status of pd-disagg: Based on this PR's implementation: |
Yep. This is correct |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
- Set req.prefix_indices in _pre_alloc so init_next_round_input(None) computes extend_input_len correctly from the cached prefix length. Without this, prepare_for_prebuilt runs a full-length extend instead of a delta extend. - Always call inc_lock_ref on the matched node (even on empty match) to match aggregated scheduler behavior. Prevents lock_ref underflow when cache_finished_req unconditionally calls dec_lock_ref.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Next step is testing with a larger model on B200. And then step after (maybe in follow up) is to do the same for mooncake |
|
@ishandhanani There seems to be a constraint here: |
|
Could you share the exact command you used to run this? I'd like to reproduce it and test it on my side. |
Theres a few things here.
|
|
/rerun-stage stage-c-test-8-gpu-h20 |
|
✅ Triggered |
|
CI passed for this job |
|
/rerun-test test/registered/distributed/test_disaggregation_decode_radix_cache.py |
|
✅ |
| def maybe_cache_unfinished_req(req: Req, tree_cache: BasePrefixCache, **kwargs): | ||
| if getattr(req, "skip_radix_cache_insert", False): | ||
| return | ||
|
|
||
| tree_cache.cache_unfinished_req(req, **kwargs) |
There was a problem hiding this comment.
Just wonder if we should replace all tree_cache.cache_unfinished_req(req, **kwargs with maybe_cache_unfinished_req?
CC: @xiezhq-hermann
| if match_result.mamba_branching_seqlen is not None: | ||
| req.mamba_branching_seqlen = match_result.mamba_branching_seqlen | ||
| if match_result.cache_protected_len is not None: | ||
| req.cache_protected_len = match_result.cache_protected_len |
There was a problem hiding this comment.
These look like new logic, but not used temporarily?
ShangmingCai
left a comment
There was a problem hiding this comment.
Others LGTM, if CI passes.
|
/rerun-failed-ci |
|
CC: @ByronHsu please help review |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
|
@ByronHsu Please check this PR if you have time, I think it is good to merge. |
|
If there are no further comments and suggestion, we will merge this PR today. |
|
LGTM. Excited to try this feature for long context PD! |
| # Aux data is still sent below when is_last=True. | ||
| if len(kv_indices) > 0: | ||
| notif = ( | ||
| f"{req.room}_kv_{chunk_id}_{int(is_last)}_{self.kv_args.pp_rank}" |
There was a problem hiding this comment.
This looks like a rebase issue, it reverted a fix for the hang when TP P>D. I will fix it in #23967
|
Have you tested low-concurrency scenarios (e.g., 30 concurrent requests)? As you mentioned, the baseline decode can only accommodate ~37 running requests. Consequently, a large number of requests will be queued during the decode preallocation phase, which leads to a higher TTFT for the baseline setup. |

Summary
In PD disaggregation, the decode worker can now use radix cache to reuse shared prefixes and request only the delta KV from prefill instead of transferring the full prefix on every turn.
This is enabled with
--disaggregation-decode-enable-radix-cacheon the decode server.For now, this path is supported only with
--disaggregation-transfer-backend nixl.server_args.pynow rejects other transfer backends early when the decode radix cache flag is enabled. Mooncake support will follow in a separate PR.Main Changes
decode_prefix_lenfrom decode to prefill for the NIXL path.--disaggregation-decode-enable-radix-cache.--disaggregation-transfer-backend nixlwhen that flag is set.Interface
Enable decode radix cache on the decode worker with:
Prefill continues to run with
--disaggregation-transfer-backend nixl.Note: DP attention is still experimental here. The flag is allowed, but good cache hit rates require prefix-aware DP routing.
Benchmark
Setup
Qwen/Qwen3-32B, FP8 KV cache, 3P1D, TP=2 per workerResults
Decode-side logs show the reason for the throughput gain: baseline decode ran near KV capacity (
token_usage ~ 0.99) and only fit ~37 running requests, while decode radix cache reduced duplicate prefix residency (token_usage ~ 0.75) and fit roughly 104-126 running requests. The ITL regression is expected from the larger decode batch.Test Plan
nixlinserver_args.py