[P/D disagg] - support decode side radix cache by ishandhanani · Pull Request #19746 · sgl-project/sglang

ishandhanani · 2026-03-03T06:21:29Z

Summary

In PD disaggregation, the decode worker can now use radix cache to reuse shared prefixes and request only the delta KV from prefill instead of transferring the full prefix on every turn.

This is enabled with --disaggregation-decode-enable-radix-cache on the decode server.

For now, this path is supported only with --disaggregation-transfer-backend nixl. server_args.py now rejects other transfer backends early when the decode radix cache flag is enabled. Mooncake support will follow in a separate PR.

Main Changes

Decode scheduler
- Match incoming requests against the decode-side radix tree.
- Lock matched prefix nodes for the request lifetime.
- Pre-allocate only the delta KV pages beyond the matched prefix.
Decode -> prefill protocol
- Plumb decode_prefix_len from decode to prefill for the NIXL path.
- Allow full-prefix hits where decode may need no KV pages transferred.
Prefill transfer path
- Initialize the sender with only the unsent delta pages.
- Keep the chunked transfer cursor monotonic when decode already has part of the prefix.
- Skip empty non-last chunks so the sender/receiver chunk protocol stays consistent.
Correctness / cleanup
- Align matched prefix length to page boundaries for paged KV allocators.
- Guard lock release / cleanup paths for transfer-failure cases.
- Batch finished prebuilt frees through the free-group path.
CLI / config
- The user-facing switch is --disaggregation-decode-enable-radix-cache.
- Current validation requires --disaggregation-transfer-backend nixl when that flag is set.

Interface

Enable decode radix cache on the decode worker with:

--disaggregation-mode decode --disaggregation-transfer-backend nixl --disaggregation-decode-enable-radix-cache

Prefill continues to run with --disaggregation-transfer-backend nixl.

Note: DP attention is still experimental here. The flag is allowed, but good cache hit rates require prefix-aware DP routing.

Benchmark

Setup

Hardware: 1x NVIDIA B200 node (8 GPUs), single-node PD disaggregation via NIXL
Model: Qwen/Qwen3-32B, FP8 KV cache, 3P1D, TP=2 per worker
Workload: 20 unique ~50K-token prefixes + ~4.5K suffix (~91% prefix reuse), 1000 requests, concurrency 128

Results

Metric	Baseline	Decode Radix Cache	Improvement
Request throughput (req/s)	1.21	1.59	1.32x
Output token throughput (tok/s)	430	566	1.32x
TTFT p50 (s)	73.2	9.0	8.1x
TTFT avg (s)	77.7	31.6	2.5x
Request latency p50 (s)	99.1	73.4	1.35x
ITL avg (ms)	65.6	130.6	0.50x
Benchmark duration (s)	827	628	1.32x

Decode-side logs show the reason for the throughput gain: baseline decode ran near KV capacity (token_usage ~ 0.99) and only fit ~37 running requests, while decode radix cache reduced duplicate prefix residency (token_usage ~ 0.75) and fit roughly 104-126 running requests. The ITL regression is expected from the larger decode batch.

Test Plan

Qwen3-0.6B local PD disagg sanity runs
MiniMax-M2.5 1P1D on B200
Qwen3-32B 3P1D on B200 (results above)
Guard decode radix cache behind nixl in server_args.py
Multi-node cross-host testing (RDMA transport)
Mooncake transfer backend support (separate PR)

gemini-code-assist · 2026-03-03T06:21:32Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

dongyibo · 2026-03-03T14:49:01Z

@ishandhanani Can this feature be understood as follows:
In a multi-turn dialogue scenario, the first round takes tokens 1, 2, 3 as input and outputs tokens 4, 5, 6.
The second round takes tokens 1, 2, 3, 4, 5, 6, 7, 8, 9 as input and outputs tokens 10, 11, 12.

Current status of pd-disagg:
In the first round, for the decode worker, the generated tokens 4, 5, 6 are not cached, and the KV cache of the input tokens 1, 2, 3 is not saved.
In the second round, the prefill worker needs to send the KV cache for all tokens 1, 2, 3, 4, 5, 6, 7, 8, and 9 to the decode worker.

Based on this PR's implementation:
In the first round, the decode worker saves the KV cache for tokens 1, 2, 3, 4, 5, and 6.
In the second round, the prefill worker only needs to send the KV cache for tokens 7, 8, and 9 to the decode worker.
Is my understanding correct?

ishandhanani · 2026-03-03T19:48:18Z

@ishandhanani Can this feature be understood as follows: In a multi-turn dialogue scenario, the first round takes tokens 1, 2, 3 as input and outputs tokens 4, 5, 6. The second round takes tokens 1, 2, 3, 4, 5, 6, 7, 8, 9 as input and outputs tokens 10, 11, 12.

Current status of pd-disagg: In the first round, for the decode worker, the generated tokens 4, 5, 6 are not cached, and the KV cache of the input tokens 1, 2, 3 is not saved. In the second round, the prefill worker needs to send the KV cache for all tokens 1, 2, 3, 4, 5, 6, 7, 8, and 9 to the decode worker.

Based on this PR's implementation: In the first round, the decode worker saves the KV cache for tokens 1, 2, 3, 4, 5, and 6. In the second round, the prefill worker only needs to send the KV cache for tokens 7, 8, and 9 to the decode worker. Is my understanding correct?

Yep. This is correct

ishandhanani · 2026-03-03T20:31:03Z

/gemini review

gemini-code-assist · 2026-03-03T20:31:07Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

- Set req.prefix_indices in _pre_alloc so init_next_round_input(None) computes extend_input_len correctly from the cached prefix length. Without this, prepare_for_prebuilt runs a full-length extend instead of a delta extend. - Always call inc_lock_ref on the matched node (even on empty match) to match aggregated scheduler behavior. Prevents lock_ref underflow when cache_finished_req unconditionally calls dec_lock_ref.

gemini-code-assist · 2026-03-04T00:24:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ishandhanani · 2026-03-04T00:26:46Z

Next step is testing with a larger model on B200. And then step after (maybe in follow up) is to do the same for mooncake

dongyibo · 2026-03-04T09:13:56Z

@ishandhanani There seems to be a constraint here:
For multiple decode workers, such as when decode is run with DP, it's best if the same DP rank is used for the entire conversation; otherwise, the cached KV cache cannot be utilized？

nananall · 2026-03-04T12:02:27Z

Could you share the exact command you used to run this? I'd like to reproduce it and test it on my side.

ishandhanani · 2026-03-04T17:47:55Z

@ishandhanani There seems to be a constraint here: For multiple decode workers, such as when decode is run with DP, it's best if the same DP rank is used for the entire conversation; otherwise, the cached KV cache cannot be utilized？

Theres a few things here.

when running with multiple decode workers (standard data parallelism of workers) - I expect the router to pick the right decode worker based on kv load. The dynamo router handles this very well + performantly out of the box
For DP attention - agreed. Right now I have not added support. Need to do this

ishandhanani · 2026-04-23T04:43:21Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-04-23T04:43:45Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies). View workflow run

ishandhanani · 2026-04-23T05:06:08Z

CI passed for this job

ishandhanani · 2026-04-23T05:07:19Z

/rerun-test test/registered/distributed/test_disaggregation_decode_radix_cache.py

github-actions · 2026-04-23T05:07:49Z

✅ 8-gpu-h20 (1 test): View workflow run

cd test/ && python3 registered/distributed/test_disaggregation_decode_radix_cache.py

ShangmingCai · 2026-04-23T12:25:46Z

+def maybe_cache_unfinished_req(req: Req, tree_cache: BasePrefixCache, **kwargs):
+    if getattr(req, "skip_radix_cache_insert", False):
+        return
+
+    tree_cache.cache_unfinished_req(req, **kwargs)


Just wonder if we should replace all tree_cache.cache_unfinished_req(req, **kwargs with maybe_cache_unfinished_req?

CC: @xiezhq-hermann

ShangmingCai · 2026-04-23T12:26:27Z

+    if match_result.mamba_branching_seqlen is not None:
+        req.mamba_branching_seqlen = match_result.mamba_branching_seqlen
+    if match_result.cache_protected_len is not None:
+        req.cache_protected_len = match_result.cache_protected_len


These look like new logic, but not used temporarily?

ShangmingCai

Others LGTM, if CI passes.

ShangmingCai · 2026-04-23T12:32:30Z

/rerun-failed-ci

ShangmingCai · 2026-04-23T12:33:23Z

CC: @ByronHsu please help review

ishandhanani · 2026-04-23T14:33:25Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-04-23T14:40:45Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies).
⚠️ Could not retrieve workflow run URL. Check the Actions tab for progress.

ShangmingCai · 2026-04-30T18:00:35Z

@ByronHsu Please check this PR if you have time, I think it is good to merge.

ShangmingCai · 2026-05-01T02:16:28Z

If there are no further comments and suggestion, we will merge this PR today.
CC: @cctry @ByronHsu @xiezhq-hermann

ByronHsu · 2026-05-01T02:48:26Z

LGTM. Excited to try this feature for long context PD!

ShangmingCai · 2026-05-01T13:54:18Z

CI has passed.

ovidiusm · 2026-05-05T23:46:25Z

+            # Aux data is still sent below when is_last=True.
+            if len(kv_indices) > 0:
+                notif = (
+                    f"{req.room}_kv_{chunk_id}_{int(is_last)}_{self.kv_args.pp_rank}"


This looks like a rebase issue, it reverted a fix for the hang when TP P>D. I will fix it in #23967

llc-kc · 2026-05-07T02:32:43Z

Have you tested low-concurrency scenarios (e.g., 30 concurrent requests)? As you mentioned, the baseline decode can only accommodate ~37 running requests. Consequently, a large number of requests will be queued during the decode preallocation phase, which leads to a higher TTFT for the baseline setup.
Tests under low concurrency can fairly reflect the performance improvements brought by Delta KV Cache transmission.
In contrast, the results from high-concurrency benchmarks primarily highlight the benefits of higher concurrency enabled by decode Radix Cache, while the advantages of Delta KV Cache transmission remain obscured.

init

73f769c

ishandhanani added 2 commits March 3, 2026 06:36

Merge branch 'main' into ishan/add-radix-cache-decode

c316c42

fix stupid mem leak

0268551

ishandhanani changed the title ~~[Draft] [P/D disagg] - support decode side radix cache~~ [P/D disagg] - support decode side radix cache Mar 3, 2026

ishandhanani mentioned this pull request Mar 3, 2026

Development Roadmap (2026 Q1) #12780

Open

ishandhanani added 2 commits March 3, 2026 20:40

fix: restore token_msg in _check_radix_cache_memory

5fa0819

ishandhanani marked this pull request as ready for review March 4, 2026 00:24

ishandhanani requested review from ByronHsu, ShangmingCai, Ying1123, hanming-lu, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners March 4, 2026 00:24

ShangmingCai reviewed Mar 4, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/prefill.py Outdated

xiezhq-hermann assigned xiezhq-hermann and hzh0425 Mar 4, 2026

rebase

95dc9a7

ShangmingCai assigned ByronHsu and ShangmingCai Apr 23, 2026

ShangmingCai reviewed Apr 23, 2026

View reviewed changes

ShangmingCai approved these changes Apr 23, 2026

View reviewed changes

ishandhanani and others added 5 commits April 24, 2026 03:12

Merge branch 'main' into ishan/add-radix-cache-decode

d1aec3a

Merge branch 'main' into ishan/add-radix-cache-decode

a3d3ce4

fix conflict address

7d7156c

Merge branch 'main' into ishan/add-radix-cache-decode

f6b9601

Merge origin/main into ishan/add-radix-cache-decode

eaee8c1

ShangmingCai merged commit 5b7ce41 into main May 1, 2026
264 of 294 checks passed

ShangmingCai deleted the ishan/add-radix-cache-decode branch May 1, 2026 13:55

hzh0425 mentioned this pull request May 1, 2026

[pd]: (Bug Fix) Incorrect out_cache_loc slicing in prepare_for_prebuilt #24230

Merged

5 tasks

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[P/D disagg] - support decode side radix cache (sgl-project#19746)

ba2b3ae

ovidiusm reviewed May 5, 2026

View reviewed changes

ByronHsu mentioned this pull request May 6, 2026

[sglang-miles] Integrate PD decode radix cache #24544

Open

7 tasks

Conversation

ishandhanani commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main Changes

Interface

Benchmark

Setup

Results

Test Plan

Uh oh!

gemini-code-assist Bot commented Mar 3, 2026

Uh oh!

dongyibo commented Mar 3, 2026

Uh oh!

ishandhanani commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot commented Mar 4, 2026

Uh oh!

ishandhanani commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dongyibo commented Mar 4, 2026

Uh oh!

nananall commented Mar 4, 2026

Uh oh!

ishandhanani commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

ishandhanani commented Apr 23, 2026

Uh oh!

ishandhanani commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

ShangmingCai Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Apr 23, 2026

Uh oh!

ShangmingCai commented Apr 23, 2026

Uh oh!

ishandhanani commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

ShangmingCai commented Apr 30, 2026

Uh oh!

ShangmingCai commented May 1, 2026

Uh oh!

ByronHsu commented May 1, 2026

Uh oh!

ShangmingCai commented May 1, 2026

Uh oh!

Uh oh!

ovidiusm May 5, 2026

Choose a reason for hiding this comment

Uh oh!

llc-kc commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

ishandhanani commented Mar 3, 2026 •

edited

Loading

ishandhanani commented Mar 3, 2026 •

edited

Loading

ishandhanani commented Mar 4, 2026 •

edited

Loading

ishandhanani commented Mar 4, 2026 •

edited

Loading