[UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree by hzh0425 · Pull Request #23316 · sgl-project/sglang

hzh0425 · 2026-04-21T03:47:42Z

Motivation

This branch adds HiCache support to UnifiedRadixTree and unifies the device/host cache lifecycle across Full, Mamba, and related hybrid components.

In particular, it enables HiCache on top of the unified tree for Hybrid Linear models and DeepSeek DSA-style models, with componentized eviction/load-back logic, explicit D-leaf/H-leaf tracking, host-side LRU management for auxiliary components, and integrated HiCache pool/controller wiring in the unified path.

TODO for subsequent PRs:

SWA HiCache support @ispobock Support swa HiCache for unified radix cache #23391
L3 support

Thanks for @linjianyu233 @icepoint666 for the testing feedback.

Modifications

Accuracy Tests

Speed Tests and Profiling

Comparison for Hybrid Linear Model (Qwen3_5-397B-A17B-FP8)

Comparison for Hybrid DeepSeek DSA Model (DeepSeek V32)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

# Conflicts: # python/sglang/srt/mem_cache/unified_cache_components/tree_component.py

# Conflicts: # python/sglang/srt/mem_cache/unified_radix_cache.py

…by thresholding

Co-authored-by: diemchai <diemchai@tencent.com>

gemini-code-assist · 2026-04-21T03:47:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

linjianyu233 · 2026-04-21T03:57:57Z

GSM8K benchmark accuracy test passed.

 INFO - Benchmark 'precision_gsm8k' setup starting
 INFO - Benchmark 'precision_gsm8k' run starting with params: {'eval-name': 'gsm8k', 'num-examples': 100, 'max-tokens': 20480, 'num-shots': 5, 'gsm8k-data-path': '<WORKDIR>/store_test/datasets/GSM8K.jsonl', 'score-tolerance': 0.03, 'num-threads': 32, 'model-path': '<MODELS>/Qwen3.5-35B-A3B-FP8', 'port': 6178, 'host': '0.0.0.0', 'output-dir': '<WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision'}
 INFO - [precision_gsm8k] Phase 1: Running first eval (gsm8k)
 INFO - Running eval: python3 -m sglang.test.run_eval --port 6178 --eval-name gsm8k --num-threads 32 --host 0.0.0.0 --num-examples 100 --output-dir <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision --max-tokens 20480 --num-shots 5 --gsm8k-data-path <WORKDIR>/store_test/datasets/GSM8K.jsonl
 INFO - Eval stdout saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_before_eval.log
 INFO - Eval stderr saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_before_eval_stderr.log
 INFO - Auto-detected eval result file: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json
 INFO - Parsed score from <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json: 0.97
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_before.json
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.html → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_before.html
 INFO - [precision_gsm8k] Phase 2: Flushing cache
 INFO - Flushing cache (attempt 1/3): POST http://0.0.0.0:6178/flush_cache
 INFO - Cache flushed successfully (status: 200)
 INFO - [precision_gsm8k] Phase 3: Running second eval (gsm8k)
 INFO - Running eval: python3 -m sglang.test.run_eval --port 6178 --eval-name gsm8k --num-threads 32 --host 0.0.0.0 --num-examples 100 --output-dir <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision --max-tokens 20480 --num-shots 5 --gsm8k-data-path <WORKDIR>/store_test/datasets/GSM8K.jsonl
 INFO - Eval stdout saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_after_eval.log
 INFO - Eval stderr saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_after_eval_stderr.log
 INFO - Auto-detected eval result file: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json
 INFO - Parsed score from <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json: 0.99
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_after.json
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.html → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_after.html
 INFO - [precision_gsm8k] Precision test PASSED: before=0.97, after=0.99, diff=0.020000000000000018, tolerance=0.03
 INFO - Benchmark 'precision_gsm8k' run completed with status: success

linjianyu233 · 2026-04-21T04:03:36Z

Compare multi-turn conversation tests between HiCache disabled and HiCache with UnifiedRadixTree enabled.

deploy:

export SGLANG_ENABLE_UNIFIED_RADIX_TREE="1"
export SGLANG_LOG_MS="true"
export SGLANG_MM_FEATURE_CACHE_MB="4096"
export SGLANG_USE_CUDA_IPC_TRANSPORT="1"
export SGLANG_VLM_CACHE_SIZE_MB="0"

python3 -m sglang.launch_server \
    --model-path /home/models/Qwen3.5-397B-A17B-FP8 \
    --host 0.0.0.0 \
    --port 6178 \
    --tp 8 \
    --page-size 128 \
    --dtype auto \
    --mamba-scheduler-strategy extra_buffer \
    --kv-cache-dtype fp8_e4m3 \
    --cuda-graph-max-bs 64 \
    --chunked-prefill-size 65536 \
    --expert-parallel-size 8 \
    --enable-metrics \
    --enable-cache-report \
    --log-level info \
    --enable-hierarchical-cache \
    --hicache-size 40 \
    --hicache-mem-layout page_first_direct \
    --hicache-io-backend direct \
    --hicache-write-policy write_through \
    --mem-fraction-static 0.8

run benchmark:

python3 sglang/benchmark/hicache/bench_multiturn.py --output-length 8 --request-length 400 --num-clients 300 --num-rounds 19 --max-parallel 32 --request-rate 4 --ready-queue-policy random --disable-random-sample --disable-auto-run --enable-round-barrier --model-path /home/models/Qwen3.5-397B-A17B-FP8/ --port 6178

Hicache disabled:

HiCache with UnifiedRadixTree:

hzh0425 · 2026-04-21T04:05:07Z

/rerun-test test_unified_radix_cache_unittest.py test_unified_radix_cache_bench.py test_unified_radix_cache_kl.py

ispobock · 2026-04-21T04:20:07Z

/tag-and-rerun-ci

icepoint666 · 2026-04-21T05:22:20Z

Validated UnifiedRadixTree + L1+L2 HiCache on Qwen3.5-397B-A17B across both stability and correctness.

Deploy (TP16, 2 nodes, page_size=64)

export SGLANG_ENABLE_UNIFIED_RADIX_TREE="1"

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 \
    --port 30000 \
    --tp 16 \
    --nnodes 2 \
    --node-rank ${NODE_RANK} \
    --dist-init-addr <launcher_ip>:29500 \
    --page-size 64 \
    --mamba-scheduler-strategy extra_buffer \
    --enable-hierarchical-cache \
    --hicache-ratio 2 \
    --hicache-size 0 \
    --hicache-mem-layout page_first_direct \
    --hicache-io-backend direct \
    --hicache-write-policy write_through \
    --hicache-storage-prefetch-policy wait_complete

Stability (3 rounds × 2048 prompts, shared-prefix workload)

python3 -m sglang.bench_serving \
    --backend sglang \
    --base-url http://127.0.0.1:30000 \
    --model Qwen/Qwen3.5-397B-A17B \
    --dataset-name generated-shared-prefix \
    --gsp-num-groups 128 \
    --gsp-prompts-per-group 16 \
    --gsp-system-prompt-len 4096 \
    --gsp-question-len 512 \
    --gsp-output-len 128 \
    --num-prompts 2048 \
    --request-rate inf

Round	Duration (s)	Input tok/s	Output tok/s	Median TTFT (s)
1	2382	4063	110.1	1297
2	2033	4759	128.9	952
3	2027	4774	129.3	961

Rounds 2 and 3 reuse the shared prefix populated in Round 1. Compared to Round 1, duration drops −14.6% and TTFT drops −27%; Round 2 and Round 3 are nearly identical, and cache reuse is stable. Server stayed healthy the entire time, no OOM / assert / hang.

Byte-level correctness

To check whether the L2 host copy is byte-identical to what the device actually wrote, I sampled SHA-256 fingerprints on both backup_from_device_all_layer (D→H) and load_to_device_per_layer (H→D)

The hook is installed via source-level append (every TP worker picks it up after spawn), samples 4 pages per call, and uses a 128-byte sketch SHA-256.

Result — aggregated across all 16 TP ranks:

metric	per-rank	total (×16)	meaning
`backup_n`	1292	20672	pages sampled after a D→H backup (fingerprint stored)
`load_n`	48	768	pages sampled before an H→D load (layer_id=0)
`verify_n`	32	512	loads where the stored fingerprint was found and compared
`match_n`	32	512	comparisons where `stored == loaded`
`mismatch_n`	0	0	comparisons where `stored != loaded`

Every verified sample is byte-exact.

# Conflicts: # python/sglang/srt/mem_cache/unified_radix_cache.py

hzh0425 · 2026-04-21T10:37:30Z

/rerun-test test_unified_radix_cache_unittest.py test_unified_radix_cache_bench.py test_unified_radix_cache_kl.py

github-actions · 2026-04-21T10:38:07Z

✅ 1-gpu-5090 (2 tests): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_unified_radix_cache_unittest.py
cd test/ && python3 registered/unit/mem_cache/test_unified_radix_cache_bench.py

✅ 4-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/radix_cache/test_unified_radix_cache_kl.py

Support hicache_anchor_kv_shared_indices_pools in unifiedRadixTree directly

riZZZhik · 2026-04-24T12:05:55Z

            req.init_next_round_input(self.tree_cache, cow_mamba=False)
            last_host_node = req.last_host_node
            if last_host_node.backuped or last_host_node is self.tree_cache.root_node:
                last_hash = last_host_node.get_last_hash_value()


UnifiedTreeNode is missing the get_last_hash_value method, so running this PR with the Mooncake HiCache backend results in the following exception:

AttributeError: 'UnifiedTreeNode' object has no attribute 'get_last_hash_value'

Also, UnifiedRadixCache is missing prefetch-related methods like: prefetch_from_storage, check_prefetch_progress, ... which are called by the scheduler

hi; @riZZZhik
This pr only support l2 hicache now; this missing methods are used for l3 hicache, and will be supportted in next pr

please refer to #23316 (comment)

hzh0425 and others added 17 commits April 16, 2026 10:27

Support first version

97418c8

Fix some bugs

cf23861

Redefine CacheTransferPhase Enum

16752d0

Add TestUnifiedMambaRadixCacheWithHiCache

0f2f7a6

Refactor Evict Logic

05589de

Merge remote-tracking branch 'origin/main' into hybrid/hicache_v1_tmp

06c7246

# Conflicts: # python/sglang/srt/mem_cache/unified_cache_components/tree_component.py

upd

d726d09

upd

d015997

Refactor hybrid_pool_assembler.py

eb8f2dd

Support shared_anchor_component for dsa and eagle

0a93331

Refactor hybrid_pool_assembler.py

e072c0e

Merge remote-tracking branch 'origin/main' into hybrid/hicache_v2_tmp

dc23e3a

# Conflicts: # python/sglang/srt/mem_cache/unified_radix_cache.py

Aux transfers should still run even when the Full-KV load is skipped …

70711f3

…by thresholding

Merge remote-tracking branch 'origin/main' into hybrid/hicache_v2_tmp

e3fa2d7

Update test

585db06

Optimize code

cb7410c

[UnifiedRadixTree] Reclaim L2 host pool slots on flush_cache (#23216)

9ce35f6

Co-authored-by: diemchai <diemchai@tencent.com>

hzh0425 assigned xiezhq-hermann and ispobock Apr 21, 2026

hzh0425 requested review from Ying1123, hanming-lu, hnyls2002, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 21, 2026 03:47

github-actions Bot added the hicache Hierarchical Caching for SGLang label Apr 21, 2026

hzh0425 mentioned this pull request Apr 21, 2026

[Roadmap] Unified Hybrid Radix Cache Refactor #20415

Open

23 tasks

hzh0425 added the run-ci label Apr 21, 2026

Merge remote-tracking branch 'origin/main' into hybrid/hicache_v2_tmp

31934ba

# Conflicts: # python/sglang/srt/mem_cache/unified_radix_cache.py

sgl-project deleted a comment from github-actions Bot Apr 21, 2026

huangtingwei9988 self-assigned this Apr 21, 2026

Merge branch 'main' into hybrid_tree/hicache_integrate

a08d9f5

ispobock mentioned this pull request Apr 21, 2026

Support swa HiCache for unified radix cache #23391

Merged

hzh0425 added 4 commits April 22, 2026 05:58

Merge branch 'main' into hybrid_tree/hicache_integrate

9718e71

Optimize comment

b666891

Remove shared_anchor_components.py

1efcbc9

Support hicache_anchor_kv_shared_indices_pools in unifiedRadixTree directly

Improve unittest: using real cache_controller to test hicache logic

2fa24e2

hzh0425 mentioned this pull request Apr 24, 2026

[Feature] UnifiedRadix support hicache for DeepSeekV4 #23639

Open

2 tasks

riZZZhik reviewed Apr 24, 2026

View reviewed changes

xiezhq-hermann added the high priority label Apr 28, 2026

ispobock reviewed May 3, 2026

View reviewed changes

Comment thread test/registered/radix_cache/test_unified_radix_cache_kl.py

ispobock approved these changes May 3, 2026

View reviewed changes

ispobock merged commit c0f5950 into main May 3, 2026
171 of 190 checks passed

ispobock deleted the hybrid_tree/hicache_integrate branch May 3, 2026 14:13

Conversation

hzh0425 commented Apr 21, 2026 • edited by ispobock Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 21, 2026

Uh oh!

linjianyu233 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linjianyu233 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzh0425 commented Apr 21, 2026

Uh oh!

ispobock commented Apr 21, 2026

Uh oh!

icepoint666 commented Apr 21, 2026

Deploy (TP16, 2 nodes, page_size=64)

Stability (3 rounds × 2048 prompts, shared-prefix workload)

Byte-level correctness

Uh oh!

hzh0425 commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

riZZZhik Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

riZZZhik Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzh0425 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

hzh0425 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hzh0425 commented Apr 21, 2026 •

edited by ispobock

Loading

linjianyu233 commented Apr 21, 2026 •

edited

Loading

linjianyu233 commented Apr 21, 2026 •

edited

Loading

riZZZhik Apr 24, 2026 •

edited

Loading