[UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree #23316
[UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree #23316
Conversation
# Conflicts: # python/sglang/srt/mem_cache/unified_cache_components/tree_component.py
# Conflicts: # python/sglang/srt/mem_cache/unified_radix_cache.py
Co-authored-by: diemchai <diemchai@tencent.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
GSM8K benchmark accuracy test passed. |
|
Compare multi-turn conversation tests between HiCache disabled and HiCache with UnifiedRadixTree enabled.
|
|
/rerun-test test_unified_radix_cache_unittest.py test_unified_radix_cache_bench.py test_unified_radix_cache_kl.py |
|
/tag-and-rerun-ci |
|
Validated Deploy (TP16, 2 nodes, page_size=64)export SGLANG_ENABLE_UNIFIED_RADIX_TREE="1"
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--host 0.0.0.0 \
--port 30000 \
--tp 16 \
--nnodes 2 \
--node-rank ${NODE_RANK} \
--dist-init-addr <launcher_ip>:29500 \
--page-size 64 \
--mamba-scheduler-strategy extra_buffer \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-size 0 \
--hicache-mem-layout page_first_direct \
--hicache-io-backend direct \
--hicache-write-policy write_through \
--hicache-storage-prefetch-policy wait_completeStability (3 rounds × 2048 prompts, shared-prefix workload)python3 -m sglang.bench_serving \
--backend sglang \
--base-url http://127.0.0.1:30000 \
--model Qwen/Qwen3.5-397B-A17B \
--dataset-name generated-shared-prefix \
--gsp-num-groups 128 \
--gsp-prompts-per-group 16 \
--gsp-system-prompt-len 4096 \
--gsp-question-len 512 \
--gsp-output-len 128 \
--num-prompts 2048 \
--request-rate inf
Rounds 2 and 3 reuse the shared prefix populated in Round 1. Compared to Round 1, duration drops −14.6% and TTFT drops −27%; Round 2 and Round 3 are nearly identical, and cache reuse is stable. Server stayed healthy the entire time, no OOM / assert / hang. Byte-level correctnessTo check whether the L2 host copy is byte-identical to what the device actually wrote, I sampled SHA-256 fingerprints on both backup_from_device_all_layer (D→H) and load_to_device_per_layer (H→D) The hook is installed via source-level append (every TP worker picks it up after Result — aggregated across all 16 TP ranks:
Every verified sample is byte-exact. |
# Conflicts: # python/sglang/srt/mem_cache/unified_radix_cache.py
|
/rerun-test test_unified_radix_cache_unittest.py test_unified_radix_cache_bench.py test_unified_radix_cache_kl.py |
|
✅ ✅ |
Support hicache_anchor_kv_shared_indices_pools in unifiedRadixTree directly
| req.init_next_round_input(self.tree_cache, cow_mamba=False) | ||
| last_host_node = req.last_host_node | ||
| if last_host_node.backuped or last_host_node is self.tree_cache.root_node: | ||
| last_hash = last_host_node.get_last_hash_value() |
There was a problem hiding this comment.
UnifiedTreeNode is missing the get_last_hash_value method, so running this PR with the Mooncake HiCache backend results in the following exception:
AttributeError: 'UnifiedTreeNode' object has no attribute 'get_last_hash_value'
There was a problem hiding this comment.
Also, UnifiedRadixCache is missing prefetch-related methods like: prefetch_from_storage, check_prefetch_progress, ... which are called by the scheduler
There was a problem hiding this comment.
hi; @riZZZhik
This pr only support l2 hicache now; this missing methods are used for l3 hicache, and will be supportted in next pr


Motivation
This branch adds HiCache support to UnifiedRadixTree and unifies the device/host cache lifecycle across Full, Mamba, and related hybrid components.
In particular, it enables HiCache on top of the unified tree for Hybrid Linear models and DeepSeek DSA-style models, with componentized eviction/load-back logic, explicit D-leaf/H-leaf tracking, host-side LRU management for auxiliary components, and integrated HiCache pool/controller wiring in the unified path.
TODO for subsequent PRs:
Thanks for @linjianyu233 @icepoint666 for the testing feedback.
Modifications
Accuracy Tests
Speed Tests and Profiling
Comparison for Hybrid Linear Model (Qwen3_5-397B-A17B-FP8)

Comparison for Hybrid DeepSeek DSA Model (DeepSeek V32)

Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci