Skip to content

[UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree #23316

Merged
ispobock merged 23 commits intomainfrom
hybrid_tree/hicache_integrate
May 3, 2026
Merged

[UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree #23316
ispobock merged 23 commits intomainfrom
hybrid_tree/hicache_integrate

Conversation

@hzh0425
Copy link
Copy Markdown
Collaborator

@hzh0425 hzh0425 commented Apr 21, 2026

Motivation

This branch adds HiCache support to UnifiedRadixTree and unifies the device/host cache lifecycle across Full, Mamba, and related hybrid components.

In particular, it enables HiCache on top of the unified tree for Hybrid Linear models and DeepSeek DSA-style models, with componentized eviction/load-back logic, explicit D-leaf/H-leaf tracking, host-side LRU management for auxiliary components, and integrated HiCache pool/controller wiring in the unified path.

TODO for subsequent PRs:

Thanks for @linjianyu233 @icepoint666 for the testing feedback.

Modifications

Accuracy Tests

Speed Tests and Profiling

Comparison for Hybrid Linear Model (Qwen3_5-397B-A17B-FP8)
image

Comparison for Hybrid DeepSeek DSA Model (DeepSeek V32)
image

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the hicache Hierarchical Caching for SGLang label Apr 21, 2026
@hzh0425 hzh0425 added the run-ci label Apr 21, 2026
@linjianyu233
Copy link
Copy Markdown

linjianyu233 commented Apr 21, 2026

GSM8K benchmark accuracy test passed.

 INFO - Benchmark 'precision_gsm8k' setup starting
 INFO - Benchmark 'precision_gsm8k' run starting with params: {'eval-name': 'gsm8k', 'num-examples': 100, 'max-tokens': 20480, 'num-shots': 5, 'gsm8k-data-path': '<WORKDIR>/store_test/datasets/GSM8K.jsonl', 'score-tolerance': 0.03, 'num-threads': 32, 'model-path': '<MODELS>/Qwen3.5-35B-A3B-FP8', 'port': 6178, 'host': '0.0.0.0', 'output-dir': '<WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision'}
 INFO - [precision_gsm8k] Phase 1: Running first eval (gsm8k)
 INFO - Running eval: python3 -m sglang.test.run_eval --port 6178 --eval-name gsm8k --num-threads 32 --host 0.0.0.0 --num-examples 100 --output-dir <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision --max-tokens 20480 --num-shots 5 --gsm8k-data-path <WORKDIR>/store_test/datasets/GSM8K.jsonl
 INFO - Eval stdout saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_before_eval.log
 INFO - Eval stderr saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_before_eval_stderr.log
 INFO - Auto-detected eval result file: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json
 INFO - Parsed score from <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json: 0.97
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_before.json
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.html → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_before.html
 INFO - [precision_gsm8k] Phase 2: Flushing cache
 INFO - Flushing cache (attempt 1/3): POST http://0.0.0.0:6178/flush_cache
 INFO - Cache flushed successfully (status: 200)
 INFO - [precision_gsm8k] Phase 3: Running second eval (gsm8k)
 INFO - Running eval: python3 -m sglang.test.run_eval --port 6178 --eval-name gsm8k --num-threads 32 --host 0.0.0.0 --num-examples 100 --output-dir <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision --max-tokens 20480 --num-shots 5 --gsm8k-data-path <WORKDIR>/store_test/datasets/GSM8K.jsonl
 INFO - Eval stdout saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_after_eval.log
 INFO - Eval stderr saved to: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k_after_eval_stderr.log
 INFO - Auto-detected eval result file: <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json
 INFO - Parsed score from <WORKDIR>/store_test/results/20260417_072038_native_l2_qwen3_5_precision/gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json: 0.99
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.json → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_after.json
 INFO - Renamed eval output: gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8.html → gsm8k__home_admin_models_Qwen3.5-35B-A3B-FP8_after.html
 INFO - [precision_gsm8k] Precision test PASSED: before=0.97, after=0.99, diff=0.020000000000000018, tolerance=0.03
 INFO - Benchmark 'precision_gsm8k' run completed with status: success

@linjianyu233
Copy link
Copy Markdown

linjianyu233 commented Apr 21, 2026

Compare multi-turn conversation tests between HiCache disabled and HiCache with UnifiedRadixTree enabled.

  • deploy:
export SGLANG_ENABLE_UNIFIED_RADIX_TREE="1"
export SGLANG_LOG_MS="true"
export SGLANG_MM_FEATURE_CACHE_MB="4096"
export SGLANG_USE_CUDA_IPC_TRANSPORT="1"
export SGLANG_VLM_CACHE_SIZE_MB="0"

python3 -m sglang.launch_server \
    --model-path /home/models/Qwen3.5-397B-A17B-FP8 \
    --host 0.0.0.0 \
    --port 6178 \
    --tp 8 \
    --page-size 128 \
    --dtype auto \
    --mamba-scheduler-strategy extra_buffer \
    --kv-cache-dtype fp8_e4m3 \
    --cuda-graph-max-bs 64 \
    --chunked-prefill-size 65536 \
    --expert-parallel-size 8 \
    --enable-metrics \
    --enable-cache-report \
    --log-level info \
    --enable-hierarchical-cache \
    --hicache-size 40 \
    --hicache-mem-layout page_first_direct \
    --hicache-io-backend direct \
    --hicache-write-policy write_through \
    --mem-fraction-static 0.8
  • run benchmark:
python3 sglang/benchmark/hicache/bench_multiturn.py --output-length 8 --request-length 400 --num-clients 300 --num-rounds 19 --max-parallel 32 --request-rate 4 --ready-queue-policy random --disable-random-sample --disable-auto-run --enable-round-barrier --model-path /home/models/Qwen3.5-397B-A17B-FP8/ --port 6178 

Hicache disabled:
image
HiCache with UnifiedRadixTree:
image

@hzh0425
Copy link
Copy Markdown
Collaborator Author

hzh0425 commented Apr 21, 2026

/rerun-test test_unified_radix_cache_unittest.py test_unified_radix_cache_bench.py test_unified_radix_cache_kl.py

@ispobock
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@icepoint666
Copy link
Copy Markdown
Contributor

Validated UnifiedRadixTree + L1+L2 HiCache on Qwen3.5-397B-A17B across both stability and correctness.

Deploy (TP16, 2 nodes, page_size=64)

export SGLANG_ENABLE_UNIFIED_RADIX_TREE="1"

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 \
    --port 30000 \
    --tp 16 \
    --nnodes 2 \
    --node-rank ${NODE_RANK} \
    --dist-init-addr <launcher_ip>:29500 \
    --page-size 64 \
    --mamba-scheduler-strategy extra_buffer \
    --enable-hierarchical-cache \
    --hicache-ratio 2 \
    --hicache-size 0 \
    --hicache-mem-layout page_first_direct \
    --hicache-io-backend direct \
    --hicache-write-policy write_through \
    --hicache-storage-prefetch-policy wait_complete

Stability (3 rounds × 2048 prompts, shared-prefix workload)

python3 -m sglang.bench_serving \
    --backend sglang \
    --base-url http://127.0.0.1:30000 \
    --model Qwen/Qwen3.5-397B-A17B \
    --dataset-name generated-shared-prefix \
    --gsp-num-groups 128 \
    --gsp-prompts-per-group 16 \
    --gsp-system-prompt-len 4096 \
    --gsp-question-len 512 \
    --gsp-output-len 128 \
    --num-prompts 2048 \
    --request-rate inf
Round Duration (s) Input tok/s Output tok/s Median TTFT (s)
1 2382 4063 110.1 1297
2 2033 4759 128.9 952
3 2027 4774 129.3 961

Rounds 2 and 3 reuse the shared prefix populated in Round 1. Compared to Round 1, duration drops −14.6% and TTFT drops −27%; Round 2 and Round 3 are nearly identical, and cache reuse is stable. Server stayed healthy the entire time, no OOM / assert / hang.

Byte-level correctness

To check whether the L2 host copy is byte-identical to what the device actually wrote, I sampled SHA-256 fingerprints on both backup_from_device_all_layer (D→H) and load_to_device_per_layer (H→D)

The hook is installed via source-level append (every TP worker picks it up after spawn), samples 4 pages per call, and uses a 128-byte sketch SHA-256.

Result — aggregated across all 16 TP ranks:

metric per-rank total (×16) meaning
backup_n 1292 20672 pages sampled after a D→H backup (fingerprint stored)
load_n 48 768 pages sampled before an H→D load (layer_id=0)
verify_n 32 512 loads where the stored fingerprint was found and compared
match_n 32 512 comparisons where stored == loaded
mismatch_n 0 0 comparisons where stored != loaded

Every verified sample is byte-exact.

# Conflicts:
#	python/sglang/srt/mem_cache/unified_radix_cache.py
@sgl-project sgl-project deleted a comment from github-actions Bot Apr 21, 2026
@hzh0425
Copy link
Copy Markdown
Collaborator Author

hzh0425 commented Apr 21, 2026

/rerun-test test_unified_radix_cache_unittest.py test_unified_radix_cache_bench.py test_unified_radix_cache_kl.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-5090 (2 tests): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_unified_radix_cache_unittest.py
cd test/ && python3 registered/unit/mem_cache/test_unified_radix_cache_bench.py

4-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/radix_cache/test_unified_radix_cache_kl.py

@huangtingwei9988 huangtingwei9988 self-assigned this Apr 21, 2026
req.init_next_round_input(self.tree_cache, cow_mamba=False)
last_host_node = req.last_host_node
if last_host_node.backuped or last_host_node is self.tree_cache.root_node:
last_hash = last_host_node.get_last_hash_value()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnifiedTreeNode is missing the get_last_hash_value method, so running this PR with the Mooncake HiCache backend results in the following exception:

AttributeError: 'UnifiedTreeNode' object has no attribute 'get_last_hash_value'

Copy link
Copy Markdown

@riZZZhik riZZZhik Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, UnifiedRadixCache is missing prefetch-related methods like: prefetch_from_storage, check_prefetch_progress, ... which are called by the scheduler

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi; @riZZZhik
This pr only support l2 hicache now; this missing methods are used for l3 hicache, and will be supportted in next pr

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer to #23316 (comment)

Comment thread test/registered/radix_cache/test_unified_radix_cache_kl.py
@ispobock ispobock merged commit c0f5950 into main May 3, 2026
171 of 190 checks passed
@ispobock ispobock deleted the hybrid_tree/hicache_integrate branch May 3, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants