Support swa HiCache for unified radix cache by ispobock · Pull Request #23391 · sgl-project/sglang

ispobock · 2026-04-21T17:40:18Z

Motivation

Add SWA support to HiCache on unified radix cache. Follow-up to #23316.

Benchmark

w/ HiCache:

SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 \
python3 -m sglang.launch_server \
    --model-path openai/gpt-oss-120b \
    --tp 2 --port 30001 \
    --page-size 64 \
    --attention-backend fa3 --decode-attention-backend fa3 \
    --enable-hierarchical-cache \
    --hicache-ratio 2.0 \
    --hicache-write-policy write_through \
    --hicache-io-backend kernel

w/o HiCache:

SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 \
python3 -m sglang.launch_server \
    --model-path openai/gpt-oss-120b \
    --tp 2 --port 30001 \
    --page-size 64 \
    --attention-backend fa3 --decode-attention-backend fa3

bench:

python3 ./benchmark/hicache/bench_multiturn.py \
    --model-path openai/gpt-oss-120b \
    --port 30001 \
    --disable-random-sample \
    --request-length 2048 --output-length 1024 \
    --num-clients 140 --num-rounds 12 \
    --max-parallel 32 --request-rate 4 \
    --ready-queue-policy random --disable-auto-run \
    --enable-round-barrier \
    --log-file metrics.jsonl

Accuracy

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1400 --parallel 1400

Accuracy: 0.841
Invalid: 0.013
Latency: 130.863 s
Output throughput: 3288.482 token/s

gemini-code-assist · 2026-04-21T17:40:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

baoskee · 2026-04-27T22:15:13Z

thank you for making this PR! I am excited for L3 file support.

hzh0425 · 2026-04-28T02:10:14Z

thank you for making this PR! I am excited for L3 file support.

This PR currently supports only L2 SWA HiCache.

There are still several L2-support PRs that haven’t been merged yet.

Once the preceding PRs are merged, L3 support will be enabled immediately. @baoskee

baoskee · 2026-04-28T05:35:09Z

thank you for making this PR! I am excited for L3 file support.

This PR currently supports only L2 SWA HiCache.

There are still several L2-support PRs that haven’t been merged yet.

Once the preceding PRs are merged, L3 support will be enabled immediately. @baoskee

Amazing, thank you. This is pretty critical to our business right now and you are saving it. I'm running this branch in production right now 😂 but once L3 is added it will reduce inference costs significantly more.

KV cache is the limiting factor right now for our long context inference setup.

hzh0425 · 2026-04-28T06:06:24Z

thank you for making this PR! I am excited for L3 file support.

This PR currently supports only L2 SWA HiCache.
There are still several L2-support PRs that haven’t been merged yet.
Once the preceding PRs are merged, L3 support will be enabled immediately. @baoskee

Amazing, thank you. This is pretty critical to our business right now and you are saving it. I'm running this branch in production right now 😂 but once L3 is added it will reduce inference costs significantly more.

KV cache is the limiting factor right now for our long context inference setup.

Just fixed a small issue—you can go ahead and update. Looking forward to your feedback!

rarepepi · 2026-04-29T22:46:22Z

thank you for making this PR! I am excited for L3 file support.

This PR currently supports only L2 SWA HiCache.
There are still several L2-support PRs that haven’t been merged yet.
Once the preceding PRs are merged, L3 support will be enabled immediately. @baoskee

Amazing, thank you. This is pretty critical to our business right now and you are saving it. I'm running this branch in production right now 😂 but once L3 is added it will reduce inference costs significantly more.
KV cache is the limiting factor right now for our long context inference setup.

Just fixed a small issue—you can go ahead and update. Looking forward to your feedback!

UnifiedRadixCache errors with assertions — replicas die after 45 mins of accepting traffic.

Setup

Hardware: 8× B200 (single node), running 4 replicas at TP=2
Model: nvidia/Gemma-4-31B-IT-NVFP4
SGLANG_ENABLE_UNIFIED_RADIX_TREE=1
--enable-hierarchical-cache
--hicache-ratio 1.0
--hicache-write-policy write_through_selective
--hicache-storage-prefetch-policy wait_complete
--hicache-io-backend kernel
--hicache-mem-layout page_first
--page-size 64
--kv-cache-dtype fp8_e4m3
--mem-fraction-static 0.92
--context-length 32768
--chunked-prefill-size 4096
--schedule-policy lpm

Crashes (both in unified_radix_cache.py)

File ".../mem_cache/unified_radix_cache.py", line 537, in cache_unfinished_req
req.cache_protected_len <= len(new_indices) + self.page_size - 1
AssertionError: req.cache_protected_len=4288, len(new_indices)=0, page_aligned_len=4608

File ".../mem_cache/unified_radix_cache.py", line 1786, in sanity_check
AssertionError: Sanity check FAILED (2 violations across 980 nodes):
[INV-2] swa host LRU: +S3=set(), +lru={476}
[INV-5] swa in both device and host LRU: {476}

Really appreciate your help thank you!

baoskee · 2026-05-03T08:41:32Z

Hello! Thank you again for making this PR.

I'm running this branch in production in tp=2 across four workers. This could be a race condition since it runs fine for a while but occasionally crashes (possibly when high rates of eviction are happening).

We're getting:

ile "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_cache_components/swa_component.py", line 321, in drive_eviction
    self.cache._cascade_evict(x, self, tracker)
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 892, in _cascade_evict
    assert cd.lock_ref == 0
           ^^^^^^^^^^^^^^^^
AssertionError

[2026-05-02 05:55:41 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3807, in run_scheduler_process
    scheduler.run_event_loop()
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1394, in run_event_loop
    dispatch_event_loop(self)
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3677, in dispatch_event_loop
    scheduler.event_loop_overlap()
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1444, in event_loop_overlap
    batch = self.get_next_batch_to_run()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2410, in get_next_batch_to_run
    self.running_batch = self.update_running_batch(self.running_batch)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2757, in update_running_batch
    batch.prepare_for_decode()
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/schedule_batch.py", line 2225, in prepare_for_decode
    self.out_cache_loc = alloc_for_decode(self, token_per_req=1)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/common.py", line 444, in alloc_for_decode
    out_cache_loc = alloc_paged_token_slots_decode(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/common.py", line 405, in alloc_paged_token_slots_decode
    evict_from_tree_cache(tree_cache, num_tokens)
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/common.py", line 246, in evict_from_tree_cache
    tree_cache.evict(
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 369, in evict
    component.drive_eviction(params=params, tracker=tracker)
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_cache_components/swa_component.py", line 321, in drive_eviction
    self.cache._cascade_evict(x, self, tracker)
  File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 892, in _cascade_evict
    assert cd.lock_ref == 0
           ^^^^^^^^^^^^^^^^
AssertionError

[2026-05-02 05:55:41] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed

hzh0425 · 2026-05-03T10:44:06Z

I'll sync some fix code to this branch later—you can give it another try; I've fixed a few issues.
@rarepepi @baoskee

hzh0425 · 2026-05-03T10:45:15Z

Crashes (both in unified_radix_cache.py)

File ".../mem_cache/unified_radix_cache.py", line 537, in cache_unfinished_req req.cache_protected_len <= len(new_indices) + self.page_size - 1 AssertionError: req.cache_protected_len=4288, len(new_indices)=0, page_aligned_len=4608

File ".../mem_cache/unified_radix_cache.py", line 1786, in sanity_check AssertionError: Sanity check FAILED (2 violations across 980 nodes): [INV-2] swa host LRU: +S3=set(), +lru={476} [INV-5] swa in both device and host LRU: {476}

Really appreciate your help thank you!

This is a known issue, and I'll fix it in this branch. @rarepepi

v-shobhit · 2026-05-04T01:35:52Z

Hello! Will this also be applicable to deepseek-v4-pro?

hzh0425 · 2026-05-04T02:26:16Z

Hello! Will this also be applicable to deepseek-v4-pro?

Yes; based on the swa_hicache branch, we developed HiCache for ds-v4.
We'll wait until swa_hicache is merged, and then rebase the ds branch. @v-shobhit

ispobock · 2026-05-04T18:23:26Z

/tag-and-rerun-ci

Co-authored-by: hzh0425 <hzh0425@apache.org>

baoskee · 2026-05-08T22:58:56Z

This doubled our throughput on B200 clusters. Thank you @ispobock and @hzh0425 🙏

rarepepi · 2026-05-09T02:15:31Z

This doubled our throughput on B200 clusters. Thank you @ispobock and @hzh0425 🙏

Yes thank you!!

ispobock requested review from Ying1123, hanming-lu, hnyls2002, hzh0425, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 21, 2026 17:40

github-actions Bot added the hicache Hierarchical Caching for SGLang label Apr 21, 2026

This was referenced Apr 21, 2026

[UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree #23316

Merged

[Roadmap] Unified Hybrid Radix Cache Refactor #20415

Open

hzh0425 self-assigned this Apr 21, 2026

xiezhq-hermann self-assigned this Apr 24, 2026

This was referenced Apr 24, 2026

[Feature] UnifiedRadix support hicache for DeepSeekV4 #23639

Open

[Feature] HiCache Support for SWA (Gemma and Mistral) #23659

Closed

xiezhq-hermann added the high priority label Apr 28, 2026

Base automatically changed from hybrid_tree/hicache_integrate to main May 3, 2026 14:13

ispobock added 5 commits May 3, 2026 17:16

swa hicache type

e153914

swa kvpool layer counter

c4c9050

add suffix aware hicache alloc

f1e4195

dispatch swa load

0c4e0a4

add swa component hicache hooks

064a6a1

ispobock added 3 commits May 3, 2026 17:30

assemble swa hicache stack

a02ba83

add ut

6be020a

fix lock and rebuild value

5aa2023

ispobock force-pushed the hybrid_tree/hicache_integrate_swa branch from 2fcd434 to 5aa2023 Compare May 3, 2026 17:39

hzh0425 and others added 7 commits May 4, 2026 10:42

Fix swa host split

3179048

Fix swa redistribute_on_node_split

8c92825

BugFix: fix hicache inc/dec_lock_ref not correctly passing swa uuid

67ced57

fix and refactor swa load back

3589f07

update host_hit_length

b6d9e68

fix dec lock ref

b09eacb

add ut

bc45d64

github-actions Bot added the run-ci label May 4, 2026

ispobock merged commit eb5f0fb into main May 6, 2026
330 of 360 checks passed

ispobock deleted the hybrid_tree/hicache_integrate_swa branch May 6, 2026 14:19

hzh0425 mentioned this pull request May 8, 2026

[UnifiedTree]: Support HiCache For DeepSeek_V4 #24691

Open

5 tasks

LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026

Support swa HiCache for unified radix cache (sgl-project#23391)

c67b4ab

Co-authored-by: hzh0425 <hzh0425@apache.org>

Conversation

ispobock commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark

Accuracy

Uh oh!

gemini-code-assist Bot commented Apr 21, 2026

Uh oh!

baoskee commented Apr 27, 2026

Uh oh!

hzh0425 commented Apr 28, 2026

Uh oh!

baoskee commented Apr 28, 2026

Uh oh!

hzh0425 commented Apr 28, 2026

Uh oh!

rarepepi commented Apr 29, 2026

Uh oh!

baoskee commented May 3, 2026

Uh oh!

hzh0425 commented May 3, 2026

Uh oh!

hzh0425 commented May 3, 2026

Uh oh!

v-shobhit commented May 4, 2026

Uh oh!

hzh0425 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ispobock commented May 4, 2026

Uh oh!

Uh oh!

baoskee commented May 8, 2026

Uh oh!

rarepepi commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ispobock commented Apr 21, 2026 •

edited

Loading

hzh0425 commented May 4, 2026 •

edited

Loading