Support swa HiCache for unified radix cache#23391
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
thank you for making this PR! I am excited for L3 file support. |
This PR currently supports only L2 SWA HiCache. There are still several L2-support PRs that haven’t been merged yet. Once the preceding PRs are merged, L3 support will be enabled immediately. @baoskee |
Amazing, thank you. This is pretty critical to our business right now and you are saving it. I'm running this branch in production right now 😂 but once L3 is added it will reduce inference costs significantly more. KV cache is the limiting factor right now for our long context inference setup. |
Just fixed a small issue—you can go ahead and update. Looking forward to your feedback! |
UnifiedRadixCache errors with assertions — replicas die after 45 mins of accepting traffic. Setup
Crashes (both in unified_radix_cache.py) File ".../mem_cache/unified_radix_cache.py", line 537, in cache_unfinished_req File ".../mem_cache/unified_radix_cache.py", line 1786, in sanity_check Really appreciate your help thank you! |
|
Hello! Thank you again for making this PR. I'm running this branch in production in tp=2 across four workers. This could be a race condition since it runs fine for a while but occasionally crashes (possibly when high rates of eviction are happening). We're getting: ile "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_cache_components/swa_component.py", line 321, in drive_eviction
self.cache._cascade_evict(x, self, tracker)
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 892, in _cascade_evict
assert cd.lock_ref == 0
^^^^^^^^^^^^^^^^
AssertionError
[2026-05-02 05:55:41 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3807, in run_scheduler_process
scheduler.run_event_loop()
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1394, in run_event_loop
dispatch_event_loop(self)
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3677, in dispatch_event_loop
scheduler.event_loop_overlap()
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1444, in event_loop_overlap
batch = self.get_next_batch_to_run()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2410, in get_next_batch_to_run
self.running_batch = self.update_running_batch(self.running_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2757, in update_running_batch
batch.prepare_for_decode()
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/managers/schedule_batch.py", line 2225, in prepare_for_decode
self.out_cache_loc = alloc_for_decode(self, token_per_req=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/common.py", line 444, in alloc_for_decode
out_cache_loc = alloc_paged_token_slots_decode(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/common.py", line 405, in alloc_paged_token_slots_decode
evict_from_tree_cache(tree_cache, num_tokens)
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/common.py", line 246, in evict_from_tree_cache
tree_cache.evict(
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 369, in evict
component.drive_eviction(params=params, tracker=tracker)
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_cache_components/swa_component.py", line 321, in drive_eviction
self.cache._cascade_evict(x, self, tracker)
File "/home/baoskee/root/.venv/lib/python3.12/site-packages/sglang/srt/mem_cache/unified_radix_cache.py", line 892, in _cascade_evict
assert cd.lock_ref == 0
^^^^^^^^^^^^^^^^
AssertionError
[2026-05-02 05:55:41] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed |
This is a known issue, and I'll fix it in this branch. @rarepepi |
2fcd434 to
5aa2023
Compare
|
Hello! Will this also be applicable to deepseek-v4-pro? |
Yes; based on the swa_hicache branch, we developed HiCache for ds-v4. |
|
/tag-and-rerun-ci |
Co-authored-by: hzh0425 <hzh0425@apache.org>
Motivation
Add SWA support to HiCache on unified radix cache. Follow-up to #23316.
Benchmark
w/ HiCache:
SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 \ python3 -m sglang.launch_server \ --model-path openai/gpt-oss-120b \ --tp 2 --port 30001 \ --page-size 64 \ --attention-backend fa3 --decode-attention-backend fa3 \ --enable-hierarchical-cache \ --hicache-ratio 2.0 \ --hicache-write-policy write_through \ --hicache-io-backend kernelw/o HiCache:
SGLANG_ENABLE_UNIFIED_RADIX_TREE=1 \ python3 -m sglang.launch_server \ --model-path openai/gpt-oss-120b \ --tp 2 --port 30001 \ --page-size 64 \ --attention-backend fa3 --decode-attention-backend fa3bench:
Accuracy