[RaidxTree Refactor]: Support Unified HybridRadixTree V2 by hzh0425 · Pull Request #21206 · sgl-project/sglang

hzh0425 · 2026-03-23T10:26:44Z

Motivation

Collabrate with @ispobock @yizhang2077 @pansicheng @xiezhq-hermann

MambaRadixCache and SWARadixCache maintain separate cache management logic with significant code duplication. Worse, each new attention variant (full, sliding-window, SSM, or future combinations) would require yet another standalone cache implementation, making the radix tree layer increasingly hard to maintain and extend.

HybridRadixCache replaces both with a unified tree structure and pluggable TreeComponents. New attention types can be supported by adding a TreeComponent — without touching the core tree logic (match / insert / evict). This makes it straightforward to handle hybrid models with arbitrary attention layer compositions.

Modifications

HybridRadixCache: Unified multi-component radix tree based on BasePrefixCache, enabled via --enable-hybrid-radix-tree(Temporarily defaulting to False), replacing MambaRadixCache / SWARadixCache. Core tree operations (match, insert, evict, LRU) are component-agnostic; each TreeComponent hooks into them via a well-defined interface.
MambaComponent / SWAComponent: Pluggable components handling SSM state caching (aligned to mamba_track_interval) and sliding-window KV caching (tombstone + in-window recovery) respectively.

Future plan:

Support HiCache with Hybrid Cache Controller (This means that mamba / swa / full can natively support HiCache without needing to implement another HiMamba or HiSWA Radix Tree.)
More Testing; Making HybridRadixCache as default option

Accuracy Tests

AIME 25 Repeat16 Test:

Qwen3-next :

with hybrid tree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6247       | 1147        | 64.38% ± 5.54%   | 0.00%    
nemo-run_1/0 majority@16       | 30          | 6247       | 1147        | 75.00%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6247       | 1147        | 86.67%           | 0.00%  

with mambaRadixTree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6599       | 1228        | 64.79% ± 4.71%   | 0.00%    
nemo-run_1/0 majority@16       | 30          | 6599       | 1228        | 78.33%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6599       | 1228        | 86.67%           | 0.00%

GPT-OSS-20B:

with hybrid tree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 8751       | 854         | 73.33% ± 5.31%   | 5.62%    
nemo-run_1/0 majority@16       | 30          | 8751       | 854         | 86.67%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 8751       | 854         | 100.00%          | 0.00%    

with SWARadixTree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 9249       | 925         | 71.67% ± 6.32%   | 8.96%    
nemo-run_1/0 majority@16       | 30          | 9249       | 925         | 90.00%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 9249       | 925         | 93.33%           | 0.00%

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-23T10:27:08Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural improvement to the memory cache management by refactoring the radix tree implementation into a unified, component-based system. This change aims to streamline the integration of diverse attention mechanisms, such as Mamba and Sliding Window Attention, by providing a flexible framework that can adapt to future attention types without requiring extensive modifications to the core caching logic. The new structure promotes modularity and simplifies the process of extending cache support for complex model architectures.

Highlights

Unified Radix Tree Structure: Replaced separate MambaRadixCache and SWARadixCache implementations with a single, unified HybridRadixCache, significantly reducing code duplication and improving maintainability.
Pluggable TreeComponents: Introduced a pluggable architecture using TreeComponents (e.g., MambaComponent, SWAComponent, FullComponent) to handle different attention types, allowing new attention variants to be supported by simply adding a new component without altering core tree logic.
Enhanced Extensibility: The new design makes it straightforward to support models with arbitrary combinations of attention layer compositions (e.g., interleaved full, sliding-window, and SSM attention).
New Server Argument: Added a --enable-hybrid-radix-tree flag to activate the new hybrid cache, with a temporary default to False.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

merrymercy · 2026-04-07T22:32:10Z

I feel we can just migrate the whole class to rust as the immediate next step before implementing other features. Several reasons

This class becomes more general and it maintains more metadata. The overhead of python is larger.
This class is very modular. It is all new code. We can easily rewrite it in rust without changing too many other places.
We will do many large scale tests to validate this and make it as the default path. If we do rust migration in this effort, we only need to run the test once and enable it by default. Otherwise, we need to run multiple times of large scale tests or maintain many code paths (old python, unified python, unified rust) at the same time in the future.

whybeyoung · 2026-04-08T16:35:38Z

I feel we can just migrate the whole class to rust as the immediate next step before implementing other features. Several reasons

This class becomes more general and it maintains more metadata. The overhead of python is larger.

This class is very modular. It is all new code. We can easily rewrite it in rust without changing too many other places.

We will do many large scale tests to validate this and make it as the default path. If we do rust migration in this effort, we only need to run the test once and enable it by default. Otherwise, we need to run multiple times of large scale tests or maintain many code paths (old python, unified python, unified rust) at the same time in the future.

definitely agreed~

ispobock · 2026-04-13T02:06:41Z

https://github.com/sgl-project/sglang/actions/runs/24282997210?pr=21206 all CI passed

nvpohanh · 2026-04-13T07:50:50Z

cc @hlu1 @YAMY1234

…#21206) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>

…omponent The unified HybridRadixTree V2 (sgl-project#21206) landed after this branch was originally cut, and its MambaComponent still used the ping-pong buffer model. After merging main, the mamba_component path was broken — it referenced attributes and APIs that no longer exist (mamba_ping_pong_track_buffer, mamba_next_track_idx, get_mamba_ping_pong_other_idx, free_mamba_cache's mamba_ping_pong_track_buffer_to_keep kwarg). Port the pending_radix_mamba_slot producer-consumer model to the unified tree's component hooks: - prepare_for_caching_req (both finished and unfinished branches): replace ping-pong indexing with zero-copy ownership transfer of req.pending_radix_mamba_slot into insert_params.mamba_value. When no pending slot is available due to mamba pool pressure, return 0 so the caller takes the empty-key short-circuit in _insert_helper (finished) or the effective_cache_len<=0 early-return in cache_unfinished_req (unfinished). enable_mamba_extra_buffer=False keeps the legacy fork_from path unchanged. - cleanup_after_caching_req: drop the obsolete mamba_ping_pong_track_buffer_to_keep argument; on finished, free the pending slot when the tree already holds mamba state; on unfinished, pre-allocate the next pending slot off the forward hot path (evict once on failure, leave as None if the pool is fully locked). Also address review comments from yizhang2077 on the original path: - schedule_batch._mamba_prefix_cache_update: restore three explanatory comments that were dropped during the earlier ping-pong removal. - tree_component.prepare_for_caching_req docstring: update the Mamba description from ping-pong buffer to pending_radix_mamba_slot ownership transfer. - test_streaming_session_unit._FakeReq: rename mamba_ping_pong_track_buffer/mamba_next_track_idx fields to pending_radix_mamba_slot.

…#21206) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>

hzh0425 and others added 20 commits March 19, 2026 11:31

Init support for hybrid radix tree

f2b5482

Support evict with priority

34c638e

tiny cleanup type annotations

d350e34

add swa component

372b04b

update type

4784b4d

Support unified cache_finished_req&cache_unfinished_req

dc5257b

Fix code lint

f12231e

update iteratively delete

387da78

remove cascade evict internal node

db4f034

move out dup kv free to avoid double free

45d1e97

update cascade internal

ca869a9

update insert hook for swa

eb7b4b1

skip leaf creation for corner case

d344977

fix full lru

aab9ed8

add swa entry

d7cfb0e

add swa kl test

18c3a0b

Release mamba when cache_finished_req return early

e9ef0bb

Add multi-turn kl test for mamba hybrid tree

3c0640a

add kl multiturn test for swa

41a4ab0

Add unit test for hybrid tree

b2246ca

hzh0425 assigned yizhang2077 and ispobock Mar 23, 2026

hzh0425 requested review from Ying1123, hanming-lu, hnyls2002, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners March 23, 2026 10:26

merrymercy reviewed Apr 7, 2026

View reviewed changes

Comment thread python/sglang/srt/mem_cache/unified_radix_cache.py Outdated

hzh0425 added 5 commits April 8, 2026 11:45

Fix commnet

112ac9a

Tmp fix perf using int enum

b25051c

Perf1: replace dict index with list index

fa04e70

Perf2: Using component_tuple to replace components.values()

ec4a079

Free full only in full component when swa component exist

693641d

hzh0425 and others added 9 commits April 9, 2026 14:30

perf3: Optimize lock/unlock perf

143c415

Optimize codeStyle

bdcb8df

Fix lint

e248e18

Add sanity_check for unified tree

7202a60

Update test

6e48292

fix suite

22a0ad9

fix test

898eb8d

Merge branch 'main' into hybrid/hybrid_radix_tree

f0a6992

fix ci

d119818

hzh0425 force-pushed the hybrid/hybrid_radix_tree branch from 0eb4b6f to d119818 Compare April 11, 2026 05:47

hzh0425 added 2 commits April 11, 2026 20:28

Merge branch 'main' into hybrid/hybrid_radix_tree

93db94e

fix lint

4870c1c

ispobock approved these changes Apr 13, 2026

View reviewed changes

ispobock merged commit bc59cc0 into main Apr 13, 2026
228 of 278 checks passed

ispobock deleted the hybrid/hybrid_radix_tree branch April 13, 2026 02:28

merrymercy reviewed Apr 13, 2026

View reviewed changes

Comment thread test/registered/unit/mem_cache/test_unified_radix_cache_bench.py

Comment thread test/registered/unit/mem_cache/test_unified_radix_cache_unittest.py

hzh0425 mentioned this pull request Apr 24, 2026

[Feature] UnifiedRadix support hicache for DeepSeekV4 #23639

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RaidxTree Refactor]: Support Unified HybridRadixTree V2#21206

[RaidxTree Refactor]: Support Unified HybridRadixTree V2#21206
ispobock merged 55 commits intomainfrom
hybrid/hybrid_radix_tree

hzh0425 commented Mar 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Uh oh!

Uh oh!

merrymercy commented Apr 7, 2026 •

edited

Loading

Uh oh!

whybeyoung commented Apr 8, 2026

Uh oh!

ispobock commented Apr 13, 2026

Uh oh!

Uh oh!

nvpohanh commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

hzh0425 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

merrymercy commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whybeyoung commented Apr 8, 2026

Uh oh!

ispobock commented Apr 13, 2026

Uh oh!

Uh oh!

nvpohanh commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hzh0425 commented Mar 23, 2026 •

edited

Loading

merrymercy commented Apr 7, 2026 •

edited

Loading