Skip to content

[RaidxTree Refactor]: Support Unified HybridRadixTree V2#21206

Merged
ispobock merged 55 commits intomainfrom
hybrid/hybrid_radix_tree
Apr 13, 2026
Merged

[RaidxTree Refactor]: Support Unified HybridRadixTree V2#21206
ispobock merged 55 commits intomainfrom
hybrid/hybrid_radix_tree

Conversation

@hzh0425
Copy link
Copy Markdown
Collaborator

@hzh0425 hzh0425 commented Mar 23, 2026

Motivation

Collabrate with @ispobock @yizhang2077 @pansicheng @xiezhq-hermann

MambaRadixCache and SWARadixCache maintain separate cache management logic with significant code duplication. Worse, each new attention variant (full, sliding-window, SSM, or future combinations) would require yet another standalone cache implementation, making the radix tree layer increasingly hard to maintain and extend.

HybridRadixCache replaces both with a unified tree structure and pluggable TreeComponents. New attention types can be supported by adding a TreeComponent — without touching the core tree logic (match / insert / evict). This makes it straightforward to handle hybrid models with arbitrary attention layer compositions.

Modifications

  • HybridRadixCache: Unified multi-component radix tree based on BasePrefixCache, enabled via --enable-hybrid-radix-tree(Temporarily defaulting to False), replacing MambaRadixCache / SWARadixCache. Core tree operations (match, insert, evict, LRU) are component-agnostic; each TreeComponent hooks into them via a well-defined interface.
  • MambaComponent / SWAComponent: Pluggable components handling SSM state caching (aligned to mamba_track_interval) and sliding-window KV caching (tombstone + in-window recovery) respectively.

Future plan:

  • Support HiCache with Hybrid Cache Controller (This means that mamba / swa / full can natively support HiCache without needing to implement another HiMamba or HiSWA Radix Tree.)
  • More Testing; Making HybridRadixCache as default option

Accuracy Tests

AIME 25 Repeat16 Test:

Qwen3-next :

with hybrid tree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6247       | 1147        | 64.38% ± 5.54%   | 0.00%    
nemo-run_1/0 majority@16       | 30          | 6247       | 1147        | 75.00%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6247       | 1147        | 86.67%           | 0.00%  

with mambaRadixTree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6599       | 1228        | 64.79% ± 4.71%   | 0.00%    
nemo-run_1/0 majority@16       | 30          | 6599       | 1228        | 78.33%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6599       | 1228        | 86.67%           | 0.00% 

GPT-OSS-20B:

with hybrid tree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 8751       | 854         | 73.33% ± 5.31%   | 5.62%    
nemo-run_1/0 majority@16       | 30          | 8751       | 854         | 86.67%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 8751       | 854         | 100.00%          | 0.00%    

with SWARadixTree:
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 9249       | 925         | 71.67% ± 6.32%   | 8.96%    
nemo-run_1/0 majority@16       | 30          | 9249       | 925         | 90.00%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 9249       | 925         | 93.33%           | 0.00% 

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural improvement to the memory cache management by refactoring the radix tree implementation into a unified, component-based system. This change aims to streamline the integration of diverse attention mechanisms, such as Mamba and Sliding Window Attention, by providing a flexible framework that can adapt to future attention types without requiring extensive modifications to the core caching logic. The new structure promotes modularity and simplifies the process of extending cache support for complex model architectures.

Highlights

  • Unified Radix Tree Structure: Replaced separate MambaRadixCache and SWARadixCache implementations with a single, unified HybridRadixCache, significantly reducing code duplication and improving maintainability.
  • Pluggable TreeComponents: Introduced a pluggable architecture using TreeComponents (e.g., MambaComponent, SWAComponent, FullComponent) to handle different attention types, allowing new attention variants to be supported by simply adding a new component without altering core tree logic.
  • Enhanced Extensibility: The new design makes it straightforward to support models with arbitrary combinations of attention layer compositions (e.g., interleaved full, sliding-window, and SSM attention).
  • New Server Argument: Added a --enable-hybrid-radix-tree flag to activate the new hybrid cache, with a temporary default to False.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment thread python/sglang/srt/mem_cache/unified_radix_cache.py Outdated
@merrymercy
Copy link
Copy Markdown
Contributor

merrymercy commented Apr 7, 2026

I feel we can just migrate the whole class to rust as the immediate next step before implementing other features. Several reasons

  1. This class becomes more general and it maintains more metadata. The overhead of python is larger.
  2. This class is very modular. It is all new code. We can easily rewrite it in rust without changing too many other places.
  3. We will do many large scale tests to validate this and make it as the default path. If we do rust migration in this effort, we only need to run the test once and enable it by default. Otherwise, we need to run multiple times of large scale tests or maintain many code paths (old python, unified python, unified rust) at the same time in the future.

@whybeyoung
Copy link
Copy Markdown
Collaborator

I feel we can just migrate the whole class to rust as the immediate next step before implementing other features. Several reasons

  1. This class becomes more general and it maintains more metadata. The overhead of python is larger.
  2. This class is very modular. It is all new code. We can easily rewrite it in rust without changing too many other places.
  3. We will do many large scale tests to validate this and make it as the default path. If we do rust migration in this effort, we only need to run the test once and enable it by default. Otherwise, we need to run multiple times of large scale tests or maintain many code paths (old python, unified python, unified rust) at the same time in the future.

definitely agreed~

@hzh0425 hzh0425 force-pushed the hybrid/hybrid_radix_tree branch from 0eb4b6f to d119818 Compare April 11, 2026 05:47
@ispobock
Copy link
Copy Markdown
Collaborator

@ispobock ispobock merged commit bc59cc0 into main Apr 13, 2026
228 of 278 checks passed
@ispobock ispobock deleted the hybrid/hybrid_radix_tree branch April 13, 2026 02:28
@nvpohanh
Copy link
Copy Markdown
Collaborator

cc @hlu1 @YAMY1234

Comment thread test/registered/unit/mem_cache/test_unified_radix_cache_bench.py
Comment thread test/registered/unit/mem_cache/test_unified_radix_cache_unittest.py
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
…#21206)

Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>
YAMY1234 added a commit to YAMY1234/sglang that referenced this pull request Apr 17, 2026
…omponent

The unified HybridRadixTree V2 (sgl-project#21206) landed after this branch was
originally cut, and its MambaComponent still used the ping-pong buffer
model. After merging main, the mamba_component path was broken —
it referenced attributes and APIs that no longer exist
(mamba_ping_pong_track_buffer, mamba_next_track_idx,
get_mamba_ping_pong_other_idx, free_mamba_cache's
mamba_ping_pong_track_buffer_to_keep kwarg).

Port the pending_radix_mamba_slot producer-consumer model to the
unified tree's component hooks:

- prepare_for_caching_req (both finished and unfinished branches):
  replace ping-pong indexing with zero-copy ownership transfer of
  req.pending_radix_mamba_slot into insert_params.mamba_value. When
  no pending slot is available due to mamba pool pressure, return 0
  so the caller takes the empty-key short-circuit in _insert_helper
  (finished) or the effective_cache_len<=0 early-return in
  cache_unfinished_req (unfinished). enable_mamba_extra_buffer=False
  keeps the legacy fork_from path unchanged.

- cleanup_after_caching_req: drop the obsolete
  mamba_ping_pong_track_buffer_to_keep argument; on finished, free
  the pending slot when the tree already holds mamba state; on
  unfinished, pre-allocate the next pending slot off the forward
  hot path (evict once on failure, leave as None if the pool is
  fully locked).

Also address review comments from yizhang2077 on the original path:

- schedule_batch._mamba_prefix_cache_update: restore three explanatory
  comments that were dropped during the earlier ping-pong removal.

- tree_component.prepare_for_caching_req docstring: update the Mamba
  description from ping-pong buffer to pending_radix_mamba_slot
  ownership transfer.

- test_streaming_session_unit._FakeReq: rename
  mamba_ping_pong_track_buffer/mamba_next_track_idx fields to
  pending_radix_mamba_slot.
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…#21206)

Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants