[RaidxTree Refactor]: Support Unified HybridRadixTree V2#21206
[RaidxTree Refactor]: Support Unified HybridRadixTree V2#21206
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant architectural improvement to the memory cache management by refactoring the radix tree implementation into a unified, component-based system. This change aims to streamline the integration of diverse attention mechanisms, such as Mamba and Sliding Window Attention, by providing a flexible framework that can adapt to future attention types without requiring extensive modifications to the core caching logic. The new structure promotes modularity and simplifies the process of extending cache support for complex model architectures. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
I feel we can just migrate the whole class to rust as the immediate next step before implementing other features. Several reasons
|
definitely agreed~ |
0eb4b6f to
d119818
Compare
…#21206) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>
…omponent The unified HybridRadixTree V2 (sgl-project#21206) landed after this branch was originally cut, and its MambaComponent still used the ping-pong buffer model. After merging main, the mamba_component path was broken — it referenced attributes and APIs that no longer exist (mamba_ping_pong_track_buffer, mamba_next_track_idx, get_mamba_ping_pong_other_idx, free_mamba_cache's mamba_ping_pong_track_buffer_to_keep kwarg). Port the pending_radix_mamba_slot producer-consumer model to the unified tree's component hooks: - prepare_for_caching_req (both finished and unfinished branches): replace ping-pong indexing with zero-copy ownership transfer of req.pending_radix_mamba_slot into insert_params.mamba_value. When no pending slot is available due to mamba pool pressure, return 0 so the caller takes the empty-key short-circuit in _insert_helper (finished) or the effective_cache_len<=0 early-return in cache_unfinished_req (unfinished). enable_mamba_extra_buffer=False keeps the legacy fork_from path unchanged. - cleanup_after_caching_req: drop the obsolete mamba_ping_pong_track_buffer_to_keep argument; on finished, free the pending slot when the tree already holds mamba state; on unfinished, pre-allocate the next pending slot off the forward hot path (evict once on failure, leave as None if the pool is fully locked). Also address review comments from yizhang2077 on the original path: - schedule_batch._mamba_prefix_cache_update: restore three explanatory comments that were dropped during the earlier ping-pong removal. - tree_component.prepare_for_caching_req docstring: update the Mamba description from ping-pong buffer to pending_radix_mamba_slot ownership transfer. - test_streaming_session_unit._FakeReq: rename mamba_ping_pong_track_buffer/mamba_next_track_idx fields to pending_radix_mamba_slot.
…#21206) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>
Motivation
Collabrate with @ispobock @yizhang2077 @pansicheng @xiezhq-hermann
MambaRadixCache and SWARadixCache maintain separate cache management logic with significant code duplication. Worse, each new attention variant (full, sliding-window, SSM, or future combinations) would require yet another standalone cache implementation, making the radix tree layer increasingly hard to maintain and extend.
HybridRadixCache replaces both with a unified tree structure and pluggable TreeComponents. New attention types can be supported by adding a TreeComponent — without touching the core tree logic (match / insert / evict). This makes it straightforward to handle hybrid models with arbitrary attention layer compositions.
Modifications
Future plan:
Accuracy Tests
AIME 25 Repeat16 Test:
Qwen3-next :
GPT-OSS-20B:
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci