[HiCache] Add L2 prefetch-buffer-only memory mode#20535
[HiCache] Add L2 prefetch-buffer-only memory mode#20535vladnosiv wants to merge 56 commits intosgl-project:mainfrom
Conversation
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to HiCache's memory management by offering a 'buffer_only' mode for host memory. This mode reconfigures the typically large, exclusive per-worker host memory pools into smaller, transient staging buffers. This strategic shift aims to drastically reduce memory duplication across workers and improve overall memory utilization, allowing more budget to be allocated to shared storage backends. The changes are designed to boost cache hit rates and improve latency, as evidenced by the provided benchmarks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a buffer_only mode for HiCache's host memory, a significant feature for optimizing memory usage in high-throughput scenarios. The changes are extensive, touching core caching logic, memory management, and storage backends. The implementation of a pending write queue to handle backpressure in buffer_only mode is a solid design choice. Additionally, the refactoring in the storage backend interaction to use generic methods for handling different KV cache layouts is a good step towards better maintainability. I've found one critical issue related to a memory leak which should be addressed.
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
/tag-and-rerun-ci |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Resolved conflict in hiradix_cache.py: combined upstream's InitLoadBackParams refactor with branch's buffer_only early return in init_load_back.
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
/rerun-failed-ci |
|
/rerun-stage stage-b-test-1-gpu-large |
|
✅ Triggered |
|
@xiezhq-hermann Can we merge this PR if most of the CI tests have passed? The AMD CI tests always fail. |
|
/rerun-stage stage-b-test-1-gpu-large |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
I checked this path locally and I think the reviewer concern is valid. In buffer_only mode, last_host_node is protected on the host side, but it is still not protected from GPU eviction because lock_ref is not held. So the current implementation appears to leave the prefetch anchor evictable while the prefetch is still in progress. It would be safer to explicitly pin the anchor for the full prefetch lifetime and release it on completion/abort, plus add a regression test for that lifecycle. |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
Add I wrote a demo to verify the correctness. REF: https://github.com/kvcache-ai/sglang/tree/copilot/pr20535-hicache-tests |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
# Conflicts: # python/sglang/srt/mem_cache/hiradix_cache.py
# Conflicts: # python/sglang/srt/mem_cache/hiradix_cache.py
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
Hi @stmatengss @xiezhq-hermann ! |
Sorry for delay. I will review it again, and let you know |


Motivation
Now in HiCache each worker allocates a large exclusive host memory pool (up to hundreds of gb per 1GPU worker).
Popular prefixes get duplicated in every worker's host cache, wasting the majority of the memory budget. Meanwhile, the shared storage backend (MoonCake) is allocated a relatively small pool and remains underutilized.
With
buffer_onlymode, host memory shrinks to a tiny staging buffer (up to dozens gb per worker), and the freed budget goes entirely to the shared storage pool. Any worker can read data written by any other worker, eliminating duplication.Modifications
buffer_onlyhost memory mode: host memory acts as a small transient staging buffer instead of a persistent cache tier. Pages are freed immediately after the async storage write completes. Controlled via--hicache-host-memory-mode buffer_onlyand--hicache-buffer-pages.storage_backedflag on TreeNode: tracks whether a node is durably written to external storage, allowing GPU eviction without losing the node from the radix tree. Unified withbackupedvia newstorage_readyproperty.Accuracy Tests
Qwen3/Qwen3-32B-FP8
With hot cache in MoonCake:
L2 as cache mode:
L2 as buffer only:
Benchmarking and Profiling
Qwen3-32B-FP8 on 8xH200 with TP1.
Common flags:
Configuration for
buffer_only:--hicache-host-memory-mode buffer_only --hicache-buffer-pages 512Configuration for
cache:--hicache-host-memory-mode cache --hicache-ratio 2.0(107 gb host memory per worker)MoonCake Traces bench with
toolagentworkload and 0.5 slowdown factor.buffer_onlycache