Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant optimization for sparse attention models called 'HiSparse'. The primary goal is to enhance the efficiency of KV cache management by offloading less frequently used tokens to CPU memory, thereby freeing up valuable GPU resources. This system is designed to improve batch size and overall throughput, particularly benefiting models that utilize Non-Sequential Attention (NSA) like DeepSeek-V3.2 and GLM-5. The changes span from low-level CUDA kernel implementations for fast data movement to high-level Python coordination for memory allocation and scheduling, ensuring a comprehensive solution for hierarchical sparse attention. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
| self.init_diffusion_llm(dllm_config) | ||
|
|
||
| # For hisparse | ||
| self.staging = False |
There was a problem hiding this comment.
The 'staging' naming is very common. It's already used in dllm and may have some confusion. It's better to have a more specific naming for hisparse.
| "Required fields: algorithm (str), backend (str). " | ||
| "All other fields are algorithm-specific and passed to the algorithm constructor. " | ||
| 'Example: \'{"algorithm": "quest", "backend": "flashattention", "sparsity_ratio": 0.7, "min_sparse_prompt_len": 2048}\'', | ||
| 'Example: \'{"top_k": 2048, "device_buffer_size": 4096}\'', |
There was a problem hiding this comment.
There are serval config here. Maybe we need to have a more detailed usage guide for user later.
| req: Req | ||
|
|
||
|
|
||
| class HiSparseCoordinator: |
There was a problem hiding this comment.
Many component logics in this PR should have unit/e2e test coverage (ref). We can add them in followup PR.
|
|
||
| # Merge the new batch into the running batch. | ||
| if not self.last_batch.is_empty(): | ||
| if self.enable_hisparse: |
There was a problem hiding this comment.
Here we have a different branch with the original handling logic. It should be addtitional maintained when later refactor/changes in original branch. Do we have plan to merge it in the original branch with incremental change?
cc: @hnyls2002 for double check
There was a problem hiding this comment.
I think for all non-hisparse paths, there are no breaks.
|
|
||
| # Merge the new batch into the running batch. | ||
| if not self.last_batch.is_empty(): | ||
| if self.enable_hisparse: |
There was a problem hiding this comment.
I think for all non-hisparse paths, there are no breaks.
Motivation
This PR introduces HiSparse, which leverages CPU memory to store idle KV cache during decoding, thereby increasing batch size and improving throughput for models that use the NSA sparse attention mechanism, such as DeepSeek-V3.2 and GLM-5.
This PR is followed by a prior attempt to support hicache for sparse models: #14619, contributed by @hzh0425 @huangtingwei9988 as well.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci