Skip to content

HiSparse for Sparse Attention#20343

Merged
hnyls2002 merged 23 commits intomainfrom
hisparse
Mar 23, 2026
Merged

HiSparse for Sparse Attention#20343
hnyls2002 merged 23 commits intomainfrom
hisparse

Conversation

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

@xiezhq-hermann xiezhq-hermann commented Mar 11, 2026

Motivation

This PR introduces HiSparse, which leverages CPU memory to store idle KV cache during decoding, thereby increasing batch size and improving throughput for models that use the NSA sparse attention mechanism, such as DeepSeek-V3.2 and GLM-5.
This PR is followed by a prior attempt to support hicache for sparse models: #14619, contributed by @hzh0425 @huangtingwei9988 as well.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization for sparse attention models called 'HiSparse'. The primary goal is to enhance the efficiency of KV cache management by offloading less frequently used tokens to CPU memory, thereby freeing up valuable GPU resources. This system is designed to improve batch size and overall throughput, particularly benefiting models that utilize Non-Sequential Attention (NSA) like DeepSeek-V3.2 and GLM-5. The changes span from low-level CUDA kernel implementations for fast data movement to high-level Python coordination for memory allocation and scheduling, ensuring a comprehensive solution for hierarchical sparse attention.

Highlights

  • HiSparse Feature Introduction: Introduced 'HiSparse' to optimize sparse attention mechanisms by leveraging CPU memory for idle KV cache, aiming to increase batch size and improve throughput for models like DeepSeek-V3.2 and GLM-5.
  • New CUDA Kernel for Cache Management: Added a new CUDA kernel (hisparse.cuh) that includes warp-level data transfer, inclusive scan operations, and a specialized kernel for loading sparse attention KV cache to device buffers with LRU-like eviction.
  • Python Interface for HiSparse: Created a new Python module (hisparse.py) to provide a JIT-compiled interface to the hisparse.cuh CUDA kernel, enabling Python-side control over the sparse cache loading process.
  • Integrated Memory and Scheduling Coordination: Implemented HiSparseCoordinator to manage the complex interplay between host and device memory for sparse attention, including request staging, dynamic device buffer allocation, and data transfers.
  • Core System Integration: Extensively integrated HiSparse logic into the SGLang scheduler, model runner, and attention backend components to ensure seamless operation during prefill and decode phases, including updates to KV cache allocation and top-k token handling.
  • Command-Line Argument: Added a new command-line argument --enable-hisparse to allow users to activate this new hierarchical sparse attention feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/jit_kernel/csrc/hisparse.cuh
    • Added a new CUDA kernel for efficient sparse attention KV cache management, including warp-level data transfer and inclusive scan primitives.
  • python/sglang/jit_kernel/hisparse.py
    • Added a new Python module to provide a JIT-compiled interface for the hisparse.cuh CUDA kernel, specifically for loading cache to device buffers in MLA mode.
  • python/sglang/srt/layers/attention/nsa_backend.py
    • Added force_unfused_topk to NSAIndexerMetadata to control top-k transformation behavior.
    • Modified topk_transform to respect force_unfused_topk when SGLANG_NSA_FUSE_TOPK is enabled.
    • Integrated HiSparse coordinator to translate page table locations during forward_extend.
    • Updated forward_decode to use hisparse_coordinator.swap_in_selected_pages for loading top-k indices when HiSparse is enabled.
    • Added a condition to set_nsa_prefill_impl to disable MHA prefill if hisparse_coordinator is active.
    • Modified get_indexer_metadata to set force_unfused_topk based on HiSparse and forward mode.
  • python/sglang/srt/managers/hisparse_coordinator.py
    • Added a new Python module defining HiSparseCoordinator to manage the lifecycle of sparse attention KV cache, including staging requests, allocating device buffers, handling host-device data transfers, and managing LRU slots.
  • python/sglang/srt/managers/schedule_batch.py
    • Imported HiSparseCoordinator.
    • Added staging and batch attributes to Req class for HiSparse management.
    • Added hisparse_coordinator attribute to ScheduleBatch.
    • Modified init_new to assign the created batch to requests for staging.
    • Updated prepare_for_decode to call hisparse_coordinator.map_last_loc_to_buffer when HiSparse is enabled.
  • python/sglang/srt/managers/scheduler.py
    • Imported HiSparseCoordinator.
    • Added enable_hisparse and hisparse_coordinator attributes to Scheduler.
    • Initialized hisparse_coordinator from tp_worker.model_runner and set its decode producer stream.
    • Modified get_next_batch_to_run to collect ready batches from hisparse_coordinator when enabled.
    • Added hisparse_coordinator.retract_req call when requests are retracted.
    • Added hisparse_coordinator.request_finished call when requests are aborted.
  • python/sglang/srt/managers/scheduler_output_processor_mixin.py
    • Added hisparse_coordinator.admit_request_into_staging call for unfinished requests during prefill.
    • Added hisparse_coordinator.request_finished call for finished requests during decode.
  • python/sglang/srt/managers/scheduler_runtime_checker_mixin.py
    • Added a check for hisparse_coordinator.has_ongoing_staging() during idle periods.
  • python/sglang/srt/managers/tp_worker.py
    • Added register_hisparse_coordinator method to set the coordinator on the model_runner.
  • python/sglang/srt/mem_cache/hisparse_memory_pool.py
    • Added new classes HiSparseNSATokenToKVPool and HiSparseTokenToKVPoolAllocator to manage memory for HiSparse, including mapping logical indices to hisparse device indices and handling allocation/deallocation.
  • python/sglang/srt/mem_cache/memory_pool.py
    • Modified NSATokenToKVPool to accept an optional index_buf_size and use it for buffer sizing.
  • python/sglang/srt/mem_cache/memory_pool_host.py
    • Modified free method to move indices to CPU before concatenating to free_slots.
  • python/sglang/srt/model_executor/cuda_graph_runner.py
    • Set forward_batch.hisparse_coordinator and updated num_real_reqs during CUDA graph capture.
    • Updated num_real_reqs in replay_prepare for HiSparse.
  • python/sglang/srt/model_executor/forward_batch_info.py
    • Added hisparse_coordinator attribute to ForwardBatch.
  • python/sglang/srt/model_executor/model_runner.py
    • Added enable_hisparse attribute.
    • Initialized hisparse_coordinator to None before initialize().
    • Instantiated HiSparseCoordinator within initialize() if enable_hisparse is true.
    • Set forward_batch.hisparse_coordinator and updated num_real_reqs in _forward_raw.
  • python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py
    • Imported new HiSparse memory pool classes.
    • Modified _init_pools to conditionally use HiSparseNSATokenToKVPool and HiSparseTokenToKVPoolAllocator when HiSparse is enabled.
  • python/sglang/srt/server_args.py
    • Added --enable-hisparse argument to enable hierarchical sparse attention.
Activity
  • The pull request introduces a new feature, HiSparse, for sparse attention, leveraging CPU memory for KV cache.
  • It involves adding new CUDA kernels and Python modules to manage this hierarchical memory system.
  • The new functionality is integrated into existing SGLang components, including the scheduler, memory pools, and model execution, to coordinate data movement and resource allocation.
  • The author, xiezhq-hermann, has provided a clear motivation for the PR, highlighting its benefits for models using NSA sparse attention.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment thread python/sglang/srt/managers/hisparse_coordinator.py
Comment thread python/sglang/srt/model_executor/cuda_graph_runner.py Outdated
self.init_diffusion_llm(dllm_config)

# For hisparse
self.staging = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'staging' naming is very common. It's already used in dllm and may have some confusion. It's better to have a more specific naming for hisparse.

"Required fields: algorithm (str), backend (str). "
"All other fields are algorithm-specific and passed to the algorithm constructor. "
'Example: \'{"algorithm": "quest", "backend": "flashattention", "sparsity_ratio": 0.7, "min_sparse_prompt_len": 2048}\'',
'Example: \'{"top_k": 2048, "device_buffer_size": 4096}\'',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are serval config here. Maybe we need to have a more detailed usage guide for user later.

req: Req


class HiSparseCoordinator:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many component logics in this PR should have unit/e2e test coverage (ref). We can add them in followup PR.


# Merge the new batch into the running batch.
if not self.last_batch.is_empty():
if self.enable_hisparse:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have a different branch with the original handling logic. It should be addtitional maintained when later refactor/changes in original branch. Do we have plan to merge it in the original branch with incremental change?

cc: @hnyls2002 for double check

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for all non-hisparse paths, there are no breaks.


# Merge the new batch into the running batch.
if not self.last_batch.is_empty():
if self.enable_hisparse:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for all non-hisparse paths, there are no breaks.

@hnyls2002 hnyls2002 merged commit 13f4f01 into main Mar 23, 2026
214 of 235 checks passed
@hnyls2002 hnyls2002 deleted the hisparse branch March 23, 2026 06:09
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
@hnyls2002 hnyls2002 mentioned this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants