HiSparse for Sparse Attention by xiezhq-hermann · Pull Request #20343 · sgl-project/sglang

xiezhq-hermann · 2026-03-11T08:49:10Z

Motivation

This PR introduces HiSparse, which leverages CPU memory to store idle KV cache during decoding, thereby increasing batch size and improving throughput for models that use the NSA sparse attention mechanism, such as DeepSeek-V3.2 and GLM-5.
This PR is followed by a prior attempt to support hicache for sparse models: #14619, contributed by @hzh0425 @huangtingwei9988 as well.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-11T08:58:02Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization for sparse attention models called 'HiSparse'. The primary goal is to enhance the efficiency of KV cache management by offloading less frequently used tokens to CPU memory, thereby freeing up valuable GPU resources. This system is designed to improve batch size and overall throughput, particularly benefiting models that utilize Non-Sequential Attention (NSA) like DeepSeek-V3.2 and GLM-5. The changes span from low-level CUDA kernel implementations for fast data movement to high-level Python coordination for memory allocation and scheduling, ensuring a comprehensive solution for hierarchical sparse attention.

Highlights

HiSparse Feature Introduction: Introduced 'HiSparse' to optimize sparse attention mechanisms by leveraging CPU memory for idle KV cache, aiming to increase batch size and improve throughput for models like DeepSeek-V3.2 and GLM-5.
New CUDA Kernel for Cache Management: Added a new CUDA kernel (hisparse.cuh) that includes warp-level data transfer, inclusive scan operations, and a specialized kernel for loading sparse attention KV cache to device buffers with LRU-like eviction.
Python Interface for HiSparse: Created a new Python module (hisparse.py) to provide a JIT-compiled interface to the hisparse.cuh CUDA kernel, enabling Python-side control over the sparse cache loading process.
Integrated Memory and Scheduling Coordination: Implemented HiSparseCoordinator to manage the complex interplay between host and device memory for sparse attention, including request staging, dynamic device buffer allocation, and data transfers.
Core System Integration: Extensively integrated HiSparse logic into the SGLang scheduler, model runner, and attention backend components to ensure seamless operation during prefill and decode phases, including updates to KV cache allocation and top-k token handling.
Command-Line Argument: Added a new command-line argument --enable-hisparse to allow users to activate this new hierarchical sparse attention feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/jit_kernel/csrc/hisparse.cuh
- Added a new CUDA kernel for efficient sparse attention KV cache management, including warp-level data transfer and inclusive scan primitives.
python/sglang/jit_kernel/hisparse.py
- Added a new Python module to provide a JIT-compiled interface for the hisparse.cuh CUDA kernel, specifically for loading cache to device buffers in MLA mode.
python/sglang/srt/layers/attention/nsa_backend.py
- Added force_unfused_topk to NSAIndexerMetadata to control top-k transformation behavior.
- Modified topk_transform to respect force_unfused_topk when SGLANG_NSA_FUSE_TOPK is enabled.
- Integrated HiSparse coordinator to translate page table locations during forward_extend.
- Updated forward_decode to use hisparse_coordinator.swap_in_selected_pages for loading top-k indices when HiSparse is enabled.
- Added a condition to set_nsa_prefill_impl to disable MHA prefill if hisparse_coordinator is active.
- Modified get_indexer_metadata to set force_unfused_topk based on HiSparse and forward mode.
python/sglang/srt/managers/hisparse_coordinator.py
- Added a new Python module defining HiSparseCoordinator to manage the lifecycle of sparse attention KV cache, including staging requests, allocating device buffers, handling host-device data transfers, and managing LRU slots.
python/sglang/srt/managers/schedule_batch.py
- Imported HiSparseCoordinator.
- Added staging and batch attributes to Req class for HiSparse management.
- Added hisparse_coordinator attribute to ScheduleBatch.
- Modified init_new to assign the created batch to requests for staging.
- Updated prepare_for_decode to call hisparse_coordinator.map_last_loc_to_buffer when HiSparse is enabled.
python/sglang/srt/managers/scheduler.py
- Imported HiSparseCoordinator.
- Added enable_hisparse and hisparse_coordinator attributes to Scheduler.
- Initialized hisparse_coordinator from tp_worker.model_runner and set its decode producer stream.
- Modified get_next_batch_to_run to collect ready batches from hisparse_coordinator when enabled.
- Added hisparse_coordinator.retract_req call when requests are retracted.
- Added hisparse_coordinator.request_finished call when requests are aborted.
python/sglang/srt/managers/scheduler_output_processor_mixin.py
- Added hisparse_coordinator.admit_request_into_staging call for unfinished requests during prefill.
- Added hisparse_coordinator.request_finished call for finished requests during decode.
python/sglang/srt/managers/scheduler_runtime_checker_mixin.py
- Added a check for hisparse_coordinator.has_ongoing_staging() during idle periods.
python/sglang/srt/managers/tp_worker.py
- Added register_hisparse_coordinator method to set the coordinator on the model_runner.
python/sglang/srt/mem_cache/hisparse_memory_pool.py
- Added new classes HiSparseNSATokenToKVPool and HiSparseTokenToKVPoolAllocator to manage memory for HiSparse, including mapping logical indices to hisparse device indices and handling allocation/deallocation.
python/sglang/srt/mem_cache/memory_pool.py
- Modified NSATokenToKVPool to accept an optional index_buf_size and use it for buffer sizing.
python/sglang/srt/mem_cache/memory_pool_host.py
- Modified free method to move indices to CPU before concatenating to free_slots.
python/sglang/srt/model_executor/cuda_graph_runner.py
- Set forward_batch.hisparse_coordinator and updated num_real_reqs during CUDA graph capture.
- Updated num_real_reqs in replay_prepare for HiSparse.
python/sglang/srt/model_executor/forward_batch_info.py
- Added hisparse_coordinator attribute to ForwardBatch.
python/sglang/srt/model_executor/model_runner.py
- Added enable_hisparse attribute.
- Initialized hisparse_coordinator to None before initialize().
- Instantiated HiSparseCoordinator within initialize() if enable_hisparse is true.
- Set forward_batch.hisparse_coordinator and updated num_real_reqs in _forward_raw.
python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py
- Imported new HiSparse memory pool classes.
- Modified _init_pools to conditionally use HiSparseNSATokenToKVPool and HiSparseTokenToKVPoolAllocator when HiSparse is enabled.
python/sglang/srt/server_args.py
- Added --enable-hisparse argument to enable hierarchical sparse attention.

Activity

The pull request introduces a new feature, HiSparse, for sparse attention, leveraging CPU memory for KV cache.
It involves adding new CUDA kernels and Python modules to manage this hierarchical memory system.
The new functionality is integrated into existing SGLang components, including the scheduler, memory pools, and model execution, to coordinate data movement and resource allocation.
The author, xiezhq-hermann, has provided a clear motivation for the PR, highlighting its benefits for models using NSA sparse attention.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

ispobock · 2026-03-23T01:22:42Z

        self.init_diffusion_llm(dllm_config)

+        # For hisparse
+        self.staging = False


The 'staging' naming is very common. It's already used in dllm and may have some confusion. It's better to have a more specific naming for hisparse.

ispobock · 2026-03-23T01:23:59Z

-            "Required fields: algorithm (str), backend (str). "
-            "All other fields are algorithm-specific and passed to the algorithm constructor. "
-            'Example: \'{"algorithm": "quest", "backend": "flashattention", "sparsity_ratio": 0.7, "min_sparse_prompt_len": 2048}\'',
+            'Example: \'{"top_k": 2048, "device_buffer_size": 4096}\'',


There are serval config here. Maybe we need to have a more detailed usage guide for user later.

ispobock · 2026-03-23T01:27:59Z

+    req: Req
+
+
+class HiSparseCoordinator:


Many component logics in this PR should have unit/e2e test coverage (ref). We can add them in followup PR.

ispobock · 2026-03-23T01:32:49Z

-
-            # Merge the new batch into the running batch.
-            if not self.last_batch.is_empty():
+        if self.enable_hisparse:


Here we have a different branch with the original handling logic. It should be addtitional maintained when later refactor/changes in original branch. Do we have plan to merge it in the original branch with incremental change?

cc: @hnyls2002 for double check

I think for all non-hisparse paths, there are no breaks.

hnyls2002 · 2026-03-23T06:08:44Z

-
-            # Merge the new batch into the running batch.
-            if not self.last_batch.is_empty():
+        if self.enable_hisparse:


I think for all non-hisparse paths, there are no breaks.

xiezhq-hermann added 10 commits March 1, 2026 07:06

init hisparse

1a1725c

hisparse baselines

0464a37

bug fix

02531be

overlap scheduling compatibility

6720aa2

jit kernel integrated

e96cbd3

CUDA graph enabled

f60deb3

simplify kernel implementation

41335af

fix overlap bug

b35dc3b

incrementally allocate device buffer

6a79021

Merge branch 'main' into hisparse

147061a

xiezhq-hermann requested review from BBuf, DarkSharpness, Fridge003, HaiShaw, HydraQYH, Qiaolin-Yu, Ying1123, celve, hanming-lu, hebiao064, hnyls2002, hzh0425, ispobock, merrymercy, yizhang2077 and yuan-luo as code owners March 11, 2026 08:49

xiezhq-hermann added the run-ci label Mar 11, 2026

xiezhq-hermann self-assigned this Mar 11, 2026

xiezhq-hermann added the deepseek label Mar 11, 2026

fix host IMA

c630cf4

zoxxxx reviewed Mar 16, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/hisparse_coordinator.py

xiezhq-hermann added 7 commits March 17, 2026 07:39

optimize hisparse kernel

f19a227

hisparse configs

93cf2ba

cut shared memory usage

0874282

faster existence check with hashing

0e9d4f5

slightly reduce synchronization overhead

c142183

Merge branch 'main' into hisparse

f347749

nit fix

148417e

xiezhq-hermann added the high priority label Mar 21, 2026

hzh0425 assigned hzh0425 and huangtingwei9988 Mar 21, 2026

Merge branch 'main' into hisparse

3adc0fc

xiezhq-hermann commented Mar 21, 2026

View reviewed changes

Comment thread python/sglang/srt/model_executor/cuda_graph_runner.py Outdated

revert data type for cuda graph and refinement

8d79758

ispobock reviewed Mar 23, 2026

View reviewed changes

hnyls2002 approved these changes Mar 23, 2026

View reviewed changes

hnyls2002 merged commit 13f4f01 into main Mar 23, 2026
214 of 235 checks passed

hnyls2002 deleted the hisparse branch March 23, 2026 06:09

hzh0425 mentioned this pull request Mar 23, 2026

[Sparse & HICache]: Enables hierarchical sparse KV cache management and scheduling for DeepSeek V32. #14619

Closed

12 tasks

Edward-lyz mentioned this pull request Mar 24, 2026

[Feature] Add DSA (DeepSeek Sparse Attention) Offload #21288

Closed

5 tasks

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

HiSparse for Sparse Attention (sgl-project#20343)

7b15ed0

Edward-lyz mentioned this pull request Mar 27, 2026

[Bug] Incorrect output when using --enable-hisparse with NSA attention on DeepSeek-V3.2 #21545

Open

Fridge003 mentioned this pull request Mar 31, 2026

Support mixing NSA flashmla prefill and trtllm decode kernels #21011

Closed

5 tasks

DarkSharpness mentioned this pull request Apr 4, 2026

[HiCache] Input length validation rejects requests that fit in L1+L2 combined capacity #22105

Open

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

HiSparse for Sparse Attention (sgl-project#20343)

a8f8647

hnyls2002 mentioned this pull request Apr 11, 2026

Add hisparse staging + decode offload guards to is_fully_idle() #22577

Merged

2 tasks

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

HiSparse for Sparse Attention (sgl-project#20343)

a5d1d32

hnyls2002 mentioned this pull request Apr 29, 2026

Deepseek V4 #23882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HiSparse for Sparse Attention#20343

HiSparse for Sparse Attention#20343
hnyls2002 merged 23 commits intomainfrom
hisparse

xiezhq-hermann commented Mar 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

ispobock Mar 23, 2026

Uh oh!

ispobock Mar 23, 2026

Uh oh!

ispobock Mar 23, 2026

Uh oh!

ispobock Mar 23, 2026

Uh oh!

hnyls2002 Mar 23, 2026

Uh oh!

hnyls2002 Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

xiezhq-hermann commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

Uh oh!

ispobock Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ispobock Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ispobock Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ispobock Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

hnyls2002 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

hnyls2002 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xiezhq-hermann commented Mar 11, 2026 •

edited

Loading