[DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache#19148
[DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache#19148BBuf merged 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the key cache population mechanism within the DeepSeek V3.2 architecture. By consolidating the quantization of BF16 keys and their subsequent storage into a single, JIT-compiled CUDA kernel, the change aims to minimize computational overhead and improve overall inference performance. The implementation includes robust fallback logic, ensuring system stability even if the optimized kernel cannot be utilized. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request introduces a JIT-compiled CUDA kernel to fuse quantization and storage of the NSA indexer K cache, aiming to reduce kernel launch overhead and improve performance. The changes include a new CUDA kernel, a Python wrapper for JIT compilation, and modifications to nsa_indexer.py to use this new fused kernel.
My review has identified a critical race condition in the new CUDA kernel where multiple threads in a warp attempt to write to the same memory location. I've also found some opportunities for code cleanup by removing unused functions and a redundant buffer fetch. Addressing these points will improve the correctness and maintainability of the new implementation.
9c6ddb2 to
75bca33
Compare
|
/rerun-failed-ci |
055d29f to
781acbd
Compare
|
/rerun-failed-ci |
6 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
Can we add PDL support for this kernel? I'm not sure if this will bring performance improvement. |
| /// NOTE: 132 = 128 + 4 | ||
| constexpr int64_t kPageBytes = 132 << kPageBits; | ||
|
|
||
| // each warp handles 128 elements, 1 warp, each block handles multiple rows |
There was a problem hiding this comment.
| // each warp handles 128 elements, 1 warp, each block handles multiple rows | |
| // each warp handles 128 elements, each block handles multiple rows |
|
|
||
| # Fast path: JIT fused store (CUDA, page_size=64, non-fnuz) | ||
| if can_use_nsa_fused_store() and _is_cuda and (not _is_fp8_fnuz): | ||
| if forward_batch.token_to_kv_pool.page_size == 64: |
There was a problem hiding this comment.
Can we make those two if code to onlt one if?
| layer_id=layer_id | ||
| ) | ||
| fused_store_index_k_cache(key, buf, forward_batch.out_cache_loc) | ||
| else: |
There was a problem hiding this comment.
can_use_nsa_fused_store func has a fallback now, why we need another fallback here?
There was a problem hiding this comment.
can_use_nsa_fused_store func itself doesn't have fallback.
There are two branches in forward_cuda, both needs a separate fallback:
- fast path(seqlen<2048): it skips topk computation and calls
_forward_cuda_k_only-->fused_store_index_k_cache - normal path: it conducts topk computation and calls
_store_index_k_cache()
Addressed and refactored code. |
f63bef3 to
c4d59ad
Compare
|
@yuan-luo Can you please test the result of gpqa and aime25 as shown here: https://docs.sglang.io/basic_usage/deepseek_v32.html#accuracy-test-with-gpqa-diamond |
|
Also can you please test on some extreme workloads (e.g. 128k input), to make sure it doesn't crack due to any IMA -like errors (although with int64 out cache loc this shouldn't happen) |
|
Can we add a test for this jit kernel |
gpqa result: |
|
AIME25 result: |
WIP. |
c4d59ad to
350c76d
Compare
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
350c76d to
8f6a1f3
Compare
|
…gl-project#19148) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
…gl-project#19148) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
…gl-project#19148) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
…gl-project#19148) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Motivation
In DeepSeek v3.2, after the Indexer produces key in bf16 (roughly (N, 128)), it needs to populate NSA’s index_k_with_scale_buffer. The previous implementation used two steps:
This path requires at least two kernel launches (quant + store). Under CUDA Graph / multi-stream execution, launch and sync overhead becomes more noticeable.
This PR is to introduce a JIT-compiled CUDA kernel that fuses quantization and store:
Inside the kernel, it will do:
Inspired by @DarkSharpness
Before PR:

After PR:

Performance improved slightly. Will do more testing.
gsm8k no drops.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci