DeepSeek-V3.2: Add Adaptive MHA Attention Pathway for Short-Sequence Prefill#11892
DeepSeek-V3.2: Add Adaptive MHA Attention Pathway for Short-Sequence Prefill#11892Fridge003 merged 22 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @YAMY1234, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the inference performance of DeepSeek-V3.2 models by introducing an adaptive attention mechanism. It intelligently selects between Multi-Head Attention (MHA) and Multi-Latent Attention (MLA) based on the sequence length during the prefill stage. This ensures that the more efficient MHA is used for shorter sequences where MLA's overhead is detrimental, while MLA's benefits for longer sequences are preserved. The change results in notable reductions in latency and increases in throughput across various sequence lengths. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an adaptive attention mechanism for DeepSeek-V3.2 models, switching between Multi-Head Attention (MHA) for short sequences and Multi-Latent Attention (MLA) for longer ones to optimize performance. The changes are well-structured, with clear logic for selecting the attention pathway based on sequence length during the prefill phase. The new _forward_standard_mha method in the NSA backend is a clean implementation for handling the MHA path.
My review focuses on the correctness and efficiency of the new logic. I've found a minor opportunity for improvement in python/sglang/srt/models/deepseek_v2.py to make the code more efficient and idiomatic. Overall, this is a solid contribution with impressive performance gains demonstrated in the benchmarks.
|
@YAMY1234 Can you post some accuracy results on GPQA ? |
|
Please test with MTP and make sure it's not broken. |
1eaabcc to
8a8cb9a
Compare
Thanks! Just added the GPQA result. |
8351c1c to
9b08344
Compare
Ensured MHA is currently triggered without MTP. Will test with MTP later to confirm it runs correctly. |
|
Please update the benchmark data after adding back |
ad79486 to
df02529
Compare
Updated the results with adding back indexer but skipping topK! |
f610f1e to
a2b76e6
Compare
|
@YAMY1234 Please fix lint |
a2b76e6 to
14de5a5
Compare
@Fridge003 Done! Thanks |
6225b54 to
ea5c241
Compare
ea5c241 to
c4fa618
Compare
|
Great job! @YAMY1234 |
|
I noticed that decoding in a context smaller than 2048 also spends a lot of time in the indexer() function. Why is that? Can't we skip this step during decoding? @YAMY1234 |
|
In the case of decoding, because of the limitations of CudaGraph, we'll need to instantiate two versions of CudaGraph, one with indexer, and one with skipping logits computation. Given that prompts with <2k tokens are not the majority of the cases, I don't think it's worth the effort or complexity. |
So, you mean that in a cudagraph, an if branch will only capture one result? Therefore, judging by length is invalid in a cudagraph? |
|
Perhaps piecewise CUDA graphs can solve this issue. I believe skipping the attention graph would be much more cost-effective than performing the default indexing. |
Motivation
For DeepSeek-V3.2 models, using MLA (Multi-Latent Attention) uniformly across all sequence lengths during prefill is suboptimal. For short sequences, the overhead of MLA's compression/decompression and absorbed attention mechanism outweighs any potential benefits, making standard MHA (Multi-Head Attention) more efficient. This PR implements adaptive attention mechanism selection based on sequence length to optimize inference performance across different workloads.
Modifications
This PR implements sequence-length-based adaptive attention mechanism selection in the NSA (Native Sparse Attention) backend:
Core Changes
Add MHA mode detection and processing in
nsa_backend.py_forward_standard_mha()method using FlashAttention varlen interface for standard MHAImplement intelligent attention mechanism selection in
deepseek_v2.pyhandle_attention_nsa()based on sequence length:nsa_seq_len_thresholdparameter (default: 2048) for flexible threshold tuningNote: This threshold is also chosen because of NSA's topk sparse filtering, which operates within the MLA pathway.
Accuracy Tests
GSM8K:
GPQA:
Benchmarking and Profiling
1024 tokens, 100 requests
512 tokens, 100 requests
2000 tokens, 100 requests
Kernel level analysis:
Forward Total Time
Prepare Stage
forward_normal_prepare/forward_normal_chunked_kv_prepare+topk skipped indexer):forward_absorb_prepare):Core Stage
forward_normal_chunked_kv_core): ~300 μsChecklist