[Feature] NSA optimization roadmap

### Attention algorithm

<img width="2536" height="588" alt="Image" src="https://github.com/user-attachments/assets/b25c1894-2551-4eb0-94c8-307c06d4cb67" />

[Link to original table](https://docs.google.com/spreadsheets/d/1u_AaRCEp2xWO3lxBkXJOgzFm4PG9P-88U6M-Dfa-svI/edit?usp=sharing)

The parts highlighted in blue is work that has been done or in progress.

To summarize:

- Use MHA for short context lengths, as suggested in https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf
- Below context lengths 2K, sparse attention is the same as non-sparse attention and we can skip the logits computation and directly generates the indices for sparse MLA kernel or use MHA when possible
- The current flashmla_decode kernel is not well optimized on B200. So a separate dequant kernel + flashmla_sparse_bf16 works better for prefill + fp8 kvcache if the kv cache is not too long compared to the q sequence length. The heuristics will need to be updated after new optimizations to either the prefill or decode kernels, making it a bit hard to use in practice. Detailed analysis is [here](https://docs.google.com/spreadsheets/d/1u_AaRCEp2xWO3lxBkXJOgzFm4PG9P-88U6M-Dfa-svI/edit?gid=1552909022#gid=1552909022)

### Kernel optimizations
- FP8 per tensor sparse MLA kernel from trtllm
- More optimizations to flashmla_decode on B200
- Adaptive MHA attention pathway #11892 https://github.com/sgl-project/sglang/pull/12788 https://github.com/sgl-project/sglang/pull/12964 
- Layernorm optimization #12044
- quantize_k_cache_fast (curr: 3 kernel, 6us). No need to cat, but pass in two separate tensors.
- Optimize `torch.cat([q_nope, q_rope])` by either writing a fast triton/cuda kernel or using torch.compile. It's used for prefill/decode but the prefill one is much bigger and has more room for optimizations. The trtllm kernel supports separate `q_nope` and `q_rope`, but flashmla doesn't. #12215 https://github.com/sgl-project/sglang/pull/13022
- DeepGeMM fp8_mqa_logits optimizations https://github.com/sgl-project/sglang/pull/13402

### Indexer optimizations
- [Decode] Optimize dual stream in Indexer https://github.com/sgl-project/sglang/pull/13546
- [Decode] Move deep_gemm.get_paged_mqa_logits_metadata to init time, similar to attention kernel metadata compute
- [Prefill] Optimize _get_topk_ragged where there are a lot of small kernels. Try multi-stream, torch.compile, and add new kernels when necessary.
- [MTP] Enable nextn = 2/4 in deep_gemm.fp8_paged_mqa_logits, which is faster than the current implementation which uses the kernel with nextn = 1 regardless of mtp size.

### Min latency optimizations
- Enable TP in Attention


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] NSA optimization roadmap #11989

Attention algorithm

Kernel optimizations

Indexer optimizations

Min latency optimizations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] NSA optimization roadmap #11989

Description

Attention algorithm

Kernel optimizations

Indexer optimizations

Min latency optimizations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions