[RL] [DSv32] [GLM-5] Add `--nsa-topk-backend` and integrate FlashInfer and pytorch topk by zianglih · Pull Request #22851 · sgl-project/sglang

zianglih · 2026-04-15T04:38:38Z

Motivation

Add --nsa-topk-backend for configurable topk backend implementation selection.

torch.topk is used by GLM-5 for RL.
FlashInfer topk has determinism and configurable tie break (flashinfer-ai/flashinfer#3095), and better long context performance.

Modifications

Add --nsa-topk-backend, default to existing sgl-kernel
Integrate flashinfer and torch topk for unfused code path
Integrate flashinfer topk for fused code path
Add SGLANG_NSA_TOPK_FLASHINFER_TIE_BREAK and SGLANG_NSA_TOPK_FLASHINFER_DETERMINISTIC
Add new unit test

Accuracy Tests

New unit test python3 -m pytest -q test/registered/kernels/test_nsa_indexer.py -k test_topk_unfused_backends_valid_selection passed.

SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 python3 -m sglang.launch_server --nsa-topk-backend sgl-kernel --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend nsa --nsa-decode-backend flashmla_sparse --nsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.977
Invalid: 0.000
Latency: 13.146 s
Output throughput: 8566.746 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 12.749 s
Output throughput: 8878.813 token/s
Accuracy: 0.981
Invalid: 0.000
Latency: 17.272 s
Output throughput: 6584.294 token/s
# torch unfused
SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_NSA_FUSE_TOPK=0 python3 -m sglang.launch_server --nsa-topk-backend torch --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend nsa --nsa-decode-backend flashmla_sparse --nsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.982
Invalid: 0.000
Latency: 18.256 s
Output throughput: 6183.790 token/s
Accuracy: 0.983
Invalid: 0.000
Latency: 17.637 s
Output throughput: 6388.987 token/s
Accuracy: 0.980
Invalid: 0.000
Latency: 17.609 s
Output throughput: 6403.039 token/s
# flashinfer unfused
SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_NSA_FUSE_TOPK=0 python3 -m sglang.launch_server --nsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend nsa --nsa-decode-backend flashmla_sparse --nsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.978
Invalid: 0.000
Latency: 20.846 s
Output throughput: 5413.876 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 24.896 s
Output throughput: 4557.003 token/s
Accuracy: 0.979
Invalid: 0.000
Latency: 21.313 s
Output throughput: 5292.839 token/s
# flashinfer fused
SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_NSA_FUSE_TOPK=1 python3 -m sglang.launch_server --nsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend nsa --nsa-decode-backend flashmla_sparse --nsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.980
Invalid: 0.000
Latency: 13.531 s
Output throughput: 8320.213 token/s
Accuracy: 0.981
Invalid: 0.000
Latency: 12.771 s
Output throughput: 8832.274 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 12.121 s
Output throughput: 9267.255 token/s
# flashinfer fused with tie_break=1
SGLANG_NSA_TOPK_FLASHINFER_TIE_BREAK=1 SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_NSA_FUSE_TOPK=1 python3 -m sglang.launch_server --nsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend nsa --nsa-decode-backend flashmla_sparse --nsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.978
Invalid: 0.000
Latency: 13.716 s
Output throughput: 8219.616 token/s
Accuracy: 0.980
Invalid: 0.000
Latency: 13.008 s
Output throughput: 8652.700 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 17.669 s
Output throughput: 6457.714 token/s
# flashinfer fused with tie_break=2
SGLANG_NSA_TOPK_FLASHINFER_TIE_BREAK=2 SGLANG_NSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_NSA_FUSE_TOPK=1 python3 -m sglang.launch_server --nsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend nsa --nsa-decode-backend flashmla_sparse --nsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.979
Invalid: 0.000
Latency: 13.370 s
Output throughput: 8438.633 token/s
Accuracy: 0.982
Invalid: 0.000
Latency: 13.129 s
Output throughput: 8628.890 token/s
Accuracy: 0.980
Invalid: 0.000
Latency: 12.498 s
Output throughput: 9047.713 token/s

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-15T04:38:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

nvpohanh · 2026-04-21T04:58:32Z

cc @nvjullin

DarkSharpness · 2026-04-21T06:04:53Z

qq: Does flashinfer kernel support cuda-graph? I know flashinfer may dispatch to different algorithms based on static sequence length, but is that safe under CUDA graph?

zianglih · 2026-04-21T06:49:10Z

Hi @DarkSharpness , thank you for calling this out. This is indeed a valid concern. Current FlashInfer's dispatch heuritics use max_len, which is not CUDA graph safe in current implementation. We are also working with CCCL team for a graph safe topk (flashinfer-ai/flashinfer#3091 etc) which will be integrated into flashinfer soon. As of now for this PR we can disallow cuda graph if flashinfer topk backend is used.

zianglih · 2026-04-21T07:21:27Z

Hold until flashinfer-ai/flashinfer#3133 , which introduces a graph safe mode.

## 📌 Description @HumansAnd Parent PR: #3095 SGLang PR: sgl-project/sglang#22851 Add `row_starts` and `dsa_graph_safe` for SGLang DSA integration.  ## 🔍 Related Issues sgl-project/sglang#22851 (comment)  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added dsa_graph_safe flag to top-k APIs to opt into DSA-graph safe execution. * Added optional row_starts parameter to page-table and ragged top-k transforms to support per-row score offsets. * **Behavior** * When dsa_graph_safe=True the optimized clusters fast-path is disabled to ensure safe execution. * **Tests** * Added tests covering row_starts behavior for page-table and ragged transforms.

zianglih · 2026-04-24T17:36:24Z

Hold until flashinfer v0.6.10 release.

zianglih requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners April 15, 2026 04:38

github-actions Bot added the documentation Improvements or additions to documentation label Apr 15, 2026

Initial impl

a62152a

ziang-and force-pushed the torch-topk branch from 3e9baab to a62152a Compare April 21, 2026 02:54

ziang-and requested a review from wisclmy0611 as a code owner April 21, 2026 02:54

zianglih changed the title ~~[RL] [V3.2] [GLM-5] Add SGLANG_NSA_TORCH_TOPK~~ [RL] [DSv32] [GLM-5] Add --nsa-topk-backend and integrate FlashInfer and pytorch topk Apr 21, 2026

This was referenced Apr 21, 2026

[Feature] DSv32: Optimize topk for long context decode #16858

Open

[Roadmap] DeepSeek v3.2 (GLM 5) Optimization #15025

Open

zianglih mentioned this pull request Apr 22, 2026

feat: Add row_starts and dsa_graph_safe to topk flashinfer-ai/flashinfer#3133

Merged

5 tasks

zianglih added 6 commits April 23, 2026 16:14

Change to no tie logit initialization

8c40d26

Add fused test

69a309b

Rename test

2a28cac

Integrate flashinfer for fused

eb6b356

Add contiguous

4c201d8

Fix

ae24754

Add deterministic and tie break

56815b1

This was referenced May 1, 2026

Add --miles-nsa-topk-backend radixark/miles#1058

Open

[Perf & Feat] Add deepseek32 topk opt : Introduction to the ultra low latency attention #23761

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] [DSv32] [GLM-5] Add `--nsa-topk-backend` and integrate FlashInfer and pytorch topk#22851

[RL] [DSv32] [GLM-5] Add `--nsa-topk-backend` and integrate FlashInfer and pytorch topk#22851
zianglih wants to merge 8 commits intosgl-project:mainfrom
zianglih:torch-topk

zianglih commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Uh oh!

nvpohanh commented Apr 21, 2026

Uh oh!

DarkSharpness commented Apr 21, 2026

Uh oh!

zianglih commented Apr 21, 2026

Uh oh!

zianglih commented Apr 21, 2026 •

edited

Loading

Uh oh!

zianglih commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zianglih commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Uh oh!

nvpohanh commented Apr 21, 2026

Uh oh!

DarkSharpness commented Apr 21, 2026

Uh oh!

zianglih commented Apr 21, 2026

Uh oh!

zianglih commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zianglih commented Apr 15, 2026 •

edited

Loading

zianglih commented Apr 21, 2026 •

edited

Loading