[RL] [DSv32] [GLM-5] Add --nsa-topk-backend and integrate FlashInfer and pytorch topk#22851
[RL] [DSv32] [GLM-5] Add --nsa-topk-backend and integrate FlashInfer and pytorch topk#22851zianglih wants to merge 8 commits intosgl-project:mainfrom
--nsa-topk-backend and integrate FlashInfer and pytorch topk#22851Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
SGLANG_NSA_TORCH_TOPK--nsa-topk-backend and integrate FlashInfer and pytorch topk
|
cc @nvjullin |
|
qq: Does flashinfer kernel support cuda-graph? I know flashinfer may dispatch to different algorithms based on static sequence length, but is that safe under CUDA graph? |
|
Hi @DarkSharpness , thank you for calling this out. This is indeed a valid concern. Current FlashInfer's dispatch heuritics use |
|
Hold until flashinfer-ai/flashinfer#3133 , which introduces a graph safe mode. |
<!-- .github/pull_request_template.md --> ## 📌 Description @HumansAnd Parent PR: #3095 SGLang PR: sgl-project/sglang#22851 Add `row_starts` and `dsa_graph_safe` for SGLang DSA integration. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues sgl-project/sglang#22851 (comment) <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added dsa_graph_safe flag to top-k APIs to opt into DSA-graph safe execution. * Added optional row_starts parameter to page-table and ragged top-k transforms to support per-row score offsets. * **Behavior** * When dsa_graph_safe=True the optimized clusters fast-path is disabled to ensure safe execution. * **Tests** * Added tests covering row_starts behavior for page-table and ragged transforms. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
Hold until flashinfer v0.6.10 release. |
Motivation
@HumansAnd
Add
--nsa-topk-backendfor configurable topk backend implementation selection.torch.topkis used by GLM-5 for RL.FlashInfer topk has determinism and configurable tie break (flashinfer-ai/flashinfer#3095), and better long context performance.
Modifications
--nsa-topk-backend, default to existingsgl-kernelSGLANG_NSA_TOPK_FLASHINFER_TIE_BREAKandSGLANG_NSA_TOPK_FLASHINFER_DETERMINISTICAccuracy Tests
New unit test
python3 -m pytest -q test/registered/kernels/test_nsa_indexer.py -k test_topk_unfused_backends_valid_selectionpassed.Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci