add sgl_kernel.fast_topk_v2 to top_k benchmark#2461
add sgl_kernel.fast_topk_v2 to top_k benchmark#2461yzh119 merged 1 commit intoflashinfer-ai:mainfrom
Conversation
Summary of ChangesHello @huangzhilin-hzl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughSGLang comparison functionality is integrated into the top-k benchmark with conditional execution when Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds a benchmark for SGLang's fast_topk_v2 kernel to the existing top-k benchmark script. The changes correctly add the new benchmark logic and update the output printing to include the SGLang results. My review focuses on improving code maintainability by replacing hardcoded magic numbers related to SGLang's constraints with named constants. This will make the code cleaner and easier to update in the future.
| result["speedup_vs_torch"] = torch_ms / fi_ms | ||
|
|
||
| # SGLang comparison (only supports k=2048 and float32) | ||
| if compare_sglang and HAS_SGL_KERNEL and k == 2048 and dtype == torch.float32: |
There was a problem hiding this comment.
The values 2048 and torch.float32 are hardcoded. These seem to be specific constraints from the SGLang kernel. To improve readability and maintainability, it's better to define them as constants at the top of the file. This would also allow reusing them in other parts of the code where the same constraints apply.
For example, you could define at the top of the file:
SGLANG_SUPPORTED_K = 2048
SGLANG_SUPPORTED_DTYPE = torch.float32And then use these constants here.
| if compare_sglang and HAS_SGL_KERNEL and k == 2048 and dtype == torch.float32: | |
| if compare_sglang and HAS_SGL_KERNEL and k == SGLANG_SUPPORTED_K and dtype == SGLANG_SUPPORTED_DTYPE: |
| print("=" * 100) | ||
| print(f"top_k: Basic radix-based top-k selection (dtype={dtype_str})") | ||
| if args.compare_sglang: | ||
| print("NOTE: SGLang only supports k=2048 and float32") |
There was a problem hiding this comment.
The hardcoded value 2048 should be replaced with the SGLANG_SUPPORTED_K constant for better maintainability. You could also consider deriving the 'float32' string from the SGLANG_SUPPORTED_DTYPE constant for full consistency.
| print("NOTE: SGLang only supports k=2048 and float32") | |
| print(f"NOTE: SGLang only supports k={SGLANG_SUPPORTED_K} and float32") |
| ) | ||
| if "sglang_us" in result: | ||
| line += f" {result['sglang_us']:>10.2f}us {result['speedup_vs_sglang']:>9.2f}x" | ||
| elif args.compare_sglang and k == 2048: |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@benchmarks/bench_topk.py`:
- Around line 326-330: The "(SGLang error)" message is being printed even when
SGLang is simply unsupported for the chosen dtype; update the conditional that
currently checks `elif args.compare_sglang and k == 2048:` to also require the
dtype be float32 (e.g., check `args.dtype == torch.float32` or equivalent) so it
only reports "(SGLang error)" for real runtime errors on float32 runs; apply the
same change to the other two identical spots in the file where
`args.compare_sglang and k == 2048` is used so the logic in those benchmark
sections matches (search for the three occurrences around the shown blocks).
| if "sglang_us" in result: | ||
| line += f" {result['sglang_us']:>10.2f}us {result['speedup_vs_sglang']:>9.2f}x" | ||
| elif args.compare_sglang and k == 2048: | ||
| line += " (SGLang error)" | ||
| print(line) |
There was a problem hiding this comment.
Misleading "(SGLang error)" message when dtype is not float32.
When --compare-sglang --dtype fp16 is used with k=2048, SGLang won't run (since it only supports float32), but this condition will print "(SGLang error)" — implying a failure rather than an unsupported configuration.
The condition should also check for dtype == torch.float32 to correctly distinguish between unsupported configurations and actual runtime errors.
🔧 Proposed fix
if "sglang_us" in result:
line += f" {result['sglang_us']:>10.2f}us {result['speedup_vs_sglang']:>9.2f}x"
- elif args.compare_sglang and k == 2048:
+ elif args.compare_sglang and k == 2048 and dtype == torch.float32:
line += " (SGLang error)"
print(line)Note: The same pattern appears at lines 372-373 and 416-417 for the other benchmark sections. Consider applying the same fix there for consistency.
🤖 Prompt for AI Agents
In `@benchmarks/bench_topk.py` around lines 326 - 330, The "(SGLang error)"
message is being printed even when SGLang is simply unsupported for the chosen
dtype; update the conditional that currently checks `elif args.compare_sglang
and k == 2048:` to also require the dtype be float32 (e.g., check `args.dtype ==
torch.float32` or equivalent) so it only reports "(SGLang error)" for real
runtime errors on float32 runs; apply the same change to the other two identical
spots in the file where `args.compare_sglang and k == 2048` is used so the logic
in those benchmark sections matches (search for the three occurrences around the
shown blocks).
yzh119
left a comment
There was a problem hiding this comment.
@huangzhilin-hzl thanks for the contribution, here is the result I get for fast_topk_v2 on B200:
====================================================================================================
top_k_page_table_transform: Fused top-k + page table gather (dtype=FP32)
NOTE: SGLang only supports k=2048 and float32
====================================================================================================
batch seq_len k | FlashInfer SGLang Speedup
------------------------------------------------------------------------------------------
1 4096 256 | 8.38us
1 4096 512 | 7.62us
1 4096 1024 | 8.67us
1 4096 2048 | 5.34us 5.17us 0.97x
1 4096 4096 | 4.29us
1 16384 256 | 12.46us
1 16384 512 | 11.87us
1 16384 1024 | 14.67us
1 16384 2048 | 15.07us 15.39us 1.02x
1 16384 4096 | 20.90us
1 65536 256 | 31.30us
1 65536 512 | 31.62us
1 65536 1024 | 32.16us
1 65536 2048 | 33.76us 40.88us 1.21x
1 65536 4096 | 35.46us
1 131072 256 | 36.32us
1 131072 512 | 36.32us
1 131072 1024 | 36.86us
1 131072 2048 | 38.38us 68.00us 1.77x
1 131072 4096 | 40.75us
1 262144 256 | 40.75us
1 262144 512 | 40.99us
1 262144 1024 | 41.22us
1 262144 2048 | 42.85us 124.90us 2.91x
1 262144 4096 | 45.22us
1 524288 256 | 43.07us
1 524288 512 | 43.14us
1 524288 1024 | 43.23us
1 524288 2048 | 43.84us 213.49us 4.87x
1 524288 4096 | 45.22us
16 4096 256 | 8.54us
16 4096 512 | 9.50us
16 4096 1024 | 8.80us
16 4096 2048 | 7.97us 7.84us 0.98x
16 4096 4096 | 4.48us
16 16384 256 | 12.90us
16 16384 512 | 12.90us
16 16384 1024 | 14.85us
16 16384 2048 | 15.01us 15.52us 1.03x
16 16384 4096 | 20.99us
16 65536 256 | 24.35us
16 65536 512 | 29.58us
16 65536 1024 | 30.53us
16 65536 2048 | 33.60us 41.15us 1.22x
16 65536 4096 | 39.87us
16 131072 256 | 40.46us
16 131072 512 | 41.20us
16 131072 1024 | 51.65us
16 131072 2048 | 51.78us 69.22us 1.34x
16 131072 4096 | 45.63us
16 262144 256 | 45.44us
16 262144 512 | 45.62us
16 262144 1024 | 46.21us
16 262144 2048 | 47.54us 126.29us 2.66x
16 262144 4096 | 49.71us
16 524288 256 | 87.39us
16 524288 512 | 87.84us
16 524288 1024 | 88.67us
16 524288 2048 | 89.50us 216.21us 2.42x
16 524288 4096 | 92.43us
64 4096 256 | 9.60us
64 4096 512 | 9.12us
64 4096 1024 | 9.73us
64 4096 2048 | 8.48us 8.35us 0.98x
64 4096 4096 | 4.70us
64 16384 256 | 13.18us
64 16384 512 | 13.65us
64 16384 1024 | 15.26us
64 16384 2048 | 16.10us 16.61us 1.03x
64 16384 4096 | 21.76us
64 65536 256 | 25.50us
64 65536 512 | 30.46us
64 65536 1024 | 31.30us
64 65536 2048 | 34.85us 42.24us 1.21x
64 65536 4096 | 41.60us
64 131072 256 | 42.08us
64 131072 512 | 42.03us
64 131072 1024 | 52.34us
64 131072 2048 | 53.52us 71.42us 1.33x
64 131072 4096 | 90.30us
64 262144 256 | 72.22us
64 262144 512 | 79.20us
64 262144 1024 | 79.63us
64 262144 2048 | 96.35us 145.25us 1.51x
64 262144 4096 | 146.66us
64 524288 256 | 157.60us
64 524288 512 | 158.18us
64 524288 1024 | 170.53us
64 524288 2048 | 171.41us 354.24us 2.07x
64 524288 4096 | 229.20us
256 4096 256 | 16.61us
256 4096 512 | 17.70us
256 4096 1024 | 17.22us
256 4096 2048 | 15.10us 10.53us 0.70x
256 4096 4096 | 7.47us
256 16384 256 | 25.73us
256 16384 512 | 27.20us
256 16384 1024 | 29.10us
256 16384 2048 | 32.26us 25.31us 0.78x
256 16384 4096 | 42.08us
256 65536 256 | 51.63us
256 65536 512 | 62.42us
256 65536 1024 | 64.74us
256 65536 2048 | 73.12us 64.42us 0.88x
256 65536 4096 | 153.02us
256 131072 256 | 88.61us
256 131072 512 | 90.22us
256 131072 1024 | 108.10us
256 131072 2048 | 112.61us 125.34us 1.11x
256 131072 4096 | 261.50us
256 262144 256 | 173.25us
256 262144 512 | 186.21us
256 262144 1024 | 188.06us
256 262144 2048 | 225.66us 231.12us 1.02x
256 262144 4096 | 422.05us
256 524288 256 | 320.86us
256 524288 512 | 322.24us
256 524288 1024 | 349.15us
256 524288 2048 | 352.86us 397.81us 1.13x
256 524288 4096 | 835.82usA faster version of topk will be landing in flashinfer soon.
have any specific timings for the top-k optimization? and is the performance gain also expected to apply to top_k_page_table_transform? I'm trying to integrate it into SGLang, but I'm finding that for seq_len < 65536, top_k_page_table_transform is only beneficial during the prefill phase. In the decode phase, it actually hurts performance. |
Hi @huangzhilin-hzl, is there a specific timeline you and SGLang are trying to aim for? We are sorting out some final optimizations such as variable seqlen support before landing a mature versio, but I'm curious whether there is a need to accelerate our plan. |
@bkryu Thanks for asking! There's no urgent timeline on my end. I'm just benchmarking flashinfer against sgl-kernel. top_k_page_table_transform to explore potential performance improvements for SGLang. Here are my H20 benchmarks for prefill and decode:
|
<!-- .github/pull_request_template.md --> ## 📌 Description add sgl_kernel.fast_topk_v2 to top_k benchmark <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added SGLang comparison integration to top-k benchmarking with timing measurements and performance speedup metrics. * Enhanced benchmark reporting to display SGLang results alongside existing comparisons. * Improved result table formatting with conditional column expansion and dynamic layout to accommodate additional metrics. * Maintained backward compatibility with existing benchmark features and error handling. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com>
📌 Description
add sgl_kernel.fast_topk_v2 to top_k benchmark
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit