Skip to content

add sgl_kernel.fast_topk_v2 to top_k benchmark#2461

Merged
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
huangzhilin-hzl:bench_indexer_topk
Feb 3, 2026
Merged

add sgl_kernel.fast_topk_v2 to top_k benchmark#2461
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
huangzhilin-hzl:bench_indexer_topk

Conversation

@huangzhilin-hzl
Copy link
Copy Markdown
Contributor

@huangzhilin-hzl huangzhilin-hzl commented Feb 2, 2026

📌 Description

add sgl_kernel.fast_topk_v2 to top_k benchmark

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Tests
    • Added SGLang comparison integration to top-k benchmarking with timing measurements and performance speedup metrics.
    • Enhanced benchmark reporting to display SGLang results alongside existing comparisons.
    • Improved result table formatting with conditional column expansion and dynamic layout to accommodate additional metrics.
    • Maintained backward compatibility with existing benchmark features and error handling.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @huangzhilin-hzl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the top_k benchmark by incorporating SGLang's fast_topk_v2 implementation. This addition provides a more comprehensive performance comparison for top_k operations, allowing developers to evaluate FlashInfer's efficiency against another optimized library under specific conditions. The changes also refine the benchmark's output presentation to clearly display these new comparative metrics.

Highlights

  • SGLang fast_topk_v2 Integration: Integrated sgl_kernel.fast_topk_v2 into the top_k benchmark suite to allow performance comparison against FlashInfer and PyTorch implementations.
  • Conditional Benchmarking: The SGLang benchmark is conditionally executed only when k=2048 and dtype=torch.float32, reflecting its current limitations.
  • Enhanced Benchmark Output: Modified the benchmark output table to include dedicated columns for SGLang performance (sglang_us) and its speedup relative to FlashInfer (speedup_vs_sglang) when the comparison is enabled.
  • compare_sglang Argument: Introduced a new boolean argument compare_sglang to the bench_top_k function and the main script's argument parser, allowing users to enable or disable the SGLang comparison.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 2, 2026

📝 Walkthrough

Walkthrough

SGLang comparison functionality is integrated into the top-k benchmark with conditional execution when compare_sglang flag is enabled. The implementation enforces constraints (k=2048, float32 dtype), executes SGLang's fast_topk_v2 kernel, records timing and speedup metrics, and adjusts result reporting accordingly while maintaining backward compatibility.

Changes

Cohort / File(s) Summary
SGLang Comparison Integration
benchmarks/bench_topk.py
Adds SGLang kernel execution with constraint validation (k=2048, float32 required), timing measurement, and speedup calculation. Updates result table to conditionally display SGLang metrics and adjusts UI formatting (header dashes, column expansion) to accommodate optional timing columns. Maintains graceful fallback if SGLang unavailable.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A speedy topk, now measured twice,
SGLang joins the benchmark spice,
When k is 2048, float's the way,
Speedup metrics dance and play,
Compare and contrast, side by side! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive PR description includes the main objective but lacks detail in required sections. The Description section only briefly states what was added without explaining why, and Related Issues and Reviewer Notes sections are left empty. Expand the Description section to explain the motivation and benefits of adding SGLang comparison. Fill in Related Issues if applicable, and optionally add Reviewer Notes highlighting areas of focus (e.g., SGLang integration handling, conditional metrics).
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding sgl_kernel.fast_topk_v2 to the top_k benchmark.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a benchmark for SGLang's fast_topk_v2 kernel to the existing top-k benchmark script. The changes correctly add the new benchmark logic and update the output printing to include the SGLang results. My review focuses on improving code maintainability by replacing hardcoded magic numbers related to SGLang's constraints with named constants. This will make the code cleaner and easier to update in the future.

Comment thread benchmarks/bench_topk.py
result["speedup_vs_torch"] = torch_ms / fi_ms

# SGLang comparison (only supports k=2048 and float32)
if compare_sglang and HAS_SGL_KERNEL and k == 2048 and dtype == torch.float32:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The values 2048 and torch.float32 are hardcoded. These seem to be specific constraints from the SGLang kernel. To improve readability and maintainability, it's better to define them as constants at the top of the file. This would also allow reusing them in other parts of the code where the same constraints apply.

For example, you could define at the top of the file:

SGLANG_SUPPORTED_K = 2048
SGLANG_SUPPORTED_DTYPE = torch.float32

And then use these constants here.

Suggested change
if compare_sglang and HAS_SGL_KERNEL and k == 2048 and dtype == torch.float32:
if compare_sglang and HAS_SGL_KERNEL and k == SGLANG_SUPPORTED_K and dtype == SGLANG_SUPPORTED_DTYPE:

Comment thread benchmarks/bench_topk.py
print("=" * 100)
print(f"top_k: Basic radix-based top-k selection (dtype={dtype_str})")
if args.compare_sglang:
print("NOTE: SGLang only supports k=2048 and float32")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hardcoded value 2048 should be replaced with the SGLANG_SUPPORTED_K constant for better maintainability. You could also consider deriving the 'float32' string from the SGLANG_SUPPORTED_DTYPE constant for full consistency.

Suggested change
print("NOTE: SGLang only supports k=2048 and float32")
print(f"NOTE: SGLang only supports k={SGLANG_SUPPORTED_K} and float32")

Comment thread benchmarks/bench_topk.py
)
if "sglang_us" in result:
line += f" {result['sglang_us']:>10.2f}us {result['speedup_vs_sglang']:>9.2f}x"
elif args.compare_sglang and k == 2048:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 2048 should be replaced with the SGLANG_SUPPORTED_K constant for consistency and maintainability.

Suggested change
elif args.compare_sglang and k == 2048:
elif args.compare_sglang and k == SGLANG_SUPPORTED_K:

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@benchmarks/bench_topk.py`:
- Around line 326-330: The "(SGLang error)" message is being printed even when
SGLang is simply unsupported for the chosen dtype; update the conditional that
currently checks `elif args.compare_sglang and k == 2048:` to also require the
dtype be float32 (e.g., check `args.dtype == torch.float32` or equivalent) so it
only reports "(SGLang error)" for real runtime errors on float32 runs; apply the
same change to the other two identical spots in the file where
`args.compare_sglang and k == 2048` is used so the logic in those benchmark
sections matches (search for the three occurrences around the shown blocks).

Comment thread benchmarks/bench_topk.py
Comment on lines +326 to +330
if "sglang_us" in result:
line += f" {result['sglang_us']:>10.2f}us {result['speedup_vs_sglang']:>9.2f}x"
elif args.compare_sglang and k == 2048:
line += " (SGLang error)"
print(line)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading "(SGLang error)" message when dtype is not float32.

When --compare-sglang --dtype fp16 is used with k=2048, SGLang won't run (since it only supports float32), but this condition will print "(SGLang error)" — implying a failure rather than an unsupported configuration.

The condition should also check for dtype == torch.float32 to correctly distinguish between unsupported configurations and actual runtime errors.

🔧 Proposed fix
                         if "sglang_us" in result:
                             line += f" {result['sglang_us']:>10.2f}us {result['speedup_vs_sglang']:>9.2f}x"
-                        elif args.compare_sglang and k == 2048:
+                        elif args.compare_sglang and k == 2048 and dtype == torch.float32:
                             line += " (SGLang error)"
                         print(line)

Note: The same pattern appears at lines 372-373 and 416-417 for the other benchmark sections. Consider applying the same fix there for consistency.

🤖 Prompt for AI Agents
In `@benchmarks/bench_topk.py` around lines 326 - 330, The "(SGLang error)"
message is being printed even when SGLang is simply unsupported for the chosen
dtype; update the conditional that currently checks `elif args.compare_sglang
and k == 2048:` to also require the dtype be float32 (e.g., check `args.dtype ==
torch.float32` or equivalent) so it only reports "(SGLang error)" for real
runtime errors on float32 runs; apply the same change to the other two identical
spots in the file where `args.compare_sglang and k == 2048` is used so the logic
in those benchmark sections matches (search for the three occurrences around the
shown blocks).

Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huangzhilin-hzl thanks for the contribution, here is the result I get for fast_topk_v2 on B200:

====================================================================================================
top_k_page_table_transform: Fused top-k + page table gather (dtype=FP32)
NOTE: SGLang only supports k=2048 and float32
====================================================================================================
 batch    seq_len      k |   FlashInfer       SGLang    Speedup
------------------------------------------------------------------------------------------
     1       4096    256 |       8.38us
     1       4096    512 |       7.62us
     1       4096   1024 |       8.67us
     1       4096   2048 |       5.34us       5.17us      0.97x
     1       4096   4096 |       4.29us
     1      16384    256 |      12.46us
     1      16384    512 |      11.87us
     1      16384   1024 |      14.67us
     1      16384   2048 |      15.07us      15.39us      1.02x
     1      16384   4096 |      20.90us
     1      65536    256 |      31.30us
     1      65536    512 |      31.62us
     1      65536   1024 |      32.16us
     1      65536   2048 |      33.76us      40.88us      1.21x
     1      65536   4096 |      35.46us
     1     131072    256 |      36.32us
     1     131072    512 |      36.32us
     1     131072   1024 |      36.86us
     1     131072   2048 |      38.38us      68.00us      1.77x
     1     131072   4096 |      40.75us
     1     262144    256 |      40.75us
     1     262144    512 |      40.99us
     1     262144   1024 |      41.22us
     1     262144   2048 |      42.85us     124.90us      2.91x
     1     262144   4096 |      45.22us
     1     524288    256 |      43.07us
     1     524288    512 |      43.14us
     1     524288   1024 |      43.23us
     1     524288   2048 |      43.84us     213.49us      4.87x
     1     524288   4096 |      45.22us
    16       4096    256 |       8.54us
    16       4096    512 |       9.50us
    16       4096   1024 |       8.80us
    16       4096   2048 |       7.97us       7.84us      0.98x
    16       4096   4096 |       4.48us
    16      16384    256 |      12.90us
    16      16384    512 |      12.90us
    16      16384   1024 |      14.85us
    16      16384   2048 |      15.01us      15.52us      1.03x
    16      16384   4096 |      20.99us
    16      65536    256 |      24.35us
    16      65536    512 |      29.58us
    16      65536   1024 |      30.53us
    16      65536   2048 |      33.60us      41.15us      1.22x
    16      65536   4096 |      39.87us
    16     131072    256 |      40.46us
    16     131072    512 |      41.20us
    16     131072   1024 |      51.65us
    16     131072   2048 |      51.78us      69.22us      1.34x
    16     131072   4096 |      45.63us
    16     262144    256 |      45.44us
    16     262144    512 |      45.62us
    16     262144   1024 |      46.21us
    16     262144   2048 |      47.54us     126.29us      2.66x
    16     262144   4096 |      49.71us
    16     524288    256 |      87.39us
    16     524288    512 |      87.84us
    16     524288   1024 |      88.67us
    16     524288   2048 |      89.50us     216.21us      2.42x
    16     524288   4096 |      92.43us
    64       4096    256 |       9.60us
    64       4096    512 |       9.12us
    64       4096   1024 |       9.73us
    64       4096   2048 |       8.48us       8.35us      0.98x
    64       4096   4096 |       4.70us
    64      16384    256 |      13.18us
    64      16384    512 |      13.65us
    64      16384   1024 |      15.26us
    64      16384   2048 |      16.10us      16.61us      1.03x
    64      16384   4096 |      21.76us
    64      65536    256 |      25.50us
    64      65536    512 |      30.46us
    64      65536   1024 |      31.30us
    64      65536   2048 |      34.85us      42.24us      1.21x
    64      65536   4096 |      41.60us
    64     131072    256 |      42.08us
    64     131072    512 |      42.03us
    64     131072   1024 |      52.34us
    64     131072   2048 |      53.52us      71.42us      1.33x
    64     131072   4096 |      90.30us
    64     262144    256 |      72.22us
    64     262144    512 |      79.20us
    64     262144   1024 |      79.63us
    64     262144   2048 |      96.35us     145.25us      1.51x
    64     262144   4096 |     146.66us
    64     524288    256 |     157.60us
    64     524288    512 |     158.18us
    64     524288   1024 |     170.53us
    64     524288   2048 |     171.41us     354.24us      2.07x
    64     524288   4096 |     229.20us
   256       4096    256 |      16.61us
   256       4096    512 |      17.70us
   256       4096   1024 |      17.22us
   256       4096   2048 |      15.10us      10.53us      0.70x
   256       4096   4096 |       7.47us
   256      16384    256 |      25.73us
   256      16384    512 |      27.20us
   256      16384   1024 |      29.10us
   256      16384   2048 |      32.26us      25.31us      0.78x
   256      16384   4096 |      42.08us
   256      65536    256 |      51.63us
   256      65536    512 |      62.42us
   256      65536   1024 |      64.74us
   256      65536   2048 |      73.12us      64.42us      0.88x
   256      65536   4096 |     153.02us
   256     131072    256 |      88.61us
   256     131072    512 |      90.22us
   256     131072   1024 |     108.10us
   256     131072   2048 |     112.61us     125.34us      1.11x
   256     131072   4096 |     261.50us
   256     262144    256 |     173.25us
   256     262144    512 |     186.21us
   256     262144   1024 |     188.06us
   256     262144   2048 |     225.66us     231.12us      1.02x
   256     262144   4096 |     422.05us
   256     524288    256 |     320.86us
   256     524288    512 |     322.24us
   256     524288   1024 |     349.15us
   256     524288   2048 |     352.86us     397.81us      1.13x
   256     524288   4096 |     835.82us

A faster version of topk will be landing in flashinfer soon.

@yzh119 yzh119 closed this Feb 3, 2026
@yzh119 yzh119 reopened this Feb 3, 2026
@yzh119 yzh119 merged commit 95d8511 into flashinfer-ai:main Feb 3, 2026
33 checks passed
@huangzhilin-hzl
Copy link
Copy Markdown
Contributor Author

@huangzhilin-hzl thanks for the contribution, here is the result I get for fast_topk_v2 on B200:

====================================================================================================
top_k_page_table_transform: Fused top-k + page table gather (dtype=FP32)
NOTE: SGLang only supports k=2048 and float32
====================================================================================================
 batch    seq_len      k |   FlashInfer       SGLang    Speedup
------------------------------------------------------------------------------------------
     1       4096    256 |       8.38us
     1       4096    512 |       7.62us
     1       4096   1024 |       8.67us
     1       4096   2048 |       5.34us       5.17us      0.97x
     1       4096   4096 |       4.29us
     1      16384    256 |      12.46us
     1      16384    512 |      11.87us
     1      16384   1024 |      14.67us
     1      16384   2048 |      15.07us      15.39us      1.02x
     1      16384   4096 |      20.90us
     1      65536    256 |      31.30us
     1      65536    512 |      31.62us
     1      65536   1024 |      32.16us
     1      65536   2048 |      33.76us      40.88us      1.21x
     1      65536   4096 |      35.46us
     1     131072    256 |      36.32us
     1     131072    512 |      36.32us
     1     131072   1024 |      36.86us
     1     131072   2048 |      38.38us      68.00us      1.77x
     1     131072   4096 |      40.75us
     1     262144    256 |      40.75us
     1     262144    512 |      40.99us
     1     262144   1024 |      41.22us
     1     262144   2048 |      42.85us     124.90us      2.91x
     1     262144   4096 |      45.22us
     1     524288    256 |      43.07us
     1     524288    512 |      43.14us
     1     524288   1024 |      43.23us
     1     524288   2048 |      43.84us     213.49us      4.87x
     1     524288   4096 |      45.22us
    16       4096    256 |       8.54us
    16       4096    512 |       9.50us
    16       4096   1024 |       8.80us
    16       4096   2048 |       7.97us       7.84us      0.98x
    16       4096   4096 |       4.48us
    16      16384    256 |      12.90us
    16      16384    512 |      12.90us
    16      16384   1024 |      14.85us
    16      16384   2048 |      15.01us      15.52us      1.03x
    16      16384   4096 |      20.99us
    16      65536    256 |      24.35us
    16      65536    512 |      29.58us
    16      65536   1024 |      30.53us
    16      65536   2048 |      33.60us      41.15us      1.22x
    16      65536   4096 |      39.87us
    16     131072    256 |      40.46us
    16     131072    512 |      41.20us
    16     131072   1024 |      51.65us
    16     131072   2048 |      51.78us      69.22us      1.34x
    16     131072   4096 |      45.63us
    16     262144    256 |      45.44us
    16     262144    512 |      45.62us
    16     262144   1024 |      46.21us
    16     262144   2048 |      47.54us     126.29us      2.66x
    16     262144   4096 |      49.71us
    16     524288    256 |      87.39us
    16     524288    512 |      87.84us
    16     524288   1024 |      88.67us
    16     524288   2048 |      89.50us     216.21us      2.42x
    16     524288   4096 |      92.43us
    64       4096    256 |       9.60us
    64       4096    512 |       9.12us
    64       4096   1024 |       9.73us
    64       4096   2048 |       8.48us       8.35us      0.98x
    64       4096   4096 |       4.70us
    64      16384    256 |      13.18us
    64      16384    512 |      13.65us
    64      16384   1024 |      15.26us
    64      16384   2048 |      16.10us      16.61us      1.03x
    64      16384   4096 |      21.76us
    64      65536    256 |      25.50us
    64      65536    512 |      30.46us
    64      65536   1024 |      31.30us
    64      65536   2048 |      34.85us      42.24us      1.21x
    64      65536   4096 |      41.60us
    64     131072    256 |      42.08us
    64     131072    512 |      42.03us
    64     131072   1024 |      52.34us
    64     131072   2048 |      53.52us      71.42us      1.33x
    64     131072   4096 |      90.30us
    64     262144    256 |      72.22us
    64     262144    512 |      79.20us
    64     262144   1024 |      79.63us
    64     262144   2048 |      96.35us     145.25us      1.51x
    64     262144   4096 |     146.66us
    64     524288    256 |     157.60us
    64     524288    512 |     158.18us
    64     524288   1024 |     170.53us
    64     524288   2048 |     171.41us     354.24us      2.07x
    64     524288   4096 |     229.20us
   256       4096    256 |      16.61us
   256       4096    512 |      17.70us
   256       4096   1024 |      17.22us
   256       4096   2048 |      15.10us      10.53us      0.70x
   256       4096   4096 |       7.47us
   256      16384    256 |      25.73us
   256      16384    512 |      27.20us
   256      16384   1024 |      29.10us
   256      16384   2048 |      32.26us      25.31us      0.78x
   256      16384   4096 |      42.08us
   256      65536    256 |      51.63us
   256      65536    512 |      62.42us
   256      65536   1024 |      64.74us
   256      65536   2048 |      73.12us      64.42us      0.88x
   256      65536   4096 |     153.02us
   256     131072    256 |      88.61us
   256     131072    512 |      90.22us
   256     131072   1024 |     108.10us
   256     131072   2048 |     112.61us     125.34us      1.11x
   256     131072   4096 |     261.50us
   256     262144    256 |     173.25us
   256     262144    512 |     186.21us
   256     262144   1024 |     188.06us
   256     262144   2048 |     225.66us     231.12us      1.02x
   256     262144   4096 |     422.05us
   256     524288    256 |     320.86us
   256     524288    512 |     322.24us
   256     524288   1024 |     349.15us
   256     524288   2048 |     352.86us     397.81us      1.13x
   256     524288   4096 |     835.82us

A faster version of topk will be landing in flashinfer soon.

have any specific timings for the top-k optimization? and is the performance gain also expected to apply to top_k_page_table_transform?

I'm trying to integrate it into SGLang, but I'm finding that for seq_len < 65536, top_k_page_table_transform is only beneficial during the prefill phase. In the decode phase, it actually hurts performance.

@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Feb 3, 2026

have any specific timings for the top-k optimization? and is the performance gain also expected to apply to top_k_page_table_transform?

I'm trying to integrate it into SGLang, but I'm finding that for seq_len < 65536, top_k_page_table_transform is only beneficial during the prefill phase. In the decode phase, it actually hurts performance.

Hi @huangzhilin-hzl, is there a specific timeline you and SGLang are trying to aim for? We are sorting out some final optimizations such as variable seqlen support before landing a mature versio, but I'm curious whether there is a need to accelerate our plan.

@huangzhilin-hzl
Copy link
Copy Markdown
Contributor Author

have any specific timings for the top-k optimization? and is the performance gain also expected to apply to top_k_page_table_transform?
I'm trying to integrate it into SGLang, but I'm finding that for seq_len < 65536, top_k_page_table_transform is only beneficial during the prefill phase. In the decode phase, it actually hurts performance.

Hi @huangzhilin-hzl, is there a specific timeline you and SGLang are trying to aim for? We are sorting out some final optimizations such as variable seqlen support before landing a mature versio, but I'm curious whether there is a need to accelerate our plan.

@bkryu Thanks for asking! There's no urgent timeline on my end. I'm just benchmarking flashinfer against sgl-kernel. top_k_page_table_transform to explore potential performance improvements for SGLang.

Here are my H20 benchmarks for prefill and decode:
Setup: dsv3.2 with chunk-size=16384, SGLang using fast_topk_transform_fused API.

  • In the prefill scenario, where prefill_bs <= B, the topk_transform_prefill_kernel is called
batch seq_len k FlashInfer SGLang
1 2048 2048 22.24 us 34.56 us
1 4096 2048 202.94 us 223.95 us
1 8192 2048 636.80 us 676.74 us
1 16384 2048 1885.76 us 2022.46 us
  • In the decode scenario, where prefill_bs == B, and the batch_size is a multiple of 8, the topk_transform_decode_kernel is called.
batch seq_len k FlashInfer SGLang
8 1024 2048 6.59 us 6.05 us
8 2048 2048 7.09 us 6.24 us
8 4096 2048 11.49 us 11.06 us
8 8192 2048 15.41 us 14.98 us
8 16384 2048 19.44 us 19.36 us
8 65536 2048 40.16 us 46.19 us

raayandhar pushed a commit to raayandhar/flashinfer that referenced this pull request Feb 5, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description
add sgl_kernel.fast_topk_v2 to top_k benchmark

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Added SGLang comparison integration to top-k benchmarking with timing
measurements and performance speedup metrics.
* Enhanced benchmark reporting to display SGLang results alongside
existing comparisons.
* Improved result table formatting with conditional column expansion and
dynamic layout to accommodate additional metrics.
* Maintained backward compatibility with existing benchmark features and
error handling.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com>
@coderabbitai coderabbitai Bot mentioned this pull request Apr 14, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants