Skip to content

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+#20755

Merged
Qiaolin-Yu merged 10 commits intosgl-project:mainfrom
mmangkad-dev:gpt-oss-router-tinygemm
Mar 24, 2026
Merged

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+#20755
Qiaolin-Yu merged 10 commits intosgl-project:mainfrom
mmangkad-dev:gpt-oss-router-tinygemm

Conversation

@mmangkad
Copy link
Copy Markdown
Contributor

Summary

FlashInfer 0.6.6 (actually 0.6.5) added tinygemm_bf16, a faster kernel for small GEMMs. This applies it to the GPT-OSS MoE router on SM90+.

Accuracy Test (GPQA)

Before:

[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260317_060903', 'metric': 0.5625}]

After:

[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260317_062812', 'metric': 0.5593434343434344}]

Benchmarking and Profiling

Before:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  8.84
Total input tokens:                      2517
Total input text tokens:                 2517
Total generated tokens:                  2684
Total generated tokens (retokenized):    2576
Request throughput (req/s):              1.13
Input token throughput (tok/s):          284.65
Output token throughput (tok/s):         303.53
Peak output token throughput (tok/s):    307.00
Peak concurrent requests:                3
Total token throughput (tok/s):          588.18
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   882.95
Median E2E Latency (ms):                 847.49
P90 E2E Latency (ms):                    1527.18
P99 E2E Latency (ms):                    1654.68
---------------Time to First Token----------------
Mean TTFT (ms):                          12.41
Median TTFT (ms):                        12.05
P99 TTFT (ms):                           15.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.25
Median TPOT (ms):                        3.25
P99 TPOT (ms):                           3.26
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.26
Median ITL (ms):                         3.25
P95 ITL (ms):                            3.50
P99 ITL (ms):                            3.98
Max ITL (ms):                            8.07
==================================================
Screenshot 2026-03-17 143543

After:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  8.61
Total input tokens:                      2517
Total input text tokens:                 2517
Total generated tokens:                  2684
Total generated tokens (retokenized):    2521
Request throughput (req/s):              1.16
Input token throughput (tok/s):          292.39
Output token throughput (tok/s):         311.79
Peak output token throughput (tok/s):    316.00
Peak concurrent requests:                3
Total token throughput (tok/s):          604.19
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   859.66
Median E2E Latency (ms):                 826.02
P90 E2E Latency (ms):                    1486.07
P99 E2E Latency (ms):                    1610.24
---------------Time to First Token----------------
Mean TTFT (ms):                          12.22
Median TTFT (ms):                        11.53
P99 TTFT (ms):                           16.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.17
Median TPOT (ms):                        3.17
P99 TPOT (ms):                           3.18
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.17
Median ITL (ms):                         3.17
P95 ITL (ms):                            3.59
P99 ITL (ms):                            5.68
Max ITL (ms):                            9.44
==================================================
Screenshot 2026-03-17 143948

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the GPT-OSS Mixture-of-Experts (MoE) router by leveraging the tinygemm_bf16 kernel from the FlashInfer library. This targeted optimization, active on SM90+ NVIDIA GPUs, aims to reduce latency and increase throughput for small GEMM operations within the router, as evidenced by the provided benchmarks, while maintaining model accuracy.

Highlights

  • Performance Enhancement: Integrated FlashInfer's tinygemm_bf16 kernel to accelerate the GPT-OSS MoE router.
  • Hardware Targeting: The optimization is specifically enabled for NVIDIA GPUs with SM90+ architecture.
  • Code Refactoring: Introduced GptOssRouterLinear to encapsulate the tinygemm_bf16 fast path, replacing the generic ReplicatedLinear for the router.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/models/gpt_oss.py
    • Imported the tinygemm_bf16 function from flashinfer.gemm.
    • Defined a new class, GptOssRouterLinear, which inherits from ReplicatedLinear and implements a conditional fast path using tinygemm_bf16 for specific bfloat16 2D tensor operations on SM90+ CUDA devices.
    • Updated the GptOssSparseMoeBlock to utilize GptOssRouterLinear for its self.router attribute, enabling the new optimized path.
Activity
  • The author performed accuracy tests using GPQA, showing a negligible change in metric (0.5625 to 0.5593).
  • Comprehensive benchmarking and profiling were conducted, demonstrating improvements in request throughput, input/output token throughput, and reduced end-to-end latency and time per output token.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the GPT-OSS MoE router on SM90+ GPUs by using the tinygemm_bf16 kernel from FlashInfer. A new GptOssRouterLinear class is implemented with a fast path for bfloat16 data types on supported hardware, including necessary checks and a safe fallback to the original implementation. The changes are well-contained and appear correct. The provided benchmarks show a performance improvement.

@mmangkad
Copy link
Copy Markdown
Contributor Author

mmangkad commented Mar 17, 2026

/rerun-failed-ci

Comment thread python/sglang/srt/models/gpt_oss.py Outdated
Comment thread python/sglang/srt/models/gpt_oss.py Outdated
Comment thread python/sglang/srt/models/gpt_oss.py Outdated
Comment thread python/sglang/srt/models/gpt_oss.py Outdated
Comment thread python/sglang/srt/models/gpt_oss.py Outdated
@mmangkad
Copy link
Copy Markdown
Contributor Author

@Qiaolin-Yu comments addressed, and here are the decoding results for bs 64 and 128

bs = 64

State Median Latency (s) Median Throughput (token/s)
Before 0.00778 8222.16
After 0.00764 (-1.80%) 8376.83 (+1.88%)

bs = 128

State Median Latency (s) Median Throughput (token/s)
Before 0.01026 12469.96
After 0.01021 (-0.49%) 12538.58 (+0.55%)

Copy link
Copy Markdown
Collaborator

@Qiaolin-Yu Qiaolin-Yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. let's wait for CIs.

@mmangkad
Copy link
Copy Markdown
Contributor Author

@Qiaolin-Yu any blockers to merge this?

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

@elvischenv what's your github email? will add you as a coauthor when merging.

@elvischenv
Copy link
Copy Markdown
Contributor

@elvischenv what's your github email? will add you as a coauthor when merging.

@Qiaolin-Yu Thank you, please use this one: 219235043+elvischenv@users.noreply.github.com
@mmangkad Good PR!

@Qiaolin-Yu Qiaolin-Yu merged commit bbe25b2 into sgl-project:main Mar 24, 2026
229 of 263 checks passed
@mmangkad mmangkad deleted the gpt-oss-router-tinygemm branch March 25, 2026 00:12
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
…20755)

Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
…20755)

Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
…20755)

Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…20755)

Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…20755)

Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants