Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+ by mmangkad · Pull Request #20755 · sgl-project/sglang

mmangkad · 2026-03-17T07:16:32Z

Summary

FlashInfer 0.6.6 (actually 0.6.5) added tinygemm_bf16, a faster kernel for small GEMMs. This applies it to the GPT-OSS MoE router on SM90+.

Accuracy Test (GPQA)

Before:

[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260317_060903', 'metric': 0.5625}]

After:

[{'eval_name': 'gpqa', 'model_name': 'openai__gpt-oss-20b-low_temp1.0_20260317_062812', 'metric': 0.5593434343434344}]

Benchmarking and Profiling

Before:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  8.84
Total input tokens:                      2517
Total input text tokens:                 2517
Total generated tokens:                  2684
Total generated tokens (retokenized):    2576
Request throughput (req/s):              1.13
Input token throughput (tok/s):          284.65
Output token throughput (tok/s):         303.53
Peak output token throughput (tok/s):    307.00
Peak concurrent requests:                3
Total token throughput (tok/s):          588.18
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   882.95
Median E2E Latency (ms):                 847.49
P90 E2E Latency (ms):                    1527.18
P99 E2E Latency (ms):                    1654.68
---------------Time to First Token----------------
Mean TTFT (ms):                          12.41
Median TTFT (ms):                        12.05
P99 TTFT (ms):                           15.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.25
Median TPOT (ms):                        3.25
P99 TPOT (ms):                           3.26
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.26
Median ITL (ms):                         3.25
P95 ITL (ms):                            3.50
P99 ITL (ms):                            3.98
Max ITL (ms):                            8.07
==================================================

After:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  8.61
Total input tokens:                      2517
Total input text tokens:                 2517
Total generated tokens:                  2684
Total generated tokens (retokenized):    2521
Request throughput (req/s):              1.16
Input token throughput (tok/s):          292.39
Output token throughput (tok/s):         311.79
Peak output token throughput (tok/s):    316.00
Peak concurrent requests:                3
Total token throughput (tok/s):          604.19
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   859.66
Median E2E Latency (ms):                 826.02
P90 E2E Latency (ms):                    1486.07
P99 E2E Latency (ms):                    1610.24
---------------Time to First Token----------------
Mean TTFT (ms):                          12.22
Median TTFT (ms):                        11.53
P99 TTFT (ms):                           16.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.17
Median TPOT (ms):                        3.17
P99 TPOT (ms):                           3.18
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.17
Median ITL (ms):                         3.17
P95 ITL (ms):                            3.59
P99 ITL (ms):                            5.68
Max ITL (ms):                            9.44
==================================================

gemini-code-assist · 2026-03-17T07:16:48Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the GPT-OSS Mixture-of-Experts (MoE) router by leveraging the tinygemm_bf16 kernel from the FlashInfer library. This targeted optimization, active on SM90+ NVIDIA GPUs, aims to reduce latency and increase throughput for small GEMM operations within the router, as evidenced by the provided benchmarks, while maintaining model accuracy.

Highlights

Performance Enhancement: Integrated FlashInfer's tinygemm_bf16 kernel to accelerate the GPT-OSS MoE router.
Hardware Targeting: The optimization is specifically enabled for NVIDIA GPUs with SM90+ architecture.
Code Refactoring: Introduced GptOssRouterLinear to encapsulate the tinygemm_bf16 fast path, replacing the generic ReplicatedLinear for the router.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/models/gpt_oss.py
- Imported the tinygemm_bf16 function from flashinfer.gemm.
- Defined a new class, GptOssRouterLinear, which inherits from ReplicatedLinear and implements a conditional fast path using tinygemm_bf16 for specific bfloat16 2D tensor operations on SM90+ CUDA devices.
- Updated the GptOssSparseMoeBlock to utilize GptOssRouterLinear for its self.router attribute, enabling the new optimized path.

Activity

The author performed accuracy tests using GPQA, showing a negligible change in metric (0.5625 to 0.5593).
Comprehensive benchmarking and profiling were conducted, demonstrating improvements in request throughput, input/output token throughput, and reduced end-to-end latency and time per output token.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a performance optimization for the GPT-OSS MoE router on SM90+ GPUs by using the tinygemm_bf16 kernel from FlashInfer. A new GptOssRouterLinear class is implemented with a fast path for bfloat16 data types on supported hardware, including necessary checks and a safe fallback to the original implementation. The changes are well-contained and appear correct. The provided benchmarks show a performance improvement.

mmangkad · 2026-03-17T07:24:43Z

/rerun-failed-ci

mmangkad · 2026-03-18T21:36:24Z

@Qiaolin-Yu comments addressed, and here are the decoding results for bs 64 and 128

bs = 64

State	Median Latency (s)	Median Throughput (token/s)
Before	0.00778	8222.16
After	0.00764 (-1.80%)	8376.83 (+1.88%)

bs = 128

State	Median Latency (s)	Median Throughput (token/s)
Before	0.01026	12469.96
After	0.01021 (-0.49%)	12538.58 (+0.55%)

Qiaolin-Yu

lgtm. let's wait for CIs.

mmangkad · 2026-03-23T09:18:07Z

@Qiaolin-Yu any blockers to merge this?

Qiaolin-Yu · 2026-03-23T20:10:41Z

@elvischenv what's your github email? will add you as a coauthor when merging.

elvischenv · 2026-03-23T22:36:31Z

@elvischenv what's your github email? will add you as a coauthor when merging.

@Qiaolin-Yu Thank you, please use this one: 219235043+elvischenv@users.noreply.github.com
@mmangkad Good PR!

…20755) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

upd

61ff118

gemini-code-assist Bot reviewed Mar 17, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 17, 2026

Merge branch 'main' into gpt-oss-router-tinygemm

c15e114

Qiaolin-Yu assigned zminglei and Qiaolin-Yu Mar 17, 2026

zminglei reviewed Mar 17, 2026

View reviewed changes

Comment thread python/sglang/srt/models/gpt_oss.py Outdated

mmangkad added 3 commits March 18, 2026 11:58

Guard gpt-oss tinygemm fast path

e7a3873

Merge branch 'main' into gpt-oss-router-tinygemm

8dd0da6

upd

894e5da

Qiaolin-Yu reviewed Mar 18, 2026

View reviewed changes

Comment thread python/sglang/srt/models/gpt_oss.py Outdated

Comment thread python/sglang/srt/models/gpt_oss.py Outdated

Comment thread python/sglang/srt/models/gpt_oss.py Outdated

Comment thread python/sglang/srt/models/gpt_oss.py Outdated

mmangkad added 2 commits March 19, 2026 05:06

upd

f7fa8e3

Merge branch 'main' into gpt-oss-router-tinygemm

6669a7a

Qiaolin-Yu approved these changes Mar 18, 2026

View reviewed changes

Qiaolin-Yu mentioned this pull request Mar 18, 2026

[Perf] Support Flashinfer trtllm tinygemm router gemm for GPT-OSS #20848

Closed

5 tasks

mmangkad added 3 commits March 20, 2026 22:40

Merge branch 'main' into gpt-oss-router-tinygemm

fdb18a1

Merge branch 'main' into gpt-oss-router-tinygemm

907b8c9

Merge branch 'main' into gpt-oss-router-tinygemm

4ba0f8a

Qiaolin-Yu merged commit bbe25b2 into sgl-project:main Mar 24, 2026
229 of 263 checks passed

mmangkad deleted the gpt-oss-router-tinygemm branch March 25, 2026 00:12

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+ (sgl-project#…

95599de

…20755) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+ (sgl-project#…

03e0598

…20755) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+ (sgl-project#…

1973b14

…20755) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

nvpohanh mentioned this pull request Mar 31, 2026

[Tracking] GPT-OSS B200/GB200 performance optimization tracker #15243

Open

7 tasks

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+ (sgl-project#…

9f79900

…20755) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+ (sgl-project#…

63051ac

…20755) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+#20755

Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+#20755
Qiaolin-Yu merged 10 commits intosgl-project:mainfrom
mmangkad-dev:gpt-oss-router-tinygemm

mmangkad commented Mar 17, 2026

Uh oh!

gemini-code-assist Bot commented Mar 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mmangkad commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mmangkad commented Mar 18, 2026

Uh oh!

Qiaolin-Yu left a comment

Uh oh!

mmangkad commented Mar 23, 2026

Uh oh!

Qiaolin-Yu commented Mar 23, 2026

Uh oh!

elvischenv commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mmangkad commented Mar 17, 2026

Summary

Accuracy Test (GPQA)

Benchmarking and Profiling

Uh oh!

gemini-code-assist Bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mmangkad commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mmangkad commented Mar 18, 2026

Uh oh!

Qiaolin-Yu left a comment

Choose a reason for hiding this comment

Uh oh!

mmangkad commented Mar 23, 2026

Uh oh!

Qiaolin-Yu commented Mar 23, 2026

Uh oh!

elvischenv commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mmangkad commented Mar 17, 2026 •

edited

Loading