[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel by zhendonghua · Pull Request #17891 · sgl-project/sglang

zhendonghua · 2026-01-28T16:05:12Z

Motivation

Llama-4-Scout-17B-16E-Instruct is not tuned on fused_moe_kernel and b200 gpu, which leads to a sub-optimal performance of throughput.

This PR runs the script that tunes Triton MoE Kernel, benchmarks the decode throughput and profiles the performance of tuned fused_moe_kernel.

The fused_moe_kernel is tuned by the following instruction.

python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
	--tp-size 4 \
	--disable-shared-experts-fusion \
    --tune

Modifications

Add a configuration json file python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=16,N=2048,device_name=NVIDIA_B200.json

Accuracy Tests

Benchmarking and Profiling

Tests are conducted on B200 GPUs.

Serve the model using python3 -m sglang.launch_server --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" --port 23333 --trust-remote-code --tp 4 --context-length 1000000.

Benchmark the batch size = 16 case using python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.98 --num-prompts 80 --max-concurrency 16 --port 23333 .

The output is as depicted below. The output throughput is 1604 token/s.

Profile the fused_moe_kernel under batch size = 16 using python3 -m sglang.bench_serving --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 16 --num-prompts 80 --port 23333 --profile

The performance is as depicted below. The fused_moe_kernel takes 58 us.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-28T16:05:28Z

Summary of Changes

Hello @zhendonghua, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the Llama-4-Scout-17B-16E-Instruct model when running on NVIDIA B200 GPUs. It achieves this by integrating a newly tuned configuration for the fused_moe_kernel, which was previously sub-optimal for this specific model and hardware combination. The changes are validated with comprehensive benchmarking and profiling, confirming a substantial boost in decoding throughput and reduced kernel execution time.

Highlights

Performance Tuning: Introduces optimized configurations for the fused_moe_kernel specifically for the Llama-4-Scout-17B-16E-Instruct model on NVIDIA B200 GPUs, addressing previously sub-optimal performance.
New Configuration File: Adds a new JSON file (E=16,N=2048,device_name=NVIDIA_B200.json) containing the tuned kernel parameters, including BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages for various expert counts.
Benchmarking Results: Demonstrates a significant improvement in decode throughput, achieving 1604 tokens/s on B200 GPUs with the new configuration.
Profiling Results: Shows the fused_moe_kernel execution time reduced to 58 microseconds under batch size 16, confirming the efficiency of the tuning.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new tuned configuration file for the fused_moe_kernel targeting the Llama-4-Scout-17B-16E-Instruct model on NVIDIA B200 GPUs. The new configuration is expected to improve performance. My review focuses on ensuring this new configuration is correctly integrated and utilized. I have one suggestion to ensure the new configuration can be used as a fallback by different Triton versions, maximizing the impact of this performance tuning.

gemini-code-assist · 2026-01-28T16:06:35Z

@@ -0,0 +1,146 @@
+{


This configuration file is placed in a directory for Triton v3.5.1. For these performance optimizations to be available as a fallback for users with other Triton versions, 3.5.1 should be added to the supported_triton_versions list in python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_config.py. Without this change, the performance benefits of this new configuration might not be realized on systems that are not using Triton 3.5.1 exactly. It is recommended to update the list to ["3.5.1", "3.4.0", "3.3.1", "3.2.0", "3.1.0"] to ensure broader applicability.

…ect#17891)

tune Llama-4-Scout-17B-16E-Instruct fused moe kernel

9c13e6d

zhendonghua requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners January 28, 2026 16:05

gemini-code-assist Bot reviewed Jan 28, 2026

View reviewed changes

Qiaolin-Yu approved these changes Jan 28, 2026

View reviewed changes

Qiaolin-Yu added the ready-to-merge The PR is ready to merge after the CI is green. label Jan 28, 2026

Qiaolin-Yu assigned Fridge003 Jan 28, 2026

Kangyan-Zhou merged commit 0998de0 into sgl-project:main Jan 28, 2026
54 of 61 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel (sgl-proj…

7a6912b

…ect#17891)

Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel (sgl-proj…

af8201b

…ect#17891)

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel (sgl-proj…

b425466

…ect#17891)

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel (sgl-proj…

2bbec47

…ect#17891)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel#17891

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel#17891
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
zhendonghua:tune_llama4_fused_moe

zhendonghua commented Jan 28, 2026

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhendonghua commented Jan 28, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants