Skip to content

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel#17891

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
zhendonghua:tune_llama4_fused_moe
Jan 28, 2026
Merged

[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel#17891
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
zhendonghua:tune_llama4_fused_moe

Conversation

@zhendonghua
Copy link
Copy Markdown
Contributor

Motivation

Llama-4-Scout-17B-16E-Instruct is not tuned on fused_moe_kernel and b200 gpu, which leads to a sub-optimal performance of throughput.

This PR runs the script that tunes Triton MoE Kernel, benchmarks the decode throughput and profiles the performance of tuned fused_moe_kernel.

The fused_moe_kernel is tuned by the following instruction.

python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
	--tp-size 4 \
	--disable-shared-experts-fusion \
    --tune

Modifications

Add a configuration json file python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=16,N=2048,device_name=NVIDIA_B200.json

Accuracy Tests

Benchmarking and Profiling

Tests are conducted on B200 GPUs.

Serve the model using python3 -m sglang.launch_server --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" --port 23333 --trust-remote-code --tp 4 --context-length 1000000.

Benchmark the batch size = 16 case using python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.98 --num-prompts 80 --max-concurrency 16 --port 23333 .

The output is as depicted below. The output throughput is 1604 token/s.
Screenshot 2026-01-27 at 11 07 49 PM

Profile the fused_moe_kernel under batch size = 16 using python3 -m sglang.bench_serving --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 16 --num-prompts 80 --port 23333 --profile

The performance is as depicted below. The fused_moe_kernel takes 58 us.

Screenshot 2026-01-28 at 11 02 49 AM

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zhendonghua, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the Llama-4-Scout-17B-16E-Instruct model when running on NVIDIA B200 GPUs. It achieves this by integrating a newly tuned configuration for the fused_moe_kernel, which was previously sub-optimal for this specific model and hardware combination. The changes are validated with comprehensive benchmarking and profiling, confirming a substantial boost in decoding throughput and reduced kernel execution time.

Highlights

  • Performance Tuning: Introduces optimized configurations for the fused_moe_kernel specifically for the Llama-4-Scout-17B-16E-Instruct model on NVIDIA B200 GPUs, addressing previously sub-optimal performance.
  • New Configuration File: Adds a new JSON file (E=16,N=2048,device_name=NVIDIA_B200.json) containing the tuned kernel parameters, including BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages for various expert counts.
  • Benchmarking Results: Demonstrates a significant improvement in decode throughput, achieving 1604 tokens/s on B200 GPUs with the new configuration.
  • Profiling Results: Shows the fused_moe_kernel execution time reduced to 58 microseconds under batch size 16, confirming the efficiency of the tuning.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new tuned configuration file for the fused_moe_kernel targeting the Llama-4-Scout-17B-16E-Instruct model on NVIDIA B200 GPUs. The new configuration is expected to improve performance. My review focuses on ensuring this new configuration is correctly integrated and utilized. I have one suggestion to ensure the new configuration can be used as a fallback by different Triton versions, maximizing the impact of this performance tuning.

@@ -0,0 +1,146 @@
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This configuration file is placed in a directory for Triton v3.5.1. For these performance optimizations to be available as a fallback for users with other Triton versions, 3.5.1 should be added to the supported_triton_versions list in python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_config.py. Without this change, the performance benefits of this new configuration might not be realized on systems that are not using Triton 3.5.1 exactly. It is recommended to update the list to ["3.5.1", "3.4.0", "3.3.1", "3.2.0", "3.1.0"] to ensure broader applicability.

@Qiaolin-Yu Qiaolin-Yu added the ready-to-merge The PR is ready to merge after the CI is green. label Jan 28, 2026
@Kangyan-Zhou Kangyan-Zhou merged commit 0998de0 into sgl-project:main Jan 28, 2026
54 of 61 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge The PR is ready to merge after the CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants