[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel#17891
[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel#17891Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @zhendonghua, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of the Llama-4-Scout-17B-16E-Instruct model when running on NVIDIA B200 GPUs. It achieves this by integrating a newly tuned configuration for the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new tuned configuration file for the fused_moe_kernel targeting the Llama-4-Scout-17B-16E-Instruct model on NVIDIA B200 GPUs. The new configuration is expected to improve performance. My review focuses on ensuring this new configuration is correctly integrated and utilized. I have one suggestion to ensure the new configuration can be used as a fallback by different Triton versions, maximizing the impact of this performance tuning.
| @@ -0,0 +1,146 @@ | |||
| { | |||
There was a problem hiding this comment.
This configuration file is placed in a directory for Triton v3.5.1. For these performance optimizations to be available as a fallback for users with other Triton versions, 3.5.1 should be added to the supported_triton_versions list in python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_config.py. Without this change, the performance benefits of this new configuration might not be realized on systems that are not using Triton 3.5.1 exactly. It is recommended to update the list to ["3.5.1", "3.4.0", "3.3.1", "3.2.0", "3.1.0"] to ensure broader applicability.
Motivation
Llama-4-Scout-17B-16E-Instruct is not tuned on
fused_moe_kerneland b200 gpu, which leads to a sub-optimal performance of throughput.This PR runs the script that tunes Triton MoE Kernel, benchmarks the decode throughput and profiles the performance of tuned
fused_moe_kernel.The
fused_moe_kernelis tuned by the following instruction.python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \ --model meta-llama/Llama-4-Scout-17B-16E-Instruct \ --tp-size 4 \ --disable-shared-experts-fusion \ --tuneModifications
Add a configuration json file
python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=16,N=2048,device_name=NVIDIA_B200.jsonAccuracy Tests
Benchmarking and Profiling
Serve the model using
python3 -m sglang.launch_server --model-path "meta-llama/Llama-4-Scout-17B-16E-Instruct" --port 23333 --trust-remote-code --tp 4 --context-length 1000000.Benchmark the batch size = 16 case using
python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.98 --num-prompts 80 --max-concurrency 16 --port 23333.The output is as depicted below. The output throughput is 1604 token/s.

Profile the
fused_moe_kernelunder batch size = 16 usingpython3 -m sglang.bench_serving --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 16 --num-prompts 80 --port 23333 --profileThe performance is as depicted below. The
fused_moe_kerneltakes 58 us.Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci