[VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar by yuan-luo · Pull Request #24048 · sgl-project/sglang

yuan-luo · 2026-04-29T11:35:50Z

Motivation

Optimize Gemma4 26B-A4B prefill performance through two complementary approaches:

Fused Triton kernels for Gemma4 decoder layers — Reduces kernel launch overhead by fusing multiple operations into single kernels.
Enable Piecewise CUDA Graph (PCG) for VLM models — Fixes PCG support for multimodal models that use self.language_model instead of self.model to reference their text backbone.

Main:

PR:

Modifications

Fused Triton Kernels (sglang/srt/layers/gemma4_fused_ops.py)

gemma_rmsnorm_residual_scalar: Fuses RMSNorm + residual add + scalar multiply into a single Triton kernel (replaces 3 separate ops).
gemma_dual_rmsnorm_residual_scalar: Fuses the full MoE post-processing pipeline — rmsnorm(rmsnorm(dense_out, w1) + rmsnorm(moe_out, w2), w3) + residual) * scalar — into a single kernel (replaces 5+ separate ops per MoE layer).

Gemma4 Model Integration (sglang/srt/models/gemma4_causal.py)

Updated Gemma4DecoderLayer.forward() to use the fused kernels in the MoE block, eliminating redundant kernel launches.

PCG VLM Compatibility (sglang/srt/model_executor/model_runner.py)

resolve_language_model(): Added fallback to model.language_model when model.model is not present, supporting VLM architectures like Gemma4ForConditionalGeneration.
PCG gate check: Extended the guard condition to also check for language_model attribute, preventing false disabling of PCG for VLMs.
Layer resolution: Added flexible layer discovery that handles both language_model.model.layers (standard) and language_model.layers (Gemma4 VLM where TextModel is the language_model directly).

PCG Runner Fix (sglang/srt/model_executor/piecewise_cuda_graph_runner.py)

Updated patch_model target resolution to handle models where the TextModel (with .layers) is the language_model itself, not nested under .model.

Accuracy Tests

MMMU match official 54.9%.

Tasks	Filter	n-shot	Metric		Value		Stderr
mmmu_val	none	0	mmmu_acc	↑	0.5489	±	N/A

Speed Tests and Profiling

PCG eliminates CPU dispatch overhead by capturing per-layer CUDA graphs for prefill. The benefit is most significant at small-to-medium token counts where kernel launch latency dominates compute time.

Tokens	Baseline	Fused + PCG	Improvement
93	35.6ms	16.5ms	-53.7%
189	35.4ms	17.1ms	-51.7%
1453	36.3ms	25.1ms	-30.9%
2893	40.6ms	30.6ms	-24.6%
5773	~49ms	41.6ms	-15.1%
11533	~70ms	66.4ms	-5.1%

Compatibility

Non-VLM models: No change — existing model.model path is tried first.
VLM models with self.model (e.g., Qwen2.5-VL): No change — hasattr(model, "model") succeeds as before.
VLM models with self.language_model (e.g., Gemma4): Now correctly supported.
Verified correctness on both Gemma4 (text + image) and Qwen2.5-VL (text + image) with PCG enabled.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

yuan-luo · 2026-04-29T11:37:24Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request introduces a fused Triton kernel, gemma_dual_rmsnorm_residual_scalar, to optimize the Gemma4 model's forward pass by combining multiple RMSNorm and residual operations. It also refactors model resolution logic in model_runner.py and piecewise_cuda_graph_runner.py to better handle different model architectures. Feedback highlights the need for stricter input validation in the new Triton wrapper to prevent potential memory issues and identifies a logic error in resolve_language_model that would lead to a guaranteed AttributeError.

kpham-sgl

Nice! Can you run MMMU and verify against the score here https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#mmmu

yuan-luo · 2026-04-30T03:42:20Z

Nice! Can you run MMMU and verify against the score here https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#mmmu

Updated.

yuan-luo requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hebiao064, hnyls2002, ispobock, kpham-sgl and merrymercy as code owners April 29, 2026 11:35

yuan-luo requested review from JustinTong0323, mickqian and yhyang201 April 29, 2026 11:37

github-actions Bot added the run-ci label Apr 29, 2026

yuan-luo changed the title ~~[VLM] Optimize Gemm4 VLM with PCG and fuse RMSNorm + residual add + scalar~~ [VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar Apr 29, 2026

yuan-luo force-pushed the opt_gemma4_mm branch from 5c847fa to 9e46771 Compare April 29, 2026 11:41

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/gemma4_fused_ops.py

Comment thread python/sglang/srt/model_executor/model_runner.py

yuan-luo force-pushed the opt_gemma4_mm branch from 9e46771 to dc0ecde Compare April 29, 2026 11:49

kpham-sgl approved these changes Apr 29, 2026

View reviewed changes

yuan-luo force-pushed the opt_gemma4_mm branch 3 times, most recently from 85d7f4b to a1f46b9 Compare May 2, 2026 02:52

yuan-luo requested a review from wisclmy0611 as a code owner May 2, 2026 02:52

github-actions Bot added the documentation Improvements or additions to documentation label May 2, 2026

yuan-luo force-pushed the opt_gemma4_mm branch 2 times, most recently from 8cdbdbf to e35d303 Compare May 3, 2026 14:12

Optimize Gemm4 VLM with PCG and fuse RMSNorm + residual add + scalar

26e0cf6

yuan-luo force-pushed the opt_gemma4_mm branch from e35d303 to 26e0cf6 Compare May 3, 2026 14:43

Kangyan-Zhou merged commit e5c58eb into sgl-project:main May 4, 2026
436 of 505 checks passed

yuan-luo deleted the opt_gemma4_mm branch May 6, 2026 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar#24048

[VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar#24048
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yuan-luo:opt_gemma4_mm

yuan-luo commented Apr 29, 2026 •

edited

Loading

Uh oh!

yuan-luo commented Apr 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

kpham-sgl left a comment

Uh oh!

yuan-luo commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuan-luo commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Fused Triton Kernels (sglang/srt/layers/gemma4_fused_ops.py)

Gemma4 Model Integration (sglang/srt/models/gemma4_causal.py)

PCG VLM Compatibility (sglang/srt/model_executor/model_runner.py)

PCG Runner Fix (sglang/srt/model_executor/piecewise_cuda_graph_runner.py)

Accuracy Tests

Speed Tests and Profiling

Compatibility

Checklist

Review and Merge Process

Uh oh!

yuan-luo commented Apr 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kpham-sgl left a comment

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuan-luo commented Apr 29, 2026 •

edited

Loading