Skip to content

[VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar#24048

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yuan-luo:opt_gemma4_mm
May 4, 2026
Merged

[VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar#24048
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yuan-luo:opt_gemma4_mm

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Apr 29, 2026

Motivation

Optimize Gemma4 26B-A4B prefill performance through two complementary approaches:

  1. Fused Triton kernels for Gemma4 decoder layers — Reduces kernel launch overhead by fusing multiple operations into single kernels.
  2. Enable Piecewise CUDA Graph (PCG) for VLM models — Fixes PCG support for multimodal models that use self.language_model instead of self.model to reference their text backbone.

Main:
image

PR:
image

Modifications

Fused Triton Kernels (sglang/srt/layers/gemma4_fused_ops.py)

  • gemma_rmsnorm_residual_scalar: Fuses RMSNorm + residual add + scalar multiply into a single Triton kernel (replaces 3 separate ops).
  • gemma_dual_rmsnorm_residual_scalar: Fuses the full MoE post-processing pipeline — rmsnorm(rmsnorm(dense_out, w1) + rmsnorm(moe_out, w2), w3) + residual) * scalar — into a single kernel (replaces 5+ separate ops per MoE layer).

Gemma4 Model Integration (sglang/srt/models/gemma4_causal.py)

  • Updated Gemma4DecoderLayer.forward() to use the fused kernels in the MoE block, eliminating redundant kernel launches.

PCG VLM Compatibility (sglang/srt/model_executor/model_runner.py)

  • resolve_language_model(): Added fallback to model.language_model when model.model is not present, supporting VLM architectures like Gemma4ForConditionalGeneration.
  • PCG gate check: Extended the guard condition to also check for language_model attribute, preventing false disabling of PCG for VLMs.
  • Layer resolution: Added flexible layer discovery that handles both language_model.model.layers (standard) and language_model.layers (Gemma4 VLM where TextModel is the language_model directly).

PCG Runner Fix (sglang/srt/model_executor/piecewise_cuda_graph_runner.py)

  • Updated patch_model target resolution to handle models where the TextModel (with .layers) is the language_model itself, not nested under .model.

Accuracy Tests

MMMU match official 54.9%.

Tasks Filter n-shot Metric Value Stderr
mmmu_val none 0 mmmu_acc 0.5489 ± N/A

Speed Tests and Profiling

PCG eliminates CPU dispatch overhead by capturing per-layer CUDA graphs for prefill. The benefit is most significant at small-to-medium token counts where kernel launch latency dominates compute time.

Tokens Baseline Fused + PCG Improvement
93 35.6ms 16.5ms -53.7%
189 35.4ms 17.1ms -51.7%
1453 36.3ms 25.1ms -30.9%
2893 40.6ms 30.6ms -24.6%
5773 ~49ms 41.6ms -15.1%
11533 ~70ms 66.4ms -5.1%

Compatibility

  • Non-VLM models: No change — existing model.model path is tried first.
  • VLM models with self.model (e.g., Qwen2.5-VL): No change — hasattr(model, "model") succeeds as before.
  • VLM models with self.language_model (e.g., Gemma4): Now correctly supported.
  • Verified correctness on both Gemma4 (text + image) and Qwen2.5-VL (text + image) with PCG enabled.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@yuan-luo yuan-luo changed the title [VLM] Optimize Gemm4 VLM with PCG and fuse RMSNorm + residual add + scalar [VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fused Triton kernel, gemma_dual_rmsnorm_residual_scalar, to optimize the Gemma4 model's forward pass by combining multiple RMSNorm and residual operations. It also refactors model resolution logic in model_runner.py and piecewise_cuda_graph_runner.py to better handle different model architectures. Feedback highlights the need for stricter input validation in the new Triton wrapper to prevent potential memory issues and identifies a logic error in resolve_language_model that would lead to a guaranteed AttributeError.

Comment thread python/sglang/srt/layers/gemma4_fused_ops.py
Comment thread python/sglang/srt/model_executor/model_runner.py
Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can you run MMMU and verify against the score here https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#mmmu

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Nice! Can you run MMMU and verify against the score here https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#mmmu

Updated.

@yuan-luo yuan-luo force-pushed the opt_gemma4_mm branch 3 times, most recently from 85d7f4b to a1f46b9 Compare May 2, 2026 02:52
@yuan-luo yuan-luo requested a review from wisclmy0611 as a code owner May 2, 2026 02:52
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 2, 2026
@yuan-luo yuan-luo force-pushed the opt_gemma4_mm branch 2 times, most recently from 8cdbdbf to e35d303 Compare May 3, 2026 14:12
@Kangyan-Zhou Kangyan-Zhou merged commit e5c58eb into sgl-project:main May 4, 2026
436 of 505 checks passed
@yuan-luo yuan-luo deleted the opt_gemma4_mm branch May 6, 2026 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants