[New Model] Gemma 4 by JustinTong0323 · Pull Request #21952 · sgl-project/sglang

JustinTong0323 · 2026-04-02T15:05:17Z

Motivation

Add Gemma 4 model support to SGLang. Gemma 4 is Google's next-generation family of open models featuring Dense and MoE architectures, multimodal support (text, image, audio), hybrid reasoning, and native tool calling.

Supported Models:

Model	Architecture	Parameters
google/gemma-4-E2B-it	Dense	~2B
google/gemma-4-E4B-it	Dense	~4B
google/gemma-4-31B-it	Dense	31B
google/gemma-4-26B-A4B-it	MoE	26B total / 4B active

Installation

# Install SGLang (after this PR is merged)
pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'

# Install transformers with Gemma 4 support
pip install 'git+https://github.com/huggingface/transformers.git@91b1ab1fdfa81a552644a92fbe3e8d88de40e167'

Usage

Launch Server

# E2B (~2B, single GPU)
sglang serve --model-path google/gemma-4-E2B-it \
  --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

# E4B (~4B, single GPU)
sglang serve --model-path google/gemma-4-E4B-it \
  --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

# 31B Dense (2x H200 TP=2, or 1x MI300X TP=1)
sglang serve --model-path google/gemma-4-31B-it \
  --tp 2 --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

# 26B-A4B MoE (single GPU)
sglang serve --model-path google/gemma-4-26B-A4B-it \
  --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

Basic Chat

from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "What are the key differences between TCP and UDP?"}],
    max_tokens=1024
)
print(response.choices[0].message.content)

Vision

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg"}},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }],
    max_tokens=1024
)

Reasoning (Thinking Mode)

Thinking is not enabled by default — pass chat_template_kwargs: {"enable_thinking": true} to activate:

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "Solve: If a train travels at 60 km/h for 2.5 hours, how far does it go?"}],
    max_tokens=4096,
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

Tool Calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}},
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools
)

Accuracy Tests

MMLU (gemma-4-26B-A4B-it, H200)

Metric	Value
Overall	0.631

GSM8K (gemma-4-26B-A4B-it, H200)

Metric	Value
Accuracy	0.405
Invalid	0.010

MMMU (gemma-4-26B-A4B-it, H200)

Metric	Value
Overall	0.549

Speed Tests and Profiling

See full benchmark results in the SGLang Cookbook - Gemma 4.

Modifications

New model files: gemma4_causal.py, gemma4_mm.py, gemma4_vision.py, gemma4_audio.py
Architecture registration for Gemma4ForCausalLM and Gemma4ForConditionalGeneration
Gemma4SGLangProcessor multimodal processor (image + audio)
gemma4 reasoning parser (<|channel> / <channel|> tokens)
gemma4 tool call parser (<|tool_call> / <tool_call|> tokens with streaming)
Triton fused RMSNorm + residual + scalar kernel
Hybrid SWA support with sliding/full attention layer types

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…oes AR

…e test

The HF reference applies layer_scalar to every Gemma4DecoderLayer, not just full-attention layers. New checkpoints have non-trivial scalar values on SWA layers that were being silently ignored. Made-with: Cursor

Gate the two-buffer path on sliding_window_size to make intent explicit, and rewrite comment to explain the kernel's // Lv stride constraint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Kp/gemma4 multimodal support

JustinTong0323 · 2026-04-06T20:15:42Z

/rerun-failed-ci

JustinTong0323 · 2026-04-07T03:30:44Z

/rerun-failed-ci

JustinTong0323 · 2026-04-07T05:30:41Z

/rerun-failed-ci

JustinTong0323 · 2026-04-07T06:45:40Z

/rerun-failed-ci

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Pengyu Chen <pychen96@gmail.com> Co-authored-by: kpham-sgl <khoa.pham@radixark.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Andy Luo <andy.luo@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>

Add accuracy tests for Google Gemma 4 models on AMD GPUs, covering both MI30x (MI325/MI300X) and MI35x platforms with both default ROCm and ROCm 7.2 workflows. Models tested: - google/gemma-4-E4B-it (Dense, ~4B params, TP=1) - google/gemma-4-31B-it (Dense, 31B params, TP=1) All models use --attention-backend triton (required for bidirectional image-token attention on AMD GPUs) and --reasoning-parser/--tool-call-parser gemma4 per the upstream model PR #21952. Test files: - test/registered/amd/accuracy/mi30x/test_gemma4_eval_amd.py - test/registered/amd/accuracy/mi35x/test_gemma4_eval_mi35x.py Workflow jobs added: - nightly-accuracy-2-gpu-gemma4 (MI30x, 2-GPU runner) - nightly-8-gpu-mi35x-gemma4 (MI35x, 8-GPU runner) - Corresponding ROCm 7.2 variants Depends on: #21952 (Gemma 4 model support) Ref: https://www.amd.com/en/developer/resources/technical-articles/2026/day-0-support-for-gemma-4-on-amd-processors-and-gpus.html

The CI Docker image ships an older transformers that doesn't recognize the gemma4 architecture. Install from the specific commit required by the Gemma 4 model PR (#21952).

The CI Docker image has an older transformers that doesn't recognize the gemma4 model architecture. Install from the specific commit required by model PR #21952.

Add accuracy tests for Google Gemma 4 models on AMD GPUs (MI30x and MI35x) with both default ROCm and ROCm 7.2 workflows. Models tested: - google/gemma-4-E4B-it (Dense ~4B, TP=1) - google/gemma-4-31B-it (Dense 31B, TP=1) Server config: --attention-backend triton (required for bidirectional image-token attention on AMD GPUs per AMD Day 0 article). Each CI job installs transformers from the commit required by #21952.

Add Gemma 4 accuracy test as a step within existing CI jobs rather than standalone jobs: - 2-GPU job (nightly-accuracy-2-gpu): new step after GSM8K eval - MI35x 8-GPU job (nightly-accuracy-8-gpu-mi35x): new step after GPT-OSS Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with --attention-backend triton and threshold 0.90 (observed 0.984). Each step installs transformers from the commit required by #21952.

Add Gemma 4 accuracy test as a step within the existing 2-GPU accuracy job (nightly-accuracy-2-gpu) for both default ROCm and ROCm 7.2 workflows. Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with --attention-backend triton and threshold 0.90 (observed 0.984). Step uses if:always() to run even if prior GSM8K step fails. Each step installs transformers from the commit required by #21952.

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Pengyu Chen <pychen96@gmail.com> Co-authored-by: kpham-sgl <khoa.pham@radixark.ai> Co-authored-by: Andy Luo <andy.luo@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>

pyc96 and others added 30 commits March 10, 2026 03:18

Gemma4 init.

2e5b079

format and cleanup

fe25241

temp fix for kv sharing

cff018c

cleanup & tp

5939270

Reasoning parser.

dea02a2

tool call parser

418ba40

mm init

3289b26

config conversion global_head_dim <-> swa_head_dim

67b7b29

more mm

9a87e88

re-add gemma4 rope. (was removed as part of rebase)

416eccb

lint on main

1614769

gemma4 mm init and kvcache fix

2af1b41

add vision tower support, pending some refactor

89bf65c

don't partite the embedding projection because vision tower already d…

c9959ab

…oes AR

so many changes to make vision encoder work

18115f9

clean up

898b52d

add more comments

d6652f7

init audio support

8b4c06f

TP fix for audio encoder, change act_fn for vision_encoder, and updat…

f85490e

…e test

audio, vision, and text all work correctly now

5b77767

fix swa memory pool indices to retrieve

785b99e

softmax_scale should not be kwargs

27687c9

fix misc bugs with SWA kv cache

278d9fc

addressing comments

8b2dbe4

lint

95950dc

nit

f6b9759

Fix layer_scalar to apply unconditionally on all decoder layers

879aaef

The HF reference applies layer_scalar to every Gemma4DecoderLayer, not just full-attention layers. New checkpoints have non-trivial scalar values on SWA layers that were being silently ignored. Made-with: Cursor

Clarify SWA attn_logits buffer condition in triton backend

e1f8f61

Gate the two-buffer path on sliding_window_size to make intent explicit, and rewrite comment to explain the kernel's // Lv stride constraint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request #3 from pyc96/kp/gemma4-audio

52d7fe8

Kp/gemma4 multimodal support

initial dense 31b support

702a55e

michaelzhang-ai mentioned this pull request Apr 6, 2026

[AMD][CI] Add Gemma 4 nightly accuracy tests for MI30x and MI35x #22201

Open

5 tasks

michaelzhang-ai added a commit that referenced this pull request Apr 6, 2026

Merge Gemma 4 model support (PR #21952) to enable CI testing

599fe1d

Copilot AI mentioned this pull request Apr 7, 2026

Fix xpu_backend value_cache reshaping to use layer.v_head_dim jianan-gu/sglang#107

Closed

Kangyan-Zhou merged commit 2813cb6 into sgl-project:main Apr 7, 2026
218 of 242 checks passed

leofan-lab mentioned this pull request Apr 7, 2026

[Question] Gemma 4 support via HF wrapping approach? THUDM/slime#1811

Open

3 tasks

This was referenced Apr 7, 2026

modify testcases Ascend/sglang#254

Closed

Update testcase Ascend/sglang#268

Closed

aqweteddy mentioned this pull request Apr 7, 2026

[Bug] Gemma4 E4B: fp8 KV cache crashes with num_kv_shared_layers > 0 — Triton extend_attention dtype mismatch #22277

Open

5 tasks

kpham-sgl mentioned this pull request Apr 9, 2026

[CI] Adding Gemma 4 to Nightly CI #22408

Merged

5 tasks

leofan-lab mentioned this pull request Apr 24, 2026

feat(gemma4): add Gemma4 26B-A4B MoE and 31B dense support THUDM/slime#1855

Open

skytruax mentioned this pull request May 4, 2026

Inference: gemma4 tool/reasoning parsers; add flatclaw-inference-dev image skytruax/FlatClaw#2

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model] Gemma 4#21952

[New Model] Gemma 4#21952
Kangyan-Zhou merged 140 commits intosgl-project:mainfrom
JustinTong0323:new-model-gg

JustinTong0323 commented Apr 2, 2026 •

edited

Loading

Uh oh!

JustinTong0323 commented Apr 6, 2026

Uh oh!

Uh oh!

JustinTong0323 commented Apr 7, 2026

Uh oh!

JustinTong0323 commented Apr 7, 2026

Uh oh!

JustinTong0323 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

JustinTong0323 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Installation

Usage

Launch Server

Basic Chat

Vision

Reasoning (Thinking Mode)

Tool Calling

Accuracy Tests

MMLU (gemma-4-26B-A4B-it, H200)

GSM8K (gemma-4-26B-A4B-it, H200)

MMMU (gemma-4-26B-A4B-it, H200)

Speed Tests and Profiling

Modifications

Checklist

Review and Merge Process

Uh oh!

JustinTong0323 commented Apr 6, 2026

Uh oh!

Uh oh!

JustinTong0323 commented Apr 7, 2026

Uh oh!

JustinTong0323 commented Apr 7, 2026

Uh oh!

JustinTong0323 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

JustinTong0323 commented Apr 2, 2026 •

edited

Loading