Skip to content

[New Model] Gemma 4#21952

Merged
Kangyan-Zhou merged 140 commits intosgl-project:mainfrom
JustinTong0323:new-model-gg
Apr 7, 2026
Merged

[New Model] Gemma 4#21952
Kangyan-Zhou merged 140 commits intosgl-project:mainfrom
JustinTong0323:new-model-gg

Conversation

@JustinTong0323
Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 commented Apr 2, 2026

Motivation

Add Gemma 4 model support to SGLang. Gemma 4 is Google's next-generation family of open models featuring Dense and MoE architectures, multimodal support (text, image, audio), hybrid reasoning, and native tool calling.

Supported Models:

Model Architecture Parameters
google/gemma-4-E2B-it Dense ~2B
google/gemma-4-E4B-it Dense ~4B
google/gemma-4-31B-it Dense 31B
google/gemma-4-26B-A4B-it MoE 26B total / 4B active

Installation

# Install SGLang (after this PR is merged)
pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'

# Install transformers with Gemma 4 support
pip install 'git+https://github.com/huggingface/transformers.git@91b1ab1fdfa81a552644a92fbe3e8d88de40e167'

Usage

Launch Server

# E2B (~2B, single GPU)
sglang serve --model-path google/gemma-4-E2B-it \
  --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

# E4B (~4B, single GPU)
sglang serve --model-path google/gemma-4-E4B-it \
  --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

# 31B Dense (2x H200 TP=2, or 1x MI300X TP=1)
sglang serve --model-path google/gemma-4-31B-it \
  --tp 2 --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

# 26B-A4B MoE (single GPU)
sglang serve --model-path google/gemma-4-26B-A4B-it \
  --reasoning-parser gemma4 --tool-call-parser gemma4 \
  --host 0.0.0.0 --port 30000

Basic Chat

from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "What are the key differences between TCP and UDP?"}],
    max_tokens=1024
)
print(response.choices[0].message.content)

Vision

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg"}},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }],
    max_tokens=1024
)

Reasoning (Thinking Mode)

Thinking is not enabled by default — pass chat_template_kwargs: {"enable_thinking": true} to activate:

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "Solve: If a train travels at 60 km/h for 2.5 hours, how far does it go?"}],
    max_tokens=4096,
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

Tool Calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}},
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools
)

Accuracy Tests

MMLU (gemma-4-26B-A4B-it, H200)

Metric Value
Overall 0.631

GSM8K (gemma-4-26B-A4B-it, H200)

Metric Value
Accuracy 0.405
Invalid 0.010

MMMU (gemma-4-26B-A4B-it, H200)

Metric Value
Overall 0.549

Speed Tests and Profiling

See full benchmark results in the SGLang Cookbook - Gemma 4.

Modifications

  • New model files: gemma4_causal.py, gemma4_mm.py, gemma4_vision.py, gemma4_audio.py
  • Architecture registration for Gemma4ForCausalLM and Gemma4ForConditionalGeneration
  • Gemma4SGLangProcessor multimodal processor (image + audio)
  • gemma4 reasoning parser (<|channel> / <channel|> tokens)
  • gemma4 tool call parser (<|tool_call> / <tool_call|> tokens with streaming)
  • Triton fused RMSNorm + residual + scalar kernel
  • Hybrid SWA support with sliding/full attention layer types

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

pyc96 and others added 30 commits March 10, 2026 03:18
The HF reference applies layer_scalar to every Gemma4DecoderLayer,
not just full-attention layers. New checkpoints have non-trivial
scalar values on SWA layers that were being silently ignored.

Made-with: Cursor
Gate the two-buffer path on sliding_window_size to make intent explicit,
and rewrite comment to explain the kernel's // Lv stride constraint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kp/gemma4 multimodal support
@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 2813cb6 into sgl-project:main Apr 7, 2026
218 of 242 checks passed
@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

1 similar comment
@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

Fridge003 pushed a commit that referenced this pull request Apr 7, 2026
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
Co-authored-by: kpham-sgl <khoa.pham@radixark.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Andy Luo <andy.luo@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>
@kpham-sgl kpham-sgl mentioned this pull request Apr 9, 2026
5 tasks
michaelzhang-ai added a commit that referenced this pull request Apr 9, 2026
Add accuracy tests for Google Gemma 4 models on AMD GPUs, covering
both MI30x (MI325/MI300X) and MI35x platforms with both default ROCm
and ROCm 7.2 workflows.

Models tested:
- google/gemma-4-E4B-it (Dense, ~4B params, TP=1)
- google/gemma-4-31B-it (Dense, 31B params, TP=1)

All models use --attention-backend triton (required for bidirectional
image-token attention on AMD GPUs) and --reasoning-parser/--tool-call-parser
gemma4 per the upstream model PR #21952.

Test files:
- test/registered/amd/accuracy/mi30x/test_gemma4_eval_amd.py
- test/registered/amd/accuracy/mi35x/test_gemma4_eval_mi35x.py

Workflow jobs added:
- nightly-accuracy-2-gpu-gemma4 (MI30x, 2-GPU runner)
- nightly-8-gpu-mi35x-gemma4 (MI35x, 8-GPU runner)
- Corresponding ROCm 7.2 variants

Depends on: #21952 (Gemma 4 model support)

Ref: https://www.amd.com/en/developer/resources/technical-articles/2026/day-0-support-for-gemma-4-on-amd-processors-and-gpus.html
michaelzhang-ai added a commit that referenced this pull request Apr 9, 2026
The CI Docker image ships an older transformers that doesn't
recognize the gemma4 architecture. Install from the specific
commit required by the Gemma 4 model PR (#21952).
michaelzhang-ai added a commit that referenced this pull request Apr 9, 2026
The CI Docker image has an older transformers that doesn't recognize
the gemma4 model architecture. Install from the specific commit
required by model PR #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 9, 2026
Add accuracy tests for Google Gemma 4 models on AMD GPUs (MI30x and
MI35x) with both default ROCm and ROCm 7.2 workflows.

Models tested:
- google/gemma-4-E4B-it (Dense ~4B, TP=1)
- google/gemma-4-31B-it (Dense 31B, TP=1)

Server config: --attention-backend triton (required for bidirectional
image-token attention on AMD GPUs per AMD Day 0 article).

Each CI job installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 10, 2026
Add accuracy tests for Google Gemma 4 models on AMD GPUs (MI30x and
MI35x) with both default ROCm and ROCm 7.2 workflows.

Models tested:
- google/gemma-4-E4B-it (Dense ~4B, TP=1)
- google/gemma-4-31B-it (Dense 31B, TP=1)

Server config: --attention-backend triton (required for bidirectional
image-token attention on AMD GPUs per AMD Day 0 article).

Each CI job installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 10, 2026
Add accuracy tests for Google Gemma 4 models on AMD GPUs (MI30x and
MI35x) with both default ROCm and ROCm 7.2 workflows.

Models tested:
- google/gemma-4-E4B-it (Dense ~4B, TP=1)
- google/gemma-4-31B-it (Dense 31B, TP=1)

Server config: --attention-backend triton (required for bidirectional
image-token attention on AMD GPUs per AMD Day 0 article).

Each CI job installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 10, 2026
Add Gemma 4 accuracy test as a step within existing CI jobs rather
than standalone jobs:
- 2-GPU job (nightly-accuracy-2-gpu): new step after GSM8K eval
- MI35x 8-GPU job (nightly-accuracy-8-gpu-mi35x): new step after GPT-OSS

Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with
--attention-backend triton and threshold 0.90 (observed 0.984).

Each step installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add Gemma 4 accuracy test as a step within existing CI jobs rather
than standalone jobs:
- 2-GPU job (nightly-accuracy-2-gpu): new step after GSM8K eval
- MI35x 8-GPU job (nightly-accuracy-8-gpu-mi35x): new step after GPT-OSS

Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with
--attention-backend triton and threshold 0.90 (observed 0.984).

Each step installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add Gemma 4 accuracy test as a step within existing CI jobs rather
than standalone jobs:
- 2-GPU job (nightly-accuracy-2-gpu): new step after GSM8K eval
- MI35x 8-GPU job (nightly-accuracy-8-gpu-mi35x): new step after GPT-OSS

Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with
--attention-backend triton and threshold 0.90 (observed 0.984).

Each step installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add Gemma 4 accuracy test as a step within existing CI jobs rather
than standalone jobs:
- 2-GPU job (nightly-accuracy-2-gpu): new step after GSM8K eval
- MI35x 8-GPU job (nightly-accuracy-8-gpu-mi35x): new step after GPT-OSS

Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with
--attention-backend triton and threshold 0.90 (observed 0.984).

Each step installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add Gemma 4 accuracy test as a step within the existing 2-GPU
accuracy job (nightly-accuracy-2-gpu) for both default ROCm and
ROCm 7.2 workflows.

Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with
--attention-backend triton and threshold 0.90 (observed 0.984).

Step uses if:always() to run even if prior GSM8K step fails.
Each step installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add Gemma 4 accuracy test as a step within the existing 2-GPU
accuracy job (nightly-accuracy-2-gpu) for both default ROCm and
ROCm 7.2 workflows.

Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with
--attention-backend triton and threshold 0.90 (observed 0.984).

Step uses if:always() to run even if prior GSM8K step fails.
Each step installs transformers from the commit required by #21952.
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add Gemma 4 accuracy test as a step within the existing 2-GPU
accuracy job (nightly-accuracy-2-gpu) for both default ROCm and
ROCm 7.2 workflows.

Tests google/gemma-4-31B-it (Dense 31B, TP=1) on mgsm_en with
--attention-backend triton and threshold 0.90 (observed 0.984).

Step uses if:always() to run even if prior GSM8K step fails.
Each step installs transformers from the commit required by #21952.
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
Co-authored-by: kpham-sgl <khoa.pham@radixark.ai>
Co-authored-by: Andy Luo <andy.luo@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.