[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels for qwen3.5 by yichiche · Pull Request #21188 · sgl-project/sglang

yichiche · 2026-03-23T06:24:07Z

Motivation

Previously GemmaRMSNorm re-dispatched HIP to forward_native, bypassing fused kernels. This adds a dedicated forward_hip that routes through aiter or vllm fused_add_rms_norm/rms_norm, matching the existing CUDA path logic with the +1 weight offset that Gemma requires.

Modifications

Removed the __init__ override that forced _forward_method = forward_native on HIP.
Added forward_hip() method to GemmaRMSNorm that:
- Applies the Gemma-specific weight + 1.0 offset.
- Routes through aiter fused kernels when _use_aiter is set.
- Falls back to vllm fused kernels otherwise.
- Falls back to forward_native if neither is available.

Accuracy Tests

Model: Qwen3.5-397B-A17B-FP8, Hardware: 8x MI355X, Image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260318

GSM8K (5-shot, 2000 questions, parallel=1000):

	Before (aiter)	After (aiter + PR)
Accuracy	0.943	0.955

Server launch command:

SGLANG_USE_AITER=1 python3 -m sglang.launch_server \
  --model-path /data/Qwen3.5-397B-A17B-FP8/ \
  --tp 8 \
  --attention-backend aiter \
  --trust-remote-code \
  --model-loader-extra-config '{"enable_multithread_load": true}' \
  --watchdog-timeout 1200 \
  --mem-fraction-static 0.8 \
  --host 0.0.0.0 --port 9000

Accuracy test command:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-questions 2000 --parallel 1000 --num-shots 5 --port 9000

Benchmarking and Profiling

Workload: sglang.bench_serving --dataset-name random --random-input 8192 --random-output 1024 --random-range-ratio 1.0

Benchmark command:

python3 -m sglang.bench_serving \
  --host localhost --port 9000 \
  --model /data/Qwen3.5-397B-A17B-FP8/ \
  --dataset-name random \
  --random-input 8192 --random-output 1024 \
  --random-range-ratio 1.0 \
  --max-concurrency 1 --num-prompt 8

Concurrency=1 comparison (8 prompts):

Latency & Throughput

	Before (aiter)	After (aiter + PR)	Improvement
Median E2E Latency (ms)	16,179.00	12,438.75	-23.1%
Total Throughput (tok/s)	569.53	740.14	+30.0%

TTFT & ITL

	Before (aiter)	After (aiter + PR)	Improvement
Median TTFT (ms)	243.40	202.14	-17.0%
Median ITL (ms)	15.58	11.96	-23.2%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Previously GemmaRMSNorm re-dispatched HIP to forward_native, bypassing fused kernels. Add a dedicated forward_hip that routes through aiter or vllm fused_add_rms_norm/rms_norm, matching the existing CUDA path logic with the +1 weight offset that Gemma requires.

gemini-code-assist · 2026-03-23T06:24:12Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw

Approve for now. Let's remove/clone vllm dependencies in a different PR.

HaiShaw · 2026-03-23T06:49:27Z

/tag-and-rerun-ci

yichiche · 2026-03-23T08:00:31Z

Removing the HIP override that forced forward_native lets CustomOp dispatch hit forward_hip; nit: double-check no callers still expect forward_native on HIP when _has_vllm_rms_norm is false (forward_hip already falls back, so this should be OK).

Confirmed -- forward_hip handles all three cases:

_use_aiter=True → aiter fused kernels
_has_vllm_rms_norm=True (via vllm) → vllm fused kernels
neither available → falls back to forward_native at line 505

No callers check which method _forward_method points to; they all go through MultiPlatformOp.forward() which just calls self._forward_method(). The fallback path is identical to what the old _is_hip override did, just reached via forward_hip instead of being wired directly in init.

…r qwen3.5 (sgl-project#21188)

yichiche requested review from BBuf, Edwardf0t1, Fridge003, Ying1123, ch-wan, ispobock and merrymercy as code owners March 23, 2026 06:24

HaiShaw approved these changes Mar 23, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 23, 2026

yichiche marked this pull request as draft March 23, 2026 09:00

yichiche marked this pull request as ready for review March 23, 2026 09:17

yichiche added the amd label Mar 23, 2026

HaiShaw merged commit b4d3fb0 into sgl-project:main Mar 23, 2026
149 of 175 checks passed

adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels fo…

1dde41f

…r qwen3.5 (sgl-project#21188)

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels fo…

5b5d9d9

…r qwen3.5 (sgl-project#21188)

zhentaocc mentioned this pull request Mar 30, 2026

[AMD][MI35X]Update qwen3.5 perf SemiAnalysisAI/InferenceX#980

Closed

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels fo…

71a0866

…r qwen3.5 (sgl-project#21188)

zhentaocc mentioned this pull request Apr 16, 2026

[AMD][MI35X]Update qwen3.5 perf SemiAnalysisAI/InferenceX#1036

Merged

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels fo…

c3b4621

…r qwen3.5 (sgl-project#21188)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels for qwen3.5#21188

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels for qwen3.5#21188
HaiShaw merged 1 commit intosgl-project:mainfrom
yichiche:yichiche/fuse-gemma-rmsnorm-hip

yichiche commented Mar 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Uh oh!

HaiShaw left a comment

Uh oh!

HaiShaw commented Mar 23, 2026

Uh oh!

yichiche commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yichiche commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

HaiShaw commented Mar 23, 2026

Uh oh!

yichiche commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yichiche commented Mar 23, 2026 •

edited

Loading

yichiche commented Mar 23, 2026 •

edited

Loading