Skip to content

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels for qwen3.5#21188

Merged
HaiShaw merged 1 commit intosgl-project:mainfrom
yichiche:yichiche/fuse-gemma-rmsnorm-hip
Mar 23, 2026
Merged

[AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels for qwen3.5#21188
HaiShaw merged 1 commit intosgl-project:mainfrom
yichiche:yichiche/fuse-gemma-rmsnorm-hip

Conversation

@yichiche
Copy link
Copy Markdown
Collaborator

@yichiche yichiche commented Mar 23, 2026

co-author: @zhentaocc

Motivation

Previously GemmaRMSNorm re-dispatched HIP to forward_native, bypassing fused kernels. This adds a dedicated forward_hip that routes through aiter or vllm fused_add_rms_norm/rms_norm, matching the existing CUDA path logic with the +1 weight offset that Gemma requires.

Modifications

  • Removed the __init__ override that forced _forward_method = forward_native on HIP.
  • Added forward_hip() method to GemmaRMSNorm that:
    • Applies the Gemma-specific weight + 1.0 offset.
    • Routes through aiter fused kernels when _use_aiter is set.
    • Falls back to vllm fused kernels otherwise.
    • Falls back to forward_native if neither is available.

Accuracy Tests

Model: Qwen3.5-397B-A17B-FP8, Hardware: 8x MI355X, Image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260318

GSM8K (5-shot, 2000 questions, parallel=1000):

Before (aiter) After (aiter + PR)
Accuracy 0.943 0.955

Server launch command:

SGLANG_USE_AITER=1 python3 -m sglang.launch_server \
  --model-path /data/Qwen3.5-397B-A17B-FP8/ \
  --tp 8 \
  --attention-backend aiter \
  --trust-remote-code \
  --model-loader-extra-config '{"enable_multithread_load": true}' \
  --watchdog-timeout 1200 \
  --mem-fraction-static 0.8 \
  --host 0.0.0.0 --port 9000

Accuracy test command:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-questions 2000 --parallel 1000 --num-shots 5 --port 9000

Benchmarking and Profiling

Workload: sglang.bench_serving --dataset-name random --random-input 8192 --random-output 1024 --random-range-ratio 1.0

Benchmark command:

python3 -m sglang.bench_serving \
  --host localhost --port 9000 \
  --model /data/Qwen3.5-397B-A17B-FP8/ \
  --dataset-name random \
  --random-input 8192 --random-output 1024 \
  --random-range-ratio 1.0 \
  --max-concurrency 1 --num-prompt 8

Concurrency=1 comparison (8 prompts):

Latency & Throughput

Before (aiter) After (aiter + PR) Improvement
Median E2E Latency (ms) 16,179.00 12,438.75 -23.1%
Total Throughput (tok/s) 569.53 740.14 +30.0%

TTFT & ITL

Before (aiter) After (aiter + PR) Improvement
Median TTFT (ms) 243.40 202.14 -17.0%
Median ITL (ms) 15.58 11.96 -23.2%

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Previously GemmaRMSNorm re-dispatched HIP to forward_native, bypassing
fused kernels. Add a dedicated forward_hip that routes through aiter or
vllm fused_add_rms_norm/rms_norm, matching the existing CUDA path logic
with the +1 weight offset that Gemma requires.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve for now. Let's remove/clone vllm dependencies in a different PR.

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Mar 23, 2026

/tag-and-rerun-ci

@yichiche
Copy link
Copy Markdown
Collaborator Author

yichiche commented Mar 23, 2026

Removing the HIP override that forced forward_native lets CustomOp dispatch hit forward_hip; nit: double-check no callers still expect forward_native on HIP when _has_vllm_rms_norm is false (forward_hip already falls back, so this should be OK).

Confirmed -- forward_hip handles all three cases:

  • _use_aiter=True → aiter fused kernels
  • _has_vllm_rms_norm=True (via vllm) → vllm fused kernels
  • neither available → falls back to forward_native at line 505

No callers check which method _forward_method points to; they all go through MultiPlatformOp.forward() which just calls self._forward_method(). The fallback path is identical to what the old _is_hip override did, just reached via forward_hip instead of being wired directly in init.

@yichiche yichiche marked this pull request as draft March 23, 2026 09:00
@yichiche yichiche marked this pull request as ready for review March 23, 2026 09:17
@yichiche yichiche added the amd label Mar 23, 2026
@HaiShaw HaiShaw merged commit b4d3fb0 into sgl-project:main Mar 23, 2026
149 of 175 checks passed
adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants