Skip to content

[Perf] Precompute gemma_weight to avoid redundant add on every forward #22673

Merged
ispobock merged 1 commit intosgl-project:mainfrom
Chen-0210:gemmarmsnorm-precompute-clean
Apr 17, 2026
Merged

[Perf] Precompute gemma_weight to avoid redundant add on every forward #22673
ispobock merged 1 commit intosgl-project:mainfrom
Chen-0210:gemmarmsnorm-precompute-clean

Conversation

@Chen-0210
Copy link
Copy Markdown
Contributor

@Chen-0210 Chen-0210 commented Apr 13, 2026

Motivation

GemmaRMSNorm computes weight + 1.0 on every forward call in forward_hip and forward_with_allreduce_fusion. This repeated tensor addition is unnecessary overhead.

Modifications

Replace runtime weight + 1.0 with the cached gemma_weight in forward_hip and forward_with_allreduce_fusion.

Accuracy Tests

python -m sglang.launch_server \
  --model-path /Qwen/Qwen3.5-397B-A17B/ \
  --tp-size 8 \
  --mamba-scheduler-strategy extra_buffer \
  --speculative-algo NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \

python3 benchmark/gsm8k/bench_sglang.py --parallel 1000 --num-questions 1000

100%|██████████| 1000/1000 [02:32<00:00,  6.55it/s]
Accuracy: 0.959
Invalid: 0.011
Latency: 152.584 s

Speed Tests and Profiling

Speed Tests

python -m sglang.launch_server \
  --model-path /Qwen/Qwen3.5-397B-A17B/ \
  --tp-size 8 \
  --mamba-scheduler-strategy extra_buffer \
  --speculative-algo NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 256 --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency 8..32
concurrency before E2E ms after E2E ms
8 6807.02 6762.07
16 9887.29 9642.32
32 13988.34 13954.73

profiling
image

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@Chen-0210 Chen-0210 changed the title [Perf] Precompute GemmaRMSNorm gemma_weight to avoid redundant add on every forward [Perf] Precompute gemma_weight to avoid redundant add on every forward Apr 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes Gemma layer normalization by precomputing the gemma_weight (standard weight plus 1.0) and storing it as a buffer, rather than recalculating it during every forward pass. A critical issue was identified in the weight loader where self.gemma_weight is reassigned rather than updated in-place, which would break the buffer's connection to the module and cause issues when moving the model between devices.

Comment thread python/sglang/srt/layers/layernorm.py
@Chen-0210
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@Chen-0210 Chen-0210 force-pushed the gemmarmsnorm-precompute-clean branch from 3bd1362 to fa5120d Compare April 14, 2026 03:07
@Chen-0210
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

1 similar comment
@Chen-0210
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@Chen-0210
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

2 similar comments
@Chen-0210
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@Chen-0210
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@ispobock
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@ispobock ispobock merged commit 2bac219 into sgl-project:main Apr 17, 2026
492 of 570 checks passed
@Chen-0210 Chen-0210 deleted the gemmarmsnorm-precompute-clean branch April 18, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants