Revert "perf: optimize PCG inductor path for FP8 models (#21734)" by bingxche · Pull Request #23159 · sgl-project/sglang

bingxche · 2026-04-19T10:05:33Z

Summary

Reverts #21734 (perf: optimize PCG inductor path for FP8 models, commit 6da3aba).

Re-submission of the previously closed #23039, this time with full Tier 2 + Tier 3 AMD CI verification evidence.

Motivation

#21734 introduces a non-recoverable Memory access fault by GPU on AMD CI that takes down 7+ tests across 14 model files. The faulty change is in python/sglang/srt/models/utils.py::apply_qk_norm, where four q.reshape(-1, head_dim) / k.reshape(-1, head_dim) calls are replaced with q.view(*q.shape[:-1], -1, head_dim) / k.view(*k.shape[:-1], -1, head_dim):

Under PCG capture mode q / k are frequently non-contiguous (stride-trick views from the QKV split), so .view() yields a tensor whose stride disagrees with what the downstream q_norm / k_norm kernels expect.
The kernel then writes into an unmapped page, producing the fingerprint Memory access fault by GPU node-2 (Agent handle: 0x...) on address 0x7f... immediately followed by scheduler died with exit code -6 / SIGKILL -9 / 30-min action timeout.
apply_qk_norm is called from 14 model files (qwen3.py, qwen3_5.py, qwen3_next.py, qwen3_moe.py, llada2.py, glm4_moe.py, olmo2.py, commandr.py, dflash.py, sdar.py, sdar_moe.py, bailing_moe.py, bailing_moe_linear.py, afmoe.py), so the regression is not limited to FP8 models as the PR title suggests; bf16 models hit it too.

Evidence

Bisect:

The PR's own check (run 24262716047 partition 4, Apr 11) was already failing with Memory access fault ... addr 0x7f28244bf000. This is before the aiter 0.1.12.post1 bump (Apr 13), so the regression cannot be attributed to aiter.
Baseline AMD schedule PR test (run 24527146590) hits the same fingerprint on partitions 2 / 4 / 8 / 10 / 12 / 13 of stage-b-test-1-gpu-small-amd.

Affected tests (all share the same fault fingerprint):

test_lora_load_from_tensor.py
test_priority_metrics.py
test_metrics.py
test_llada2_mini.py
test_llada2_mini_amd.py
test_reasoning.py
test_deterministic.py
(likely) test_eagle_dp_attention.py

Manual revert verification:

One-shot revert run 24450852614: test_priority_metrics.py 70s PASS / test_metrics.py 73s PASS / test_deterministic.py 155s PASS / test_llada2_mini.py 293s PASS, all Memory access fault occurrences gone.

Tier 2 + Tier 3 verification (this revert):

Tier 2 run 24596337420 (Apr 18 03:55 UTC, 1h 11m): stage-b-test-1-gpu-small-amd + stage-b-test-1-gpu-small-amd-nondeterministic + stage-c-test-4-gpu-amd (16 jobs) all green, all 7 affected tests PASS.
Tier 3 run 24599716745 (Apr 18 07:21 UTC, full stage matrix ~35 jobs): Memory access fault fingerprint never re-appeared.

Modifications

Revert python/sglang/srt/layers/quantization/fp8_utils.py and python/sglang/srt/models/utils.py to their state before 6da3aba.
Diff stat: 2 files changed, 10 insertions(+), 35 deletions(-).

Checklist

Format your code according to the Format code with pre-commit.
CI verified green on this revert (Tier 2 run 24596337420 + Tier 3 run 24599716745).

This reverts commit 6da3aba. The change to apply_qk_norm in python/sglang/srt/models/utils.py replaces q.reshape(-1, head_dim) / k.reshape(-1, head_dim) with q.view(*q.shape[:-1], -1, head_dim) / same for k. Under PCG capture mode q/k are frequently non-contiguous (stride-trick views from the QKV split), so .view() yields a tensor with a wrong stride relative to what the q_norm / k_norm kernels expect. The kernel then writes into an unmapped page and triggers a non-recoverable "Memory access fault by GPU node-2 ... Reason: Unknown" that kills the scheduler with SIGABRT/SIGKILL. apply_qk_norm is called from 14 model files (qwen3, qwen3_5, qwen3_next, qwen3_moe, llada2, glm4_moe, olmo2, commandr, dflash, sdar, sdar_moe, bailing_moe, bailing_moe_linear, afmoe), so the regression is not limited to FP8 models as the PR title suggests; bf16 models hit it too. Made-with: Cursor

gemini-code-assist

Code Review

This pull request removes the conditional logic tied to the 'inductor' compiler and the global server arguments in apply_fp8_linear and apply_qk_norm. It also simplifies tensor reshaping in apply_qk_norm by using .reshape() instead of .view(). I have no feedback to provide.

HaiShaw · 2026-04-20T06:30:27Z

@ispobock I'd like to revert #21734 as it broke AMD CI

…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…21734)" (sgl-project#23159) Feel free to PR again.

…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…21734)" (sgl-project#23159) Feel free to PR again.

Profile of DFlash on Qwen3-8B (MI355X, conc=64) showed Q/K RMSNorm runs as two separate kernels on ROCm (~12.9ms total per batch in the draft model alone), because apply_qk_norm only had a fast path for CUDA. Add an AITER fused fast path that runs Q-norm + K-norm in a single HIP kernel: - Imports aiter.ops.fused_qk_norm_rope_cache_quant.fused_qk_rmsnorm when running on ROCm with AITER installed (gracefully falls back if the module is missing) - Gated on _is_hip + matching epsilons + non-deterministic mode - Uses reshape (not view) because qkv.split() returns strided views and ROCm RMSNorm kernels fault on strided inputs (see sgl-project#23159) - Returns new tensors (AITER kernel is functional, not in-place) Benefits both the DFlash draft model (5 layers x N requests per decode step) and the target Qwen3 model on ROCm. CUDA path is untouched. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

bingxche requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw, b8zhong and ch-wan as code owners April 19, 2026 10:05

gemini-code-assist Bot reviewed Apr 19, 2026

View reviewed changes

bingxche added the run-ci label Apr 19, 2026

Merge branch 'main' into bingxche/revert-pr-21734

5510708

HaiShaw approved these changes Apr 20, 2026

View reviewed changes

HaiShaw merged commit ab936ce into main Apr 20, 2026
96 of 151 checks passed

HaiShaw deleted the bingxche/revert-pr-21734 branch April 20, 2026 06:32

jasperjiaguo mentioned this pull request Apr 20, 2026

perf: optimize PCG inductor path for FP8 models (redo of #21734) #23227

Merged

3 tasks

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

Revert "perf: optimize PCG inductor path for FP8 models (sgl-project#…

943e9ce

…21734)" (sgl-project#23159) Feel free to PR again.

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

Revert "perf: optimize PCG inductor path for FP8 models (sgl-project#…

54068ce

…21734)" (sgl-project#23159) Feel free to PR again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "perf: optimize PCG inductor path for FP8 models (#21734)"#23159

Revert "perf: optimize PCG inductor path for FP8 models (#21734)"#23159
HaiShaw merged 2 commits intomainfrom
bingxche/revert-pr-21734

bingxche commented Apr 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

HaiShaw commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bingxche commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Evidence

Modifications

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

HaiShaw commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bingxche commented Apr 19, 2026 •

edited

Loading