Revert "perf: optimize PCG inductor path for FP8 models (#21734)"#23159
Merged
Revert "perf: optimize PCG inductor path for FP8 models (#21734)"#23159
Conversation
This reverts commit 6da3aba. The change to apply_qk_norm in python/sglang/srt/models/utils.py replaces q.reshape(-1, head_dim) / k.reshape(-1, head_dim) with q.view(*q.shape[:-1], -1, head_dim) / same for k. Under PCG capture mode q/k are frequently non-contiguous (stride-trick views from the QKV split), so .view() yields a tensor with a wrong stride relative to what the q_norm / k_norm kernels expect. The kernel then writes into an unmapped page and triggers a non-recoverable "Memory access fault by GPU node-2 ... Reason: Unknown" that kills the scheduler with SIGABRT/SIGKILL. apply_qk_norm is called from 14 model files (qwen3, qwen3_5, qwen3_next, qwen3_moe, llada2, glm4_moe, olmo2, commandr, dflash, sdar, sdar_moe, bailing_moe, bailing_moe_linear, afmoe), so the regression is not limited to FP8 models as the PR title suggests; bf16 models hit it too. Made-with: Cursor
Contributor
There was a problem hiding this comment.
Code Review
This pull request removes the conditional logic tied to the 'inductor' compiler and the global server arguments in apply_fp8_linear and apply_qk_norm. It also simplifies tensor reshaping in apply_qk_norm by using .reshape() instead of .view(). I have no feedback to provide.
Collaborator
HaiShaw
approved these changes
Apr 20, 2026
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 20, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 20, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 20, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 20, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 20, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 21, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 21, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 21, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
…21734)" (sgl-project#23159) Feel free to PR again.
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 23, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 24, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo
added a commit
to jasperjiaguo/sglang
that referenced
this pull request
Apr 24, 2026
…21734) Re-applies the inductor-fusion optimizations from sgl-project#21734 that were reverted in sgl-project#23159, with the AMD crash fixed. Changes vs. sgl-project#21734: - models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead of `view(...)`. reshape is safe on non-contiguous QKV-split tensors (the AMD crash root cause), while still giving inductor the same multi-dim shape information as view for fusion with surrounding ops. - fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant inductor-fusion path). Why sgl-project#21734's view() crashed on AMD: - Under PCG capture, q/k are stride-trick views from the QKV split and are frequently non-contiguous. - `view()` on non-contiguous input yields a tensor whose strides disagree with what q_norm/k_norm kernels expect, leading to writes into unmapped pages (`Memory access fault by GPU`) on AMD. - `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim shape for inductor but copies when strides don't allow a view, avoiding the fault. Intention matches sgl-project#21734: preserve the multi-dim shape for inductor fusion while skipping the opaque fused_inplace_qknorm under the inductor compiler so inductor can fuse QK norm with RMSNorm / quant / residual add. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kyx1999
pushed a commit
to KMSorSMS/sglang
that referenced
this pull request
Apr 27, 2026
…21734)" (sgl-project#23159) Feel free to PR again.
zhentaocc
pushed a commit
to zhentaocc/sglang
that referenced
this pull request
Apr 29, 2026
Profile of DFlash on Qwen3-8B (MI355X, conc=64) showed Q/K RMSNorm runs as two separate kernels on ROCm (~12.9ms total per batch in the draft model alone), because apply_qk_norm only had a fast path for CUDA. Add an AITER fused fast path that runs Q-norm + K-norm in a single HIP kernel: - Imports aiter.ops.fused_qk_norm_rope_cache_quant.fused_qk_rmsnorm when running on ROCm with AITER installed (gracefully falls back if the module is missing) - Gated on _is_hip + matching epsilons + non-deterministic mode - Uses reshape (not view) because qkv.split() returns strided views and ROCm RMSNorm kernels fault on strided inputs (see sgl-project#23159) - Returns new tensors (AITER kernel is functional, not in-place) Benefits both the DFlash draft model (5 layers x N requests per decode step) and the target Qwen3 model on ROCm. CUDA path is untouched. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
zhentaocc
pushed a commit
to zhentaocc/sglang
that referenced
this pull request
May 9, 2026
Profile of DFlash on Qwen3-8B (MI355X, conc=64) showed Q/K RMSNorm runs as two separate kernels on ROCm (~12.9ms total per batch in the draft model alone), because apply_qk_norm only had a fast path for CUDA. Add an AITER fused fast path that runs Q-norm + K-norm in a single HIP kernel: - Imports aiter.ops.fused_qk_norm_rope_cache_quant.fused_qk_rmsnorm when running on ROCm with AITER installed (gracefully falls back if the module is missing) - Gated on _is_hip + matching epsilons + non-deterministic mode - Uses reshape (not view) because qkv.split() returns strided views and ROCm RMSNorm kernels fault on strided inputs (see sgl-project#23159) - Returns new tensors (AITER kernel is functional, not in-place) Benefits both the DFlash draft model (5 layers x N requests per decode step) and the target Qwen3 model on ROCm. CUDA path is untouched. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reverts #21734 (
perf: optimize PCG inductor path for FP8 models, commit 6da3aba).Re-submission of the previously closed #23039, this time with full Tier 2 + Tier 3 AMD CI verification evidence.
cc @jasperjiaguo , @HaiShaw
Motivation
#21734 introduces a non-recoverable
Memory access fault by GPUon AMD CI that takes down 7+ tests across 14 model files. The faulty change is inpython/sglang/srt/models/utils.py::apply_qk_norm, where fourq.reshape(-1, head_dim)/k.reshape(-1, head_dim)calls are replaced withq.view(*q.shape[:-1], -1, head_dim)/k.view(*k.shape[:-1], -1, head_dim):q/kare frequently non-contiguous (stride-trick views from the QKV split), so.view()yields a tensor whose stride disagrees with what the downstreamq_norm/k_normkernels expect.Memory access fault by GPU node-2 (Agent handle: 0x...) on address 0x7f...immediately followed byscheduler died with exit code -6/SIGKILL -9/ 30-min action timeout.apply_qk_normis called from 14 model files (qwen3.py,qwen3_5.py,qwen3_next.py,qwen3_moe.py,llada2.py,glm4_moe.py,olmo2.py,commandr.py,dflash.py,sdar.py,sdar_moe.py,bailing_moe.py,bailing_moe_linear.py,afmoe.py), so the regression is not limited to FP8 models as the PR title suggests; bf16 models hit it too.Evidence
Bisect:
Memory access fault ... addr 0x7f28244bf000. This is before the aiter 0.1.12.post1 bump (Apr 13), so the regression cannot be attributed to aiter.stage-b-test-1-gpu-small-amd.Affected tests (all share the same fault fingerprint):
test_lora_load_from_tensor.pytest_priority_metrics.pytest_metrics.pytest_llada2_mini.pytest_llada2_mini_amd.pytest_reasoning.pytest_deterministic.pytest_eagle_dp_attention.pyManual revert verification:
test_priority_metrics.py70s PASS /test_metrics.py73s PASS /test_deterministic.py155s PASS /test_llada2_mini.py293s PASS, allMemory access faultoccurrences gone.Tier 2 + Tier 3 verification (this revert):
stage-b-test-1-gpu-small-amd+stage-b-test-1-gpu-small-amd-nondeterministic+stage-c-test-4-gpu-amd(16 jobs) all green, all 7 affected tests PASS.Memory access faultfingerprint never re-appeared.Modifications
python/sglang/srt/layers/quantization/fp8_utils.pyandpython/sglang/srt/models/utils.pyto their state before 6da3aba.2 files changed, 10 insertions(+), 35 deletions(-).Checklist