Skip to content

Revert "perf: optimize PCG inductor path for FP8 models (#21734)"#23159

Merged
HaiShaw merged 2 commits intomainfrom
bingxche/revert-pr-21734
Apr 20, 2026
Merged

Revert "perf: optimize PCG inductor path for FP8 models (#21734)"#23159
HaiShaw merged 2 commits intomainfrom
bingxche/revert-pr-21734

Conversation

@bingxche
Copy link
Copy Markdown
Collaborator

@bingxche bingxche commented Apr 19, 2026

Summary

Reverts #21734 (perf: optimize PCG inductor path for FP8 models, commit 6da3aba).

Re-submission of the previously closed #23039, this time with full Tier 2 + Tier 3 AMD CI verification evidence.

cc @jasperjiaguo , @HaiShaw

Motivation

#21734 introduces a non-recoverable Memory access fault by GPU on AMD CI that takes down 7+ tests across 14 model files. The faulty change is in python/sglang/srt/models/utils.py::apply_qk_norm, where four q.reshape(-1, head_dim) / k.reshape(-1, head_dim) calls are replaced with q.view(*q.shape[:-1], -1, head_dim) / k.view(*k.shape[:-1], -1, head_dim):

  • Under PCG capture mode q / k are frequently non-contiguous (stride-trick views from the QKV split), so .view() yields a tensor whose stride disagrees with what the downstream q_norm / k_norm kernels expect.
  • The kernel then writes into an unmapped page, producing the fingerprint Memory access fault by GPU node-2 (Agent handle: 0x...) on address 0x7f... immediately followed by scheduler died with exit code -6 / SIGKILL -9 / 30-min action timeout.
  • apply_qk_norm is called from 14 model files (qwen3.py, qwen3_5.py, qwen3_next.py, qwen3_moe.py, llada2.py, glm4_moe.py, olmo2.py, commandr.py, dflash.py, sdar.py, sdar_moe.py, bailing_moe.py, bailing_moe_linear.py, afmoe.py), so the regression is not limited to FP8 models as the PR title suggests; bf16 models hit it too.

Evidence

Bisect:

  • The PR's own check (run 24262716047 partition 4, Apr 11) was already failing with Memory access fault ... addr 0x7f28244bf000. This is before the aiter 0.1.12.post1 bump (Apr 13), so the regression cannot be attributed to aiter.
  • Baseline AMD schedule PR test (run 24527146590) hits the same fingerprint on partitions 2 / 4 / 8 / 10 / 12 / 13 of stage-b-test-1-gpu-small-amd.

Affected tests (all share the same fault fingerprint):

  • test_lora_load_from_tensor.py
  • test_priority_metrics.py
  • test_metrics.py
  • test_llada2_mini.py
  • test_llada2_mini_amd.py
  • test_reasoning.py
  • test_deterministic.py
  • (likely) test_eagle_dp_attention.py

Manual revert verification:

  • One-shot revert run 24450852614: test_priority_metrics.py 70s PASS / test_metrics.py 73s PASS / test_deterministic.py 155s PASS / test_llada2_mini.py 293s PASS, all Memory access fault occurrences gone.

Tier 2 + Tier 3 verification (this revert):

  • Tier 2 run 24596337420 (Apr 18 03:55 UTC, 1h 11m): stage-b-test-1-gpu-small-amd + stage-b-test-1-gpu-small-amd-nondeterministic + stage-c-test-4-gpu-amd (16 jobs) all green, all 7 affected tests PASS.
  • Tier 3 run 24599716745 (Apr 18 07:21 UTC, full stage matrix ~35 jobs): Memory access fault fingerprint never re-appeared.

Modifications

  • Revert python/sglang/srt/layers/quantization/fp8_utils.py and python/sglang/srt/models/utils.py to their state before 6da3aba.
  • Diff stat: 2 files changed, 10 insertions(+), 35 deletions(-).

Checklist

  • Format your code according to the Format code with pre-commit.
  • CI verified green on this revert (Tier 2 run 24596337420 + Tier 3 run 24599716745).

This reverts commit 6da3aba.

The change to apply_qk_norm in python/sglang/srt/models/utils.py replaces
q.reshape(-1, head_dim) / k.reshape(-1, head_dim) with q.view(*q.shape[:-1],
-1, head_dim) / same for k. Under PCG capture mode q/k are frequently
non-contiguous (stride-trick views from the QKV split), so .view() yields a
tensor with a wrong stride relative to what the q_norm / k_norm kernels
expect. The kernel then writes into an unmapped page and triggers a
non-recoverable "Memory access fault by GPU node-2 ... Reason: Unknown"
that kills the scheduler with SIGABRT/SIGKILL.

apply_qk_norm is called from 14 model files (qwen3, qwen3_5, qwen3_next,
qwen3_moe, llada2, glm4_moe, olmo2, commandr, dflash, sdar, sdar_moe,
bailing_moe, bailing_moe_linear, afmoe), so the regression is not limited
to FP8 models as the PR title suggests; bf16 models hit it too.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the conditional logic tied to the 'inductor' compiler and the global server arguments in apply_fp8_linear and apply_qk_norm. It also simplifies tensor reshaping in apply_qk_norm by using .reshape() instead of .view(). I have no feedback to provide.

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 20, 2026

@ispobock I'd like to revert #21734 as it broke AMD CI

@HaiShaw HaiShaw merged commit ab936ce into main Apr 20, 2026
96 of 151 checks passed
@HaiShaw HaiShaw deleted the bingxche/revert-pr-21734 branch April 20, 2026 06:32
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 20, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 20, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 20, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 20, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 20, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 21, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 21, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 21, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 23, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 24, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jasperjiaguo added a commit to jasperjiaguo/sglang that referenced this pull request Apr 24, 2026
…21734)

Re-applies the inductor-fusion optimizations from sgl-project#21734 that were
reverted in sgl-project#23159, with the AMD crash fixed.

Changes vs. sgl-project#21734:
- models/utils.py: use `reshape(*q.shape[:-1], -1, head_dim)` instead
  of `view(...)`. reshape is safe on non-contiguous QKV-split tensors
  (the AMD crash root cause), while still giving inductor the same
  multi-dim shape information as view for fusion with surrounding ops.
- fp8_utils.py: unchanged from sgl-project#21734 (FP8 per-tensor activation quant
  inductor-fusion path).

Why sgl-project#21734's view() crashed on AMD:
- Under PCG capture, q/k are stride-trick views from the QKV split
  and are frequently non-contiguous.
- `view()` on non-contiguous input yields a tensor whose strides
  disagree with what q_norm/k_norm kernels expect, leading to writes
  into unmapped pages (`Memory access fault by GPU`) on AMD.
- `reshape(*q.shape[:-1], -1, head_dim)` preserves the same multi-dim
  shape for inductor but copies when strides don't allow a view,
  avoiding the fault.

Intention matches sgl-project#21734: preserve the multi-dim shape for inductor
fusion while skipping the opaque fused_inplace_qknorm under the
inductor compiler so inductor can fuse QK norm with RMSNorm / quant
/ residual add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
zhentaocc pushed a commit to zhentaocc/sglang that referenced this pull request Apr 29, 2026
Profile of DFlash on Qwen3-8B (MI355X, conc=64) showed Q/K RMSNorm runs
as two separate kernels on ROCm (~12.9ms total per batch in the draft
model alone), because apply_qk_norm only had a fast path for CUDA.

Add an AITER fused fast path that runs Q-norm + K-norm in a single HIP
kernel:
- Imports aiter.ops.fused_qk_norm_rope_cache_quant.fused_qk_rmsnorm
  when running on ROCm with AITER installed (gracefully falls back if
  the module is missing)
- Gated on _is_hip + matching epsilons + non-deterministic mode
- Uses reshape (not view) because qkv.split() returns strided views and
  ROCm RMSNorm kernels fault on strided inputs (see sgl-project#23159)
- Returns new tensors (AITER kernel is functional, not in-place)

Benefits both the DFlash draft model (5 layers x N requests per decode
step) and the target Qwen3 model on ROCm. CUDA path is untouched.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
zhentaocc pushed a commit to zhentaocc/sglang that referenced this pull request May 9, 2026
Profile of DFlash on Qwen3-8B (MI355X, conc=64) showed Q/K RMSNorm runs
as two separate kernels on ROCm (~12.9ms total per batch in the draft
model alone), because apply_qk_norm only had a fast path for CUDA.

Add an AITER fused fast path that runs Q-norm + K-norm in a single HIP
kernel:
- Imports aiter.ops.fused_qk_norm_rope_cache_quant.fused_qk_rmsnorm
  when running on ROCm with AITER installed (gracefully falls back if
  the module is missing)
- Gated on _is_hip + matching epsilons + non-deterministic mode
- Uses reshape (not view) because qkv.split() returns strided views and
  ROCm RMSNorm kernels fault on strided inputs (see sgl-project#23159)
- Returns new tensors (AITER kernel is functional, not in-place)

Benefits both the DFlash draft model (5 layers x N requests per decode
step) and the target Qwen3 model on ROCm. CUDA path is untouched.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants