Skip to content

[AMD] Fix memory access fault when --page-size > 1 with speculative decoding on AMD GPUs#23596

Merged
HaiShaw merged 2 commits intosgl-project:mainfrom
hubertlu-tw:spec_fix_amd
Apr 24, 2026
Merged

[AMD] Fix memory access fault when --page-size > 1 with speculative decoding on AMD GPUs#23596
HaiShaw merged 2 commits intosgl-project:mainfrom
hubertlu-tw:spec_fix_amd

Conversation

@hubertlu-tw
Copy link
Copy Markdown
Collaborator

Co-author: @kkHuang-amd

Motivation

--page-size > 1 with speculative decoding on AMD GPUs results in memory access fault.

python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4 --attention-backend triton --page-size 16

Modifications

Accuracy Tests

Server command:

python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4 --attention-backend triton --page-size 16

Client command

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319 --num-shots 5
---
Accuracy: 0.851
Invalid: 0.011
Latency: 99.752 s

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Split out of sgl-project#23146 per review request to expedite merging (HaiShaw).

On HIP with `--attention-backend aiter`, the legacy `get_last_loc_triton`
kernel emits a mixed-width int32 -> int64 store that the HIP Triton
backend mis-compiles under EAGLE + `page_size > 1` + aiter unified
attention, producing out-of-range `last_loc` values that subsequently
crash `set_kv_buffer` with an HSA aperture fault.

Route this combination (HIP + attention_backend == "aiter") through a
new int32-safe Triton variant `get_last_loc_triton_safe`: the in-kernel
result buffer stays int32 (matching `req_to_token.dtype`), and the
consumer-dtype promotion happens in torch after the kernel returns, so
Triton never issues a mixed-width store. Other hardware backends
(CUDA / ascend / torch_native) and other attention backends on HIP
keep the original dispatcher unchanged.

Validated on Qwen3.5-397B-A17B-{FP8,MXFP4} TP=8 on MI355X with
`--speculative-algorithm EAGLE --speculative-num-steps 3
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4
--page-size 16`: crashes no longer reproduce and GSM8K accuracy is
stable across 3 runs (FP8 avg 0.949, MXFP4 avg 0.933; both above the
gates of 0.94 / 0.91).

Non-HIP and non-aiter paths are bitwise unchanged.

Made-with: Cursor
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@HaiShaw HaiShaw merged commit 4cb0c4e into sgl-project:main Apr 24, 2026
57 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants