[AMD] Fix memory access fault when `--page-size > 1` with speculative decoding on AMD GPUs by hubertlu-tw · Pull Request #23596 · sgl-project/sglang

hubertlu-tw · 2026-04-24T00:29:16Z

Motivation

--page-size > 1 with speculative decoding on AMD GPUs results in memory access fault.

python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4 --attention-backend triton --page-size 16

Modifications

Accuracy Tests

Server command:

python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tp 4 --attention-backend triton --page-size 16

Client command

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319 --num-shots 5
---
Accuracy: 0.851
Invalid: 0.011
Latency: 99.752 s

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Split out of sgl-project#23146 per review request to expedite merging (HaiShaw). On HIP with `--attention-backend aiter`, the legacy `get_last_loc_triton` kernel emits a mixed-width int32 -> int64 store that the HIP Triton backend mis-compiles under EAGLE + `page_size > 1` + aiter unified attention, producing out-of-range `last_loc` values that subsequently crash `set_kv_buffer` with an HSA aperture fault. Route this combination (HIP + attention_backend == "aiter") through a new int32-safe Triton variant `get_last_loc_triton_safe`: the in-kernel result buffer stays int32 (matching `req_to_token.dtype`), and the consumer-dtype promotion happens in torch after the kernel returns, so Triton never issues a mixed-width store. Other hardware backends (CUDA / ascend / torch_native) and other attention backends on HIP keep the original dispatcher unchanged. Validated on Qwen3.5-397B-A17B-{FP8,MXFP4} TP=8 on MI355X with `--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --page-size 16`: crashes no longer reproduce and GSM8K accuracy is stable across 3 runs (FP8 avg 0.949, MXFP4 avg 0.933; both above the gates of 0.94 / 0.91). Non-HIP and non-aiter paths are bitwise unchanged. Made-with: Cursor

gemini-code-assist · 2026-04-24T00:29:20Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hubertlu-tw added 2 commits April 23, 2026 20:09

Refactor the code

da69c2e

hubertlu-tw requested review from HaiShaw and kkHuang-amd April 24, 2026 00:29

hubertlu-tw requested review from Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 24, 2026 00:29

HaiShaw approved these changes Apr 24, 2026

View reviewed changes

HaiShaw merged commit 4cb0c4e into sgl-project:main Apr 24, 2026
57 of 65 checks passed

hubertlu-tw mentioned this pull request Apr 27, 2026

[AMD] Enable EAGLE speculative decoding for Qwen3.5 FP8 and MXFP4 models with aiter's unified attention #23146

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Fix memory access fault when `--page-size > 1` with speculative decoding on AMD GPUs#23596

[AMD] Fix memory access fault when `--page-size > 1` with speculative decoding on AMD GPUs#23596
HaiShaw merged 2 commits intosgl-project:mainfrom
hubertlu-tw:spec_fix_amd

hubertlu-tw commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hubertlu-tw commented Apr 24, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants