[AMD] Fix memory access fault when --page-size > 1 with speculative decoding on AMD GPUs#23596
Merged
HaiShaw merged 2 commits intosgl-project:mainfrom Apr 24, 2026
Merged
Conversation
Split out of sgl-project#23146 per review request to expedite merging (HaiShaw). On HIP with `--attention-backend aiter`, the legacy `get_last_loc_triton` kernel emits a mixed-width int32 -> int64 store that the HIP Triton backend mis-compiles under EAGLE + `page_size > 1` + aiter unified attention, producing out-of-range `last_loc` values that subsequently crash `set_kv_buffer` with an HSA aperture fault. Route this combination (HIP + attention_backend == "aiter") through a new int32-safe Triton variant `get_last_loc_triton_safe`: the in-kernel result buffer stays int32 (matching `req_to_token.dtype`), and the consumer-dtype promotion happens in torch after the kernel returns, so Triton never issues a mixed-width store. Other hardware backends (CUDA / ascend / torch_native) and other attention backends on HIP keep the original dispatcher unchanged. Validated on Qwen3.5-397B-A17B-{FP8,MXFP4} TP=8 on MI355X with `--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --page-size 16`: crashes no longer reproduce and GSM8K accuracy is stable across 3 runs (FP8 avg 0.949, MXFP4 avg 0.933; both above the gates of 0.94 / 0.91). Non-HIP and non-aiter paths are bitwise unchanged. Made-with: Cursor
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
HaiShaw
approved these changes
Apr 24, 2026
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Co-author: @kkHuang-amd
Motivation
--page-size > 1with speculative decoding on AMD GPUs results in memory access fault.Modifications
Accuracy Tests
Server command:
Client command
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci