[AMD] Optimize MiniMax-M2.5 - enable fused Triton kernel for FP8 KV cache write in aiter decode path by yctseng0211 · Pull Request #23620 · sgl-project/sglang

yctseng0211 · 2026-04-24T05:49:12Z

Motivation

On AMD GPUs with FP8 KV cache (--kv-cache-dtype fp8_e4m3) and unified
attention enabled, the decode KV cache write previously required two
separate kernel launches: a bf16→fp8 dtype cast (float8_copy_kernel)
followed by a paged store (store_kvcache).
This PR adds a branch in AiterAttnBackend.forward_decode that uses
launch_reshape_and_cache_flash (an existing Triton kernel already used
for SWA models) to fuse the cast and store into a single kernel launch.

Modifications

Accuracy Tests

GSM8K accuracy: 93.3% (unchanged from baseline).

Speed Tests and Profiling

Benchmarked on MI355X with MiniMax-M2.5 FP8 (TP=4, ISL=8192, OSL=1024):
+2.5% output throughput at conc=64, +2.3% at conc=32, up to +5.9% at
conc=4. No regression at conc=128 (+0.4%).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…path

gemini-code-assist · 2026-04-24T05:49:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…l only The fused Triton kernel introduced in PR sgl-project#23620 (commit adc5932) is correct enough for non-speculative target-model decode (its original target, MiniMax-M2.5) but its bf16->fp8 implicit cast through tl.store does not match PyTorch .to(torch.float8_e4m3fn) bit-exactly. PyTorch casts with round-to-nearest-even + saturation; the Triton path on ROCm/HIP rounds differently and may not saturate, even when the per-tensor k_scale / v_scale are 1.0 (verified for Kimi-K2.5 Quark MXFP4 with kv_cache_dtype=fp8 by direct probe). Non-speculative inference tolerates this small numerical drift, but EAGLE3 draft decode reads back its own freshly written K/V cache on every subsequent draft step, so any drift in the draft cache compounds across draft steps and collapses the accept length: Kimi-K2.5-MXFP4 + EAGLE3 (8xMI300, in/out 8192/1024, conc 4): pr-23146 baseline : accept=3.26 out=675 tok/s + seqused_k fix (2bee3c3) : accept=3.46 out=706 tok/s + this commit (target-only gate): accept=3.97 out=807 tok/s pr-23461 baseline reference : accept=3.97 out=798 tok/s Restrict the fast path to target-model backends by checking model_runner.is_draft_worker. The SWA path is unchanged (it already works because SWA models did not exercise the corrupted draft cache). The Triton kernel itself can be revisited later to match PyTorch fp8 cast semantics; until then, draft model writes route through the legacy MHATokenToKVPool.set_kv_buffer path.

…ache write in aiter decode path (sgl-project#23620)

[AMD] Use fused Triton kernel for FP8 KV cache write in aiter decode …

e56aec7

…path

yctseng0211 changed the title ~~[AMD] Optimize enable fused Triton kernel for FP8 KV cache write in aiter decode path~~ [AMD] Optimize MiniMax-M2.5 - enable fused Triton kernel for FP8 KV cache write in aiter decode path Apr 24, 2026

yctseng0211 marked this pull request as ready for review April 24, 2026 08:22

yctseng0211 requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners April 24, 2026 08:22

HaiShaw approved these changes Apr 25, 2026

View reviewed changes

HaiShaw merged commit adc5932 into sgl-project:main Apr 25, 2026
57 of 65 checks passed

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[AMD] Optimize MiniMax-M2.5 - enable fused Triton kernel for FP8 KV c…

cb6c1b4

…ache write in aiter decode path (sgl-project#23620)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Optimize MiniMax-M2.5 - enable fused Triton kernel for FP8 KV cache write in aiter decode path#23620

[AMD] Optimize MiniMax-M2.5 - enable fused Triton kernel for FP8 KV cache write in aiter decode path#23620
HaiShaw merged 1 commit intosgl-project:mainfrom
yctseng0211:reshape_and_cache_flash

yctseng0211 commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yctseng0211 commented Apr 24, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants