Replace QH16 bf16 kernel with a new one that does not use ptr_RP by JohnNikolay84 · Pull Request #2999 · ROCm/aiter

JohnNikolay84 · 2026-05-01T11:37:41Z

Motivation

#2729 has introduced a new QH64 kernel that is not writing directly to ptr_RP and instead is writing split data into ptr_R/logits.

As this #2983 states other kernels like MLA_A16W16_1TG_4W_32mx1_16nx1_Coex0_Msk1_QH16.co do not follow the same logic and write into a null pointer instead.

Technical Details

This change is introducing a new kernel for nhead=32 bf16 that is using same convention as QH64 kernel. However I have not been able to find a kernel with mfma layouts 32x32x16, instead I am using the one with 16x16x32.

Test Plan

Run a new test in aiter and make sure it pass torch reference.
Run DeepSeek in TP4 and make sure it is not crashing.

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-05-01T11:38:12Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2999 --add-label <label>

The legacy QH16 m32x1_n16x1 ASM kernel (gqa_ratio=32, bf16/bf16, non-persistent, decode qseqlen=1) writes its output directly via ptr_RP when kv_split==1. Upstream passes ptr_RP=nullptr and out_16_nosplit=0, causing GPU memory faults on gfx950 (DeepSeek-V3.2 TP4 hits this with nhead=32). Fix: - C++: set ptr_RP and out_16_nosplit only when gqa_ratio==32 AND max_seqlen_q==1 (the exact legacy kernel condition). Other non-persistent kernels (v3, stage1) use split-reduce and expect ptr_RP = nullptr, so they are unaffected. - Python: reuse output buffer for logits and skip stage2 only when nhead==32 and max_seqlen_q==1 (matches the C++ gate). Tested on MI355X (gfx950): nhead=8,16,32,64,128 all pass. bf16/bf16, ctx_lens=[256,1024], batch=[1,4,16]. Supersedes: ROCm#2999 (broken — tile mismatch, 85% wrong output) Co-authored-by: Cursor <cursoragent@cursor.com>

The legacy QH16 m32x1_n16x1 ASM kernel (gqa_ratio=32, bf16/bf16, non-persistent, decode qseqlen=1) writes its output directly via ptr_RP when kv_split==1. Upstream passes ptr_RP=nullptr and out_16_nosplit=0, causing GPU memory faults on gfx950 (DeepSeek-V3.2 TP4 hits this with nhead=32). Fix: - C++: set ptr_RP and out_16_nosplit only when gqa_ratio==32 AND max_seqlen_q==1 (the exact legacy kernel condition). Other non-persistent kernels (v3, stage1) use split-reduce and expect ptr_RP = nullptr, so they are unaffected. - Python: reuse output buffer for logits and skip stage2 only when nhead==32 and max_seqlen_q==1 (matches the C++ gate). Tested on MI355X (gfx950): nhead=8,16,32,64,128 all pass. bf16/bf16, ctx_lens=[256,1024], batch=[1,4,16]. Supersedes: #2999 (broken — tile mismatch, 85% wrong output) Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: azaidy <aliasger.zaidy@amd.com>

ChuanLi1101

LGTM.

Clean follow-up to merged #2729 / #2983 -- adds the matching bf16 nhead=32 path to the QH16 kernel family using the same ptr_R/logits convention as the QH64 kernel. Diff is +26/-3 across 4 files with a precise dispatch guard (q.dtype==bf16 AND kv.dtype==bf16 AND nhead==32); behavior on every other path is preserved. CI all green.

cc @frida-andersson @xaguilar-amd for a courtesy MLA-area LGTM since this lives next to your merged MLA fixes -- happy to merge once one of you takes a quick look.

xaguilar-amd · 2026-05-15T07:26:06Z

LGTM too.

frida-andersson · 2026-05-15T08:30:40Z

LGTM

JohnNikolay84 · 2026-05-15T09:12:34Z

This now has to be rebased on top of main in the following order due to conflicts

this PR
Frida's 2983 PR reverted
HEAD

…)" This reverts commit e09effa.

JohnNikolay84 · 2026-05-15T15:06:31Z

It is ready to be merged now once approved.

JohnNikolay84 self-assigned this May 1, 2026

JohnNikolay84 requested review from a team and fangche123 May 1, 2026 11:37

JohnNikolay84 requested review from Zzz9990 and valarLip May 1, 2026 11:48

JohnNikolay84 force-pushed the mla_nheads32_fault_fix branch from ce19134 to 46d6983 Compare May 4, 2026 13:38

ChuanLi1101 previously approved these changes May 14, 2026

View reviewed changes

Sergey Solo added 2 commits May 15, 2026 09:50

Revert "[MLA] Fix nhead=32 non-persistent decode crash on gfx950 (#2983…

1b51c23

…)" This reverts commit e09effa.

Replace QH16 bf16 kernel with a new one that does not use ptr_RP

da27d6d

JohnNikolay84 dismissed ChuanLi1101’s stale review via da27d6d May 15, 2026 09:55

JohnNikolay84 force-pushed the mla_nheads32_fault_fix branch from 4f7d467 to da27d6d Compare May 15, 2026 09:55

JohnNikolay84 requested a review from ChuanLi1101 May 18, 2026 07:56

valarLip approved these changes May 18, 2026

View reviewed changes

valarLip merged commit 1eb1c9b into main May 18, 2026
43 of 45 checks passed

valarLip deleted the mla_nheads32_fault_fix branch May 18, 2026 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace QH16 bf16 kernel with a new one that does not use ptr_RP#2999

Replace QH16 bf16 kernel with a new one that does not use ptr_RP#2999
valarLip merged 2 commits into
mainfrom
mla_nheads32_fault_fix

JohnNikolay84 commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

ChuanLi1101 left a comment •

edited

Loading

Uh oh!

xaguilar-amd commented May 15, 2026

Uh oh!

frida-andersson commented May 15, 2026

Uh oh!

JohnNikolay84 commented May 15, 2026

Uh oh!

JohnNikolay84 commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

JohnNikolay84 commented May 1, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented May 1, 2026

🏷️ CI Guide

Uh oh!

ChuanLi1101 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xaguilar-amd commented May 15, 2026

Uh oh!

frida-andersson commented May 15, 2026

Uh oh!

JohnNikolay84 commented May 15, 2026

Uh oh!

JohnNikolay84 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ChuanLi1101 left a comment •

edited

Loading

JohnNikolay84 commented May 15, 2026 •

edited

Loading