Skip to content

[AUTOGENERATED] [release/2.9] [ROCm][CI] Fix failing FP8 tests on RDNA4 (#174873)#3092

Merged
rocm-repo-management-api-6[bot] merged 1 commit intorelease/2.9from
autogenerated/release/2.9_cherry-pick_pr-3090
Mar 20, 2026
Merged

[AUTOGENERATED] [release/2.9] [ROCm][CI] Fix failing FP8 tests on RDNA4 (#174873)#3092
rocm-repo-management-api-6[bot] merged 1 commit intorelease/2.9from
autogenerated/release/2.9_cherry-pick_pr-3090

Conversation

@rocm-repo-management-api-6
Copy link
Copy Markdown

Cherry-pick of #3090

## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
@rocm-repo-management-api-6 rocm-repo-management-api-6 bot merged commit eb12231 into release/2.9 Mar 20, 2026
@rocm-repo-management-api-6 rocm-repo-management-api-6 bot deleted the autogenerated/release/2.9_cherry-pick_pr-3090 branch March 20, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant