Skip to content

[ROCm][CI] Fix failing FP8 tests on RDNA4 (#174873)#3090

Merged
mstankov-amd merged 1 commit intorelease/2.11from
fix_faling_fp8_on_gfx120x
Mar 20, 2026
Merged

[ROCm][CI] Fix failing FP8 tests on RDNA4 (#174873)#3090
mstankov-amd merged 1 commit intorelease/2.11from
fix_faling_fp8_on_gfx120x

Conversation

@mstankov-amd
Copy link
Copy Markdown

@mstankov-amd mstankov-amd commented Mar 20, 2026

Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).

Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:

  • 92.4% NaN outputs when M < BLOCK_M (typically 16)
  • Large numerical mismatches between eager and compiled results
  • Only occurs in max-autotune mode

Root cause: Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.

Solution

  • Added GPU-specific compile mode selection for small M values
  • gfx120x with M < 16: use compile_mode="default"
  • All other cases: use compile_mode="max-autotune"

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)

Cherry-picked to release/2.10 branch via #3091

Cherry-picked to release/2.9 branch via #3092

@mstankov-amd mstankov-amd merged commit c92f998 into release/2.11 Mar 20, 2026
1 check passed
@mstankov-amd mstankov-amd deleted the fix_faling_fp8_on_gfx120x branch March 20, 2026 10:29
@mstankov-amd
Copy link
Copy Markdown
Author

!cherry-pick --onto release/2.10

@mstankov-amd
Copy link
Copy Markdown
Author

!cherry-pick --onto release/2.9

rocm-repo-management-api-6 bot pushed a commit that referenced this pull request Mar 20, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
rocm-repo-management-api-6 bot added a commit that referenced this pull request Mar 20, 2026
Cherry-pick of #3090

Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
@rocm-repo-management-api-6
Copy link
Copy Markdown

Created branch autogenerated/release/2.10_cherry-pick_pr-3090 and #3091

Comment processed by Build

rocm-repo-management-api-6 bot pushed a commit that referenced this pull request Mar 20, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
@rocm-repo-management-api-6
Copy link
Copy Markdown

Created branch autogenerated/release/2.9_cherry-pick_pr-3090 and #3092

Comment processed by Build

rocm-repo-management-api-6 bot added a commit that referenced this pull request Mar 20, 2026
Cherry-pick of #3090

Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
jithunnair-amd pushed a commit that referenced this pull request Mar 25, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant