Skip to content

[ROCm][CI] Fix failing FP8 tests on RDNA4#174873

Closed
mstankov-amd wants to merge 4 commits intopytorch:mainfrom
mstankov-amd:fix_fp8_tests_gfx120x
Closed

[ROCm][CI] Fix failing FP8 tests on RDNA4#174873
mstankov-amd wants to merge 4 commits intopytorch:mainfrom
mstankov-amd:fix_fp8_tests_gfx120x

Conversation

@mstankov-amd
Copy link
Copy Markdown
Contributor

@mstankov-amd mstankov-amd commented Feb 12, 2026

Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).

Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:

  • 92.4% NaN outputs when M < BLOCK_M (typically 16)
  • Large numerical mismatches between eager and compiled results
  • Only occurs in max-autotune mode

Root cause: Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.

Solution

  • Added GPU-specific compile mode selection for small M values
  • gfx120x with M < 16: use compile_mode="default"
  • All other cases: use compile_mode="max-autotune"

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Feb 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174873

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3002082 with merge base 08b6f48 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@soulitzer soulitzer requested a review from jeffdaily February 13, 2026 15:44
@soulitzer soulitzer added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed open source labels Feb 13, 2026
@jeffdaily jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 labels Mar 4, 2026
@jeffdaily
Copy link
Copy Markdown
Collaborator

@mstankov-amd fix lint errors please.

@pytorch-bot pytorch-bot bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 labels Mar 5, 2026
@mstankov-amd
Copy link
Copy Markdown
Contributor Author

@mstankov-amd fix lint errors please.

The linter errors are now fixed

@jeffdaily
Copy link
Copy Markdown
Collaborator

I don't understand why these cancelled jobs persist even when requesting rerun. Will try rebase against viable/strict to see if that clears the logjam.

@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased fix_fp8_tests_gfx120x onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_fp8_tests_gfx120x && git pull --rebase)

@jeffdaily jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 labels Mar 11, 2026
@jeffdaily jeffdaily changed the title Fix failing FP8 tests on RDNA4 [ROCm][CI] Fix failing FP8 tests on RDNA4 Mar 11, 2026
@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 12, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@mstankov-amd mstankov-amd deleted the fix_fp8_tests_gfx120x branch March 17, 2026 10:30
mstankov-amd added a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
mstankov-amd added a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
rocm-repo-management-api-6 bot pushed a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
rocm-repo-management-api-6 bot added a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026
Cherry-pick of #3090

Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
rocm-repo-management-api-6 bot pushed a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
rocm-repo-management-api-6 bot added a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026
Cherry-pick of #3090

Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Mar 25, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs
when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect
tensor indexing for small M values, using partial indices instead of
full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily

(cherry picked from commit d667ffe)
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
## Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).

## Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:
- 92.4% NaN outputs when M < BLOCK_M (typically 16)
- Large numerical mismatches between eager and compiled results
- Only occurs in `max-autotune` mode

**Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.

## Solution

 - Added GPU-specific compile mode selection for small M values
 - gfx120x with M < 16: use `compile_mode="default"`
 - All other cases: use `compile_mode="max-autotune"`

Pull Request resolved: pytorch#174873
Approved by: https://github.com/jeffdaily
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/b200 ciflow/h100 ciflow/inductor ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants