[ROCm][CI] Fix failing FP8 tests on RDNA4 by mstankov-amd · Pull Request #174873 · pytorch/pytorch

mstankov-amd · 2026-02-12T11:34:43Z

Summary

This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16).

Problem

On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with:

92.4% NaN outputs when M < BLOCK_M (typically 16)
Large numerical mismatches between eager and compiled results
Only occurs in max-autotune mode

Root cause: Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations.

Solution

Added GPU-specific compile mode selection for small M values
gfx120x with M < 16: use compile_mode="default"
All other cases: use compile_mode="max-autotune"

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2026-02-12T11:34:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174873

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3002082 with merge base 08b6f48 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/inductor/test_fp8.py

jeffdaily · 2026-03-04T20:05:55Z

@mstankov-amd fix lint errors please.

mstankov-amd · 2026-03-05T09:22:39Z

@mstankov-amd fix lint errors please.

The linter errors are now fixed

jeffdaily · 2026-03-11T22:26:44Z

I don't understand why these cancelled jobs persist even when requesting rerun. Will try rebase against viable/strict to see if that clears the logjam.

jeffdaily · 2026-03-11T22:26:58Z

@pytorchbot rebase

pytorchmergebot · 2026-03-11T22:28:54Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2026-03-11T22:28:59Z

Successfully rebased fix_fp8_tests_gfx120x onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_fp8_tests_gfx120x && git pull --rebase)

jeffdaily · 2026-03-12T13:59:02Z

@pytorchbot merge

pytorchmergebot · 2026-03-12T14:01:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)

Cherry-pick of #3090 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>

## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)

Cherry-pick of #3090 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>

## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily (cherry picked from commit d667ffe)

## Summary This PR fixes FP8 inductor test failures that occur on AMD RDNA4 GPUs when testing matrix multiplications with small M dimensions (M < 16). ## Problem On gfx120x GPUs, FP8 scaled matrix multiplication tests fail with: - 92.4% NaN outputs when M < BLOCK_M (typically 16) - Large numerical mismatches between eager and compiled results - Only occurs in `max-autotune` mode **Root cause:** Autotuned Triton kernels on gfx120x generate incorrect tensor indexing for small M values, using partial indices instead of full computed indices in load/store operations. ## Solution - Added GPU-specific compile mode selection for small M values - gfx120x with M < 16: use `compile_mode="default"` - All other cases: use `compile_mode="max-autotune"` Pull Request resolved: pytorch#174873 Approved by: https://github.com/jeffdaily

pytorch-bot bot added module: inductor topic: not user facing topic category labels Feb 12, 2026

pytorchbot added the open source label Feb 12, 2026

soulitzer requested a review from jeffdaily February 13, 2026 15:44

soulitzer added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed open source labels Feb 13, 2026

pytorchbot added the open source label Feb 13, 2026

Skylion007 reviewed Feb 14, 2026

View reviewed changes

test/inductor/test_fp8.py Outdated Show resolved Hide resolved

jeffdaily reviewed Feb 20, 2026

View reviewed changes

test/inductor/test_fp8.py Outdated Show resolved Hide resolved

mstankov-amd requested review from Skylion007 and jeffdaily March 4, 2026 15:55

jeffdaily approved these changes Mar 4, 2026

View reviewed changes

jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 labels Mar 4, 2026

pytorch-bot bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 labels Mar 5, 2026

mstankov-amd added 4 commits March 11, 2026 22:28

Fix failing FP8 tests on RDNA4

eee5437

Replace onlyCuda with xfailIfROCm

675a3ca

Remove xfailIfROCm since only CUDA tests cases shpuld be run

c176e93

Fix linter errors

3002082

pytorchmergebot force-pushed the fix_fp8_tests_gfx120x branch from d6bab24 to 3002082 Compare March 11, 2026 22:29

jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 labels Mar 11, 2026

jeffdaily changed the title ~~Fix failing FP8 tests on RDNA4~~ [ROCm][CI] Fix failing FP8 tests on RDNA4 Mar 11, 2026

pytorch-bot bot added ciflow/b200 ciflow/h100 ciflow/inductor module: rocm AMD GPU support for Pytorch labels Mar 11, 2026

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 12, 2026

pytorchmergebot added the merging label Mar 12, 2026

pytorchmergebot added the Merged label Mar 12, 2026

pytorchmergebot closed this in d667ffe Mar 12, 2026

pytorchmergebot removed the merging label Mar 12, 2026

mstankov-amd deleted the fix_fp8_tests_gfx120x branch March 17, 2026 10:30

mstankov-amd mentioned this pull request Mar 20, 2026

[ROCm][CI] Fix failing FP8 tests on RDNA4 (#174873) ROCm/pytorch#3090

Merged

rocm-repo-management-api-6 bot added a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026

[ROCm][CI] Fix failing FP8 tests on RDNA4 (pytorch#174873)

c19ca7a

Cherry-pick of #3090 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>

rocm-repo-management-api-6 bot added a commit to ROCm/pytorch that referenced this pull request Mar 20, 2026

[ROCm][CI] Fix failing FP8 tests on RDNA4 (pytorch#174873)

eb12231

Cherry-pick of #3090 Co-authored-by: Milica Stankovic <milica.stankovic@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][CI] Fix failing FP8 tests on RDNA4#174873

[ROCm][CI] Fix failing FP8 tests on RDNA4#174873
mstankov-amd wants to merge 4 commits intopytorch:mainfrom
mstankov-amd:fix_fp8_tests_gfx120x

mstankov-amd commented Feb 12, 2026 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jeffdaily commented Mar 4, 2026

Uh oh!

mstankov-amd commented Mar 5, 2026

Uh oh!

jeffdaily commented Mar 11, 2026

Uh oh!

jeffdaily commented Mar 11, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Uh oh!

jeffdaily commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

mstankov-amd commented Feb 12, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Uh oh!

pytorch-bot bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174873

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

jeffdaily commented Mar 4, 2026

Uh oh!

mstankov-amd commented Mar 5, 2026

Uh oh!

jeffdaily commented Mar 11, 2026

Uh oh!

jeffdaily commented Mar 11, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Uh oh!

jeffdaily commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mstankov-amd commented Feb 12, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 12, 2026 •

edited

Loading