Skip to content

[ROCm] Updating failing unit tests: Navi & MI200 & MI300 & MI350#172780

Closed
apakbin wants to merge 23 commits intopytorch:mainfrom
apakbin:arpakbin-mi355-test-failures
Closed

[ROCm] Updating failing unit tests: Navi & MI200 & MI300 & MI350#172780
apakbin wants to merge 23 commits intopytorch:mainfrom
apakbin:arpakbin-mi355-test-failures

Conversation

@apakbin
Copy link
Copy Markdown
Contributor

@apakbin apakbin commented Jan 19, 2026

This PR fixes ROCm-specific unit test issues:

  • test_sdpa_autocast, test_sdpa_backwards, and test_compile_preserves_metadata_cache: Currently skipped on AMD MI200 and MI300. This PR extends the skip to all ROCm architectures for consistency.

This PR will also enable the below DISABLED github issues to be closed:

test_sdpa_autocast:
Fixes #173715

test_sdpa_backwards:
Fixes #173712
Fixes #173713
Fixes #173714

test_compile_preserves_metadata_cache:
Fixes #173717

  • test_mm_plus_mm3: Replaced expectedFailureDynamicWrapper with pytest.mark.xfail( condition=not torch.version.hip,...). C++ wrapper dynamic shapes passes on ROCm.

  • test_triton_autotuning: This test was failing on ROCm because it expected a grid value of 32736 for all AMD architectures. Fixed by checking if the grid value is one of the possible values based on the configs. This enables the test to run on ROCm instead of being skipped entirely as it is in rocm/pytorch.

Fixes #173619

  • test_triton_mutated_autotuning: Applied the same grid value fix as test_triton_autotuning.

Fixes #173620

  • all tests in test/inductor/test_select_algorithm.py: guarded self.assertEqual(counters["inductor"]["select_algorithm_autotune"], ...) assertions so they do not run on ROCm, as autotuning behavior is non-deterministic on this platform (candidate prescreening may filter more aggressively based on architecture-specific kernel availability).
  • test_copy_non_blocking_is_pinned: Observed failures on Navi machines, skipped them while they are being investigated.

  • test_2d_reduction_odd_shapes: On ROCm (Navi vs MI*), backend scheduling differences can cause one fewer block descriptor than expected. Updated test to allow at most one fewer block descriptor (minimum 1). Finally skipped this on Navi/mi200 due to flakiness in the last part of the test matching BLOCK_R0 and BLOCK_R1 in the generated code. Behavior not consistent across different CI runs.

  • test_upsample_layout: On ROCm, bfloat16 may use extern_kernels.convolution instead of MKLDNN. Updated test to check for extern_kernels.convolution when MKLDNN is not present, since transpose_mxn is only required for MKLDNN.

Prior version of PR: #172681.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172780

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 63a57a6 with merge base 2186edb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added module: inductor module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Jan 19, 2026
Comment thread test/inductor/test_aot_inductor.py Outdated
Comment thread test/test_nestedtensor.py
@apakbin apakbin changed the title [ROCm] Skipping unit tests on ROCm: compile_preserves_metadata_cache and triton_autotuning [ROCm] Updating unit tests on ROCm: compile_preserves_metadata_cache and triton_autotuning Jan 19, 2026
jeffdaily
jeffdaily previously approved these changes Jan 19, 2026
@jeffdaily jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners labels Jan 19, 2026
@apakbin apakbin requested a review from a team as a code owner January 20, 2026 00:26
@pytorch-bot pytorch-bot Bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners labels Jan 20, 2026
@apakbin apakbin force-pushed the arpakbin-mi355-test-failures branch from 7a94c7f to e978dc0 Compare January 20, 2026 00:30
@naromero77amd naromero77amd added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners labels Jan 20, 2026
@pytorch-bot pytorch-bot Bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners labels Jan 20, 2026
@jeffdaily jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners labels Jan 20, 2026
@apakbin apakbin changed the title [ROCm] Updating unit tests on ROCm: compile_preserves_metadata_cache and triton_autotuning [ROCm] Updating unit tests on ROCm: compile_preserves_metadata_cache + triton_autotuning + test_triton_mutated_autotuning Jan 20, 2026
@apakbin apakbin force-pushed the arpakbin-mi355-test-failures branch from c7c5712 to 50b52fd Compare January 21, 2026 21:59
@pytorch-bot pytorch-bot Bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners labels Jan 21, 2026
jeffdaily
jeffdaily previously approved these changes Jan 21, 2026
@jeffdaily jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners labels Jan 21, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 21, 2026

To add the ciflow label ciflow/rocm-mi355 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot Bot removed the ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners label Jan 21, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased arpakbin-mi355-test-failures onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout arpakbin-mi355-test-failures && git pull --rebase)

Comment thread test/inductor/test_aot_inductor.py Outdated
Comment thread test/inductor/test_select_algorithm.py
Comment thread test/inductor/test_select_algorithm.py
Comment thread test/inductor/test_select_algorithm.py
Comment thread test/inductor/test_torchinductor_strided_blocks.py
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 6, 2026

This PR has pending changes requested. Please address the comments and update the PR before merging.

@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment