[ROCm] Updating failing unit tests: Navi & MI200 & MI300 & MI350#172780
[ROCm] Updating failing unit tests: Navi & MI200 & MI300 & MI350#172780apakbin wants to merge 23 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172780
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 63a57a6 with merge base 2186edb ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
7a94c7f to
e978dc0
Compare
c7c5712 to
50b52fd
Compare
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
Successfully rebased |
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
|
@pytorchbot merge |
|
This PR has pending changes requested. Please address the comments and update the PR before merging. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR fixes ROCm-specific unit test issues:
test_sdpa_autocast,test_sdpa_backwards, andtest_compile_preserves_metadata_cache: Currently skipped on AMD MI200 and MI300. This PR extends the skip to all ROCm architectures for consistency.This PR will also enable the below DISABLED github issues to be closed:
test_sdpa_autocast:Fixes #173715
test_sdpa_backwards:Fixes #173712
Fixes #173713
Fixes #173714
test_compile_preserves_metadata_cache:Fixes #173717
test_mm_plus_mm3: ReplacedexpectedFailureDynamicWrapperwithpytest.mark.xfail( condition=not torch.version.hip,...). C++ wrapper dynamic shapes passes on ROCm.test_triton_autotuning: This test was failing on ROCm because it expected a grid value of 32736 for all AMD architectures. Fixed by checking if the grid value is one of the possible values based on the configs. This enables the test to run on ROCm instead of being skipped entirely as it is in rocm/pytorch.Fixes #173619
test_triton_mutated_autotuning: Applied the same grid value fix astest_triton_autotuning.Fixes #173620
test/inductor/test_select_algorithm.py: guardedself.assertEqual(counters["inductor"]["select_algorithm_autotune"], ...)assertions so they do not run on ROCm, as autotuning behavior is non-deterministic on this platform (candidate prescreening may filter more aggressively based on architecture-specific kernel availability).test_copy_non_blocking_is_pinned: Observed failures on Navi machines, skipped them while they are being investigated.test_2d_reduction_odd_shapes: On ROCm (Navi vs MI*), backend scheduling differences can cause one fewer block descriptor than expected. Updated test to allow at most one fewer block descriptor (minimum 1). Finally skipped this on Navi/mi200 due to flakiness in the last part of the test matching BLOCK_R0 and BLOCK_R1 in the generated code. Behavior not consistent across different CI runs.test_upsample_layout: On ROCm,bfloat16may useextern_kernels.convolutioninstead of MKLDNN. Updated test to check forextern_kernels.convolutionwhen MKLDNN is not present, sincetranspose_mxnis only required for MKLDNN.Prior version of PR: #172681.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben