[ROCm][TunableOp] Fix UT race condition and reduce UT duration. by naromero77amd · Pull Request #150463 · pytorch/pytorch

naromero77amd · 2025-04-01T20:40:48Z

This PR fixes two race conditions that occur when UT tests are run:

In a particular order within a single shard.
Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name.

There were two other minor improvements to the UTs:

matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed.
bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful.

Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

pytorch-bot · 2025-04-01T20:40:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150463

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 541642d with merge base 783f045 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-xpu-2025.0-py3.9 / build (gh) (trunk failure)
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/sstream:152:52: error: expected value in expression

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…s well as default filenames.

naromero77amd · 2025-04-03T22:37:27Z

@pytorchbot merge

pytorchmergebot · 2025-04-03T22:39:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#150463) This PR fixes two race conditions that occur when UT tests are run: - In a particular order within a single shard. - Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name. There were two other minor improvements to the UTs: - matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed. - bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful. Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions. Pull Request resolved: pytorch#150463 Approved by: https://github.com/jeffdaily

…rch#150463) This PR fixes two race conditions that occur when UT tests are run: - In a particular order within a single shard. - Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name. There were two other minor improvements to the UTs: - matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed. - bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful. Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions. Pull Request resolved: pytorch#150463 Approved by: https://github.com/jeffdaily (cherry picked from commit d0026fa)

…ledGEMM rowwise fix (#2106) Align TunableOp UTs, features, and bug fixes with upstream PyTorch main UTs: pytorch#148982 pytorch#149930 pytorch#150142 pytorch#150463 Feature: offline tuning for submatrices: pytorch#151138 Bug Fix: ScaledGEMM rowwise pytorch#152403 --------- Co-authored-by: Jeff Daily <jeff.daily@amd.com>

naromero77amd added 7 commits April 1, 2025 19:23

Make this unit test run faster while maintaining coverage.

436a2b4

Fix race condition.

233e2f4

Adjust dims to a shape that has not been tuned before.

611aa0e

Guard against accidentally having the rotating buffer set to zero.

ae7d58c

Reduce time of this UT.

1b074e6

Reduce time on UTs and use dtype in decorator only.

b95601a

Shorten duration of UTs by making matrices smaller.

590d186

naromero77amd requested review from IvanYashchuk, lezcano and nikitaved as code owners April 1, 2025 20:40

pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category ciflow/rocm Trigger "default" config CI on ROCm labels Apr 1, 2025

naromero77amd removed the ciflow/rocm Trigger "default" config CI on ROCm label Apr 1, 2025

naromero77amd requested a review from jeffdaily April 1, 2025 20:41

naromero77amd changed the title ~~[ROCm][TunableOp] Fix UT race condition and reduce duration~~ [ROCm][TunableOp] Fix UT race condition and reduce UT duration. Apr 1, 2025

pytorch-bot bot added the ciflow/rocm Trigger "default" config CI on ROCm label Apr 1, 2025

pytorchbot added the open source label Apr 1, 2025

naromero77amd removed the ciflow/rocm Trigger "default" config CI on ROCm label Apr 1, 2025

Set rotating buffer size to zero to reduce duration even further.

b9825c1

naromero77amd marked this pull request as draft April 1, 2025 22:12

naromero77amd added 8 commits April 2, 2025 05:18

Move helper functions to internal functions.

13722cd

Context manager sets unique filename and cleans-up unique filenames a…

04a62b7

…s well as default filenames.

Lint.

68e128b

Support results filename as an argument.

70033f2

Use API or environment variable to extract filename patterns.

60f3cbc

Update TunableOp UTs to use unique filenames.

19b19fc

Removed debug info.

39b6c41

Another UT that needs to use a unique filename.

9f93858

naromero77amd added 3 commits April 2, 2025 16:42

Simplify assert as tests could occasionally fail with 8 GPUs.

ac9dc9a

Helper function to retrieve TunableOp untuned filename.

40ccded

Reduce duplicated code by using helper function.

541642d

pytorch-bot bot added the ciflow/rocm Trigger "default" config CI on ROCm label Apr 2, 2025

naromero77amd removed the ciflow/rocm Trigger "default" config CI on ROCm label Apr 2, 2025

naromero77amd marked this pull request as ready for review April 2, 2025 17:57

jeffdaily approved these changes Apr 3, 2025

View reviewed changes

jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Apr 3, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 3, 2025

pytorchmergebot added the merging label Apr 3, 2025

pytorchmergebot added the Merged label Apr 4, 2025

pytorchmergebot closed this in d0026fa Apr 4, 2025

pytorchmergebot removed the merging label Apr 4, 2025

naromero77amd mentioned this pull request May 8, 2025

[release/2.7][ROCm][TunableOp] UTs, submatrix offline tuning, and ScaledGEMM rowwise fix ROCm/pytorch#2106

Merged

naromero77amd deleted the fix_tunableop_ut_race_condition branch October 29, 2025 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][TunableOp] Fix UT race condition and reduce UT duration.#150463

[ROCm][TunableOp] Fix UT race condition and reduce UT duration.#150463
naromero77amd wants to merge 19 commits intopytorch:mainfrom
ROCm:fix_tunableop_ut_race_condition

naromero77amd commented Apr 1, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 1, 2025 •

edited

Loading

Uh oh!

naromero77amd commented Apr 3, 2025

Uh oh!

pytorchmergebot commented Apr 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

naromero77amd commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150463

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

naromero77amd commented Apr 3, 2025

Uh oh!

pytorchmergebot commented Apr 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

naromero77amd commented Apr 1, 2025 •

edited

Loading

pytorch-bot bot commented Apr 1, 2025 •

edited

Loading