[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners by amdfaa · Pull Request #165481 · pytorch/pytorch

amdfaa · 2025-10-14T22:06:34Z

Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.
This PR was reverted due to following issues:

I can see three different issues:
The jobs themselves are taking a long time (4.5H+). This is not something we've seen before. The specific test that's causing timeouts will take longer to triage. I'll tackle this first.
Some of the machines appear have issues with rocminfo. With the below error:
hsa api call failure at: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocminfo/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
This error is typically resolved with a reboot. THIS ISSUE IS RESOLVED.
The jobs started off with really long docker pull times (1H). This decreased down to ~8M towards the later jobs.

See HUD here:
https://hud.pytorch.org/hud/pytorch/pytorch/21131a2/1?per_page=50&name_filter=rocm&mergeEphemeralLF=true

Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @dllehr-amd

pytorch-bot · 2025-10-14T22:06:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165481

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 Cancelled Jobs, 3 Unrelated Failures

As of commit d0e8a74 with merge base 96b0e7a ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

rocm / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.mi250.2) (gh)
##[error]The operation was canceled.
rocm / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.mi250.2) (gh)
##[error]The operation was canceled.
rocm / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.mi250.2) (gh)
##[error]The operation was canceled.
rocm / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.mi250.2) (gh)
##[error]The operation was canceled.
rocm / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.mi250.2) (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

rocm / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.mi250.2) (gh) (detected as infra flaky with no log or failing log classifier)
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 1, 6, linux.rocm.gpu.gfx942.1) (gh) (disabled by #156921 but the issue was closed recently and a rebase is needed to make it pass)
test/dynamo/test_structured_trace.py::StructuredTraceTest::test_compiled_autograd_attribution
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) (gh) (disabled by #165872 but the issue was closed recently and a rebase is needed to make it pass)
test/test_cuda.py::TestCudaMallocAsync::test_allocator_backend

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jithunnair-amd · 2025-10-15T02:59:24Z

The issue I see is that all the test shards in https://github.com/pytorch/pytorch/actions/runs/18511405909/job/52754870918?pr=165481 all take more than 1hr just to pull the docker image.

jeffdaily

Changing review to request changes -- we need to understand the long docker pull times before landing this.

jithunnair-amd · 2025-10-20T15:43:20Z

@pytorchbot merge -f "Previous round of CI jobs were clean: https://hud.pytorch.org/pytorch/pytorch/pull/165481?sha=62710bc1f21e83b9e5de5d3fe125546a50689cdd#rocm, merging to prevent queue buildup on MI2xx jobs"

pytorchmergebot · 2025-10-20T15:45:13Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-20T15:45:26Z

Merge failed

Reason: Approvers from one of the following sets are needed:

OSS CI (alband, dagitses, pytorch/pytorch-dev-infra)
superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

jeffdaily · 2025-10-21T03:59:57Z

@pytorchbot merge -f "lint mystery was reason for revert, lint is passing (again)"

pytorchmergebot · 2025-10-21T04:01:45Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

…ytorch#165481)" This reverts commit 8700d68. Reverted pytorch#165481 on behalf of https://github.com/malfet due to Broke lint somehow, see https://hud.pytorch.org/hud/pytorch/pytorch/8f06a1308f256ed7f2610e5e92e06a6871618a06/1?per_page=50&name_filter=lint&mergeEphemeralLF=true ([comment](pytorch#165481 (comment)))

…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

jeffdaily · 2025-10-21T14:09:29Z

@pytorchbot revert -c weird -m “timeouts after merge”

pytorch-bot · 2025-10-21T14:09:31Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: after merge”

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

jeffdaily · 2025-10-21T14:13:14Z

@pytorchbot revert -c weird -m "timeouts after merge"

pytorchmergebot · 2025-10-21T14:15:53Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…165481)" This reverts commit ffa90d4. Reverted #165481 on behalf of https://github.com/jeffdaily due to timeouts after merge ([comment](#165481 (comment)))

pytorchmergebot · 2025-10-21T14:15:59Z

@amdfaa your PR has been successfully reverted.

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

…ytorch#165481)" This reverts commit 8700d68. Reverted pytorch#165481 on behalf of https://github.com/malfet due to Broke lint somehow, see https://hud.pytorch.org/hud/pytorch/pytorch/8f06a1308f256ed7f2610e5e92e06a6871618a06/1?per_page=50&name_filter=lint&mergeEphemeralLF=true ([comment](pytorch#165481 (comment)))

…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

…ytorch#165481)" This reverts commit ffa90d4. Reverted pytorch#165481 on behalf of https://github.com/jeffdaily due to timeouts after merge ([comment](pytorch#165481 (comment)))

github-actions · 2025-12-20T21:34:26Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

amdfaa requested a review from a team as a code owner October 14, 2025 22:06

pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Oct 14, 2025

amdfaa changed the title ~~[ROCm][CI] Update ROCm workflow to use 1 GPU runners~~ [ROCm][CI] Update ROCm workflow to use 1 GPU ARC runners Oct 14, 2025

jithunnair-amd changed the title ~~[ROCm][CI] Update ROCm workflow to use 1 GPU ARC runners~~ [ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners Oct 14, 2025

jithunnair-amd added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 14, 2025

pytorchbot added the open source label Oct 14, 2025

jithunnair-amd requested a review from jeffdaily October 14, 2025 22:29

jeffdaily previously approved these changes Oct 14, 2025

View reviewed changes

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 15, 2025

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 15, 2025

jeffdaily requested changes Oct 15, 2025

View reviewed changes

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 16, 2025

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 16, 2025

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 16, 2025

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 17, 2025

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 18, 2025

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 18, 2025

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 19, 2025

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 19, 2025

pruthvistony previously approved these changes Oct 19, 2025

View reviewed changes

amdfaa requested a review from jeffdaily October 19, 2025 21:14

jithunnair-amd added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 20, 2025

pytorchmergebot added the merging label Oct 20, 2025

pytorchmergebot removed the merging label Oct 20, 2025

jeffdaily previously approved these changes Oct 21, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 21, 2025

pytorchmergebot closed this in ffa90d4 Oct 21, 2025

pytorchmergebot removed the merging label Oct 21, 2025

pytorchmergebot reopened this Oct 21, 2025

amdfaa added 2 commits October 21, 2025 14:20

trigger rebuild

2d55d9e

Use linux.rocm.gpu.mi250.2 instead of linux.rocm.gpu.mi250.1

d0e8a74

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 21, 2025

jithunnair-amd marked this pull request as draft October 21, 2025 15:23

github-actions bot added the Stale label Dec 20, 2025

pytorch-bot bot added the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Dec 20, 2025

github-actions bot closed this Jan 20, 2026

Conversation

amdfaa commented Oct 14, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165481

❌ 5 Cancelled Jobs, 3 Unrelated Failures

Uh oh!

jithunnair-amd commented Oct 15, 2025

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd commented Oct 20, 2025

Uh oh!

pytorchmergebot commented Oct 20, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 20, 2025

Merge failed

Uh oh!

jeffdaily commented Oct 21, 2025

Uh oh!

pytorchmergebot commented Oct 21, 2025

Merge started

Uh oh!

jeffdaily commented Oct 21, 2025

Uh oh!

pytorch-bot bot commented Oct 21, 2025

Uh oh!

jeffdaily commented Oct 21, 2025

Uh oh!

pytorchmergebot commented Oct 21, 2025

Uh oh!

pytorchmergebot commented Oct 21, 2025

Uh oh!

github-actions bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

amdfaa commented Oct 14, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 14, 2025 •

edited

Loading