Skip to content

[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners#165481

Closed
amdfaa wants to merge 10 commits intopytorch:mainfrom
amdfaa:patch-24
Closed

[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners#165481
amdfaa wants to merge 10 commits intopytorch:mainfrom
amdfaa:patch-24

Conversation

@amdfaa
Copy link
Contributor

@amdfaa amdfaa commented Oct 14, 2025

  • Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.
  • This PR was reverted due to following issues:
  1. I can see three different issues:
    The jobs themselves are taking a long time (4.5H+). This is not something we've seen before. The specific test that's causing timeouts will take longer to triage. I'll tackle this first.
  2. Some of the machines appear have issues with rocminfo. With the below error:
    hsa api call failure at: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocminfo/rocminfo.cc:1306
    Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
    This error is typically resolved with a reboot. THIS ISSUE IS RESOLVED.
  3. The jobs started off with really long docker pull times (1H). This decreased down to ~8M towards the later jobs.

See HUD here:
https://hud.pytorch.org/hud/pytorch/pytorch/21131a2/1?per_page=50&name_filter=rocm&mergeEphemeralLF=true

Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @dllehr-amd

@amdfaa amdfaa requested a review from a team as a code owner October 14, 2025 22:06
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165481

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 Cancelled Jobs, 3 Unrelated Failures

As of commit d0e8a74 with merge base 96b0e7a (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Oct 14, 2025
@amdfaa amdfaa changed the title [ROCm][CI] Update ROCm workflow to use 1 GPU runners [ROCm][CI] Update ROCm workflow to use 1 GPU ARC runners Oct 14, 2025
@jithunnair-amd jithunnair-amd changed the title [ROCm][CI] Update ROCm workflow to use 1 GPU ARC runners [ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners Oct 14, 2025
@jithunnair-amd jithunnair-amd added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 14, 2025
jeffdaily
jeffdaily previously approved these changes Oct 14, 2025
@jithunnair-amd
Copy link
Collaborator

The issue I see is that all the test shards in https://github.com/pytorch/pytorch/actions/runs/18511405909/job/52754870918?pr=165481 all take more than 1hr just to pull the docker image.

@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 15, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 15, 2025
Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing review to request changes -- we need to understand the long docker pull times before landing this.

@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 16, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 16, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 16, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 17, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 18, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 18, 2025
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 19, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Oct 19, 2025
pruthvistony
pruthvistony previously approved these changes Oct 19, 2025
@amdfaa amdfaa requested a review from jeffdaily October 19, 2025 21:14
@jithunnair-amd jithunnair-amd added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 20, 2025
@jithunnair-amd
Copy link
Collaborator

@pytorchbot merge -f "Previous round of CI jobs were clean: https://hud.pytorch.org/pytorch/pytorch/pull/165481?sha=62710bc1f21e83b9e5de5d3fe125546a50689cdd#rocm, merging to prevent queue buildup on MI2xx jobs"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • OSS CI (alband, dagitses, pytorch/pytorch-dev-infra)
  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

jeffdaily
jeffdaily previously approved these changes Oct 21, 2025
@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "lint mystery was reason for revert, lint is passing (again)"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…#165481)

* Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.

Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720

Pull Request resolved: pytorch#165481
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…#165481)

* Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.

Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720

Pull Request resolved: pytorch#165481
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
@jeffdaily
Copy link
Collaborator

@pytorchbot revert -c weird -m “timeouts after merge”

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 21, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: after merge”

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@jeffdaily
Copy link
Collaborator

@pytorchbot revert -c weird -m "timeouts after merge"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 21, 2025
@pytorchmergebot
Copy link
Collaborator

@amdfaa your PR has been successfully reverted.

@pytorch-bot pytorch-bot bot dismissed jeffdaily’s stale review October 21, 2025 14:16

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 21, 2025
@jithunnair-amd jithunnair-amd marked this pull request as draft October 21, 2025 15:23
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…#165481)

* Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.

Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720

Pull Request resolved: pytorch#165481
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…#165481)

* Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.

Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720

Pull Request resolved: pytorch#165481
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 20, 2025
@pytorch-bot pytorch-bot bot added the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Dec 20, 2025
@github-actions github-actions bot closed this Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 Merged module: rocm AMD GPU support for Pytorch open source Reverted Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants