[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners#165481
[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners#165481amdfaa wants to merge 10 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165481
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 Cancelled Jobs, 3 Unrelated FailuresAs of commit d0e8a74 with merge base 96b0e7a ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
The issue I see is that all the test shards in https://github.com/pytorch/pytorch/actions/runs/18511405909/job/52754870918?pr=165481 all take more than 1hr just to pull the docker image. |
jeffdaily
left a comment
There was a problem hiding this comment.
Changing review to request changes -- we need to understand the long docker pull times before landing this.
|
@pytorchbot merge -f "Previous round of CI jobs were clean: https://hud.pytorch.org/pytorch/pytorch/pull/165481?sha=62710bc1f21e83b9e5de5d3fe125546a50689cdd#rocm, merging to prevent queue buildup on MI2xx jobs" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Approvers from one of the following sets are needed:
|
|
@pytorchbot merge -f "lint mystery was reason for revert, lint is passing (again)" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
…ytorch#165481)" This reverts commit 8700d68. Reverted pytorch#165481 on behalf of https://github.com/malfet due to Broke lint somehow, see https://hud.pytorch.org/hud/pytorch/pytorch/8f06a1308f256ed7f2610e5e92e06a6871618a06/1?per_page=50&name_filter=lint&mergeEphemeralLF=true ([comment](pytorch#165481 (comment)))
…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
|
@pytorchbot revert -c weird -m “timeouts after merge” |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -c weird -m "timeouts after merge" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…165481)" This reverts commit ffa90d4. Reverted #165481 on behalf of https://github.com/jeffdaily due to timeouts after merge ([comment](#165481 (comment)))
|
@amdfaa your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
…ytorch#165481)" This reverts commit 8700d68. Reverted pytorch#165481 on behalf of https://github.com/malfet due to Broke lint somehow, see https://hud.pytorch.org/hud/pytorch/pytorch/8f06a1308f256ed7f2610e5e92e06a6871618a06/1?per_page=50&name_filter=lint&mergeEphemeralLF=true ([comment](pytorch#165481 (comment)))
…#165481) * Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs. Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 Pull Request resolved: pytorch#165481 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
…ytorch#165481)" This reverts commit ffa90d4. Reverted pytorch#165481 on behalf of https://github.com/jeffdaily due to timeouts after merge ([comment](pytorch#165481 (comment)))
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
The jobs themselves are taking a long time (4.5H+). This is not something we've seen before. The specific test that's causing timeouts will take longer to triage. I'll tackle this first.
hsa api call failure at: /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocminfo/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
This error is typically resolved with a reboot. THIS ISSUE IS RESOLVED.
See HUD here:
https://hud.pytorch.org/hud/pytorch/pytorch/21131a2/1?per_page=50&name_filter=rocm&mergeEphemeralLF=true
Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @dllehr-amd