[ROCm] Update setup-rocm for almalinux-based images#143590
[ROCm] Update setup-rocm for almalinux-based images#143590amdfaa wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143590
Note: Links to docs will display an error until the docs builds have been completed. ❌ 22 New Failures, 1 Cancelled JobAs of commit 95c5aa6 with merge base 7ced49d ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@huydhn This is one of 3 PRs to get torchao CI working on ROCm. Can you please review and approve the PR? |
| run: | | ||
| # All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py. | ||
| echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}" | ||
| echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon --group-add bin" >> "${GITHUB_ENV}" |
There was a problem hiding this comment.
The --group-add daemon and --group-add bin are needed in the Ubuntu 24.04 and Almalinux OSs respectively. This is due to the device files (/dev/kfd & /dev/dri) being owned by video group on bare metal. This video group ID maps to subgid 1 inside the docker image. The group name corresponding to group
ID 1 can change depending on the OS, so both are necessary.
atalman
left a comment
There was a problem hiding this comment.
lgtm, please fix lint before merging. Looks like lint failure is legit
|
@pytorchbot merge -f "Current lint and other CI failures are unrelated to this PR, probably due to older base. ROCm CI succeeded." |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Needed for pytorch/test-infra#6104 and pytorch/ao#999
pytorch/pytorch/.github/actions/diskspace-cleanup@mainto be able to usesetup-rocmin test-infra's.github/workflows/linux_job_v2.yml(like in PR Enable linux_job_v2.yml workflow for ROCm test-infra#6104), otherwise Github Actions complains about not findingdiskspace-cleanupaction intest-infrarepo.RUNNER_TEMPinstead of/tmpbingroup permissions for Almalinux images due to difference in default OS group numbering in Ubuntu vs Almalinuxcc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd