Skip to content

[ROCm] Update setup-rocm for almalinux-based images#143590

Closed
amdfaa wants to merge 7 commits intopytorch:mainfrom
amdfaa:torchao_experiment
Closed

[ROCm] Update setup-rocm for almalinux-based images#143590
amdfaa wants to merge 7 commits intopytorch:mainfrom
amdfaa:torchao_experiment

Conversation

@amdfaa
Copy link
Copy Markdown
Contributor

@amdfaa amdfaa commented Dec 19, 2024

Needed for pytorch/test-infra#6104 and pytorch/ao#999

  • Explicitly specify repo and branch in pytorch/pytorch/.github/actions/diskspace-cleanup@main to be able to use setup-rocm in test-infra's .github/workflows/linux_job_v2.yml (like in PR Enable linux_job_v2.yml workflow for ROCm test-infra#6104), otherwise Github Actions complains about not finding diskspace-cleanup action in test-infra repo.
  • Use RUNNER_TEMP instead of /tmp
  • Add bin group permissions for Almalinux images due to difference in default OS group numbering in Ubuntu vs Almalinux

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Dec 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143590

Note: Links to docs will display an error until the docs builds have been completed.

❌ 22 New Failures, 1 Cancelled Job

As of commit 95c5aa6 with merge base 7ced49d (image):

NEW FAILURES - The following jobs have failed:

  • Lint / lintrunner-noclang / linux-job (gh)
    >>> Lint for test/test_transformers.py:
  • linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test (gh)
    RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.
  • pull / cuda12.4-py3.10-gcc9-sm75 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-cuda12.4-py3.10-gcc9 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3_9-clang9-xla / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3.13-clang10 / build (gh)
    Final attempt failed. Child_process exited with error code 1
  • pull / linux-focal-py3.9-clang10 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3.9-clang10-onnx / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-rocm6.2-py3.10 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3-clang12-executorch / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3-clang12-mobile-build / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.10-clang15-asan / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11 / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-pch / build (gh)
    Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / win-vs2019-cpu-py3 / build (gh)
    AttributeError: module 'distutils' has no attribute '_msvccompiler'
  • trunk / win-vs2019-cpu-py3 / build (gh)
    AttributeError: module 'distutils' has no attribute '_msvccompiler'
  • trunk / win-vs2019-cuda12.1-py3 / build (gh)
    AttributeError: module 'distutils' has no attribute '_msvccompiler'

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Dec 19, 2024
@facebook-github-bot facebook-github-bot added the module: rocm AMD GPU support for Pytorch label Dec 19, 2024
@jithunnair-amd jithunnair-amd changed the title Torchao experiment [ROCm] Update setup-rocm for almalinux-based images Dec 19, 2024
@jithunnair-amd jithunnair-amd marked this pull request as ready for review December 19, 2024 19:51
@jithunnair-amd jithunnair-amd requested a review from a team as a code owner December 19, 2024 19:51
@jithunnair-amd
Copy link
Copy Markdown
Collaborator

@huydhn This is one of 3 PRs to get torchao CI working on ROCm. Can you please review and approve the PR?

run: |
# All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py.
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon --group-add bin" >> "${GITHUB_ENV}"
Copy link
Copy Markdown
Contributor Author

@amdfaa amdfaa Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --group-add daemon and --group-add bin are needed in the Ubuntu 24.04 and Almalinux OSs respectively. This is due to the device files (/dev/kfd & /dev/dri) being owned by video group on bare metal. This video group ID maps to subgid 1 inside the docker image. The group name corresponding to group
ID 1 can change depending on the OS, so both are necessary.

Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, please fix lint before merging. Looks like lint failure is legit

@jithunnair-amd jithunnair-amd added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 19, 2024
@jithunnair-amd
Copy link
Copy Markdown
Collaborator

trunk workflow ROCm jobs passed, so these changes seem to have not broken existing workflows.

@pytorchbot merge -f "Current lint and other CI failures are unrelated to this PR, probably due to older base. ROCm CI succeeded."

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-bot pytorch-bot bot added the ciflow/rocm Trigger "default" config CI on ROCm label Dec 23, 2024
amdfaa added a commit to pytorch/test-infra that referenced this pull request Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants