Skip to content

Update ROCm base docker images to focal (ubuntu20.04)#79596

Closed
jithunnair-amd wants to merge 2 commits intopytorch:masterfrom
jithunnair-amd:update_rocm_ci_to_focal
Closed

Update ROCm base docker images to focal (ubuntu20.04)#79596
jithunnair-amd wants to merge 2 commits intopytorch:masterfrom
jithunnair-amd:update_rocm_ci_to_focal

Conversation

@jithunnair-amd
Copy link
Collaborator

No description provided.

@jithunnair-amd jithunnair-amd requested a review from a team as a code owner June 15, 2022 04:20
@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jun 15, 2022
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 15, 2022

🔗 Helpful links

❌ 2 New Failures, 1 Flaky Failures

As of commit 8383e0d (more details on the Dr. CI page):

Expand to see more

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-07-06T05:38:51.7487634Z �[91m./configure: line 6928: /usr/bin/file: No such file or directory
2022-07-06T05:38:51.6424130Z checking for dlltool... no
2022-07-06T05:38:51.6426208Z checking how to associate runtime and link libraries... printf %s\n
2022-07-06T05:38:51.6430786Z checking for ar... ar
2022-07-06T05:38:51.6650634Z checking for archiver @FILE support... @
2022-07-06T05:38:51.6654444Z checking for strip... strip
2022-07-06T05:38:51.6659061Z checking for ranlib... ranlib
2022-07-06T05:38:51.7280413Z checking command to parse /usr/bin/nm -B output from gcc object... ok
2022-07-06T05:38:51.7290199Z checking for sysroot... no
2022-07-06T05:38:51.7328989Z checking for a working dd... /bin/dd
2022-07-06T05:38:51.7361142Z checking how to truncate binary pipes... /bin/dd bs=4096 count=1
2022-07-06T05:38:51.7487634Z �[91m./configure: line 6928: /usr/bin/file: No such file or directory
2022-07-06T05:38:51.7501409Z �[0mchecking for mt... no
2022-07-06T05:38:51.7530107Z checking if : is a manifest tool... no
2022-07-06T05:38:51.7717522Z checking how to run the C preprocessor... gcc -E
2022-07-06T05:38:51.8582592Z checking for ANSI C header files... yes
2022-07-06T05:38:51.8780478Z checking for sys/types.h... yes
2022-07-06T05:38:51.9005075Z checking for sys/stat.h... yes
2022-07-06T05:38:51.9234031Z checking for stdlib.h... yes
2022-07-06T05:38:51.9476533Z checking for string.h... yes
2022-07-06T05:38:51.9713930Z checking for memory.h... yes
2022-07-06T05:38:51.9951428Z checking for strings.h... yes

See GitHub Actions build pull / linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-07-06T05:38:39.9485401Z �[91m./configure: line 6928: /usr/bin/file: No such file or directory
2022-07-06T05:38:39.8486842Z checking for dlltool... no
2022-07-06T05:38:39.8488855Z checking how to associate runtime and link libraries... printf %s\n
2022-07-06T05:38:39.8493283Z checking for ar... ar
2022-07-06T05:38:39.8707820Z checking for archiver @FILE support... @
2022-07-06T05:38:39.8712438Z checking for strip... strip
2022-07-06T05:38:39.8716920Z checking for ranlib... ranlib
2022-07-06T05:38:39.9291008Z checking command to parse /usr/bin/nm -B output from gcc object... ok
2022-07-06T05:38:39.9302560Z checking for sysroot... no
2022-07-06T05:38:39.9341459Z checking for a working dd... /bin/dd
2022-07-06T05:38:39.9372860Z checking how to truncate binary pipes... /bin/dd bs=4096 count=1
2022-07-06T05:38:39.9485401Z �[91m./configure: line 6928: /usr/bin/file: No such file or directory
2022-07-06T05:38:39.9498165Z �[0mchecking for mt... no
2022-07-06T05:38:39.9527511Z checking if : is a manifest tool... no
2022-07-06T05:38:39.9711482Z checking how to run the C preprocessor... gcc -E
2022-07-06T05:38:40.0504204Z checking for ANSI C header files... yes
2022-07-06T05:38:40.0684121Z checking for sys/types.h... yes
2022-07-06T05:38:40.0893061Z checking for sys/stat.h... yes
2022-07-06T05:38:40.1108326Z checking for stdlib.h... yes
2022-07-06T05:38:40.1325382Z checking for string.h... yes
2022-07-06T05:38:40.1551209Z checking for memory.h... yes
2022-07-06T05:38:40.1767405Z checking for strings.h... yes

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See GitHub Actions build pull / linux-bionic-rocm5.1-py3.7 / build (1/1)

Step: "Calculate docker image" (full log | diagnosis details | 🔁 rerun) ❄️

2022-07-06T05:43:56.6388699Z E: Failed to fetch...64/Packages 404 Not Found [IP: 13.82.220.49 443]
2022-07-06T05:43:55.5915654Z Ign:7 https://repo.radeon.com/amdgpu//ubuntu bionic/main amd64 Packages
2022-07-06T05:43:55.5996766Z Ign:9 https://repo.radeon.com/amdgpu//ubuntu bionic/main all Packages
2022-07-06T05:43:55.6084401Z Ign:7 https://repo.radeon.com/amdgpu//ubuntu bionic/main amd64 Packages
2022-07-06T05:43:55.6164165Z Ign:9 https://repo.radeon.com/amdgpu//ubuntu bionic/main all Packages
2022-07-06T05:43:55.6243478Z Err:7 https://repo.radeon.com/amdgpu//ubuntu bionic/main amd64 Packages
2022-07-06T05:43:55.6243985Z   404  Not Found [IP: 13.82.220.49 443]
2022-07-06T05:43:55.6321182Z Ign:9 https://repo.radeon.com/amdgpu//ubuntu bionic/main all Packages
2022-07-06T05:43:55.7624759Z Fetched 29.0 kB in 0s (90.8 kB/s)
2022-07-06T05:43:56.6104395Z Reading package lists...
2022-07-06T05:43:56.6388112Z �[91mW: The repository 'https://repo.radeon.com/amdgpu//ubuntu bionic Release' does not have a Release file.
2022-07-06T05:43:56.6388699Z E: Failed to fetch https://repo.radeon.com/amdgpu//ubuntu/dists/bionic/main/binary-amd64/Packages  404  Not Found [IP: 13.82.220.49 443]
2022-07-06T05:43:56.6389094Z E: Some index files failed to download. They have been ignored, or old ones used instead.
2022-07-06T05:43:57.0483241Z The command '/bin/sh -c bash ./install_rocm.sh' returned a non-zero code: 100
2022-07-06T05:43:57.0499645Z �[0m
2022-07-06T05:43:57.0504071Z ##[error]Process completed with exit code 100.
2022-07-06T05:43:57.0598128Z Prepare all required actions
2022-07-06T05:43:57.0598365Z Getting action download info
2022-07-06T05:43:57.2543873Z Download action repository 'nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a' (SHA:71062288b76e2b6214ebde0e673ce0de1755740a)
2022-07-06T05:43:57.4395040Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-07-06T05:43:57.4395259Z with:
2022-07-06T05:43:57.4395559Z   github-token: ***

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@jithunnair-amd jithunnair-amd marked this pull request as draft June 15, 2022 04:21
@jithunnair-amd
Copy link
Collaborator Author

@seemethere I think you will need to create new docker tags in the registry? Please create pytorch-linux-focal-rocm5.0-py3.7 as well as pytorch-linux-focal-rocm5.1-py3.7
https://github.com/pytorch/pytorch/runs/6893810090?check_suite_focus=true#step:5:20501

+ docker push 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-rocm5.0-py3.7:5e325a9256867d845ff0ff9922eded165458ff97
The push refers to repository [308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-rocm5.0-py3.7]

@kit1980
Copy link
Contributor

kit1980 commented Jun 20, 2022

@seemethere I think you will need to create new docker tags in the registry? Please create pytorch-linux-focal-rocm5.0-py3.7 as well as pytorch-linux-focal-rocm5.1-py3.7 https://github.com/pytorch/pytorch/runs/6893810090?check_suite_focus=true#step:5:20501

I'll add the images.

@kit1980
Copy link
Contributor

kit1980 commented Jun 20, 2022

Created pytorch-linux-focal-rocm5.0-py3.7 and pytorch-linux-focal-rocm5.1-py3.7

@jithunnair-amd jithunnair-amd marked this pull request as ready for review June 20, 2022 21:45
@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 21, 2022
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Jun 26, 2022

@kit1980 @malfet @seemethere I'm not sure why the docker-builds / docker-build (pytorch-linux-focal-py3.7-gcc7) job fails, but that causes the rocm docker build jobs to be cancelled. I've tried rebasing and it still failed. Can you please suggest a solution?

@kit1980
Copy link
Contributor

kit1980 commented Jun 26, 2022

@kit1980 @malfet @seemethere I'm not sure why the docker-builds / docker-build (pytorch-linux-focal-py3.7-gcc7) job fails, but that causes the rocm docker build jobs to be cancelled. I've tried rebasing and it still failed. Can you please suggest a solution?

As a temporary workaround to unblock you, try changing something inside .circleci/docker dir.

@malfet
Copy link
Contributor

malfet commented Jul 5, 2022

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/master pull/79596/head returned non-zero exit code 1

Rebasing (1/4)
Auto-merging .github/workflows/docker-builds.yml
CONFLICT (content): Merge conflict in .github/workflows/docker-builds.yml
error: could not apply 890530f139... Update ROCm base docker images to focal (ubuntu20.04)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 890530f139... Update ROCm base docker images to focal (ubuntu20.04)

Raised by https://github.com/pytorch/pytorch/actions/runs/2618842266

@malfet
Copy link
Contributor

malfet commented Jul 5, 2022

@jithunnair-amd please manually rebase (looks like there is a conflict of sorts)

@jithunnair-amd jithunnair-amd force-pushed the update_rocm_ci_to_focal branch from 6430227 to 8383e0d Compare July 6, 2022 05:30
@jithunnair-amd
Copy link
Collaborator Author

@malfet The ROCm docker build jobs succeeded. Merging this PR. Will file another PR to move the ROCm CI jobs to use focal images.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to Refusing to merge as mandatory check(s) pull failed for rule OSS CI
Raised by https://github.com/pytorch/pytorch/actions/runs/2625526308

@malfet
Copy link
Contributor

malfet commented Jul 6, 2022

@jithunnair-amd are you not concerned with the ROCm failure?

@malfet
Copy link
Contributor

malfet commented Jul 6, 2022

@pytorchbot merge -f

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Hey @jithunnair-amd.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@jeffdaily
Copy link
Collaborator

@jithunnair-amd are you not concerned with the ROCm failure?

There probably should have been a single PR to build rocm focal images and also switch CI to use them. There is now an inconsistency in the hard-coded image names in .circleci/docker/build.sh compared to the expected image names in the various workflows. The build.sh script is attempting to build a missing bionic image and the ROCM_VERSION variable is getting set incorrectly as "5.1" instead of "5.1.1".

@jeffdaily
Copy link
Collaborator

@malfet can you revert?

@malfet
Copy link
Contributor

malfet commented Jul 6, 2022

@pytorchbot revert -m "Jeff asked for it" -c nosignal

@malfet
Copy link
Contributor

malfet commented Jul 6, 2022

@malfet can you revert?

Sure, though I believe you should be able to issue revert command as well (please try next time and ping me if this is not the case)

@jithunnair-amd
Copy link
Collaborator Author

@jeffdaily Indeed I was, and was trying to figure out whether we had the upgrade for the docker images and the CI jobs in one PR last time. Didn't expect it would get landed :)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

@jithunnair-amd your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jul 6, 2022
pytorchmergebot pushed a commit that referenced this pull request Jul 8, 2022
…81031)

Re-attempting after original PR #79596 was reverted due to causing ROCm build failures
Pull Request resolved: #81031
Approved by: https://github.com/jeffdaily, https://github.com/malfet
facebook-github-bot pushed a commit that referenced this pull request Jul 8, 2022
…81031) (#81031)

Summary:
Re-attempting after original PR #79596 was reverted due to causing ROCm build failures

Pull Request resolved: #81031
Approved by: https://github.com/jeffdaily, https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/8a5d9843ff5d5dd865fc922853a15b3e7e459fdb

Reviewed By: mehtanirav

Differential Revision: D37719967

Pulled By: mehtanirav

fbshipit-source-id: 8be30b4fecb0dc2911661f6a5259e147f1726286
pytorchmergebot pushed a commit that referenced this pull request Sep 13, 2022
…#80015)

CI doesn't have any MI25s anymore. Should improve docker and Pytorch build times in CI for ROCm.

Will take out of Draft mode after #79596 is merged

Pull Request resolved: #80015
Approved by: https://github.com/jeffdaily, https://github.com/malfet
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
…#80015)

CI doesn't have any MI25s anymore. Should improve docker and Pytorch build times in CI for ROCm.

Will take out of Draft mode after #79596 is merged

Pull Request resolved: #80015
Approved by: https://github.com/jeffdaily, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: rocm AMD GPU support for Pytorch open source Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants