Skip to content

Update ROCm base docker images to focal (ubuntu20.04) (attempt #2)#81031

Closed
jithunnair-amd wants to merge 3 commits intopytorch:masterfrom
jithunnair-amd:update_rocm_ci_to_focal_2
Closed

Update ROCm base docker images to focal (ubuntu20.04) (attempt #2)#81031
jithunnair-amd wants to merge 3 commits intopytorch:masterfrom
jithunnair-amd:update_rocm_ci_to_focal_2

Conversation

@jithunnair-amd
Copy link
Collaborator

Re-attempting after original PR #79596 was reverted due to causing ROCm build failures

@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jul 7, 2022
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 7, 2022

🔗 Helpful links

❌ 2 New Failures

As of commit 499f7e6 (more details on the Dr. CI page):

Expand to see more
  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-07-07T09:05:08.0785259Z RuntimeError: test_autograd failed! Received signal: SIGIOT
2022-07-07T09:05:05.6324286Z Generated XML report: test-reports/python-unittest/test_autograd/TEST-TestAutogradForwardModeBatchedGrad-20220707090439.xml
2022-07-07T09:05:05.6433923Z Generated XML report: test-reports/python-unittest/test_autograd/TEST-autograd.test_functional.TestAutogradFunctional-20220707090439.xml
2022-07-07T09:05:05.6453167Z Generated XML report: test-reports/python-unittest/test_autograd/TEST-TestAutogradInferenceMode-20220707090439.xml
2022-07-07T09:05:05.6462958Z Generated XML report: test-reports/python-unittest/test_autograd/TEST-TestMultithreadAutograd-20220707090439.xml
2022-07-07T09:05:05.9766112Z corrupted double-linked list
2022-07-07T09:05:08.0779253Z Traceback (most recent call last):
2022-07-07T09:05:08.0779846Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 945, in <module>
2022-07-07T09:05:08.0781002Z     main()
2022-07-07T09:05:08.0781874Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 923, in main
2022-07-07T09:05:08.0784469Z     raise RuntimeError(err_message)
2022-07-07T09:05:08.0785259Z RuntimeError: test_autograd failed! Received signal: SIGIOT
2022-07-07T09:05:08.6335371Z 
2022-07-07T09:05:08.6336024Z real	131m15.913s
2022-07-07T09:05:08.6336322Z user	131m18.905s
2022-07-07T09:05:08.6336785Z sys	6m14.080s
2022-07-07T09:05:08.6382460Z ##[error]Process completed with exit code 1.
2022-07-07T09:05:08.6420207Z Prepare all required actions
2022-07-07T09:05:08.6420649Z Getting action download info
2022-07-07T09:05:08.8233617Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-07-07T09:05:08.8233920Z with:
2022-07-07T09:05:08.8234351Z   github-token: ***

See GitHub Actions build trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-07-07T08:41:05.3301734Z FAIL [7.661s]: test_bottleneck_cuda (__main__.TestBottleneck)
2022-07-07T08:40:48.0583789Z   test_random_seed (__main__.TestDataLoaderUtils) ... ok (0.174s)
2022-07-07T08:40:48.0598099Z   test_single_drop (__main__.TestDataLoaderUtils) ... ok (0.001s)
2022-07-07T08:40:48.0608738Z   test_single_keep (__main__.TestDataLoaderUtils) ... ok (0.001s)
2022-07-07T08:40:48.0640658Z   test_external_module_register (__main__.TestExtensionUtils) ... ok (0.003s)
2022-07-07T08:40:48.0647517Z   test_import_hipify (__main__.TestHipify) ... ok (0.001s)
2022-07-07T08:40:48.0673210Z   test_check_onnx_broadcast (__main__.TestONNXUtils) ... ok (0.002s)
2022-07-07T08:40:48.0684731Z   test_prepare_onnx_paddings (__main__.TestONNXUtils) ... ok (0.001s)
2022-07-07T08:41:05.3294994Z   test_load_standalone (__main__.TestStandaloneCPPJIT) ... ok (17.261s)
2022-07-07T08:41:05.3295821Z 
2022-07-07T08:41:05.3295978Z ======================================================================
2022-07-07T08:41:05.3301734Z FAIL [7.661s]: test_bottleneck_cuda (__main__.TestBottleneck)
2022-07-07T08:41:05.3302847Z ----------------------------------------------------------------------
2022-07-07T08:41:05.3303556Z Traceback (most recent call last):
2022-07-07T08:41:05.3304633Z   File "/var/lib/jenkins/workspace/test/test_utils.py", line 513, in test_bottleneck_cuda
2022-07-07T08:41:05.3305745Z     self.assertEqual(rc, 0, msg='Run failed with\n{}'.format(err))
2022-07-07T08:41:05.3306845Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2249, in assertEqual
2022-07-07T08:41:05.3307583Z     assert_equal(
2022-07-07T08:41:05.3308540Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-07-07T08:41:05.3309298Z     raise error_metas[0].to_error(msg)
2022-07-07T08:41:05.3309884Z AssertionError: Scalars are not equal!
2022-07-07T08:41:05.3310248Z 

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pruthvistony pruthvistony added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 7, 2022
@jithunnair-amd jithunnair-amd marked this pull request as ready for review July 7, 2022 16:07
@jithunnair-amd jithunnair-amd requested review from a team and jeffdaily as code owners July 7, 2022 16:07
@jithunnair-amd
Copy link
Collaborator Author

@malfet @jeffdaily Please review this second attempt to update ROCm CI docker images to focal.

@jeffdaily
Copy link
Collaborator

CUDA CI failures do not seem related to this PR. All rocm CI jobs have passed.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2022

Hey @jithunnair-amd.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jul 8, 2022
…81031) (#81031)

Summary:
Re-attempting after original PR #79596 was reverted due to causing ROCm build failures

Pull Request resolved: #81031
Approved by: https://github.com/jeffdaily, https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/8a5d9843ff5d5dd865fc922853a15b3e7e459fdb

Reviewed By: mehtanirav

Differential Revision: D37719967

Pulled By: mehtanirav

fbshipit-source-id: 8be30b4fecb0dc2911661f6a5259e147f1726286
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged module: rocm AMD GPU support for Pytorch open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants