Skip to content

Prevent clobbering of docker images by parallelnative/paralleltbb builds#39863

Closed
zou3519 wants to merge 2 commits intogh/zou3519/257/basefrom
gh/zou3519/257/head
Closed

Prevent clobbering of docker images by parallelnative/paralleltbb builds#39863
zou3519 wants to merge 2 commits intogh/zou3519/257/basefrom
gh/zou3519/257/head

Conversation

@zou3519
Copy link
Contributor

@zou3519 zou3519 commented Jun 11, 2020

Stack from ghstack:

The paralleltbb and parallelnative builds use the same docker image as
pytorch-linux-trusty-py3.6-gcc5.4-build
(

docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
)

Therefore they should push to a different intermediate docker image for
the next phase (testing), according to
(

# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Note [Special build images]
# The xla build uses the same docker image as
# pytorch-linux-trusty-py3.6-gcc5.4-build. In the push step, we have to
# distinguish between them so the test can pick up the correct image.
).

However, they're not actually included in that list.

We've found evidence of what looks like clobbering in recent CI jobs
(https://circleci.com/gh/pytorch/pytorch/5787534?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link),
(https://circleci.com/gh/pytorch/pytorch/5787763?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link)

This PR adds parallelnative and paralleltbb to the list to prevent
clobbering.

Test Plan:

  • Wait for CI tests to pass on this PR.
  • The paralleltbb and parallelnative builds don't actually run on PRs.
    So I think the plan here is to yolo land and hope it works.

Differential Revision: D22002279

The paralleltbb and parallelnative builds use the same docker image as
pytorch-linux-trusty-py3.6-gcc5.4-build
(https://github.com/pytorch/pytorch/blob/da3073e9b1db503f106842339f50f522d973be84/.circleci/config.yml#L6752)

Therefore they should push to a different intermediate docker image for
the next phase (testing), according to
(https://github.com/pytorch/pytorch/blob/da3073e9b1db503f106842339f50f522d973be84/.circleci/config.yml#L434-L439).

However, they're not actually included in that list.

We've found evidence of what looks like clobbering in recent CI jobs
(https://circleci.com/gh/pytorch/pytorch/5787534?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link),
(https://circleci.com/gh/pytorch/pytorch/5787763?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link)

This PR adds parallelnative and paralleltbb to the list to prevent
clobbering.

Test Plan:
- Wait for CI tests to pass on this PR.
- The paralleltbb and parallelnative builds don't actually run on PRs.
So I think the plan here is to yolo land and hope it works.

[ghstack-poisoned]
…lleltbb builds"

The paralleltbb and parallelnative builds use the same docker image as
pytorch-linux-trusty-py3.6-gcc5.4-build
(https://github.com/pytorch/pytorch/blob/da3073e9b1db503f106842339f50f522d973be84/.circleci/config.yml#L6752)

Therefore they should push to a different intermediate docker image for
the next phase (testing), according to
(https://github.com/pytorch/pytorch/blob/da3073e9b1db503f106842339f50f522d973be84/.circleci/config.yml#L434-L439).

However, they're not actually included in that list.

We've found evidence of what looks like clobbering in recent CI jobs
(https://circleci.com/gh/pytorch/pytorch/5787534?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link),
(https://circleci.com/gh/pytorch/pytorch/5787763?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link)

This PR adds parallelnative and paralleltbb to the list to prevent
clobbering.

Test Plan:
- Wait for CI tests to pass on this PR.
- The paralleltbb and parallelnative builds don't actually run on PRs.
So I think the plan here is to yolo land and hope it works.

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Jun 11, 2020
The paralleltbb and parallelnative builds use the same docker image as
pytorch-linux-trusty-py3.6-gcc5.4-build
(https://github.com/pytorch/pytorch/blob/da3073e9b1db503f106842339f50f522d973be84/.circleci/config.yml#L6752)

Therefore they should push to a different intermediate docker image for
the next phase (testing), according to
(https://github.com/pytorch/pytorch/blob/da3073e9b1db503f106842339f50f522d973be84/.circleci/config.yml#L434-L439).

However, they're not actually included in that list.

We've found evidence of what looks like clobbering in recent CI jobs
(https://circleci.com/gh/pytorch/pytorch/5787534?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link),
(https://circleci.com/gh/pytorch/pytorch/5787763?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link)

This PR adds parallelnative and paralleltbb to the list to prevent
clobbering.

Test Plan:
- Wait for CI tests to pass on this PR.
- The paralleltbb and parallelnative builds don't actually run on PRs.
So I think the plan here is to yolo land and hope it works.

ghstack-source-id: e2fac1e
Pull Request resolved: #39863
@zou3519 zou3519 requested review from ezyang and pbelevich June 11, 2020 16:43
@dr-ci
Copy link

dr-ci bot commented Jun 11, 2020

💊 CI failures summary and remediations

As of commit 6561a12 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 11 17:55:45 RuntimeError: Process 2 terminated or timed out after 100.05585741996765 seconds
Jun 11 17:55:45 ====================================================================== 
Jun 11 17:55:45 ERROR [100.075s]: test_backward_node_failure (__main__.TensorPipeAgentDistAutogradTestWithSpawn) 
Jun 11 17:55:45 ---------------------------------------------------------------------- 
Jun 11 17:55:45 Traceback (most recent call last): 
Jun 11 17:55:45   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper 
Jun 11 17:55:45     self._join_processes(fn) 
Jun 11 17:55:45   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes 
Jun 11 17:55:45     self._check_return_codes(elapsed_time) 
Jun 11 17:55:45   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes 
Jun 11 17:55:45     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Jun 11 17:55:45 RuntimeError: Process 2 terminated or timed out after 100.05585741996765 seconds 
Jun 11 17:55:45  
Jun 11 17:55:45 ---------------------------------------------------------------------- 
Jun 11 17:55:45 Ran 65 tests in 322.000s 
Jun 11 17:55:45  
Jun 11 17:55:45 FAILED (errors=1) 
Jun 11 17:55:45  
Jun 11 17:55:45 Generating XML reports... 
Jun 11 17:55:45 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeAgentDistAutogradTestWithSpawn-20200611175023.xml 
Jun 11 17:55:45 Traceback (most recent call last): 
Jun 11 17:55:45   File "test/run_test.py", line 711, in <module> 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@facebook-github-bot
Copy link
Contributor

@zou3519 merged this pull request in 7a79287.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants