Skip to content

[reland][distributed] pass in timeout to TCP store when initializing#33434

Closed
rohan-varma wants to merge 4 commits intogh/rohan-varma/82/basefrom
gh/rohan-varma/82/head
Closed

[reland][distributed] pass in timeout to TCP store when initializing#33434
rohan-varma wants to merge 4 commits intogh/rohan-varma/82/basefrom
gh/rohan-varma/82/head

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Feb 18, 2020

Stack from ghstack:

Reland of #33325, since the
unit test was flaky and failed on land.

To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Ran the test 500 times and it all passed, built on MacOS and verified that it passes there too.

Differential Revision: D19935390

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

…e when initializing"

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 18, 2020
…e when initializing"

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

ghstack-source-id: 98450046
Pull Request resolved: #33434
@rohan-varma rohan-varma changed the title Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" [reland][distributed] pass in timeout to TCP store when initializing Feb 18, 2020
@dr-ci
Copy link

dr-ci bot commented Feb 18, 2020

💊 CircleCI build failures summary and remediations

As of commit b47da11:

None of the build failures appear to be your fault.

  • 1/1 broken upstream at merge base d13c1b8 since Feb 18

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch origin viable/strict
    git rebase --onto viable/strict $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch origin viable/strict
    git rebase viable/strict
    

    Check out the recency history of this "viable master" tracking branch.

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 6 times.

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix lint issues before landing.

…itializing"


Reland of #33325, since the
unit test was flaky and failed on land.

To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on)

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 18, 2020
…e when initializing"

Pull Request resolved: #33434

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98504874

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
…itializing"


Reland of #33325, since the
unit test was flaky and failed on land.

To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on)

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 19, 2020
…e when initializing"

Pull Request resolved: #33434

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98526630

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
…itializing"


Reland of #33325, since the
unit test was flaky and failed on land.

To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on)

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 19, 2020
…e when initializing"

Pull Request resolved: #33434

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6cb9e6b.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/82/head branch February 23, 2020 15:17
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
…e when initializing" (pytorch#33434)

Summary:
Pull Request resolved: pytorch#33434

Reland of pytorch#33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Test Plan: Added UT test_tcp_store_timeout_set

Differential Revision: D19935390

fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
init_method, rank, world_size, timeout=timeout
)
store, rank, world_size = next(rendezvous_iterator)
store.set_timeout(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to set_timeout again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants