Skip to content

[distributed] pass in timeout to TCP store when initializing#33325

Closed
rohan-varma wants to merge 4 commits intogh/rohan-varma/81/basefrom
gh/rohan-varma/81/head
Closed

[distributed] pass in timeout to TCP store when initializing#33325
rohan-varma wants to merge 4 commits intogh/rohan-varma/81/basefrom
gh/rohan-varma/81/head

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Feb 14, 2020

Stack from ghstack:

Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into init_process_group while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into init_process_group to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.

Differential Revision: D19871946

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 14, 2020
Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!

ghstack-source-id: 98307574
Pull Request resolved: #33325
@rohan-varma rohan-varma changed the title [wip][distributed] pass in timeout to TCP store when initializing [distributed] pass in timeout to TCP store when initializing Feb 14, 2020
Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 14, 2020
Pull Request resolved: #33325

Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98316154

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!


end = time.time()
time_diff = end - start
self.assertGreater(test_store_timeout.seconds * 4, time_diff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout would've been 30 mins without this PR right? Lets just do * 10 to be conservative.

@dr-ci
Copy link

dr-ci bot commented Feb 14, 2020

💊 CircleCI build failures summary and remediations

As of commit 8b9faec:

None of the build failures appear to be your fault.

  • 2/2 broken upstream at merge base 6dd6b0b since Feb 15

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch origin viable/strict
    git rebase --onto viable/strict $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch origin viable/strict
    git rebase viable/strict
    

    Check out the recency history of this "viable master" tracking branch.

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🚧 2 upstream failures recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 10 times.

Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 14, 2020
Pull Request resolved: #33325

Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98351617

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!
Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 15, 2020
Pull Request resolved: #33325

Closes #32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Differential Revision: [D19871946](https://our.internmc.facebook.com/intern/diff/D19871946/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19871946/)!
@rohan-varma
Copy link
Contributor Author

Test failure is in JIT fuser and unrelated.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in df47a3a.

@rohan-varma
Copy link
Contributor Author

It looks like the test failed when I committed, it could be flaky. I am reverting this.

rohan-varma added a commit that referenced this pull request Feb 18, 2020
…e when initializing"

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 18, 2020
…e when initializing"

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

ghstack-source-id: 98450046
Pull Request resolved: #33434
rohan-varma added a commit that referenced this pull request Feb 18, 2020
…itializing"


Reland of #33325, since the
unit test was flaky and failed on land.

To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on)

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 18, 2020
…e when initializing"

Pull Request resolved: #33434

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98504874

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
rohan-varma added a commit that referenced this pull request Feb 19, 2020
…itializing"


Reland of #33325, since the
unit test was flaky and failed on land.

To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on)

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 19, 2020
…e when initializing"

Pull Request resolved: #33434

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98526630

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
rohan-varma added a commit that referenced this pull request Feb 19, 2020
…itializing"


Reland of #33325, since the
unit test was flaky and failed on land.

To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.

Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on)

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Feb 19, 2020
…e when initializing"

Pull Request resolved: #33434

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
facebook-github-bot pushed a commit that referenced this pull request Feb 20, 2020
…e when initializing" (#33434)

Summary:
Pull Request resolved: #33434

Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Test Plan: Added UT test_tcp_store_timeout_set

Differential Revision: D19935390

fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/81/head branch February 20, 2020 15:16
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
…#33325)

Summary:
Pull Request resolved: pytorch#33325

Closes pytorch#32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Test Plan: Added a UT

Differential Revision: D19871946

fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
…e when initializing" (pytorch#33434)

Summary:
Pull Request resolved: pytorch#33434

Reland of pytorch#33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Test Plan: Added UT test_tcp_store_timeout_set

Differential Revision: D19935390

fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants