[reland][distributed] pass in timeout to TCP store when initializing#33434
[reland][distributed] pass in timeout to TCP store when initializing#33434rohan-varma wants to merge 4 commits intogh/rohan-varma/82/basefrom
Conversation
…e when initializing" Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)! [ghstack-poisoned]
…e when initializing" Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)! ghstack-source-id: 98450046 Pull Request resolved: #33434
💊 CircleCI build failures summary and remediationsAs of commit b47da11: None of the build failures appear to be your fault.
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🚧 1 upstream failure recognized by patterns:These builds matched patterns, but were probably caused by upstream breakages:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 6 times. |
pritamdamania87
left a comment
There was a problem hiding this comment.
Please fix lint issues before landing.
…itializing" Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on) Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)! [ghstack-poisoned]
…e when initializing" Pull Request resolved: #33434 Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98504874 Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
…itializing" Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on) Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)! [ghstack-poisoned]
…e when initializing" Pull Request resolved: #33434 Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98526630 Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
…itializing" Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. Ran the test 500 times and it all passed, but I will also SSH into the macOS CI docker to ensure that it runs properly there (which is where it failed on) Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)! [ghstack-poisoned]
…e when initializing" Pull Request resolved: #33434 Reland of #33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98558377 Differential Revision: [D19935390](https://our.internmc.facebook.com/intern/diff/D19935390/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19935390/)!
|
This pull request has been merged in 6cb9e6b. |
…e when initializing" (pytorch#33434) Summary: Pull Request resolved: pytorch#33434 Reland of pytorch#33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98558377 Test Plan: Added UT test_tcp_store_timeout_set Differential Revision: D19935390 fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
| init_method, rank, world_size, timeout=timeout | ||
| ) | ||
| store, rank, world_size = next(rendezvous_iterator) | ||
| store.set_timeout(timeout) |
There was a problem hiding this comment.
there is no need to set_timeout again.
Stack from ghstack:
Reland of #33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
Ran the test 500 times and it all passed, built on MacOS and verified that it passes there too.
Differential Revision: D19935390
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!