Fix rendezvous error due to EtcdStore get method not waiting in some cases#137056
Fix rendezvous error due to EtcdStore get method not waiting in some cases#137056alenawang wants to merge 2 commits intopytorch:mainfrom
Conversation
Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137056
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit af29007 with merge base c07ebaf ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
XilunWu
left a comment
There was a problem hiding this comment.
The PR looks good to me except some suggestions on lint. Thanks for the fix!!
test/distributed/elastic/rendezvous/etcd_rendezvous_backend_test.py
Outdated
Show resolved
Hide resolved
@XilunWu Thank you! Apologies for missing the linter instructions - I ran the linter locally and fixed the issues it was complaining about. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…cases (pytorch#137056) Fixes pytorch#132950 This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch. This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue pytorch#132950). We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly. Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com> Pull Request resolved: pytorch#137056 Approved by: https://github.com/fduwjj Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
Fixes #132950
This fixes an issue in
torch/distributed/elastic/rendezvous/etcd_store.pywhere the get method does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the_try_wait_getmethod would error out immediately here if the prefix was not found instead of continuing to the etcd watch.This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950).
We added a test demonstrating this issue and the fix. Without the fix the test fails with
etcd.EtcdKeyNotFound: Key not found : /torch/elastic/storeinstead of waiting for the first key to be written; with the fix the test waits properly.Co-authored-by: tarat44 32471142+tarat44@users.noreply.github.com
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o