[1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd#34638
[1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd#34638osalpekar wants to merge 1 commit intopytorch:masterfrom
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
1f61dec to
f943c42
Compare
💊 CircleCI build failures summary and remediationsAs of commit 94cb64f (more details on the Dr. CI page):
🚧 1 upstream failure:These were probably caused by upstream breakages:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 37 times. |
f943c42 to
3103de8
Compare
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
3103de8 to
4890a80
Compare
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
4890a80 to
7c01788
Compare
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
7c01788 to
d26f333
Compare
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
There was a problem hiding this comment.
Can we add the number of DIST_AUTOGRAD_FAILURE_REQ messages received per context id to debug info and then verify here that each node gets one of these messages for all the other context ids? Can add this as a separate PR.
d26f333 to
17b1cd8
Compare
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
…ilure during Distributed Autograd (pytorch#34638) Summary: Pull Request resolved: pytorch#34638 Fixes: pytorch#27643 This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly. Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass. Differential Revision: D20164420 fbshipit-source-id: 5aada5544ed12cd7e24053ba3b93f8b9b38ba021
17b1cd8 to
94cb64f
Compare
|
This pull request was exported from Phabricator. Differential Revision: D20164420 |
|
This pull request has been merged in 5f67c92. |
|
this broke lint |
…ilure during Distributed Autograd (pytorch#34638) Summary: Pull Request resolved: pytorch#34638 Fixes: pytorch#27643 This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly. (Note: this ignores all push blocking failures!) Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass. Differential Revision: D20164420 fbshipit-source-id: 3d4ed74230969ac70bb763f1b5b1c16d979f66a2
Summary:
Fixes: #27643
This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly.
Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass.
Differential Revision: D20164420