Make torch.distributed.breakpoint() set a long timeout#158481
Make torch.distributed.breakpoint() set a long timeout#158481wconstab wants to merge 3 commits intogh/wconstab/428/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158481
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit c45e5bc with merge base 900fba4 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/distributed/__init__.py
Outdated
| # avoid having the default timeout (if short) interrupt your debug session | ||
| if timeout_s is not None: | ||
| for group in torch.distributed.distributed_c10d._pg_map: | ||
| torch.distributed.distributed_c10d._set_pg_timeout(timedelta(seconds=timeout_s), group) |
There was a problem hiding this comment.
do we need to revert this after the breakpoint ends?
There was a problem hiding this comment.
maybe we should. tbh i don't consider 'continue' a well supported feature bc half the time something crashes due to delays. But it doesn't hurt to do that.
There was a problem hiding this comment.
i believe we do not currently have a way to get the existing timeout value. (!!!)
torch.distributed.distributed_c10d._get_default_timeout(group) just gets default values that are hardcoded in distributed_c10d.py, to use as defaults when initializing a pg without a user specified value.
we only bind a 'setter' for the backends. I'll leave this as is for now and we should fix the timeout methods separately.
cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta