Use default timeout of 30 minutes for gloo backend#13056
Closed
pietern wants to merge 1 commit intopytorch:masterfrom
Closed
Use default timeout of 30 minutes for gloo backend#13056pietern wants to merge 1 commit intopytorch:masterfrom
pietern wants to merge 1 commit intopytorch:masterfrom
Conversation
Collaborator
|
10s seems too little, but 30m seems a bit too much, no? something like 1 min? |
Contributor
Author
|
@soumith There are instances where a single process group is used for a barrier between 2+ hour asynchronous tasks that complete with some variance (e.g. +/- 10m). I'm working on supporting per operation timeouts that will allow you to explicitly set a timeout on a barrier to do this, but until then we have to use a single timeout for all ops. |
Collaborator
|
@pietern interesting! |
1d92a7f to
2f4469c
Compare
2f4469c to
96abcad
Compare
96abcad to
1238c38
Compare
1238c38 to
dfe33c2
Compare
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
pietern is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
dfe33c2 to
16067c2
Compare
16067c2 to
93e35c1
Compare
Summary: The existing default timeout was set at 10 seconds, which is too low for asynchronous tasks that depend on a barrier to resynchronize. Having a single timeout for all operations is not ideal and this will be addressed in future commits. Differential Revision: D10558746 fbshipit-source-id: 26c811da27907c1388c40d6f3a352ba96a011c15
93e35c1 to
c676c46
Compare
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
pietern is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
Summary: The existing default timeout was set at 10 seconds, which is too low for asynchronous tasks that depend on a barrier to resynchronize. Having a single timeout for all operations is not ideal and this will be addressed in future commits. Pull Request resolved: pytorch#13056 Reviewed By: teng-li Differential Revision: D10558746 Pulled By: pietern fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The existing default timeout was set at 10 seconds, which is too low
for asynchronous tasks that depend on a barrier to resynchronize.
Having a single timeout for all operations is not ideal and this will
be addressed in future commits.
Differential Revision: D10558746