Skip to content

Make ProcessGroupAgent take num_threads as constructor argument#26313

Closed
xush6528 wants to merge 1 commit intopytorch:masterfrom
xush6528:export-D17405491
Closed

Make ProcessGroupAgent take num_threads as constructor argument#26313
xush6528 wants to merge 1 commit intopytorch:masterfrom
xush6528:export-D17405491

Conversation

@xush6528
Copy link
Copy Markdown
Contributor

Summary:

Problem

If there is not enough number of thread in the RPC Agent thread pool. Some circular dependent works could cause deadlock.

The current to way to get around this deadlock is to provide abundant number of threads.

Solution

as titled

Differential Revision: D17405491

@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 16, 2019
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need to add that to init_model_parallel as that's the entry point API?

Comment thread torch/distributed/rpc.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is rpc_backend from? should it not be "backend"?

Copy link
Copy Markdown
Contributor Author

@xush6528 xush6528 Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaojuanmao Thanks. That's in @satgera 's next PR. I am keeping each PR atomic.

@xush6528
Copy link
Copy Markdown
Contributor Author

@mrshenli Right, updating it.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xush6528 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xush6528 xush6528 added module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 17, 2019
Comment thread torch/distributed/__init__.py Outdated
…ment (pytorch#26313)

Summary:
# Problem

If there is not enough number of thread in the RPC Agent thread pool. Some circular dependent works could cause deadlock.

The current to way to get around this deadlock is to provide abundant number of threads.

# Solution

as titled
Pull Request resolved: pytorch#26313

Differential Revision: D17405491

Pulled By: xush6528

fbshipit-source-id: e9b34db50b0ab614ebcb3414ae615ad1cde259ce
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xush6528 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@xush6528 merged this pull request in b0b0f2c.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ment (pytorch#26313)

Summary:
# Problem

If there is not enough number of thread in the RPC Agent thread pool. Some circular dependent works could cause deadlock.

The current to way to get around this deadlock is to provide abundant number of threads.

# Solution

as titled
Pull Request resolved: pytorch#26313

Differential Revision: D17405491

Pulled By: xush6528

fbshipit-source-id: a1d9b6a84db0371cd4b63328fa00f651c0808485
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants