Skip to content

Create a new process group if group of TorchDistributedTrial is None.#4268

Closed
toshihikoyanase wants to merge 2 commits intooptuna:masterfrom
toshihikoyanase:always-create-gloo-process-group-if-none
Closed

Create a new process group if group of TorchDistributedTrial is None.#4268
toshihikoyanase wants to merge 2 commits intooptuna:masterfrom
toshihikoyanase:always-create-gloo-process-group-if-none

Conversation

@toshihikoyanase
Copy link
Copy Markdown
Member

@toshihikoyanase toshihikoyanase commented Dec 22, 2022

Motivation

This is a follow-up PR for #4106.

Always create a new process group

Currently, behavior of TorchDistributedTrial with group=None is inconsistent depending on a default process group (i.e., process groups for neural network training). The process group is reused if the backend of the default process group is gloo, while a new process group is created if the backend is nccl.
I think we can separate the communication of optuna and neural network training if we create a new process group for gloo.

And also, the current implementation does not take care of mpi backend.
https://pytorch.org/docs/stable/distributed.html

Create a process group for Optuna locally

The current master creates a process group as a (package) global variable (i.e., _g_pg) silently. Users may have difficulty to release such hidden variables. So, I'd like to change it to a member of the TorchDistributedTrial class, so that it can be released when a TorchDistributedTrial object is removed.
This change may take some time to recreate gloo process groups trial by trial, but users can reuse the process group by the group argument.

Description of the changes

  • Create a new process group if group=None
  • Create a new process group as a member of TorchDistributedTrial
  • Update docstring and fix the annotations

@toshihikoyanase toshihikoyanase added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Dec 22, 2022
@github-actions github-actions bot added the optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions. label Dec 22, 2022
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Dec 22, 2022

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.06%. Comparing base (d881144) to head (5e2dd48).
Report is 3677 commits behind head on master.

Files with missing lines Patch % Lines
optuna/integration/pytorch_distributed.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4268   +/-   ##
=======================================
  Coverage   90.05%   90.06%           
=======================================
  Files         183      183           
  Lines       14088    14081    -7     
=======================================
- Hits        12687    12682    -5     
+ Misses       1401     1399    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@contramundum53
Copy link
Copy Markdown
Member

@not522 @HideakiImamura Could you review this PR?

Copy link
Copy Markdown
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@HideakiImamura HideakiImamura removed their assignment Dec 23, 2022
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Dec 29, 2022
@not522 not522 removed the stale Exempt from stale bot labeling. label Dec 29, 2022
@not522
Copy link
Copy Markdown
Member

not522 commented Jan 4, 2023

I want to merge #4301 first because the PyTorch distributed integration is not tested.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 12, 2023
@toshihikoyanase
Copy link
Copy Markdown
Member Author

@not522 #4301 was merged today. So, could you resume your review, please?

@not522
Copy link
Copy Markdown
Member

not522 commented Jan 15, 2023

Could you merge the master branch? In my local environment, test_pytorch_distributed.py failed.

@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Jan 15, 2023
@toshihikoyanase toshihikoyanase force-pushed the always-create-gloo-process-group-if-none branch from 9170c40 to 4d500ef Compare January 17, 2023 02:35
@toshihikoyanase
Copy link
Copy Markdown
Member Author

I rebased master, and the tests-mpi jobs succeeded. Could you share the error message, please?

BTW, I'm concerned about the removal of process groups. In the DDP tutorial, the process group was removed explicitly using dist.destroy_process_group.
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#basic-use-case
This PR will create process groups trial by trial, and the process groups of previous trials may be harmful if they are not released automatically.

c.f., pytorch/pytorch#48203

@toshihikoyanase
Copy link
Copy Markdown
Member Author

toshihikoyanase commented Jan 17, 2023

In the DDP tutorial, the process group was removed explicitly using dist.destroy_process_group.

If I add the following code to TorchDistributedTrial, I guess we can remove unnecessary process groups when distributed trials are destroyed. What do you think of it?

    def __del__(self):
        if self._group:
            dist.destroy_process_group(self._group)

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 24, 2023
@HideakiImamura HideakiImamura assigned c-bata and unassigned not522 Jan 30, 2023
@HideakiImamura
Copy link
Copy Markdown
Member

@c-bata Could you review this PR?

@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Jan 30, 2023
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 7, 2023

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Feb 7, 2023
@c-bata c-bata removed the stale Exempt from stale bot labeling. label Feb 8, 2023
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Feb 15, 2023
@c-bata c-bata removed the stale Exempt from stale bot labeling. label Feb 17, 2023
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Feb 26, 2023
@github-actions
Copy link
Copy Markdown
Contributor

This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

@github-actions github-actions bot closed this Mar 12, 2023
@c-bata c-bata removed the stale Exempt from stale bot labeling. label Mar 13, 2023
@c-bata c-bata reopened this Mar 13, 2023
@c-bata
Copy link
Copy Markdown
Member

c-bata commented Mar 13, 2023

I'm really sorry for the delay. I'll review this PR today 🙇

@toshihikoyanase toshihikoyanase force-pushed the always-create-gloo-process-group-if-none branch from 4d500ef to 5e2dd48 Compare March 14, 2023 01:41
@toshihikoyanase toshihikoyanase marked this pull request as draft March 14, 2023 06:29
@toshihikoyanase
Copy link
Copy Markdown
Member Author

I think we need to remove the process groups properly, so let me mark this PR as draft.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Mar 21, 2023
@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Apr 4, 2023
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Apr 11, 2023
@github-actions
Copy link
Copy Markdown
Contributor

This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

@github-actions github-actions bot closed this Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions. stale Exempt from stale bot labeling.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants