Create a new process group if `group` of `TorchDistributedTrial` is `None`. by toshihikoyanase · Pull Request #4268 · optuna/optuna

toshihikoyanase · 2022-12-22T00:53:51Z

Motivation

This is a follow-up PR for #4106.

Always create a new process group

Currently, behavior of TorchDistributedTrial with group=None is inconsistent depending on a default process group (i.e., process groups for neural network training). The process group is reused if the backend of the default process group is gloo, while a new process group is created if the backend is nccl.
I think we can separate the communication of optuna and neural network training if we create a new process group for gloo.

And also, the current implementation does not take care of mpi backend.
https://pytorch.org/docs/stable/distributed.html

Create a process group for Optuna locally

The current master creates a process group as a (package) global variable (i.e., _g_pg) silently. Users may have difficulty to release such hidden variables. So, I'd like to change it to a member of the TorchDistributedTrial class, so that it can be released when a TorchDistributedTrial object is removed.
This change may take some time to recreate gloo process groups trial by trial, but users can reuse the process group by the group argument.

Description of the changes

Create a new process group if group=None
Create a new process group as a member of TorchDistributedTrial
Update docstring and fix the annotations

codecov-commenter · 2022-12-22T01:51:05Z

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.06%. Comparing base (d881144) to head (5e2dd48).
Report is 3677 commits behind head on master.

Files with missing lines	Patch %	Lines
optuna/integration/pytorch_distributed.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4268   +/-   ##
=======================================
  Coverage   90.05%   90.06%           
=======================================
  Files         183      183           
  Lines       14088    14081    -7     
=======================================
- Hits        12687    12682    -5     
+ Misses       1401     1399    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

contramundum53 · 2022-12-22T06:00:54Z

@not522 @HideakiImamura Could you review this PR?

HideakiImamura

LGTM.

github-actions · 2022-12-29T23:05:15Z

This pull request has not seen any recent activity.

not522 · 2023-01-04T23:41:57Z

I want to merge #4301 first because the PyTorch distributed integration is not tested.

github-actions · 2023-01-12T23:05:55Z

This pull request has not seen any recent activity.

toshihikoyanase · 2023-01-13T07:24:52Z

@not522 #4301 was merged today. So, could you resume your review, please?

not522 · 2023-01-15T23:07:37Z

Could you merge the master branch? In my local environment, test_pytorch_distributed.py failed.

toshihikoyanase · 2023-01-17T06:00:17Z

I rebased master, and the tests-mpi jobs succeeded. Could you share the error message, please?

BTW, I'm concerned about the removal of process groups. In the DDP tutorial, the process group was removed explicitly using dist.destroy_process_group.
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#basic-use-case
This PR will create process groups trial by trial, and the process groups of previous trials may be harmful if they are not released automatically.

c.f., pytorch/pytorch#48203

toshihikoyanase · 2023-01-17T06:41:39Z

In the DDP tutorial, the process group was removed explicitly using dist.destroy_process_group.

If I add the following code to TorchDistributedTrial, I guess we can remove unnecessary process groups when distributed trials are destroyed. What do you think of it?

    def __del__(self):
        if self._group:
            dist.destroy_process_group(self._group)

github-actions · 2023-01-24T23:05:20Z

This pull request has not seen any recent activity.

HideakiImamura · 2023-01-30T01:39:32Z

@c-bata Could you review this PR?

github-actions · 2023-02-07T23:05:27Z

This pull request has not seen any recent activity.

github-actions · 2023-02-15T23:05:14Z

This pull request has not seen any recent activity.

github-actions · 2023-02-26T23:07:02Z

This pull request has not seen any recent activity.

github-actions · 2023-03-12T23:07:14Z

This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

c-bata · 2023-03-13T01:21:11Z

I'm really sorry for the delay. I'll review this PR today 🙇

toshihikoyanase · 2023-03-14T06:30:11Z

I think we need to remove the process groups properly, so let me mark this PR as draft.

github-actions · 2023-03-21T23:08:57Z

This pull request has not seen any recent activity.

github-actions · 2023-04-11T23:08:07Z

This pull request has not seen any recent activity.

github-actions · 2023-04-25T23:09:44Z

This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

toshihikoyanase added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Dec 22, 2022

github-actions bot added the optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions. label Dec 22, 2022

contramundum53 assigned not522 and HideakiImamura Dec 22, 2022

HideakiImamura approved these changes Dec 22, 2022

View reviewed changes

HideakiImamura removed their assignment Dec 23, 2022

github-actions bot added the stale Exempt from stale bot labeling. label Dec 29, 2022

not522 removed the stale Exempt from stale bot labeling. label Dec 29, 2022

github-actions bot added the stale Exempt from stale bot labeling. label Jan 12, 2023

github-actions bot removed the stale Exempt from stale bot labeling. label Jan 15, 2023

toshihikoyanase force-pushed the always-create-gloo-process-group-if-none branch from 9170c40 to 4d500ef Compare January 17, 2023 02:35

github-actions bot added the stale Exempt from stale bot labeling. label Jan 24, 2023

HideakiImamura assigned c-bata and unassigned not522 Jan 30, 2023

github-actions bot removed the stale Exempt from stale bot labeling. label Jan 30, 2023

github-actions bot added the stale Exempt from stale bot labeling. label Feb 7, 2023

c-bata removed the stale Exempt from stale bot labeling. label Feb 8, 2023

toshihikoyanase mentioned this pull request Feb 10, 2023

Fix mypy error about PyTorch Distributed #4413

Merged

github-actions bot added the stale Exempt from stale bot labeling. label Feb 15, 2023

c-bata removed the stale Exempt from stale bot labeling. label Feb 17, 2023

github-actions bot added the stale Exempt from stale bot labeling. label Feb 26, 2023

github-actions bot closed this Mar 12, 2023

c-bata removed the stale Exempt from stale bot labeling. label Mar 13, 2023

c-bata reopened this Mar 13, 2023

c-bata mentioned this pull request Mar 13, 2023

Add NCCL support for pytorch_distributed_simple.py optuna/optuna-examples#150

Closed

c-bata removed their assignment Mar 13, 2023

toshihikoyanase added 2 commits March 14, 2023 10:36

Create a new process group if group is None.

ffe3118

Apply changes in PR4281

5e2dd48

toshihikoyanase force-pushed the always-create-gloo-process-group-if-none branch from 4d500ef to 5e2dd48 Compare March 14, 2023 01:41

toshihikoyanase marked this pull request as draft March 14, 2023 06:29

github-actions bot added the stale Exempt from stale bot labeling. label Mar 21, 2023

github-actions bot removed the stale Exempt from stale bot labeling. label Apr 4, 2023

github-actions bot added the stale Exempt from stale bot labeling. label Apr 11, 2023

github-actions bot closed this Apr 25, 2023

Uh oh!

Conversation

toshihikoyanase commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Always create a new process group

Create a process group for Optuna locally

Description of the changes

Uh oh!

codecov-commenter commented Dec 22, 2022 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

contramundum53 commented Dec 22, 2022

Uh oh!

HideakiImamura left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 29, 2022

Uh oh!

not522 commented Jan 4, 2023

Uh oh!

github-actions bot commented Jan 12, 2023

Uh oh!

toshihikoyanase commented Jan 13, 2023

Uh oh!

not522 commented Jan 15, 2023

Uh oh!

toshihikoyanase commented Jan 17, 2023

Uh oh!

toshihikoyanase commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 24, 2023

Uh oh!

HideakiImamura commented Jan 30, 2023

Uh oh!

github-actions bot commented Feb 7, 2023

Uh oh!

github-actions bot commented Feb 15, 2023

Uh oh!

github-actions bot commented Feb 26, 2023

Uh oh!

github-actions bot commented Mar 12, 2023

Uh oh!

c-bata commented Mar 13, 2023

Uh oh!

toshihikoyanase commented Mar 14, 2023

Uh oh!

github-actions bot commented Mar 21, 2023

Uh oh!

github-actions bot commented Apr 11, 2023

Uh oh!

github-actions bot commented Apr 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

toshihikoyanase commented Dec 22, 2022 •

edited

Loading

codecov-commenter commented Dec 22, 2022 •

edited by codecov bot

Loading

toshihikoyanase commented Jan 17, 2023 •

edited

Loading