Create a new process group if group of TorchDistributedTrial is None.#4268
Create a new process group if group of TorchDistributedTrial is None.#4268toshihikoyanase wants to merge 2 commits intooptuna:masterfrom
group of TorchDistributedTrial is None.#4268Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4268 +/- ##
=======================================
Coverage 90.05% 90.06%
=======================================
Files 183 183
Lines 14088 14081 -7
=======================================
- Hits 12687 12682 -5
+ Misses 1401 1399 -2 ☔ View full report in Codecov by Sentry. |
|
@not522 @HideakiImamura Could you review this PR? |
|
This pull request has not seen any recent activity. |
|
I want to merge #4301 first because the PyTorch distributed integration is not tested. |
|
This pull request has not seen any recent activity. |
|
Could you merge the master branch? In my local environment, |
9170c40 to
4d500ef
Compare
|
I rebased BTW, I'm concerned about the removal of process groups. In the DDP tutorial, the process group was removed explicitly using c.f., pytorch/pytorch#48203 |
If I add the following code to def __del__(self):
if self._group:
dist.destroy_process_group(self._group) |
|
This pull request has not seen any recent activity. |
|
@c-bata Could you review this PR? |
|
This pull request has not seen any recent activity. |
|
This pull request has not seen any recent activity. |
|
This pull request has not seen any recent activity. |
|
This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely. |
|
I'm really sorry for the delay. I'll review this PR today 🙇 |
4d500ef to
5e2dd48
Compare
|
I think we need to remove the process groups properly, so let me mark this PR as draft. |
|
This pull request has not seen any recent activity. |
|
This pull request has not seen any recent activity. |
|
This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely. |
Motivation
This is a follow-up PR for #4106.
Always create a new process group
Currently, behavior of
TorchDistributedTrialwithgroup=Noneis inconsistent depending on a default process group (i.e., process groups for neural network training). The process group is reused if the backend of the default process group isgloo, while a new process group is created if the backend isnccl.I think we can separate the communication of optuna and neural network training if we create a new process group for
gloo.And also, the current implementation does not take care of
mpibackend.https://pytorch.org/docs/stable/distributed.html
Create a process group for Optuna locally
The current master creates a process group as a (package) global variable (i.e.,
_g_pg) silently. Users may have difficulty to release such hidden variables. So, I'd like to change it to a member of theTorchDistributedTrialclass, so that it can be released when aTorchDistributedTrialobject is removed.This change may take some time to recreate gloo process groups trial by trial, but users can reuse the process group by the
groupargument.Description of the changes
group=NoneTorchDistributedTrial