TorchDistributedTrial uses group as parameter instead of device#4106
TorchDistributedTrial uses group as parameter instead of device#4106toshihikoyanase merged 16 commits intooptuna:masterfrom
TorchDistributedTrial uses group as parameter instead of device#4106Conversation
|
@toshihikoyanase @not522 Please take a review, thanks |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4106 +/- ##
==========================================
- Coverage 89.76% 89.74% -0.03%
==========================================
Files 162 162
Lines 12598 12607 +9
==========================================
+ Hits 11309 11314 +5
- Misses 1289 1293 +4 ☔ View full report in Codecov by Sentry. |
|
@reyoung Thank you for your PR. Before diving into details of code, could you share the minimum reproducible code of the problem? I guess we need it to confirm your change resolves the problem.
gloo seems to be used for the distributed CPU training. So, in my understanding, this PR uses the gloo backend for Optuna communication while it use NCCL for the communication of PyTorch's model training. Could you tell me if I'm wrong.
|
|
I checked the implementation of optuna/optuna/integration/pytorch_distributed.py Lines 290 to 295 in 2b8befe optuna/optuna/integration/pytorch_distributed.py Lines 300 to 302 in 2b8befe |
|
I see. It seems that the However, moving tensors between CPU and GPU seems unnecessary and slow, and it is easier just to create a Also
In summary, I think there are two possible ways to enhance this API.
I personally prefer option 1 because it is fast and easier to understand, but it makes the
Yes. You're right. The could be many process groups using PyTorch Distributed. We can use In this PR, it creates a NEW 'Gloo' process group when the The whole training communication except |
group as parameter instead of device
|
@toshihikoyanase Please take a review. |
|
@toshihikoyanase Could you review this PR? |
HideakiImamura
left a comment
There was a problem hiding this comment.
Thanks for the PR. I haven't really thought too much about the removal of the device argument and the addition of the group argument yet, but I'll leave a couple of quick comments first.
I would like to test the validity of this change. Is it possible to add unit tests? Note that Optuna does not test using gpu. optuna-examples confirms that TorchDistributedTrial works with cpu here. If possible, could you please provide a script to easily check the operation with Google colab or something?
| "Use :func:`~optuna.integration.TorchDistributedTrial.suggest_float` instead." | ||
| ) | ||
|
|
||
| _g_pg: List[Optional["torch.distributed.ProcessGroup"]] = [None] |
There was a problem hiding this comment.
I think we should not use the global variable with state. Please remove _g_gp and have TorchDistributedTrial have explicit state.
There was a problem hiding this comment.
We cannot.
Since the TorchDistributedTrial is created multiple times during training, we need a global variable to store gloo backend by default. Because process group is an application scoped variable, and TorchDistributedTrial is per-trial (or objective local ) variable.
See https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py#L100
There was a problem hiding this comment.
In a distributed optimization setup, even global variables are not shared in memory across processes, right? I think it is equivalent to storing information to properly identify the process group for each TorchDistributedTrial.
|
For testing code, Actually I changed the I tested in my internal GPU cluster and it works well in multiple GPUs DDP. However, writing unittests need a gpu CI system, which optuna does not have one? |
|
For example please see optuna/optuna-examples#145. Actually, the PyTorch distributed environments should not be set manually. You can use |
|
Let me confirm that one thing. Does not the current PyTorch DDP integration in Optuna's |
|
This pull request has not seen any recent activity. |
@reyoung @HideakiImamura I'm sorry for the delayed response. |
|
This pull request has not seen any recent activity. |
Yes, you can run GPU DDP with current
It is not common way to set I think IT IS A REALLY BAD DESIGN TO USE |
|
Thank you for your explanation. The current design of But, as you mentioned, the implementation will be straightforward if we use PyTorch's process group. I'll check the change with |
HideakiImamura
left a comment
There was a problem hiding this comment.
LGTM. The followings are follow-ups.
- Fix optuna-examples.
- Reconsider the
_g_pgvariable. - Add a test case.
toshihikoyanase
left a comment
There was a problem hiding this comment.
I confirmed that the change worked with optuna/optuna-examples#150.
Possible follow-up tasks are as follows:
- Replace
_g_pgwithself._group - Always create process group for optuna trials
- Remove
device - Add tests for
group - Fix docstring for
group
We'll work on them after merging this PR.
|
@reyoung Thank you for your contribution! |
group as parameter instead of deviceTorchDistributedTrial uses group as parameter instead of device
Motivation
Make PyTorch Distributed integration support GPU distributed data-parallel.
Description of the changes
Tensors must be on a CUDA device when using NCCL as a process group. So when using multiple GPU distributed data parallel,
optuna.TorchDistributedTrialwill throw an exception.Createing a global
glooprocess group when WORLD isNCCLbackend and carrying the_g_pgfor all dist calls.