`TorchDistributedTrial` uses `group` as parameter instead of `device` by reyoung · Pull Request #4106 · optuna/optuna

reyoung · 2022-10-31T09:20:18Z

Motivation

Make PyTorch Distributed integration support GPU distributed data-parallel.

Description of the changes

Tensors must be on a CUDA device when using NCCL as a process group. So when using multiple GPU distributed data parallel, optuna.TorchDistributedTrial will throw an exception.

Createing a global gloo process group when WORLD is NCCL backend and carrying the _g_pg for all dist calls.

reyoung · 2022-10-31T09:22:44Z

@toshihikoyanase @not522 Please take a review, thanks

codecov-commenter · 2022-10-31T10:00:56Z

Codecov Report

Attention: Patch coverage is 11.53846% with 23 lines in your changes missing coverage. Please review.

Project coverage is 89.74%. Comparing base (7e4a3d1) to head (fd7d01c).
Report is 4411 commits behind head on master.

Files with missing lines	Patch %	Lines
optuna/integration/pytorch_distributed.py	11.53%	23 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4106      +/-   ##
==========================================
- Coverage   89.76%   89.74%   -0.03%     
==========================================
  Files         162      162              
  Lines       12598    12607       +9     
==========================================
+ Hits        11309    11314       +5     
- Misses       1289     1293       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

toshihikoyanase · 2022-10-31T13:39:40Z

@reyoung Thank you for your PR. Before diving into details of code, could you share the minimum reproducible code of the problem? I guess we need it to confirm your change resolves the problem.

Tensors must be on a CUDA device when using NCCL as a process group. So when using multiple GPU distributed data parallel, optuna.TorchDistributedTrial will throw an exception.

Createing a global gloo process group

gloo seems to be used for the distributed CPU training. So, in my understanding, this PR uses the gloo backend for Optuna communication while it use NCCL for the communication of PyTorch's model training. Could you tell me if I'm wrong.

Use the Gloo backend for distributed CPU training.
https://pytorch.org/docs/stable/distributed.html#which-backend-to-use

toshihikoyanase · 2022-10-31T13:46:41Z

I checked the implementation of TorchDistriubtedTrial. It moves data to devices (e.g., GPU memory) before communication to use NCCL to broadcast Optuna-related data. So, I think TorchDistribuedTrial is design to use distributed GPU training.
What do you think?

optuna/optuna/integration/pytorch_distributed.py

Lines 290 to 295 in 2b8befe

    
           if rank == 0: 
        
               buffer = _to_tensor(value) 
        
               size_buffer[0] = buffer.shape[0] 
        
           if self._device is not None: 
        
               size_buffer = size_buffer.to(self._device) 
        
           dist.broadcast(size_buffer, src=0)  # type: ignore

optuna/optuna/integration/pytorch_distributed.py

Lines 300 to 302 in 2b8befe

    
           if self._device is not None: 
        
               buffer = buffer.to(self._device) 
        
           dist.broadcast(buffer, src=0)  # type: ignore

reyoung · 2022-11-01T02:29:37Z

I see. It seems that the device need to be passed to TorchDistributedTrial (I missed device parameter in TorchDistributedTrial.).

However, moving tensors between CPU and GPU seems unnecessary and slow, and it is easier just to create a gloo backend when the WORLD is NCCL.

Also device is not needed because we can use dist.get_backend to detect the global process group's backend. When it is NCCL, just set device to cuda:0?

The device parameter is not about the tensor's device, it is actually the process group limitation. What if some users have internal process group implementation that can handle both CPU and GPU communication (which is common in some companies internally)?

In summary, I think there are two possible ways to enhance this API.

How about just removing device parameter, adding group parameter, just creating a gloo backend when the WORLD is NCCL?
- and add group=None as TorchDistributedTrial's parameter? When group=None, TorchDistributedTrial just tries to create a process group?
- The PR is modified as this option.
How about setting device default by 'cuda:0' if dist.get_backend(dist.WORLD) == 'nccl' else None ?

I personally prefer option 1 because it is fast and easier to understand, but it makes the device parameter unnecessary.

gloo seems to be used for the distributed CPU training. So, in my understanding, this PR uses the gloo backend for Optuna communication while it use NCCL for the communication of PyTorch's model training. Could you tell me if I'm wrong.

Yes. You're right.

The could be many process groups using PyTorch Distributed. We can use Gloo for TorchDistributedTrial and NCCL for
merging gradients.

In this PR, it creates a NEW 'Gloo' process group when the WORLD is NCCL.

The whole training communication except DistributedTrial is still using NCCL in this PR.

reyoung · 2022-11-01T07:50:31Z

@toshihikoyanase Please take a review.

HideakiImamura · 2022-11-02T02:23:57Z

@toshihikoyanase Could you review this PR?

HideakiImamura

Thanks for the PR. I haven't really thought too much about the removal of the device argument and the addition of the group argument yet, but I'll leave a couple of quick comments first.

I would like to test the validity of this change. Is it possible to add unit tests? Note that Optuna does not test using gpu. optuna-examples confirms that TorchDistributedTrial works with cpu here. If possible, could you please provide a script to easily check the operation with Google colab or something?

HideakiImamura · 2022-11-07T06:52:50Z

optuna/integration/pytorch_distributed.py

    "Use :func:`~optuna.integration.TorchDistributedTrial.suggest_float` instead."
 )

+_g_pg: List[Optional["torch.distributed.ProcessGroup"]] = [None]


I think we should not use the global variable with state. Please remove _g_gp and have TorchDistributedTrial have explicit state.

We cannot.

Since the TorchDistributedTrial is created multiple times during training, we need a global variable to store gloo backend by default. Because process group is an application scoped variable, and TorchDistributedTrial is per-trial (or objective local ) variable.

See https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py#L100

In a distributed optimization setup, even global variables are not shared in memory across processes, right? I think it is equivalent to storing information to properly identify the process group for each TorchDistributedTrial.

optuna/integration/pytorch_distributed.py

reyoung · 2022-11-14T07:12:20Z

For testing code,

Actually I changed the huggingface/transformers by using this PR .

Code here

I tested in my internal GPU cluster and it works well in multiple GPUs DDP.

However, writing unittests need a gpu CI system, which optuna does not have one?

reyoung · 2022-11-16T03:23:36Z

@HideakiImamura

For example please see optuna/optuna-examples#145.

Actually, the PyTorch distributed environments should not be set manually. You can use torchrun to set them all.

HideakiImamura · 2022-11-18T02:53:52Z

Let me confirm that one thing. Does not the current PyTorch DDP integration in Optuna's master support GPU distributed data parallel training? If it already supports GPU distributed data parallel training, we don't have to change codes. It would be great to share the examples (scripts and environment information) to fail the GPU distributed data parallel training.

github-actions · 2022-11-27T23:05:36Z

This pull request has not seen any recent activity.

toshihikoyanase · 2022-11-29T07:47:20Z

Does not the current PyTorch DDP integration in Optuna's master support GPU distributed data parallel training?

@reyoung @HideakiImamura I'm sorry for the delayed response.
I confirmed PytorchDistributedTrial in the current master supported multi-node configuration.
Could you check the following PR in optuna/optuna-exampels, please?

optuna/optuna-examples#150

github-actions · 2022-12-07T23:05:22Z

This pull request has not seen any recent activity.

reyoung · 2022-12-09T03:51:23Z

Does not the current PyTorch DDP integration in Optuna's master support GPU distributed data parallel training?

Yes, you can run GPU DDP with current Optuna's code, BUT:

device must be set to PytorchDistributedTrial, and device is not related to distributed communication at all. They are NOT SAME LEVEL CONCEPTS at all.
- The ProcessGroup is decoupled with device. Some ProcessGroup has limitation to communicate tensors in some device.
- However, you cannot just assume this limitation will binding to device. Because
  - You can write a ProcessGroup can both support GPU/CPU communication.
It must copy between GPU-CPU memory, which is slow and unnecessary.
It cannot set custom ProcessGroup to handle communication, like pytorch alway can.
- Some cloud cluster has there own optimized ProcessGroup, like AWS Segmaker

It is not common way to set device in communication rather than ProcessGroup.

I think IT IS A REALLY BAD DESIGN TO USE DEVICE RATHER THAN ProcessGroup in PytorchDistributedTrial.

toshihikoyanase · 2022-12-14T05:06:51Z

Thank you for your explanation.

The current design of TorchDistributedTrial is based on ChainerMNStudy that uses the communication paths of weights and gradients to broadcast Optuna's data. So, we intentionally use device parameter for TorchDistributedTrial to reuse the process group for weights and gradients. And also, the overhead to copy between GPU-CPU is limited, since the Optuna's communication frequency is quite low compared with that for training.

But, as you mentioned, the implementation will be straightforward if we use PyTorch's process group. I'll check the change with pytorch_distributed_simple.py in optuna examples.

HideakiImamura

LGTM. The followings are follow-ups.

Fix optuna-examples.
Reconsider the _g_pg variable.
Add a test case.

toshihikoyanase

I confirmed that the change worked with optuna/optuna-examples#150.

Possible follow-up tasks are as follows:

Replace _g_pg with self._group
Always create process group for optuna trials
Remove device
Add tests for group
Fix docstring for group

We'll work on them after merging this PR.

toshihikoyanase · 2022-12-21T07:21:55Z

@reyoung Thank you for your contribution!

reyoung added 3 commits October 28, 2022 11:13

Create gloo backend when the current backend is nccl

c1b742a

Fix typo

d0d43c5

Merge branch 'optuna:master' into master

e19686f

github-actions bot added the optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions. label Oct 31, 2022

reyoung added 2 commits October 31, 2022 17:20

change to _g_pg

30d23ad

Merge branch 'master' of https://github.com/reyoung/optuna

8ccf90c

reyoung added 7 commits November 1, 2022 11:12

use group as param

0a118ec

Change typo in comment

96d4517

use group as param

cd0c778

Merge branch 'master' of https://github.com/reyoung/optuna

e4236e0

Typo

fa1d733

Typo

f7ccea5

Format code

e22469b

reyoung changed the title ~~PyTorchDistributed: create a gloo process group when WORLD is NCCL~~ PyTorchDistributed use group as parameter instead of device Nov 1, 2022

HideakiImamura assigned toshihikoyanase and HideakiImamura Nov 2, 2022

HideakiImamura reviewed Nov 7, 2022

View reviewed changes

reyoung added 3 commits November 14, 2022 14:51

Polish distributed optuna

7996508

Merge branch 'master' of https://github.com/optuna/optuna

017bf7c

Tiny polish format

656341c

Merge branch 'optuna:master' into master

fd7d01c

reyoung mentioned this pull request Nov 16, 2022

Make PyTorchDistributedSimple support torchrun optuna/optuna-examples#145

Closed

github-actions bot added the stale Exempt from stale bot labeling. label Nov 27, 2022

github-actions bot removed the stale Exempt from stale bot labeling. label Nov 29, 2022

github-actions bot added the stale Exempt from stale bot labeling. label Dec 7, 2022

github-actions bot removed the stale Exempt from stale bot labeling. label Dec 11, 2022

HideakiImamura approved these changes Dec 21, 2022

View reviewed changes

toshihikoyanase approved these changes Dec 21, 2022

View reviewed changes

toshihikoyanase added the feature Change that does not break compatibility, but affects the public interfaces. label Dec 21, 2022

toshihikoyanase merged commit 5ad1736 into optuna:master Dec 21, 2022

toshihikoyanase added this to the v3.1.0 milestone Dec 21, 2022

toshihikoyanase changed the title ~~PyTorchDistributed use group as parameter instead of device~~ TorchDistributedTrial uses group as parameter instead of device Dec 22, 2022

This was referenced Dec 22, 2022

Remove device argument of TorchDistributedTrial #4266

Merged

Create a new process group if group of TorchDistributedTrial is None. #4268

Closed

Uh oh!

Conversation

reyoung commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description of the changes

Uh oh!

reyoung commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 31, 2022 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

toshihikoyanase commented Oct 31, 2022

Uh oh!

toshihikoyanase commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reyoung commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reyoung commented Nov 1, 2022

Uh oh!

HideakiImamura commented Nov 2, 2022

Uh oh!

HideakiImamura left a comment

Choose a reason for hiding this comment

Uh oh!

HideakiImamura Nov 7, 2022

Choose a reason for hiding this comment

Uh oh!

reyoung Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

HideakiImamura Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

reyoung commented Nov 14, 2022

Uh oh!

reyoung commented Nov 16, 2022

Uh oh!

HideakiImamura commented Nov 18, 2022

Uh oh!

github-actions bot commented Nov 27, 2022

Uh oh!

toshihikoyanase commented Nov 29, 2022

Uh oh!

github-actions bot commented Dec 7, 2022

Uh oh!

reyoung commented Dec 9, 2022

Uh oh!

toshihikoyanase commented Dec 14, 2022

Uh oh!

HideakiImamura left a comment

Choose a reason for hiding this comment

Uh oh!

toshihikoyanase left a comment

Choose a reason for hiding this comment

Uh oh!

toshihikoyanase commented Dec 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reyoung commented Oct 31, 2022 •

edited

Loading

reyoung commented Oct 31, 2022 •

edited

Loading

codecov-commenter commented Oct 31, 2022 •

edited by codecov bot

Loading

toshihikoyanase commented Oct 31, 2022 •

edited

Loading

reyoung commented Nov 1, 2022 •

edited

Loading