Skip to content

TorchDistributedTrial uses group as parameter instead of device#4106

Merged
toshihikoyanase merged 16 commits intooptuna:masterfrom
reyoung:master
Dec 21, 2022
Merged

TorchDistributedTrial uses group as parameter instead of device#4106
toshihikoyanase merged 16 commits intooptuna:masterfrom
reyoung:master

Conversation

@reyoung
Copy link
Copy Markdown
Contributor

@reyoung reyoung commented Oct 31, 2022

Motivation

Make PyTorch Distributed integration support GPU distributed data-parallel.

Description of the changes

Tensors must be on a CUDA device when using NCCL as a process group. So when using multiple GPU distributed data parallel, optuna.TorchDistributedTrial will throw an exception.

Createing a global gloo process group when WORLD is NCCL backend and carrying the _g_pg for all dist calls.

@github-actions github-actions bot added the optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions. label Oct 31, 2022
@reyoung
Copy link
Copy Markdown
Contributor Author

reyoung commented Oct 31, 2022

@toshihikoyanase @not522 Please take a review, thanks

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 31, 2022

Codecov Report

Attention: Patch coverage is 11.53846% with 23 lines in your changes missing coverage. Please review.

Project coverage is 89.74%. Comparing base (7e4a3d1) to head (fd7d01c).
Report is 4411 commits behind head on master.

Files with missing lines Patch % Lines
optuna/integration/pytorch_distributed.py 11.53% 23 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4106      +/-   ##
==========================================
- Coverage   89.76%   89.74%   -0.03%     
==========================================
  Files         162      162              
  Lines       12598    12607       +9     
==========================================
+ Hits        11309    11314       +5     
- Misses       1289     1293       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@toshihikoyanase
Copy link
Copy Markdown
Member

@reyoung Thank you for your PR. Before diving into details of code, could you share the minimum reproducible code of the problem? I guess we need it to confirm your change resolves the problem.

Tensors must be on a CUDA device when using NCCL as a process group. So when using multiple GPU distributed data parallel, optuna.TorchDistributedTrial will throw an exception.

Createing a global gloo process group

gloo seems to be used for the distributed CPU training. So, in my understanding, this PR uses the gloo backend for Optuna communication while it use NCCL for the communication of PyTorch's model training. Could you tell me if I'm wrong.

Use the Gloo backend for distributed CPU training.
https://pytorch.org/docs/stable/distributed.html#which-backend-to-use

@toshihikoyanase
Copy link
Copy Markdown
Member

toshihikoyanase commented Oct 31, 2022

I checked the implementation of TorchDistriubtedTrial. It moves data to devices (e.g., GPU memory) before communication to use NCCL to broadcast Optuna-related data. So, I think TorchDistribuedTrial is design to use distributed GPU training.
What do you think?

if rank == 0:
buffer = _to_tensor(value)
size_buffer[0] = buffer.shape[0]
if self._device is not None:
size_buffer = size_buffer.to(self._device)
dist.broadcast(size_buffer, src=0) # type: ignore

if self._device is not None:
buffer = buffer.to(self._device)
dist.broadcast(buffer, src=0) # type: ignore

@reyoung
Copy link
Copy Markdown
Contributor Author

reyoung commented Nov 1, 2022

I see. It seems that the device need to be passed to TorchDistributedTrial (I missed device parameter in TorchDistributedTrial.).

However, moving tensors between CPU and GPU seems unnecessary and slow, and it is easier just to create a gloo backend when the WORLD is NCCL.

Also device is not needed because we can use dist.get_backend to detect the global process group's backend. When it is NCCL, just set device to cuda:0?

  • The device parameter is not about the tensor's device, it is actually the process group limitation. What if some users have internal process group implementation that can handle both CPU and GPU communication (which is common in some companies internally)?

In summary, I think there are two possible ways to enhance this API.

  1. How about just removing device parameter, adding group parameter, just creating a gloo backend when the WORLD is NCCL?
    • and add group=None as TorchDistributedTrial's parameter? When group=None, TorchDistributedTrial just tries to create a process group?
    • The PR is modified as this option.
  2. How about setting device default by 'cuda:0' if dist.get_backend(dist.WORLD) == 'nccl' else None ?

I personally prefer option 1 because it is fast and easier to understand, but it makes the device parameter unnecessary.


gloo seems to be used for the distributed CPU training. So, in my understanding, this PR uses the gloo backend for Optuna communication while it use NCCL for the communication of PyTorch's model training. Could you tell me if I'm wrong.

Yes. You're right.

The could be many process groups using PyTorch Distributed. We can use Gloo for TorchDistributedTrial and NCCL for
merging gradients.

In this PR, it creates a NEW 'Gloo' process group when the WORLD is NCCL.

The whole training communication except DistributedTrial is still using NCCL in this PR.

@reyoung reyoung changed the title PyTorchDistributed: create a gloo process group when WORLD is NCCL PyTorchDistributed use group as parameter instead of device Nov 1, 2022
@reyoung
Copy link
Copy Markdown
Contributor Author

reyoung commented Nov 1, 2022

@toshihikoyanase Please take a review.

@HideakiImamura
Copy link
Copy Markdown
Member

@toshihikoyanase Could you review this PR?

Copy link
Copy Markdown
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I haven't really thought too much about the removal of the device argument and the addition of the group argument yet, but I'll leave a couple of quick comments first.

I would like to test the validity of this change. Is it possible to add unit tests? Note that Optuna does not test using gpu. optuna-examples confirms that TorchDistributedTrial works with cpu here. If possible, could you please provide a script to easily check the operation with Google colab or something?

"Use :func:`~optuna.integration.TorchDistributedTrial.suggest_float` instead."
)

_g_pg: List[Optional["torch.distributed.ProcessGroup"]] = [None]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use the global variable with state. Please remove _g_gp and have TorchDistributedTrial have explicit state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot.

Since the TorchDistributedTrial is created multiple times during training, we need a global variable to store gloo backend by default. Because process group is an application scoped variable, and TorchDistributedTrial is per-trial (or objective local ) variable.

See https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py#L100

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a distributed optimization setup, even global variables are not shared in memory across processes, right? I think it is equivalent to storing information to properly identify the process group for each TorchDistributedTrial.

@reyoung
Copy link
Copy Markdown
Contributor Author

reyoung commented Nov 14, 2022

For testing code,

Actually I changed the huggingface/transformers by using this PR .

Code here

I tested in my internal GPU cluster and it works well in multiple GPUs DDP.

However, writing unittests need a gpu CI system, which optuna does not have one?

@reyoung
Copy link
Copy Markdown
Contributor Author

reyoung commented Nov 16, 2022

@HideakiImamura

For example please see optuna/optuna-examples#145.

Actually, the PyTorch distributed environments should not be set manually. You can use torchrun to set them all.

@HideakiImamura
Copy link
Copy Markdown
Member

Let me confirm that one thing. Does not the current PyTorch DDP integration in Optuna's master support GPU distributed data parallel training? If it already supports GPU distributed data parallel training, we don't have to change codes. It would be great to share the examples (scripts and environment information) to fail the GPU distributed data parallel training.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Nov 27, 2022
@toshihikoyanase
Copy link
Copy Markdown
Member

Does not the current PyTorch DDP integration in Optuna's master support GPU distributed data parallel training?

@reyoung @HideakiImamura I'm sorry for the delayed response.
I confirmed PytorchDistributedTrial in the current master supported multi-node configuration.
Could you check the following PR in optuna/optuna-exampels, please?

optuna/optuna-examples#150

@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Nov 29, 2022
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 7, 2022

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Dec 7, 2022
@reyoung
Copy link
Copy Markdown
Contributor Author

reyoung commented Dec 9, 2022

Does not the current PyTorch DDP integration in Optuna's master support GPU distributed data parallel training?

Yes, you can run GPU DDP with current Optuna's code, BUT:

  • device must be set to PytorchDistributedTrial, and device is not related to distributed communication at all. They are NOT SAME LEVEL CONCEPTS at all.
    • The ProcessGroup is decoupled with device. Some ProcessGroup has limitation to communicate tensors in some device.
    • However, you cannot just assume this limitation will binding to device. Because
      • You can write a ProcessGroup can both support GPU/CPU communication.
  • It must copy between GPU-CPU memory, which is slow and unnecessary.
  • It cannot set custom ProcessGroup to handle communication, like pytorch alway can.
    • Some cloud cluster has there own optimized ProcessGroup, like AWS Segmaker

It is not common way to set device in communication rather than ProcessGroup.

I think IT IS A REALLY BAD DESIGN TO USE DEVICE RATHER THAN ProcessGroup in PytorchDistributedTrial.

@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Dec 11, 2022
@toshihikoyanase
Copy link
Copy Markdown
Member

Thank you for your explanation.

The current design of TorchDistributedTrial is based on ChainerMNStudy that uses the communication paths of weights and gradients to broadcast Optuna's data. So, we intentionally use device parameter for TorchDistributedTrial to reuse the process group for weights and gradients. And also, the overhead to copy between GPU-CPU is limited, since the Optuna's communication frequency is quite low compared with that for training.

But, as you mentioned, the implementation will be straightforward if we use PyTorch's process group. I'll check the change with pytorch_distributed_simple.py in optuna examples.

Copy link
Copy Markdown
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The followings are follow-ups.

  • Fix optuna-examples.
  • Reconsider the _g_pg variable.
  • Add a test case.

Copy link
Copy Markdown
Member

@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that the change worked with optuna/optuna-examples#150.

Possible follow-up tasks are as follows:

  • Replace _g_pg with self._group
  • Always create process group for optuna trials
  • Remove device
  • Add tests for group
  • Fix docstring for group

We'll work on them after merging this PR.

@toshihikoyanase toshihikoyanase added the feature Change that does not break compatibility, but affects the public interfaces. label Dec 21, 2022
@toshihikoyanase toshihikoyanase merged commit 5ad1736 into optuna:master Dec 21, 2022
@toshihikoyanase toshihikoyanase added this to the v3.1.0 milestone Dec 21, 2022
@toshihikoyanase
Copy link
Copy Markdown
Member

@reyoung Thank you for your contribution!

@toshihikoyanase toshihikoyanase changed the title PyTorchDistributed use group as parameter instead of device TorchDistributedTrial uses group as parameter instead of device Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Change that does not break compatibility, but affects the public interfaces. optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants