Introduce PartialState as the device handler in the Trainer#22752
Introduce PartialState as the device handler in the Trainer#22752
PartialState as the device handler in the Trainer#22752Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
pacman100
left a comment
There was a problem hiding this comment.
Hello, my main question is that why aren't we creating or using already passed Accelerator object? My vision was that post creating accelerator object here muellerzr@12dcc13#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79R1489
you can fill in the local rank and all other dist setup variables from the accelerator state.
|
@pacman100 I'm starting small with what is minimally needed with the API. E.g. the AcceleratorState isn't needed until we get into parts such as mixed precision. The device handling can be done separately altogether as it doesn't need to rely on an accelerator object the user passes in This may eventually be the |
sgugger
left a comment
There was a problem hiding this comment.
There is a big problem in the changes of the tests with local_rank. local_rank==0 means the training was launched in a distributed fashion but with only one process (could be a setup with multiple nodes and one process per node for instance). Those tests need to be switched to use PartialState.distributed_type and compare it to NO, not to compare the local_rank with 0.
| ) | ||
|
|
||
| if args.local_rank == -1: | ||
| if args.local_rank == 0: |
There was a problem hiding this comment.
Those checks should on the distributed mode in PartialState.distributed_type == DistributedType.No as this did not error before if a user was launching a distributed training with one process.
sgugger
left a comment
There was a problem hiding this comment.
Looking good! Before we merge this, can you run all trainer tests (on single and multi GPU) as well as the DeepSpeed tests to make sure this doesn't break anything?
aa28856 to
4e30eef
Compare
|
PR is good for final review, tested on single + multi gpu + deepspeed and seemed to work fine. Ignore the |
9140c7a to
762b809
Compare
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
71a1fe8 to
6bf4774
Compare
|
I don't know if here is a right place to report but I think this merge causing error in Seq2SeqTrainingArguments. before this merge, I was able to run this code Error messages |
|
@rennn2002 Could you try updating your version of accelerate i.e. |
| if ( | ||
| torch.distributed.is_available() | ||
| and torch.distributed.is_initialized() | ||
| and self.distributed_state.distributed_type != DistributedType.NO |
There was a problem hiding this comment.
Should this be self.distributed_state.distributed_type == DistributedType.NO?
There was a problem hiding this comment.
Good catch! Fixing in a follow up. Thanks! :)
| and self.distributed_state.distributed_type != DistributedType.NO | ||
| ): | ||
| logger.warning( | ||
| "torch.distributed process group is initialized, but parallel_mode == ParallelMode.DISTRIBUTED. " |
There was a problem hiding this comment.
Should this be parallel_mode != ParallelMode.DISTRIBUTED?
…ingface#22752) * Use accelerate for device management * Add accelerate to setup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What does this PR do?
This PR is the start of the
Trainerintegration of AccelerateThe integration will follow multiple stages to ensure small changes happen iteratively. This first one simply changes the device handler/setter to be the
PartialStateinAccelerateand nothing more. In a follow-up PR, I will start to include more utilities utilizing it.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger @pacman100