Skip to content

Introduce PartialState as the device handler in the Trainer#22752

Merged
muellerzr merged 27 commits intomainfrom
muellerzr-accelerate-device
Apr 17, 2023
Merged

Introduce PartialState as the device handler in the Trainer#22752
muellerzr merged 27 commits intomainfrom
muellerzr-accelerate-device

Conversation

@muellerzr
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR is the start of the Trainer integration of Accelerate

The integration will follow multiple stages to ensure small changes happen iteratively. This first one simply changes the device handler/setter to be the PartialState in Accelerate and nothing more. In a follow-up PR, I will start to include more utilities utilizing it.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger @pacman100

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 13, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Copy Markdown
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, my main question is that why aren't we creating or using already passed Accelerator object? My vision was that post creating accelerator object here muellerzr@12dcc13#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79R1489

you can fill in the local rank and all other dist setup variables from the accelerator state.

@muellerzr
Copy link
Copy Markdown
Contributor Author

muellerzr commented Apr 14, 2023

@pacman100 I'm starting small with what is minimally needed with the API. E.g. the AcceleratorState isn't needed until we get into parts such as mixed precision. The device handling can be done separately altogether as it doesn't need to rely on an accelerator object the user passes in

This may eventually be the AcceleratorState, but for now just starting with the PartialState seemed okay to me.

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a big problem in the changes of the tests with local_rank. local_rank==0 means the training was launched in a distributed fashion but with only one process (could be a setup with multiple nodes and one process per node for instance). Those tests need to be switched to use PartialState.distributed_type and compare it to NO, not to compare the local_rank with 0.

Comment thread src/transformers/trainer.py Outdated
)

if args.local_rank == -1:
if args.local_rank == 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those checks should on the distributed mode in PartialState.distributed_type == DistributedType.No as this did not error before if a user was launching a distributed training with one process.

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Before we merge this, can you run all trainer tests (on single and multi GPU) as well as the DeepSpeed tests to make sure this doesn't break anything?

Comment thread src/transformers/trainer.py Outdated
@muellerzr muellerzr force-pushed the muellerzr-accelerate-device branch from aa28856 to 4e30eef Compare April 14, 2023 19:33
@muellerzr muellerzr requested review from pacman100 and sgugger April 14, 2023 20:21
@muellerzr
Copy link
Copy Markdown
Contributor Author

PR is good for final review, tested on single + multi gpu + deepspeed and seemed to work fine. Ignore the examples_tensorflow failure, as that's actually a good failure (if one could exist)

@muellerzr muellerzr force-pushed the muellerzr-accelerate-device branch from 9140c7a to 762b809 Compare April 17, 2023 16:35
@muellerzr muellerzr force-pushed the muellerzr-accelerate-device branch from 71a1fe8 to 6bf4774 Compare April 17, 2023 18:26
@muellerzr muellerzr merged commit 0346287 into main Apr 17, 2023
@muellerzr muellerzr deleted the muellerzr-accelerate-device branch April 17, 2023 19:09
@rennn2002
Copy link
Copy Markdown

@muellerzr

I don't know if here is a right place to report but

I think this merge causing error in Seq2SeqTrainingArguments.

before this merge, I was able to run this code

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./train_test",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=5,
    max_steps=40,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=10,
    eval_steps=10,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

Error messages

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-25-7693116680cc>](https://localhost:8080/#) in <cell line: 3>()
      1 from transformers import Seq2SeqTrainingArguments
      2 
----> 3 training_args = Seq2SeqTrainingArguments(
      4     output_dir="./whisper-large-ja-7",  # change to a repo name of your choice
      5     per_device_train_batch_size=8,

usr/local/lib/python3.9/dist-packages/transformers/training_args_seq2seq.py in __init__(self, output_dir, overwrite_output_dir, do_train, do_eval, do_predict, evaluation_strategy, prediction_loss_only, per_device_train_batch_size, per_device_eval_batch_size, per_gpu_train_batch_size, per_gpu_eval_batch_size, gradient_accumulation_steps, eval_accumulation_steps, eval_delay, learning_rate, weight_decay, adam_beta1, adam_beta2, adam_epsilon, max_grad_norm, num_train_epochs, max_steps, lr_scheduler_type, warmup_ratio, warmup_steps, log_level, log_level_replica, log_on_each_node, logging_dir, logging_strategy, logging_first_step, logging_steps, logging_nan_inf_filter, save_strategy, save_steps, save_total_limit, save_safetensors, save_on_each_node, no_cuda, use_mps_device, seed, data_seed, jit_mode_eval, use_ipex, bf16, fp16, fp16_opt_level, half_precision_backend, bf16_full_eval, fp16_full_eval, tf32, local_rank, xpu_backend, tpu_num_cores, tpu_metrics_debug, debug, dataloader_drop_last, eval_steps, dataloader_num_workers, past_index, run_name, disable_tqdm, remove_unused_columns, label_names, load_best_model_at_end, metric_for_best_model, greater_is_better, ignore_data_skip, sharded_ddp, fsdp, fsdp_min_num_params, fsdp_config, fsdp_transformer_layer_cls_to_wrap, deepspeed, label_smoothing_factor, optim, optim_args, adafactor, group_by_length, length_column_name, report_to, ddp_find_unused_parameters, ddp_bucket_cap_mb, dataloader_pin_memory, skip_mem...

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in __post_init__(self)
   1253             self.framework == "pt"
   1254             and is_torch_available()
-> 1255             and (self.device.type != "cuda")
   1256             and (get_xla_device_type(self.device) != "GPU")
   1257             and (self.fp16 or self.fp16_full_eval)

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in device(self)
   1613         """
   1614         requires_backends(self, ["torch"])
-> 1615         return self._setup_devices
   1616 
   1617     @property

[/usr/local/lib/python3.9/dist-packages/transformers/utils/generic.py](https://localhost:8080/#) in __get__(self, obj, objtype)
     52         cached = getattr(obj, attr, None)
     53         if cached is None:
---> 54             cached = self.fget(obj)
     55             setattr(obj, attr, cached)
     56         return cached

[/usr/local/lib/python3.9/dist-packages/transformers/training_args.py](https://localhost:8080/#) in _setup_devices(self)
   1547             device = self.distributed_state.device
   1548         else:
-> 1549             self.distributed_state = PartialState(backend=self.xpu_backend)
   1550             device = self.distributed_state.device
   1551             self._n_gpu = 1

NameError: name 'PartialState' is not defined

@amyeroberts
Copy link
Copy Markdown
Contributor

@rennn2002 Could you try updating your version of accelerate i.e. pip install --upgrade accelerate?

Copy link
Copy Markdown
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, Thank you @zach for adding this 🚀 . Left a couple comments

if (
torch.distributed.is_available()
and torch.distributed.is_initialized()
and self.distributed_state.distributed_type != DistributedType.NO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be self.distributed_state.distributed_type == DistributedType.NO?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixing in a follow up. Thanks! :)

and self.distributed_state.distributed_type != DistributedType.NO
):
logger.warning(
"torch.distributed process group is initialized, but parallel_mode == ParallelMode.DISTRIBUTED. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be parallel_mode != ParallelMode.DISTRIBUTED?

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…ingface#22752)

* Use accelerate for device management

* Add accelerate to setup


Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants