Fix no_trainer examples to properly calculate the number of samples by muellerzr · Pull Request #17046 · huggingface/transformers

muellerzr · 2022-05-02T15:17:03Z

Fix number of samples for `no_trainer` scripts

What does this add?

This PR fixes all of the no_trainer scripts to properly use the right number of training steps after the length of the dataloader was changed with accelerator.prepare

Why is it needed?

Currently in a multi-process setup, the progress bar still shows the old number of samples. As a result the old number of steps before breaking is set at the original amount, even though the length of the dataloaders changed. The progress bar reflects this too.

Simplified example:

If the dataloader starts with 128 batches, if 2 GPUs are used then each dataloader has 64 batches. As a result the progress bar should use 64, and the break condition needs to also know there is only 64. Both currently use 128 still

What parts of the API does this impact?

User-facing:

All scripts have a recalculation of the max_train_steps after accelerate.prepare

Basic Usage Example(s):

    # Prepare everything with our `accelerator`.
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # We need to recalculate our total training steps
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch

When would I use it, and when wouldn't I?

While this is always used, technically it is only needed when the number of nodes > 1.

sgugger

Thanks for fixing! LGTM with one nit to propagate!

examples/pytorch/image-classification/run_image_classification_no_trainer.py

HuggingFaceDocBuilderDev · 2022-05-02T15:34:27Z

The documentation is not available anymore as the PR was closed or merged.

…uggingface#17046) * Update all examples to properly calculate progress bar

kowndinya-renduchintala · 2022-05-30T13:15:52Z

Hi @muellerzr, @sgugger, in case I specify the argument max_train_steps instead of num_train_epochs while launching the training script, I need to recalculate the num_train_epochs after accelerate.prepare instead of max_train_steps right? Am I missing something?

muellerzr · 2022-05-31T13:26:56Z

@kowndinya-renduchintala we already do this for you 😄

https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue_no_trainer.py#L425

…uggingface#17046) * Update all examples to properly calculate progress bar

Update all examples to properly calculate progress bar

93c8901

muellerzr added Examples Which is related to examples in general PyTorch Anything PyTorch labels May 2, 2022

muellerzr requested a review from sgugger May 2, 2022 15:17

sgugger approved these changes May 2, 2022

View reviewed changes

examples/pytorch/image-classification/run_image_classification_no_trainer.py Outdated Show resolved Hide resolved

Propogate nit

c71dc6d

muellerzr merged commit f275e59 into main May 2, 2022

muellerzr deleted the muellerzr-fix_num_samples branch May 2, 2022 15:56

stevhliu pushed a commit to stevhliu/transformers that referenced this pull request May 3, 2022

Fix no_trainer examples to properly calculate the number of samples (h…

f923885

…uggingface#17046) * Update all examples to properly calculate progress bar

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

Fix no_trainer examples to properly calculate the number of samples (h…

c6d70d1

…uggingface#17046) * Update all examples to properly calculate progress bar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix no_trainer examples to properly calculate the number of samples#17046

Fix no_trainer examples to properly calculate the number of samples#17046
muellerzr merged 2 commits intomainfrom
muellerzr-fix_num_samples

muellerzr commented May 2, 2022

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading

Uh oh!

kowndinya-renduchintala commented May 30, 2022 •

edited

Loading

Uh oh!

muellerzr commented May 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

muellerzr commented May 2, 2022

Fix number of samples for no_trainer scripts

What does this add?

Why is it needed?

What parts of the API does this impact?

User-facing:

Basic Usage Example(s):

When would I use it, and when wouldn't I?

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kowndinya-renduchintala commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muellerzr commented May 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix number of samples for `no_trainer` scripts

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading

kowndinya-renduchintala commented May 30, 2022 •

edited

Loading