Add patience argument to Trainer by thesamuel · Pull Request #4186 · huggingface/transformers

thesamuel · 2020-05-06T21:48:10Z

This closes #4894.

Summary

Often, we want to stop training if loss does not improve for a number of epochs. This PR adds a "patience" argument, which is a limit on the number of times we can get a non-improving eval loss before stopping training early.

It is implemented by other NLP frameworks, such as AllenNLP (see trainer.py and metric_tracker.py).

Motivation

This feature allows faster fine-tuning by breaking the training loop early and avoids users the toil of checking metrics on Tensorboard.

Caveats

Often, models are evaluated once per epoch, but run_lm_finetuning.py has an option to evaluate after a set number of model update steps (dictated by --logging_steps if --evaluate_during_training is true). Because of this, I've elected to tie patience to the number of evaluations without improvement in loss.

…rainer.py

thesamuel · 2020-05-06T21:48:53Z

This supercedes #2840, where I added patience to the outdated run_language_modeling.py script.

BramVanroy · 2020-06-15T07:28:55Z

Looking good! Can you add a reference to your original post that this closes #4894? Thanks

julien-c

Looks good! Small suggestions there

julien-c · 2020-06-15T12:26:40Z

src/transformers/trainer.py

+        best_eval_loss = None
+        evals_without_improvement = 0


nit: prefix those with patience_ as they're specific to this

Suggested change

best_eval_loss = None

evals_without_improvement = 0

patience_best_eval_loss = None

patience_evals_without_improvement = 0

patience_should_stop = False

julien-c · 2020-06-15T12:27:47Z

src/transformers/trainer.py

+                                            logger.info(
+                                                f"Patience threshold ({self.args.patience}) exceeded, stopping training"
+                                            )


Suggested change

logger.info(

f"Patience threshold ({self.args.patience}) exceeded, stopping training"

)

patience_should_stop = True

logger.info(

f"Patience threshold ({self.args.patience}) exceeded, stopping training"

)

julien-c · 2020-06-15T12:28:08Z

src/transformers/trainer.py

+                if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or
+                        (self.args.patience > 0 and evals_without_improvement >= self.args.patience)):


Suggested change

if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or

(self.args.patience > 0 and evals_without_improvement >= self.args.patience)):

if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or

patience_should_stop):

julien-c · 2020-06-15T12:28:16Z

src/transformers/trainer.py

                    break
-            if self.args.max_steps > 0 and global_step > self.args.max_steps:
+            if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or
+                    (self.args.patience > 0 and evals_without_improvement >= self.args.patience)):


kevin-yauris · 2020-07-21T06:44:56Z

Hello, when this feature will be merged? I would like to use it. Thank you.

BramVanroy · 2020-08-13T07:20:22Z

Hello, when this feature will be merged? I would like to use it. Thank you.

There are some changes requested that @thesamuel should fix before this can be merged.

misrasaurabh1 · 2020-09-02T00:13:43Z

Bump. Early stopping is critical for an automated Trainer that reliably gives us the best model. Current way of figuring out the training stopping point seems to be specifying a static train_epochs but the training duration a model can take depends on way too many factors like learning rate, data complexity, model, model size, optimizer and so on that it is unreasonable to ask the user to specify the epochs in advance.
I believe the current assumption is that people train with very small learning rates so that the loss always seems to keep decreasing very slowly but according to my experience (and on my data) it is a sub-optimal schedule which takes too much time. I see that training with higher learning rates with larger batch sizes and stopping at the early stopping point results in an equally good if not better models. Although this requires use of early stopping.

PhilipMay · 2020-09-09T07:25:02Z

I would like to use this early stopping on downstream training.
The current implementation only stops training by monitoring loss. IMO it should also be possible to monitor other metrics like F1 and ROC-AUC.

I also would like to add a feature that stores the model each time when the monitored metric improves and then optionaly loads the model after training. Then later evaluation can be done on this "best" model.

@thesamuel @julien-c @kevin-yauris what do you think?

sgugger · 2020-09-09T11:51:16Z

I plan to work on this once I'm finished with the Funnel Transformer model @PhilipMay (so end of this week, beginning of the next).

PhilipMay · 2020-09-09T11:57:38Z

I plan to work on this once I'm finished with the Funnel Transformer model @PhilipMay (so end of this week, beginning of the next).

@sgugger That would be awsome. Maybe you want to get some inspiration from the FARM training loop which is pretty nice IMO:

https://github.com/deepset-ai/FARM/blob/master/farm/train.py#L262-L370

PhilipMay · 2020-09-29T18:37:33Z

I just found this PR that was already merged: #7431
I think it solved this...

sgugger · 2020-09-29T19:08:13Z

Not quite, but it makes implementing it easier.

PhilipMay · 2020-09-30T06:41:36Z

Not quite, but it makes implementing it easier.

Yes - you are right. The patience part is still missing.

stale · 2020-11-29T11:49:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

BramVanroy · 2020-11-30T16:02:05Z

@sgugger Should we keep this open? You wrote in this thread you will work on this if you find the time, but I am not sure if you plan to use another PR for that.

sgugger · 2020-11-30T16:04:20Z

There has been a PR merged adding the EarlyStoppingCallback (#8581) so I think this can be closed now.

thesamuel · 2020-12-02T02:24:16Z

Thanks @cbrochtrup @sgugger! Sorry I didn't get around to this...

cbrochtrup · 2020-12-02T15:57:19Z

You're welcome, happy to help!

Add patience argument to training_args.py and implement patience in t…

cbbff15

…rainer.py

misrasaurabh1 mentioned this pull request Jun 12, 2020

[WIP] Add early stopping to the trainer #4896

Closed

BramVanroy requested review from LysandreJik and thomwolf and removed request for LysandreJik June 15, 2020 07:26

julien-c requested changes Jun 15, 2020

View reviewed changes

arueckle mentioned this pull request Jun 17, 2020

Storing Adapters and Best Model adapter-hub/adapters#1

Closed

Breakend mentioned this pull request Jul 22, 2020

Add functioning early stopping (patience) and weighted random sampling #5958

Closed

san7988 mentioned this pull request Aug 17, 2020

🚀 Add early stopping to the trainer #4894

Closed

LysandreJik requested a review from sgugger September 7, 2020 08:54

PhilipMay mentioned this pull request Sep 29, 2020

Add automatic best model loading to Trainer #7431

Merged

KMFODA mentioned this pull request Oct 2, 2020

Add early stopping to trainer_tf.py #7533

Closed

cbrochtrup mentioned this pull request Nov 17, 2020

Add early stopping callback to pytorch trainer #8581

Merged

stale bot added the wontfix label Nov 29, 2020

stale bot removed the wontfix label Nov 30, 2020

sgugger closed this Nov 30, 2020

-        best_eval_loss = None
-        evals_without_improvement = 0
+        patience_best_eval_loss = None
+        patience_evals_without_improvement = 0
+        patience_should_stop = False

		if ((self.args.max_steps > 0 and global_step > self.args.max_steps) or
		(self.args.patience > 0 and evals_without_improvement >= self.args.patience)):

Conversation

thesamuel commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Caveats

Uh oh!

thesamuel commented May 6, 2020

Uh oh!

BramVanroy commented Jun 15, 2020

Uh oh!

julien-c left a comment

Choose a reason for hiding this comment

Uh oh!

julien-c Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

julien-c Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

julien-c Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

julien-c Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

kevin-yauris commented Jul 21, 2020

Uh oh!

BramVanroy commented Aug 13, 2020

Uh oh!

misrasaurabh1 commented Sep 2, 2020

Uh oh!

PhilipMay commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Sep 9, 2020

Uh oh!

PhilipMay commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilipMay commented Sep 29, 2020

Uh oh!

sgugger commented Sep 29, 2020

Uh oh!

PhilipMay commented Sep 30, 2020

Uh oh!

stale bot commented Nov 29, 2020

Uh oh!

BramVanroy commented Nov 30, 2020

Uh oh!

sgugger commented Nov 30, 2020

Uh oh!

thesamuel commented Dec 2, 2020

Uh oh!

cbrochtrup commented Dec 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

thesamuel commented May 6, 2020 •

edited

Loading

PhilipMay commented Sep 9, 2020 •

edited

Loading

PhilipMay commented Sep 9, 2020 •

edited

Loading