more early stopping options (convergence and divergence threshold) by awaelchli · Pull Request #6868 · Lightning-AI/pytorch-lightning

awaelchli · 2021-04-07T13:04:05Z

What does this PR do?

Part of #6795

Adds two thresholds after which we stop training immediately (no patience).

Divergence threshold: the monitor has reached a value from which we believe it cannot recover -> stop training
Stopping threshold: the monitor has reached a target value that is close to optimal, and we do not care about further improvement -> stop training

Now that we have multiple stopping criteria, it's best we report the reason for stopping too.

TODO:

tests for the new parameters
improve docs

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

awaelchli · 2021-04-07T13:07:34Z

+        check_finite: Stops training when the monitor becomes NaN or infinite. Set this argument to ``False``
+            if this behavior is undesired.


do we want an option to turn this on/off at all?

I would say off. We will add support for occasional NaN loss training soon.

codecov · 2021-04-07T13:09:07Z

Codecov Report

Merging #6868 (e6dc765) into master (832a03a) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #6868    +/-   ##
=======================================
- Coverage      92%     87%    -5%     
=======================================
  Files         196     196            
  Lines       12571   12597    +26     
=======================================
- Hits        11594   10968   -626     
- Misses        977    1629   +652

awaelchli · 2021-04-08T10:19:17Z

+        if should_stop:
+            self.stopped_epoch = trainer.current_epoch
+        if reason:
+            log.info(f"[{trainer.global_rank}] {reason}")


This is suboptimal.

If we log on rank zero only, and the user has sync_dist=False for logging, then we might not see the reason being logged because it could be rank > 0 that decided to stop.

If we log on all ranks and the user has sync_dist=True for logging, we will show the same message N times.

Should we perhaps broadcast the message and log only on rank 0?

Good idea !

carmocca · 2021-04-08T11:42:48Z

+        stopping_threshold: Stop training immediately once the monitored quantity reaches this threshold.
+        divergence_threshold: Stop training as soon as the monitored quantity becomes worse than this threshold.


we could use stop_limit and stop_loss to follow common financial terms

https://www.investopedia.com/articles/active-trading/091813/which-order-use-stoploss-or-stoplimit-orders.asp

@jlperla what do you think of this name suggestion?

But might not be loss, and most people don't know finance. These are basically optimizer settings,which is more universal.

I think sticking with optimizer style lingo is ideal. Divergence is safe and says what it means . Normally one would call the success criteria as tolerances for optimizers. But that is because they are always comparing something (eg a value itself, changes in that value, or first order conditions) to zero.

Since this could presumably compare stopping for this other than close to zero(especially if you are tracking something where a larger number is better) , I think threshold is probably more general. But open minded of course

Could also go super simple and go with min_threshold and max_threshold

No because that implies a direction to them.

IMO, the names are good right now.

this is not clear to me, it is you have some converging sequence so it stops when it starts diverse again? shall it be some patience for noise presence reason?
or natively it can be observing training and validation measure and stop overfitting - when these twos tart diverse

If something is diverging, it is because you are in some sort of local minima or outside of an attractor and it could only return with some massive jumps (i.e. emulation in simulating annealing you are way off in the boons for your optima... in theory it come come back, but it might takes months. You are better off just restarting). So patience isn't the right thing to think of for that.

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

carmocca

Should patience and delta apply to these thresholds?

carmocca · 2021-04-08T22:10:58Z

+        stopping_threshold: Stop training immediately once the monitored quantity reaches this threshold.
+        divergence_threshold: Stop training as soon as the monitored quantity becomes worse than this threshold.


Could also go super simple and go with min_threshold and max_threshold

tchaton

LGTM ! Great work ! @jlperl, does it match what you had in mind ?

jlperla · 2021-04-14T16:37:21Z

+        elif self.divergence_threshold is not None and self.monitor_op(-current, -self.divergence_threshold):
+            should_stop = True
+            reason = (
+                f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."


Minor point: whether it is

f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."

or

f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."

should depend on the monitor_op, right? Maybe have a f"Divergence: {self.monitor} = {current} {op_to_string(self.monitor_op)} {self.divergence_threshold}."
or something like that, where you fill op_to_string with whatever you need to turn it into a >, etc.?

Similarly, I think you could do the same with the "successful" convergence below/above the target.

f"Below tolerance {self.monitor} = {current} {op_to_string(self.monitor_op)} {self.stopping_threshold} ."

where you would have to ensure the order is correct as it is going form the other direction.

As I said though, minor.

jlperla · 2021-04-14T16:39:13Z

@tchaton @awaelchli This all looks great to me. I put in one minor comment about the "reason" strings only being correct for one of the monitor_ops, but I also think that could wait and do it as a separate issue later. I personally am unlikely to use the other direction for the monitor_op anytime soon.

kaushikb11

Love it!

awaelchli added 6 commits April 6, 2021 11:29

stopping with NaN

7ff7d96

improve message

9118493

Merge branch 'master' into feature/early-stopping-nan

7752fbf

initial commit

7f301b8

added stopping reason

189f939

add docs

d81b616

awaelchli commented Apr 7, 2021

View reviewed changes

make test

76fe93c

awaelchli added callback feature Is an improvement or enhancement labels Apr 8, 2021

awaelchli commented Apr 8, 2021

View reviewed changes

awaelchli marked this pull request as ready for review April 8, 2021 10:37

awaelchli requested review from Borda, SeanNaren, carmocca, justusschock, tchaton and williamFalcon as code owners April 8, 2021 10:37

carmocca reviewed Apr 8, 2021

View reviewed changes

awaelchli and others added 5 commits April 8, 2021 15:49

skip formatting for inf

23dde26

Update pytorch_lightning/callbacks/early_stopping.py

569a7fa

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

add changelog

1da0324

rearrange

b6a6a7c

test for inf value

948e5a0

awaelchli requested a review from kaushikb11 as a code owner April 8, 2021 19:43

awaelchli added this to the 1.3 milestone Apr 8, 2021

carmocca reviewed Apr 8, 2021

View reviewed changes

carmocca mentioned this pull request Apr 8, 2021

[WIP] Support all early stopping options: LambdaEarlyStopping #6909

Closed

11 tasks

mergify Bot added the has conflicts label Apr 9, 2021

tchaton approved these changes Apr 12, 2021

View reviewed changes

Merge branch 'master' into feature/early-stopping-threshold

59dedde

mergify Bot removed the has conflicts label Apr 14, 2021

change default for check_finite

853c2d7

Xl5843003 approved these changes Apr 14, 2021

View reviewed changes

jlperla reviewed Apr 14, 2021

View reviewed changes

awaelchli added 2 commits April 16, 2021 01:03

Merge branch 'master' into feature/early-stopping-threshold

29812b5

typo

cb90c76

mergify Bot added the has conflicts label Apr 16, 2021

awaelchli added 2 commits April 16, 2021 14:57

message

c706668

Merge branch 'master' into feature/early-stopping-threshold

e6dc765

mergify Bot removed the has conflicts label Apr 16, 2021

awaelchli added the priority: 0 High priority task label Apr 18, 2021

tchaton requested a review from carmocca April 19, 2021 14:24

kaushikb11 approved these changes Apr 19, 2021

View reviewed changes

carmocca approved these changes Apr 19, 2021

View reviewed changes

carmocca merged commit d12c6cf into master Apr 19, 2021

carmocca deleted the feature/early-stopping-threshold branch April 19, 2021 14:49

awaelchli mentioned this pull request Apr 20, 2021

update early stopping docs #7121

Merged

11 tasks

awaelchli mentioned this pull request Oct 23, 2022

reduce_boolean_decision behavior diverges from EarlyStopping callback's intended usage of it #15252

Closed

		check_finite: Stops training when the monitor becomes NaN or infinite. Set this argument to ``False``
		if this behavior is undesired.

		stopping_threshold: Stop training immediately once the monitored quantity reaches this threshold.
		divergence_threshold: Stop training as soon as the monitored quantity becomes worse than this threshold.

Conversation

awaelchli commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

awaelchli Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaton Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carmocca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlperla commented Apr 14, 2021

Uh oh!

kaushikb11 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

awaelchli commented Apr 7, 2021 •

edited

Loading

codecov Bot commented Apr 7, 2021 •

edited

Loading

awaelchli Apr 8, 2021 •

edited

Loading

tchaton Apr 12, 2021 •

edited

Loading

tchaton left a comment •

edited

Loading