more early stopping options (convergence and divergence threshold)#6868
more early stopping options (convergence and divergence threshold)#6868
Conversation
| check_finite: Stops training when the monitor becomes NaN or infinite. Set this argument to ``False`` | ||
| if this behavior is undesired. |
There was a problem hiding this comment.
do we want an option to turn this on/off at all?
There was a problem hiding this comment.
I would say off. We will add support for occasional NaN loss training soon.
Codecov Report
@@ Coverage Diff @@
## master #6868 +/- ##
=======================================
- Coverage 92% 87% -5%
=======================================
Files 196 196
Lines 12571 12597 +26
=======================================
- Hits 11594 10968 -626
- Misses 977 1629 +652 |
| if should_stop: | ||
| self.stopped_epoch = trainer.current_epoch | ||
| if reason: | ||
| log.info(f"[{trainer.global_rank}] {reason}") |
There was a problem hiding this comment.
This is suboptimal.
If we log on rank zero only, and the user has sync_dist=False for logging, then we might not see the reason being logged because it could be rank > 0 that decided to stop.
If we log on all ranks and the user has sync_dist=True for logging, we will show the same message N times.
Should we perhaps broadcast the message and log only on rank 0?
| stopping_threshold: Stop training immediately once the monitored quantity reaches this threshold. | ||
| divergence_threshold: Stop training as soon as the monitored quantity becomes worse than this threshold. |
There was a problem hiding this comment.
we could use stop_limit and stop_loss to follow common financial terms
There was a problem hiding this comment.
@jlperla what do you think of this name suggestion?
There was a problem hiding this comment.
But might not be loss, and most people don't know finance. These are basically optimizer settings,which is more universal.
I think sticking with optimizer style lingo is ideal. Divergence is safe and says what it means . Normally one would call the success criteria as tolerances for optimizers. But that is because they are always comparing something (eg a value itself, changes in that value, or first order conditions) to zero.
Since this could presumably compare stopping for this other than close to zero(especially if you are tracking something where a larger number is better) , I think threshold is probably more general. But open minded of course
There was a problem hiding this comment.
Could also go super simple and go with min_threshold and max_threshold
There was a problem hiding this comment.
No because that implies a direction to them.
There was a problem hiding this comment.
IMO, the names are good right now.
There was a problem hiding this comment.
this is not clear to me, it is you have some converging sequence so it stops when it starts diverse again? shall it be some patience for noise presence reason?
or natively it can be observing training and validation measure and stop overfitting - when these twos tart diverse
There was a problem hiding this comment.
If something is diverging, it is because you are in some sort of local minima or outside of an attractor and it could only return with some massive jumps (i.e. emulation in simulating annealing you are way off in the boons for your optima... in theory it come come back, but it might takes months. You are better off just restarting). So patience isn't the right thing to think of for that.
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
carmocca
left a comment
There was a problem hiding this comment.
Should patience and delta apply to these thresholds?
| stopping_threshold: Stop training immediately once the monitored quantity reaches this threshold. | ||
| divergence_threshold: Stop training as soon as the monitored quantity becomes worse than this threshold. |
There was a problem hiding this comment.
Could also go super simple and go with min_threshold and max_threshold
| elif self.divergence_threshold is not None and self.monitor_op(-current, -self.divergence_threshold): | ||
| should_stop = True | ||
| reason = ( | ||
| f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}." |
There was a problem hiding this comment.
Minor point: whether it is
f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."
or
f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."
should depend on the monitor_op, right? Maybe have a f"Divergence: {self.monitor} = {current} {op_to_string(self.monitor_op)} {self.divergence_threshold}."
or something like that, where you fill op_to_string with whatever you need to turn it into a >, etc.?
Similarly, I think you could do the same with the "successful" convergence below/above the target.
f"Below tolerance {self.monitor} = {current} {op_to_string(self.monitor_op)} {self.stopping_threshold} ."where you would have to ensure the order is correct as it is going form the other direction.
As I said though, minor.
|
@tchaton @awaelchli This all looks great to me. I put in one minor comment about the "reason" strings only being correct for one of the monitor_ops, but I also think that could wait and do it as a separate issue later. I personally am unlikely to use the other direction for the monitor_op anytime soon. |
What does this PR do?
Part of #6795
Adds two thresholds after which we stop training immediately (no patience).
Divergence threshold: the monitor has reached a value from which we believe it cannot recover -> stop training
Stopping threshold: the monitor has reached a target value that is close to optimal, and we do not care about further improvement -> stop training
Now that we have multiple stopping criteria, it's best we report the reason for stopping too.
TODO:
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃