Fix SequentialLR initialization#72856
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
✅ No Failures (0 Pending)As of commit e505c16 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
Is there an issue associated with this? |
I did not open an issue for this. Its just something I noticed and decided to contribute a fix for. Should I open an issue just for reference? |
|
We usually discuss these things on issues before working on an implementation. |
Issue created here: #72874 |
|
Going back to the Another test that is failing is this one: Lines 1429 to 1439 in 84729ce The optimizer is set as Lines 921 to 923 in 84729ce The expected target for epoch 0 is set as Is there something I'm missing here? |
|
@albanD Just pinging to get some help on this |
albanD
left a comment
There was a problem hiding this comment.
The fix sounds generally ok.
I'm wondering if we can refactor this to reduce duplicated code (the bisect_right in particular).
Also this will need testing!
There was a problem hiding this comment.
Is this still ok in the case where 0 is passed to the step() function below?
There was a problem hiding this comment.
It should be. If 0 is passed in, then last_epoch is set in step() anyways. So, this decrement shouldn't have any effect in that case.
841aa88 to
aa3b75b
Compare
|
@albanD Thanks for the initial review! What are your thoughts on the comments I made above about the current tests not being correct? |
|
Not sure about the test tbh. |
|
@albanD Just checking up on this PR again. What's the right course of action here? Should I modify those tests to reflect what I think should be correct values? |
|
Sorry for the delay - I can look into this. Will post findings shortly. |
jbschlosser
left a comment
There was a problem hiding this comment.
@antoniojkim I don't see this fix addressing the issue - your reproduction code still gives the same invalid results after the fix.
This may have been mentioned already, but the core of the issue is that the _LRSchedulers (specifically ConstantLR in this case) mutate the optimizer during initialization. This happens regardless of SequentialLR usage.
import torch
optimizer = torch.optim.SGD([torch.tensor(0.5)], lr=0.1)
print(optimizer.param_groups[0]["lr"]) # 0.1
torch.optim.lr_scheduler.ConstantLR(optimizer, factor=0.1)
print(optimizer.param_groups[0]["lr"]) # 0.01
torch.optim.lr_scheduler.ConstantLR(optimizer, factor=0.1)
print(optimizer.param_groups[0]["lr"]) # 0.001Not sure what a proper fix is yet tbh and unfortunately I don't have time to dig into this more atm. Happy to review any possible fixes you come up with, although I will be out next week so there will be some delay.
08def64 to
ed79f21
Compare
|
@jbschlosser I just added a |
ed79f21 to
4b545c2
Compare
|
@jbschlosser I fixed the |
|
@jbschlosser gentle reminder to please review this PR again |
c6b1b7e to
565220e
Compare
|
Unless I'm mistaken I don't think the failing tests are related to the changes I made. Can I get another review @jbschlosser? |
jbschlosser
left a comment
There was a problem hiding this comment.
Nice work, looks correct now wrt tests! One minor API thing and I think we can finally get it merged
| self.step() | ||
|
|
||
| @contextmanager | ||
| def init_optimizer_lr(self): |
There was a problem hiding this comment.
Need to name this privately (_init_optimizer_lr) and we're good to go :)
- This is primarily to prevent the changes from affecting ChainedScheduler
|
@jbschlosser There was a bug in the I added a test that verifies that the example case that I outlined in the issue #72874 that appears to be passing. So, I believe the behaviour is correct now. Can you please check the values in the failing sequential_lr tests? I'm suspicious of those target values. |
|
Hey @antoniojkim, I just checked the values for the first failing test: Lines 1473 to 1486 in 2ede287 Lines 941 to 943 in 2ede287 Given a |
|
@jbschlosser I found the issue. Was related to the All tests green now! |
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
@antoniojkim your PR has been successfully merged. |
|
Hey @antoniojkim. |
Awesome. Thanks for all your help with this! |
Summary: What was happening is that when we have multiple learning rate schedulers, the order in which they are being initialized is not being taken into account. This is a problem if they were being initialized in sequential order (as one might intuitively do). Each scheduler calls `step()` on initialization and sets the `lr` in its optimizer's `params_groups`. However, this means that step 0 will be using the `lr` that was set by the very last scheduler (in the case of initializing schedulers sequentially) instead of the first scheduler. The fix in this PR, addresses the above bug by performing a call to the appropriate scheduler on initialization after decrementing the `last_epoch` values in order to keep them the same post-step. This will ensure that the correct scheduler is the one setting the `lr` values for the optimizer's `param_groups` Pull Request resolved: #72856 Approved by: https://github.com/jbschlosser Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/765b6a8fab02a55cf85058037f9ab89fb00a5b23 Reviewed By: atalman Differential Revision: D37327411 fbshipit-source-id: dc461f1de257fe13f54fede2b093644bc6e51a89
|
@jbschlosser I think @Queuecumber wanted this fix in the next PyTorch release if possible. Is there still time to do that? |
|
Hey @antoniojkim, unfortunately the release branch cut for 1.12 was a decent while back. The fix will make it into 1.13 though. |
What was happening is that when we have multiple learning rate schedulers, the order in which they are being initialized is not being taken into account. This is a problem if they were being initialized in sequential order (as one might intuitively do). Each scheduler calls `step()` on initialization and sets the `lr` in its optimizer's `params_groups`. However, this means that step 0 will be using the `lr` that was set by the very last scheduler (in the case of initializing schedulers sequentially) instead of the first scheduler. The fix in this PR, addresses the above bug by performing a call to the appropriate scheduler on initialization after decrementing the `last_epoch` values in order to keep them the same post-step. This will ensure that the correct scheduler is the one setting the `lr` values for the optimizer's `param_groups` Pull Request resolved: pytorch#72856 Approved by: https://github.com/jbschlosser
What was happening is that when we have multiple learning rate schedulers, the order in which they are being initialized is not being taken into account. This is a problem if they were being initialized in sequential order (as one might intuitively do). Each scheduler calls `step()` on initialization and sets the `lr` in its optimizer's `params_groups`. However, this means that step 0 will be using the `lr` that was set by the very last scheduler (in the case of initializing schedulers sequentially) instead of the first scheduler. The fix in this PR, addresses the above bug by performing a call to the appropriate scheduler on initialization after decrementing the `last_epoch` values in order to keep them the same post-step. This will ensure that the correct scheduler is the one setting the `lr` values for the optimizer's `param_groups` Pull Request resolved: pytorch#72856 Approved by: https://github.com/jbschlosser
What was happening is that when we have multiple learning rate schedulers, the order in which they are being initialized is not being taken into account. This is a problem if they were being initialized in sequential order (as one might intuitively do).
Each scheduler calls
step()on initialization and sets thelrin its optimizer'sparams_groups. However, this means that step 0 will be using thelrthat was set by the very last scheduler (in the case of initializing schedulers sequentially) instead of the first scheduler.The fix in this PR, addresses the above bug by performing a call to the appropriate scheduler on initialization after decrementing the
last_epochvalues in order to keep them the same post-step. This will ensure that the correct scheduler is the one setting thelrvalues for the optimizer'sparam_groups