Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88106
Note: Links to docs will display an error until the docs builds have been completed. ❌ 15 New Failures, 5 Unrelated FailuresAs of commit c71c362 with merge base 64077ce ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
81c21fc to
4894875
Compare
|
Hi @datumbox, |
|
@federicopozzi33 Thanks for the ping! I'm a bit swamped this week. Can I get back to you by the middle of next one? |
datumbox
left a comment
There was a problem hiding this comment.
Thanks for the work @federicopozzi33. I've added a few comments/questions below, let me know your thoughts. Also is there a reference implementation we should be crediting here?
@frgfm I was wondering if you have the chance to have also a look given you have previously implemented it to ensure the validity.
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
Nit: Why not set a default value similar to other optimizers?
There was a problem hiding this comment.
I agree, 1e-3 is being used on the others 👍
There was a problem hiding this comment.
Do you mean to set a default value for the learning rate? I didn't because SGD, Adam, and other optimizers don't have one.
There was a problem hiding this comment.
Adam has lr=1e-3 as @frgfm mentioned. I agree it would be good to have a default here.
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
func is not declared if scripting. Is this implementation complete?
There was a problem hiding this comment.
Only the single_tensor implementation is complete. The multi_tensor one is still missing.
I'm temporarily raising an exception when scripting.
EDIT: I looked at the SGD implementation more carefully, and it seems that just scripting + foreach is not supported. So, I think that the original implementation was OK.
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
Nit: why not put the trust_coefficient and eps in the defaults above?
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
Does it make sense to grab these from the group similar to other params here?
Sure, I'll take a look by tomorrow! |
frgfm
left a comment
There was a problem hiding this comment.
Thanks for the PR 🙏 I understand it's a work in progress so take my comments without too much final consideration!
I'll need to take a look at the latest implementation of SGD with the single tensor & multi tensor functional API to make a comprehensive review. Let me know what you think
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
I agree, 1e-3 is being used on the others 👍
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
I suggest keeping the same arg order as other optimizers as well
There was a problem hiding this comment.
I followed SGD's order; Adam's is quite different (due to different parameters too).
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
Open suggestion but perhaps list comprehensions would be better?
There was a problem hiding this comment.
Using a list comprehension I would need to loop more than once. In this way, I loop only once, which is more efficient.
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
For readability, I suggest moving that before the functional API call
There was a problem hiding this comment.
I don't think it would be the same: in the functional call, momentum_buffer_list is updated in place.
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
the global LR is not applied to the momentum part to the best of my understanding
There was a problem hiding this comment.
Thank you, I will check again the paper and some online implementations soon.
torch/optim/lars.py
Outdated
There was a problem hiding this comment.
Not sure the eps is required: if p_norm is zero, then the whole thing is zero, if g_norm is zero, then the basic update term is zero (same effect as using local LR = 1 if any of those is zero, but we avoid the eps imprecision :))
There was a problem hiding this comment.
I think it's required to avoid zero division when both terms are zero.
There was a problem hiding this comment.
p_norm and g_norm are banned from being 0 with the conditional, no?
That said, I am not sure why lightning included an eps.
There was a problem hiding this comment.
| if p_norm * g_norm > 0: | |
| lars_lr = trust_coefficient * p_norm / (g_norm + p_norm * weight_decay + eps) | |
| lars_lr = trust_coefficient * p_norm / max(g_norm + p_norm * weight_decay, eps) |
This would avoid unnecessarily biasing the denominator.
There was a problem hiding this comment.
I checked the code again, and I think you're right.
|
@federicopozzi33 there are still some conflicts, could you resolve? |
8e5c45d to
0e90e08
Compare
18932c7 to
fa0912c
Compare
| group.setdefault("maximize", False) | ||
| group.setdefault("differentiable", False) | ||
|
|
||
| def _init_group(self, group, params_with_grad, grads, momentum_buffer_list): |
There was a problem hiding this comment.
I notice that has_sparse_grad is excluded from this implementation so far. What are your plans with the variable?
Wanted to give you the heads up that we are planning on deprecating has_sparse_grad, because our main use of it is to determine whether we could use the foreach implementation or not for it. Instead of barring all params from going through the foreach implementation because of one sparse grad, the new ideal is to group the tensors by device and to put everything that cannot be foreach'd into the cpu bucket, so that we can maximally enjoy the speed of foreach.
There was a problem hiding this comment.
To be honest, I skipped it. My idea was to complete at least the basic implementation, then gradually implement more "advanced" parameters, like foreach.
Does it sound good to you?
There was a problem hiding this comment.
yes, I wanted to give you more context regarding the deprecation of has_sparse_grad (as we plan to remove it for all optims as well)
| Args: | ||
| params (iterable): iterable of parameters to optimize or dicts defining | ||
| parameter groups | ||
| lr (float): learning rate |
There was a problem hiding this comment.
Unless with LARS it is recommended to decide on an initial LR? From the paper, it looks like they specifically use bigger LRs.
There was a problem hiding this comment.
Ah, maybe we should just up the default to 1.0, with a comment that larger LRs are used with LARS
|
|
||
| LARS.__doc__ = r"""Implements LARS algorithm. | ||
|
|
||
| For further details regarding the algorithm we refer to `Large Batch Training of Convolutional Networks`_. |
There was a problem hiding this comment.
We would want to include the latex version of the algorithm here. I just took some time to read the arxiv linked for the LARS introduction. Is it just me or are there several key typos in their algorithm (e.g., the trust coefficient should be used in the local lr update but it isn't?)
There was a problem hiding this comment.
I noticed it too.
Do you think I should just copy the algorithm they reported in the paper?
torch/optim/lars.py
Outdated
| p_norm = torch.norm(param.data) | ||
| g_norm = torch.norm(d_p.data) |
There was a problem hiding this comment.
| p_norm = torch.norm(param.data) | |
| g_norm = torch.norm(d_p.data) | |
| p_norm = torch.norm(param) | |
| g_norm = torch.norm(d_p) |
.data is deprecated + should not be used
torch/optim/lars.py
Outdated
|
|
||
| if weight_decay != 0: | ||
| # LARS scaling: | ||
| if p_norm * g_norm > 0: |
There was a problem hiding this comment.
| if p_norm * g_norm > 0: | |
| if p_norm != 0 and g_norm != 0: |
This should be a cheaper check
torch/optim/lars.py
Outdated
|
|
||
| if weight_decay != 0: | ||
| # LARS scaling: | ||
| if p_norm * g_norm > 0: |
There was a problem hiding this comment.
| if p_norm * g_norm > 0: | |
| if p_norm != 0 and g_norm != 0: |
This should be a cheaper check
There was a problem hiding this comment.
Actually the check is probably unnecessary (reason being that they'll be floats and comparing floats to 0 is often silly). I would vouch for getting rid of this check entirely and keeping the eps term.
There was a problem hiding this comment.
I moved the two norm computations inside the if condition, since they are used only there.
torch/optim/lars.py
Outdated
| lars_lr = trust_coefficient * p_norm / (g_norm + p_norm * weight_decay + eps) | ||
|
|
||
| d_p = d_p.add(param, alpha=weight_decay) | ||
| d_p.mul_(lars_lr) |
There was a problem hiding this comment.
| d_p.mul_(lars_lr) | |
| d_p = d_p * lars_lr |
avoid inplace updates
test/test_optim.py
Outdated
| optimizer_ctor([torch.empty((), device="cuda")], differentiable=True, fused=True) | ||
|
|
||
| def test_lars(self): | ||
| # ASK: What's the reason behind two identical calls? (See SGD tests) |
There was a problem hiding this comment.
Which two are you referring to here? The optim tests are certainly up for a refactor, so there might just be actual redundancies we can get rid of
There was a problem hiding this comment.
I'm refererring to the first two test cases defined here
Line 455 in 8615a4c
|
@federicopozzi33 I've gone over with a more thorough review. Generally, the math looks consistent :D I've also read the paper and realized a bit anticlimactically that the entire "layer-wise" portion of the optimizer is just a local multiplier that seeks to balance out the weight norm to grad norm ratio. Thankfully that means that this optimizer isn't too crazy to implement. 😛 |
Hi @janeyx99, |
|
@federicopozzi33 I wanted to update you on our side--sorry for the delay. We are currently in the middle of planning and discussing how we want to offer and incorporate new optimizers + optimizer features, including a test revamp that should enable safer test coverage for new optims. Thus, let's put a pause on this PR until we reach a consensus on conclusions from our side. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
@janeyx99 @federicopozzi33 Too bad to see this optimiser didn't make it in PyTorch, despite it's popularity. Any chance we can kick this off again? |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Followup to #6323.
Addition of LARS optimizer.
maximize,differentiable, foreach)Reference implementations: [1]
cc @vincentqb @jbschlosser @albanD @janeyx99 @crcrpar @gujinghui @PenghuiCheng @XiaobingSuper @jianyuh @jgong5 @mingfeima @sanchitintel @ashokei @jingxu10 @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @desertfire