WIP: feat: LARS optimizer by federicopozzi33 · Pull Request #88106 · pytorch/pytorch

federicopozzi33 · 2022-10-31T18:13:38Z

Followup to #6323.

Addition of LARS optimizer.

Reference implementations: [1]

cc @vincentqb @jbschlosser @albanD @janeyx99 @crcrpar @gujinghui @PenghuiCheng @XiaobingSuper @jianyuh @jgong5 @mingfeima @sanchitintel @ashokei @jingxu10 @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

pytorch-bot · 2022-10-31T18:13:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88106

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures, 5 Unrelated Failures

As of commit c71c362 with merge base 64077ce ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for torch/optim/lars.py:
Lint / Test tools / linux-job (gh)
test_ignore_current
pull / linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-bionic-cuda11.8-py3.10-gcc7-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-bionic-py3.11-clang9 / test (crossref, 2, 2, linux.2xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-bionic-py3.11-clang9 / test (default, 3, 3, linux.2xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-bionic-py3.11-clang9 / test (dynamo, 1, 2, linux.2xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-bionic-py3.8-clang9 / test (crossref, 1, 2, linux.2xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-bionic-py3.8-clang9 / test (default, 1, 3, linux.2xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-bionic-py3.8-clang9 / test (dynamo, 1, 2, linux.2xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-docs / build-docs-cpp-false (gh)
Process completed with exit code 127.
pull / linux-docs / build-docs-python-false (gh)
Process completed with exit code 127.
pull / linux-focal-py3.8-gcc7 / test (default, 2, 3, linux.2xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars
pull / linux-focal-py3.9-clang7-asan / test (default, 1, 6, linux.4xlarge) (gh)
RuntimeError: inductor/test_minifier 1/1 failed
pull / linux-focal-py3.9-clang7-asan / test (default, 4, 6, linux.4xlarge) (gh)
test_optim.py::TestDifferentiableOptimizer::test_lars

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda11.8-py3.10-gcc7-sm80 / build (gh)
Unable to resolve action pytorch/pytorch@master, unable to find version master``
inductor / cuda11.8-py3.10-gcc7-sm86 / build (gh)
Unable to resolve action pytorch/pytorch@master, unable to find version master``
inductor / linux-focal-cpu-py3.8-gcc7-inductor / build (gh)
Unable to resolve action pytorch/pytorch@master, unable to find version master``
pull / linux-bionic-cuda11.8-py3.10-gcc7 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
pull / linux-docs / build-docs-functorch-false (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2022-10-31T18:13:43Z

The committers listed above are authorized under a signed CLA.

✅ login: federicopozzi33 / name: Federico Pozzi (c8c1910, 109c83b, bead4ca, 62c1ace, 8615a4c, f9b0c26, c71c362)

federicopozzi33 · 2022-11-13T10:41:09Z

Hi @datumbox,
I think you can start reviewing what I've done so far.

datumbox · 2022-11-16T13:51:00Z

@federicopozzi33 Thanks for the ping! I'm a bit swamped this week. Can I get back to you by the middle of next one?

datumbox

Thanks for the work @federicopozzi33. I've added a few comments/questions below, let me know your thoughts. Also is there a reference implementation we should be crediting here?

@frgfm I was wondering if you have the chance to have also a look given you have previously implemented it to ensure the validity.

datumbox · 2022-11-28T09:02:39Z

torch/optim/lars.py

Nit: Why not set a default value similar to other optimizers?

I agree, 1e-3 is being used on the others 👍

Do you mean to set a default value for the learning rate? I didn't because SGD, Adam, and other optimizers don't have one.

Adam has lr=1e-3 as @frgfm mentioned. I agree it would be good to have a default here.

datumbox · 2022-11-28T09:08:13Z

torch/optim/lars.py

func is not declared if scripting. Is this implementation complete?

Only the single_tensor implementation is complete. The multi_tensor one is still missing.

I'm temporarily raising an exception when scripting.

EDIT: I looked at the SGD implementation more carefully, and it seems that just scripting + foreach is not supported. So, I think that the original implementation was OK.

datumbox · 2022-11-28T09:31:39Z

torch/optim/lars.py

Nit: why not put the trust_coefficient and eps in the defaults above?

datumbox · 2022-11-28T09:32:02Z

torch/optim/lars.py

Does it make sense to grab these from the group similar to other params here?

frgfm · 2022-11-28T14:29:39Z

@frgfm I was wondering if you have the chance to have also a look given you have previously implemented it to ensure the validity.

Sure, I'll take a look by tomorrow!

frgfm

Thanks for the PR 🙏 I understand it's a work in progress so take my comments without too much final consideration!

I'll need to take a look at the latest implementation of SGD with the single tensor & multi tensor functional API to make a comprehensive review. Let me know what you think

frgfm · 2022-11-28T22:12:27Z

torch/optim/lars.py

I agree, 1e-3 is being used on the others 👍

frgfm · 2022-11-28T22:12:47Z

torch/optim/lars.py

I suggest keeping the same arg order as other optimizers as well

I followed SGD's order; Adam's is quite different (due to different parameters too).

frgfm · 2022-11-28T22:17:25Z

torch/optim/lars.py

Open suggestion but perhaps list comprehensions would be better?

Using a list comprehension I would need to loop more than once. In this way, I loop only once, which is more efficient.

frgfm · 2022-11-28T22:18:35Z

torch/optim/lars.py

For readability, I suggest moving that before the functional API call

I don't think it would be the same: in the functional call, momentum_buffer_list is updated in place.

Agreed it would be after

torch/optim/lars.py

frgfm · 2022-11-28T23:08:26Z

torch/optim/lars.py

the global LR is not applied to the momentum part to the best of my understanding

Thank you, I will check again the paper and some online implementations soon.

frgfm · 2022-11-28T23:11:10Z

torch/optim/lars.py

Not sure the eps is required: if p_norm is zero, then the whole thing is zero, if g_norm is zero, then the basic update term is zero (same effect as using local LR = 1 if any of those is zero, but we avoid the eps imprecision :))

I think it's required to avoid zero division when both terms are zero.

p_norm and g_norm are banned from being 0 with the conditional, no?

That said, I am not sure why lightning included an eps.

Suggested change

if p_norm * g_norm > 0:

lars_lr = trust_coefficient * p_norm / (g_norm + p_norm * weight_decay + eps)

lars_lr = trust_coefficient * p_norm / max(g_norm + p_norm * weight_decay, eps)

This would avoid unnecessarily biasing the denominator.

I checked the code again, and I think you're right.

datumbox · 2022-12-12T09:20:16Z

@federicopozzi33 there are still some conflicts, could you resolve?

janeyx99 · 2023-05-01T14:35:29Z

torch/optim/lars.py

+            group.setdefault("maximize", False)
+            group.setdefault("differentiable", False)
+
+    def _init_group(self, group, params_with_grad, grads, momentum_buffer_list):


I notice that has_sparse_grad is excluded from this implementation so far. What are your plans with the variable?

Wanted to give you the heads up that we are planning on deprecating has_sparse_grad, because our main use of it is to determine whether we could use the foreach implementation or not for it. Instead of barring all params from going through the foreach implementation because of one sparse grad, the new ideal is to group the tensors by device and to put everything that cannot be foreach'd into the cpu bucket, so that we can maximally enjoy the speed of foreach.

To be honest, I skipped it. My idea was to complete at least the basic implementation, then gradually implement more "advanced" parameters, like foreach.

Does it sound good to you?

yes, I wanted to give you more context regarding the deprecation of has_sparse_grad (as we plan to remove it for all optims as well)

janeyx99 · 2023-05-01T15:20:13Z

torch/optim/lars.py

+    Args:
+        params (iterable): iterable of parameters to optimize or dicts defining
+            parameter groups
+        lr (float): learning rate


Add default 1e-3 here

Unless with LARS it is recommended to decide on an initial LR? From the paper, it looks like they specifically use bigger LRs.

Yes, they experiment way bigger LR in the paper. That's why I didn't initially set a default value.

Here, LR=1.0 is set, whereas no default value is set here.

Ah, maybe we should just up the default to 1.0, with a comment that larger LRs are used with LARS

janeyx99 · 2023-05-01T15:58:20Z

torch/optim/lars.py

+
+LARS.__doc__ = r"""Implements LARS algorithm.
+
+    For further details regarding the algorithm we refer to `Large Batch Training of Convolutional Networks`_.


We would want to include the latex version of the algorithm here. I just took some time to read the arxiv linked for the LARS introduction. Is it just me or are there several key typos in their algorithm (e.g., the trust coefficient should be used in the local lr update but it isn't?)

I noticed it too.

Do you think I should just copy the algorithm they reported in the paper?

janeyx99 · 2023-05-01T17:14:48Z

torch/optim/lars.py

+        p_norm = torch.norm(param.data)
+        g_norm = torch.norm(d_p.data)


Suggested change

p_norm = torch.norm(param.data)

g_norm = torch.norm(d_p.data)

p_norm = torch.norm(param)

g_norm = torch.norm(d_p)

.data is deprecated + should not be used

janeyx99 · 2023-05-01T21:04:06Z

torch/optim/lars.py

+
+        if weight_decay != 0:
+            # LARS scaling:
+            if p_norm * g_norm > 0:


Suggested change

if p_norm * g_norm > 0:

if p_norm != 0 and g_norm != 0:

This should be a cheaper check

janeyx99 · 2023-05-01T21:11:15Z

torch/optim/lars.py

+
+        if weight_decay != 0:
+            # LARS scaling:
+            if p_norm * g_norm > 0:


Suggested change

if p_norm * g_norm > 0:

if p_norm != 0 and g_norm != 0:

This should be a cheaper check

Actually the check is probably unnecessary (reason being that they'll be floats and comparing floats to 0 is often silly). I would vouch for getting rid of this check entirely and keeping the eps term.

I moved the two norm computations inside the if condition, since they are used only there.

janeyx99 · 2023-05-01T21:18:21Z

torch/optim/lars.py

+                lars_lr = trust_coefficient * p_norm / (g_norm + p_norm * weight_decay + eps)
+
+                d_p = d_p.add(param, alpha=weight_decay)
+                d_p.mul_(lars_lr)


Suggested change

d_p.mul_(lars_lr)

d_p = d_p * lars_lr

avoid inplace updates

Ok. There're some other in-place updates. Should I change them too?

I'm referring to:

pytorch/torch/optim/lars.py

Line 210 in 8615a4c

buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

and

pytorch/torch/optim/lars.py

Line 214 in 8615a4c

param.add_(d_p, alpha=-lr)

torch/optim/lars.py

janeyx99 · 2023-05-01T21:29:55Z

test/test_optim.py

                optimizer_ctor([torch.empty((), device="cuda")], differentiable=True, fused=True)

+    def test_lars(self):
+        # ASK: What's the reason behind two identical calls? (See SGD tests)


Which two are you referring to here? The optim tests are certainly up for a refactor, so there might just be actual redundancies we can get rid of

I'm refererring to the first two test cases defined here

pytorch/test/test_optim.py

Line 455 in 8615a4c

def test_sgd(self):

janeyx99 · 2023-05-01T22:20:02Z

@federicopozzi33 I've gone over with a more thorough review. Generally, the math looks consistent :D

I've also read the paper and realized a bit anticlimactically that the entire "layer-wise" portion of the optimizer is just a local multiplier that seeks to balance out the weight norm to grad norm ratio. Thankfully that means that this optimizer isn't too crazy to implement. 😛

federicopozzi33 · 2023-05-02T07:09:09Z

@federicopozzi33 I've gone over with a more thorough review. Generally, the math looks consistent :D

I've also read the paper and realized a bit anticlimactically that the entire "layer-wise" portion of the optimizer is just a local multiplier that seeks to balance out the weight norm to grad norm ratio. Thankfully that means that this optimizer isn't too crazy to implement. 😛

Hi @janeyx99,
thanks for the review. I will go through your comments in the next few days.

janeyx99 · 2023-06-28T15:06:55Z

@federicopozzi33 I wanted to update you on our side--sorry for the delay. We are currently in the middle of planning and discussing how we want to offer and incorporate new optimizers + optimizer features, including a test revamp that should enable safer test coverage for new optims. Thus, let's put a pause on this PR until we reach a consensus on conclusions from our side.

github-actions · 2023-08-27T15:33:27Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

datumbox · 2023-09-27T11:04:20Z

@janeyx99 @federicopozzi33 Too bad to see this optimiser didn't make it in PyTorch, despite it's popularity. Any chance we can kick this off again?

federicopozzi33 · 2023-10-01T20:11:47Z

Hi @datumbox,
I'm waiting for instructions by @janeyx99 to get back on it.

See the latest messages for more details.

github-actions · 2023-11-30T22:33:37Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions · 2024-01-30T19:33:33Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorchbot added the open source label Oct 31, 2022

federicopozzi33 force-pushed the feature/lars-optimizer branch 3 times, most recently from 81c21fc to 4894875 Compare November 6, 2022 10:00

datumbox reviewed Nov 28, 2022

View reviewed changes

frgfm reviewed Nov 28, 2022

View reviewed changes

datumbox mentioned this pull request Nov 29, 2022

[RFC] Batteries Included - Phase 3 pytorch/vision#6323

Open

16 tasks

federicopozzi33 requested review from datumbox and frgfm and removed request for datumbox and frgfm December 10, 2022 18:16

federicopozzi33 force-pushed the feature/lars-optimizer branch from 8e5c45d to 0e90e08 Compare December 18, 2022 19:30

github-actions bot added the module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration label Dec 18, 2022

federicopozzi33 force-pushed the feature/lars-optimizer branch from 18932c7 to fa0912c Compare December 28, 2022 13:59

github-actions bot added the ciflow/inductor label Dec 28, 2022

janeyx99 reviewed May 1, 2023

View reviewed changes

torch/optim/lars.py Show resolved Hide resolved

janeyx99 reviewed May 1, 2023

View reviewed changes

datumbox mentioned this pull request May 2, 2023

Implementation of LAMB optimizer pytorch/vision#6868

Open

federicopozzi33 added 2 commits May 6, 2023 01:57

tmp: PR comments

f9b0c26

tmp: remove comment

c71c362

federicopozzi33 requested a review from janeyx99 May 21, 2023 16:25

github-actions bot added the Stale label Aug 27, 2023

github-actions bot closed this Sep 26, 2023

janeyx99 reopened this Oct 1, 2023

janeyx99 removed the Stale label Oct 1, 2023

github-actions bot added the Stale label Nov 30, 2023

janeyx99 removed the Stale label Nov 30, 2023

github-actions bot added the Stale label Jan 30, 2024

janeyx99 added no-stale and removed Stale labels Jan 30, 2024

	if p_norm * g_norm > 0:
	lars_lr = trust_coefficient * p_norm / (g_norm + p_norm * weight_decay + eps)
	lars_lr = trust_coefficient * p_norm / max(g_norm + p_norm * weight_decay, eps)


		LARS.__doc__ = r"""Implements LARS algorithm.

		For further details regarding the algorithm we refer to `Large Batch Training of Convolutional Networks`_.

		p_norm = torch.norm(param.data)
		g_norm = torch.norm(d_p.data)

Conversation

federicopozzi33 commented Oct 31, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88106

❌ 15 New Failures, 5 Unrelated Failures

Uh oh!

linux-foundation-easycla bot commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

federicopozzi33 commented Nov 13, 2022

Uh oh!

datumbox commented Nov 16, 2022

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federicopozzi33 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federicopozzi33 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frgfm commented Nov 28, 2022

Uh oh!

frgfm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federicopozzi33 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federicopozzi33 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federicopozzi33 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federicopozzi33 Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

federicopozzi33 commented Oct 31, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 31, 2022 •

edited

Loading

linux-foundation-easycla bot commented Oct 31, 2022 •

edited

Loading

federicopozzi33 Dec 10, 2022 •

edited

Loading

federicopozzi33 Dec 10, 2022 •

edited

Loading

federicopozzi33 Dec 10, 2022 •

edited

Loading

federicopozzi33 Dec 10, 2022 •

edited

Loading

federicopozzi33 Dec 10, 2022 •

edited

Loading

federicopozzi33 Dec 10, 2022 •

edited

Loading

federicopozzi33 Dec 10, 2022 •

edited

Loading

federicopozzi33 May 5, 2023 •

edited

Loading

federicopozzi33 May 5, 2023 •

edited

Loading

federicopozzi33 May 5, 2023 •

edited

Loading