Add optimizer priming for dist chkpt by jonb377 · Pull Request #6572 · pytorch/xla

jonb377 · 2024-02-20T21:46:16Z

See also: #6546

The optimizer state must be primed before it can be restored. Optimizer state isn't materialized until the first optim.step call, so to restore optimizer state before resuming training, a dummy step is needed.

This PR introduces the prime_optimizer API, which will run a dummy optimizer step with zeroed gradients. The gradient sharding is copied from the parameters to ensure the resulting sharding is the same.

alanwaketan

LGTM! Thanks, Jon.

dasoto · 2024-05-28T20:17:03Z

Any idea when this will be merged? I try to implement the prime_optimizer function but looks like require some of the changes on the torch_xla/csrc/init_python_bindings.cpp side.

jonb377 · 2024-05-28T20:26:53Z

Hi @dasoto, I've found that this approach will not guarantee the same sharding in the optimizer compared to running a full training step (this is due to sharding propagation decisions in the compiler). I believe the adagrad unit test was broken after an openxla pin update, for example.

Since this is an experimental feature, I would be OK to merge after a rebase. cc @JackCaoG @alanwaketan

jonb377 requested review from alanwaketan and yeounoh February 20, 2024 21:46

jonb377 force-pushed the jonbolin/prime branch from 82dfbe2 to 1e40c74 Compare February 20, 2024 21:47

alanwaketan approved these changes Feb 23, 2024

View reviewed changes

jonb377 added 3 commits May 29, 2024 00:32

Add optimizer priming for dist chkpt

4e9aae0

Only run test on TPUs

ae713fc

Remove adagrad test

b7044e8

jonb377 force-pushed the jonbolin/prime branch from 98c51e9 to b7044e8 Compare May 29, 2024 00:43

Wait for device ops before returning

3d14232

jonb377 merged commit 8fd051f into master May 30, 2024

jonb377 deleted the jonbolin/prime branch May 30, 2024 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimizer priming for dist chkpt#6572

Add optimizer priming for dist chkpt#6572
jonb377 merged 4 commits intomasterfrom
jonbolin/prime

jonb377 commented Feb 20, 2024

Uh oh!

alanwaketan left a comment

Uh oh!

dasoto commented May 28, 2024

Uh oh!

jonb377 commented May 28, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jonb377 commented Feb 20, 2024

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

dasoto commented May 28, 2024

Uh oh!

jonb377 commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jonb377 commented May 28, 2024 •

edited

Loading