[feat][OSS] elastic and pytorch compatible checkpoints by blefaudeux · Pull Request #310 · facebookresearch/fairscale

blefaudeux · 2021-01-13T22:02:02Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #164, and make the saved state pytorch-compliant (no extra keyword). The number of ranks can change before and after the checkpoints, it will automatically adapt by repartitioning at load. Adding a new unit test which checks reproducibility (cc @joshim5)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
@joshim5 @stas00 @SeanNaren this breaks compatibility with old checkpoints, is that a big issue ? I could add some scaffolding to move old checkpoints to the new form.

@mannatsingh I know you mentioned that a long time ago, finally there. Not sure how you would rate the complexity of doing this (masking the sharding when rebuilding a pytorch-compatible state dict), but it's now out of the box with this PR

Did you have fun?

Make sure you had fun coding 🙃

fairscale/optim/oss.py

joshim5 · 2021-01-13T22:11:39Z

Incompatibility with old Fairscale checkpoints is fine with me -- but if we could support checkpoints from any arbitrary ADAM optimizer, that would be a huge win! It doesn't even have to be in this class - a separate translation script could work if that's easier.

blefaudeux · 2021-01-13T22:30:48Z

Incompatibility with old Fairscale checkpoints is fine with me -- but if we could support checkpoints from any arbitrary ADAM optimizer, that would be a huge win! It doesn't even have to be in this class - a separate translation script could work if that's easier.

should be doable with the same PR, I'll have a look at that

anj-s · 2021-01-13T23:48:47Z

@blefaudeux Let me know when this is ready for review.

stas00 · 2021-01-14T00:01:02Z

Thank you for the ping, @blefaudeux!

@sgugger did the fairscale integration, so we should probably check with him. (I only wrote the docs and added some basic tests)

I had a quick look and I don't see we are doing anything special for fairscale - checkpoint-wise, just straight pytorch. On the other hand we don't really have any tests at the moment that test checkpointing w/ fairscale (or deepspeed for that matter) so it's possible that it has never worked properly in the first place and you're doing us a favor with this fix. Yay!

blefaudeux · 2021-01-14T04:16:45Z

Thank you for the ping, @blefaudeux!

@sgugger did the fairscale integration, so we should probably check with him. (I only wrote the docs and added some basic tests)

I had a quick look and I don't see we are doing anything special for fairscale - checkpoint-wise, just straight pytorch. On the other hand we don't really have any tests at the moment that test checkpointing w/ fairscale (or deepspeed for that matter) so it's possible that it has never worked properly in the first place and you're doing us a favor with this fix. Yay!

thanks for the feedback ! To be clear it works right now, this is not really a fix but a change of things to (a) align with pytorch (a pytorch checkpoint should load) and (b) make it possible to change the number of ranks in between checkpoints (save with 8 ranks and reload with 16 for instance)

sgugger · 2021-01-14T14:22:17Z

I don't think the compatibility issue is a big matter on our side: for us checkpoints are used to resume training if it was interrupted for some reason, so I don't expect users will have updated their envs in-between.

blefaudeux · 2021-01-14T16:08:57Z

I don't think the compatibility issue is a big matter on our side: for us checkpoints are used to resume training if it was interrupted for some reason, so I don't expect users will have updated their envs in-between.

alright, perfect, and thanks for dropping by ! I'll remember to keep you in the loop @sgugger, but no big change expected on OSS after this one

blefaudeux · 2021-01-14T22:47:10Z

the optimizer state contents changed in torch1.6+, I'm not handling 1.5 well right now (this change was new to me). Updating that and then putting the PR up for review

blefaudeux · 2021-01-15T17:36:44Z

Ready for review, figuring out something not too ugly which would work for both torch 1.5 (some map indices are id(params)) and above (just plain geometric indices, 0, 1, 2) was interesting, I think that it's mostly ok now.

blefaudeux · 2021-02-01T17:12:57Z

ping reviewers, happy to change anything if it means that this can land somehow, hard to keep old PRs alive

blefaudeux · 2021-02-01T17:26:19Z

This fixes the linting that #342 is supposed to have broken, even if not touching the files

* adding fairseq's gha * adding py3.9, removing submodules * yaml linting

authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

blefaudeux · 2021-02-01T19:26:57Z

This fixes the linting that #342 is supposed to have broken, even if not touching the files

solved with #348

…points

msbaines

Can you add a test for multiple param_groups? There was a bug related to that in an earlier iteration of this change. Other than that, LGTM.

msbaines

Please add a test for multiple param_groups before submitting.

tests/optim/test_oss.py

…ints

blefaudeux · 2021-02-02T05:27:22Z

Please add a test for multiple param_groups before submitting.

the ddp_parity test takes 2 groups now, with different learning rates, let me know if you had something else in mind

msbaines · 2021-02-02T16:07:46Z

global_id_map was incorrect in an earlier iteration for multiple param_groups but no tests were failing. Would be nice to have a test that verifies that new load_state_dict and state_dict logic work when there are multiple param_groups.

Please add a test for multiple param_groups before submitting.

the ddp_parity test takes 2 groups now, with different learning rates, let me know if you had something else in mind

global_id_map was incorrect in an earlier iteration of the patch for the case of multiple param_groups but no tests were failing. Would be nice to have a test that verifies that new load_state_dict and state_dict logic work when there are multiple param_groups. Not sure if the ddp_parity test handles that.

blefaudeux · 2021-02-02T17:52:54Z

tests/optim/test_oss.py

+            optimizer_settings["momentum"] = 0.9
+
+        sharded_optimizer = optim.OSS(
+            params=oss_trainable_params,


blefaudeux · 2021-02-02T17:54:34Z

global_id_map was incorrect in an earlier iteration for multiple param_groups but no tests were failing. Would be nice to have a test that verifies that new load_state_dict and state_dict logic work when there are multiple param_groups.

Please add a test for multiple param_groups before submitting.

the ddp_parity test takes 2 groups now, with different learning rates, let me know if you had something else in mind

global_id_map was incorrect in an earlier iteration of the patch for the case of multiple param_groups but no tests were failing. Would be nice to have a test that verifies that new load_state_dict and state_dict logic work when there are multiple param_groups. Not sure if the ddp_parity test handles that.

yes it does, it checks that:

after a couple of steps you get the exact same in both cases
if you save dict on both sides, then cross load (ddp loads the oss state dict, and vice versa) nothing breaks
if you iterate after that, you still have the same results on both sides

There's also another test which checks that on OSS alone stepping, then rolling back (load an old state dict) and stepping again gives the same result, this one has one param group only, always possible to add more of course

* [chore] Fix lint errors that broke master (#348) authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> * [fix] ShardedDDP - cpu testfix - remove Gloo/CPU (#350) * no idea about the root issue, but it proved to be fairly narrowed (gloo+cpu+python3.8+no cuda installed) so I guess that's out of scope for fairscale * [feat][OSS] elastic and pytorch compatible checkpoints (#310) * adding a test to prove the inter operability with upstream pytorch * updating the changelog * eager state pruning * pytorch 1.5 compat * [fix] ShardedDDP - properly handle post device change (#353) * adding the .to(device) support + unit testing * doc update * [feat] Add AdaScaleWrapper (#347) * [feat] Add AdaScaleWrapper - This enables a different API for wrapping an optimizer with AdaScale. - This also enables AdaScale to be wrapped by OSS. - However, OSS wrapping AdaScale results in different optimization, which future research will be needed to study its effects. testing: add unit tests. * addressed comment: typo * [refactor] Refactor and enable multiprocess nn.Pipe benchmarks. (#319) * mp cleanup * round of multiprocess refactoring * test golden run * print cuda stats * fix lint errors * enable multiprocess pipe benchmarks * set world size to be available gpus * more changes * use synthetic loaders for intermediate pipeline stages * merged master * fix for the devices property * dataloader fix * modify rank check * print wps stats * enable verification * fix logging * fix flag name * fix flag name * check for rank * fix indent * pass args * pass args * modify golden data * remove unused print messsage * fix lint errors * add comments * fix benchmarks Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> * [refactor] pipe: simplify balance and module checks (#346) * [chore] v0.1.5 (#355) * [chore] disheartening switch off of a OSS cpu test (#356) * precise skip, only if agent has only cpu * [feat][minor] OSS Benchmark - regression test + background testing new optims (#352) * restoring the regression test, adding a test of the for_each optims * fix the regression test on circleci * removing unused flags * [refactor] multiprocess_pipe: cleanup __init__ (#357) * [refactor] multiprocess_pipe: remove retain_graph __init__ param (#358) It is not currently being used so we can simplify the interface by removing it. * [refactor] multiprocess_pipe: focus on LazyModule usage (#360) * [feat] ShardedDDP : Adding a proper DDP parity / AMP unit test, overdue (#361) * Adding a proper ddp parity / AMP unit test, overdue * catch non-AMP pytorch * [perf][OSS] Clip grad norm : minor obvious speedup (#363) cache this iterator, easy speed up * [refactor] multiprocess_pipe: remove pipelined_backward (#362) * [perf] ShardedDDP - small memory use reduction - minor speedup (#366) * minor * minor * [fix] repro+fix (#365) fix a broken earlier commit, only worked for the first step * [refactor] OSS only use flat buffers (#371) * flat params all along, way simpler * updating the docstring * [refactor] AsyncPipe: do not sub-class MultiProcessPipe (#370) * [refactor] remove multiprocess dependency on async (#373) * [fix] Workaround need for pip --no-build-isolation (#375) * Add fairscale.nn.misc.checkpoint_activations (#376) * Add fairscale.utils.containers Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com> * Add fairscale.nn.misc.checkpoint_activations Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * [chore] v0.1.6 (#377) * v0.1.6 Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com> Co-authored-by: Benjamin Lefaudeux <blefaudeux@users.noreply.github.com> Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> Co-authored-by: msbaines <35972327+msbaines@users.noreply.github.com> Co-authored-by: Leonard Lausen <leonard@lausen.nl> Co-authored-by: Myle Ott <myleott@fb.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

blefaudeux added 2 commits January 13, 2021 20:20

WIP

5f2baaa

fix a loading issue, add a reproducibility unit test

9e98ac8

blefaudeux requested review from joshim5 and min-xu-ai and removed request for min-xu-ai January 13, 2021 22:02

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 13, 2021

blefaudeux requested review from anj-s, joshim5, min-xu-ai and msbaines and removed request for joshim5 and msbaines January 13, 2021 22:02

blefaudeux commented Jan 13, 2021

View reviewed changes

fairscale/optim/oss.py Show resolved Hide resolved

blefaudeux marked this pull request as draft January 13, 2021 22:11

would need more checks, but getting there, saving a purely pytorch state

b464e00

should be good to go, the OSS unit tests are still pretty horrible

fc91d15

blefaudeux changed the title ~~[feat][OSS] elastic checkpoints~~ [feat][OSS] elastic and pytorch compatible checkpoints Jan 14, 2021

blefaudeux mentioned this pull request Jan 14, 2021

ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper pytorch/pytorch#46750

Closed

tentatively solving torch 1.5 compatibility

70712ef

blefaudeux marked this pull request as ready for review January 15, 2021 17:35

linting Pipe files..

ab9f242

isort pipe files, not sure how they got it wrong

b6d6f5a

blefaudeux requested a review from anj-s February 1, 2021 17:27

blefaudeux and others added 2 commits February 1, 2021 19:26

[feature] Automate wheel builds on new releases (#342)

25ca53d

* adding fairseq's gha * adding py3.9, removing submodules * yaml linting

[chore] Fix lint errors that broke master (#348)

bf2a352

authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

blefaudeux added 7 commits February 1, 2021 12:10

removing the cpu test, not too interesting anyway

24713da

Merge remote-tracking branch 'upstream/master' into oss_elastic_check…

c25db63

…points

removing gloo for now

2dabbcc

Merge branch 'shardedddp-cpu-testfix' into oss_elastic_checkpoints

a07f3f9

linting, align with new master

830a664

linting, align with new master

8fb98ce

removing a clone which is not needed I believe

8b99fe7

msbaines reviewed Feb 2, 2021

View reviewed changes

msbaines approved these changes Feb 2, 2021

View reviewed changes

blefaudeux commented Feb 2, 2021

View reviewed changes

tests/optim/test_oss.py Show resolved Hide resolved

blefaudeux commented Feb 2, 2021

View reviewed changes

tests/optim/test_oss.py Outdated Show resolved Hide resolved

blefaudeux added 2 commits February 2, 2021 04:53

adding multiple groups to the ddp parity test

d29a240

Merge remote-tracking branch 'origin/master' into oss_elastic_checkpo…

29bc4bd

…ints

blefaudeux commented Feb 2, 2021

View reviewed changes

blefaudeux merged commit 9e8929e into master Feb 2, 2021

blefaudeux deleted the oss_elastic_checkpoints branch February 2, 2021 23:41

blefaudeux mentioned this pull request Feb 12, 2021

[fix] oss dict load fix #383

Merged

4 tasks

blefaudeux mentioned this pull request Feb 23, 2021

[OSS] Not able to continue training. #418

Closed

Conversation

blefaudeux commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

What does this PR do?

PR review

Did you have fun?

Uh oh!

Uh oh!

joshim5 commented Jan 13, 2021

Uh oh!

blefaudeux commented Jan 13, 2021

Uh oh!

anj-s commented Jan 13, 2021

Uh oh!

stas00 commented Jan 14, 2021

Uh oh!

blefaudeux commented Jan 14, 2021

Uh oh!

sgugger commented Jan 14, 2021

Uh oh!

blefaudeux commented Jan 14, 2021

Uh oh!

blefaudeux commented Jan 14, 2021

Uh oh!

blefaudeux commented Jan 15, 2021

Uh oh!

blefaudeux commented Feb 1, 2021

Uh oh!

blefaudeux commented Feb 1, 2021

Uh oh!

blefaudeux commented Feb 1, 2021

Uh oh!

msbaines left a comment

Choose a reason for hiding this comment

Uh oh!

msbaines left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

blefaudeux commented Feb 2, 2021

Uh oh!

msbaines commented Feb 2, 2021

Uh oh!

blefaudeux Feb 2, 2021

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Feb 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

blefaudeux commented Jan 13, 2021 •

edited

Loading