[feat] OSS Benchmark - regression test + background testing new optims by blefaudeux · Pull Request #352 · facebookresearch/fairscale

blefaudeux · 2021-02-02T18:00:04Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes the regression not being run anymore, and adding a background test for the new multi-tensor optims (just checking that it runs and have an idea of the perf difference for now)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
cc @izdeby

Did you have fun?

Make sure you had fun coding 🙃

blefaudeux · 2021-02-02T18:08:51Z

benchmarks/golden_configs/oss_mnist.py

-        "reference_speed": 1430,
-        "reference_memory": 1220,
-        "reference_loss": 0.006,
+        "reference_speed": 660,


the numbers left must have been on a devfair, the number right are the ones we used on circleci previously

blefaudeux · 2021-02-02T18:10:31Z

.circleci/config.yml

      command: |
        python benchmarks/oss.py --world_size 4 --epochs 2
-        python benchmarks/oss.py --check_regression --world_size 4 --optim_type oss_sharded_ddp --reference_speed 660 --reference_memory 930 --reference_loss 0.023
+        python benchmarks/oss.py --check_regression --world_size 4 --optim_type oss_sharded_ddp


the flags were not used actually, the "golden config" is being used

blefaudeux · 2021-02-02T18:10:56Z

benchmarks/oss.py

    parser.add_argument("--epochs", action="store", default=10, type=int)
    parser.add_argument("--batch_size", action="store", default=256, type=int)
    parser.add_argument("--check_regression", action="store_true", default=False)
-    parser.add_argument("--reference_speed", action="store", default=1430, type=float)


not used anymore anyway, not changed by this PR

blefaudeux · 2021-02-02T19:24:45Z

The numbers are slightly different for now with the new optims, being fixed with pytorch/pytorch#48223. End plan is even faster and complete 1:1 bit parity, so I think that we could start recommending them

anj-s · 2021-02-03T03:38:24Z

benchmarks/oss.py

    logging.basicConfig(level=logging.INFO if not args.debug else logging.DEBUG)

+    use_multi_tensor = args.multi_tensor_optim and hasattr(torch.optim, "_multi_tensor")
+    OPTIM = torch.optim._multi_tensor.RMSprop if use_multi_tensor else torch.optim.RMSprop  # type: ignore  # attr is  checked but mypy misses that


Have we tested this with OSS? I couldn't find an example.

no, sorry if I was not clear, the whole point of this PR is to test that :)

I guess I was confused by the CL description which said adding tests for new optims but I guess you meant regression benchmarks.
If you want to test the new optims, you could do a golden weight check which is a different test than the regression benchmark you are adding. I am not sure if we need to add a regression benchmark (mainly because it adds to the time taken to run all checks on GH). Given I don't have context about this use case, I'll leave you to decide :)

anj-s · 2021-02-03T03:38:50Z

.circleci/config.yml

+   - run:
+       name: Run OSS with Torch AMP and ForEach optmizer
+       command: |
+         python benchmarks/oss.py --amp --epochs 3 --optim_type oss_sharded_ddp --multi_tensor_optim


Why do we want to add a regression benchmark? Would tests suffice?

in that case it's not a regression, it just checks that it runs ?

Replied below: You can do a golden weight comparison check which makes sure that the results match with the baseline optimizer (if that is what you want)

no so there are two things in this PR, not completely related:

restore the regression test for ShardedDDP which was broken

start a test job with these new optimizers. Right now they are not 1:1 correct, it's being changed as I write so it should soon be the case, I thought that having a test job on the topic could be useful to check parity and catch early any eventual issues

blefaudeux · 2021-02-03T16:25:46Z

@anj-s ok to land ? I can change something if you'd like

* [chore] Fix lint errors that broke master (#348) authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> * [fix] ShardedDDP - cpu testfix - remove Gloo/CPU (#350) * no idea about the root issue, but it proved to be fairly narrowed (gloo+cpu+python3.8+no cuda installed) so I guess that's out of scope for fairscale * [feat][OSS] elastic and pytorch compatible checkpoints (#310) * adding a test to prove the inter operability with upstream pytorch * updating the changelog * eager state pruning * pytorch 1.5 compat * [fix] ShardedDDP - properly handle post device change (#353) * adding the .to(device) support + unit testing * doc update * [feat] Add AdaScaleWrapper (#347) * [feat] Add AdaScaleWrapper - This enables a different API for wrapping an optimizer with AdaScale. - This also enables AdaScale to be wrapped by OSS. - However, OSS wrapping AdaScale results in different optimization, which future research will be needed to study its effects. testing: add unit tests. * addressed comment: typo * [refactor] Refactor and enable multiprocess nn.Pipe benchmarks. (#319) * mp cleanup * round of multiprocess refactoring * test golden run * print cuda stats * fix lint errors * enable multiprocess pipe benchmarks * set world size to be available gpus * more changes * use synthetic loaders for intermediate pipeline stages * merged master * fix for the devices property * dataloader fix * modify rank check * print wps stats * enable verification * fix logging * fix flag name * fix flag name * check for rank * fix indent * pass args * pass args * modify golden data * remove unused print messsage * fix lint errors * add comments * fix benchmarks Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> * [refactor] pipe: simplify balance and module checks (#346) * [chore] v0.1.5 (#355) * [chore] disheartening switch off of a OSS cpu test (#356) * precise skip, only if agent has only cpu * [feat][minor] OSS Benchmark - regression test + background testing new optims (#352) * restoring the regression test, adding a test of the for_each optims * fix the regression test on circleci * removing unused flags * [refactor] multiprocess_pipe: cleanup __init__ (#357) * [refactor] multiprocess_pipe: remove retain_graph __init__ param (#358) It is not currently being used so we can simplify the interface by removing it. * [refactor] multiprocess_pipe: focus on LazyModule usage (#360) * [feat] ShardedDDP : Adding a proper DDP parity / AMP unit test, overdue (#361) * Adding a proper ddp parity / AMP unit test, overdue * catch non-AMP pytorch * [perf][OSS] Clip grad norm : minor obvious speedup (#363) cache this iterator, easy speed up * [refactor] multiprocess_pipe: remove pipelined_backward (#362) * [perf] ShardedDDP - small memory use reduction - minor speedup (#366) * minor * minor * [fix] repro+fix (#365) fix a broken earlier commit, only worked for the first step * [refactor] OSS only use flat buffers (#371) * flat params all along, way simpler * updating the docstring * [refactor] AsyncPipe: do not sub-class MultiProcessPipe (#370) * [refactor] remove multiprocess dependency on async (#373) * [fix] Workaround need for pip --no-build-isolation (#375) * Add fairscale.nn.misc.checkpoint_activations (#376) * Add fairscale.utils.containers Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com> * Add fairscale.nn.misc.checkpoint_activations Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * [chore] v0.1.6 (#377) * v0.1.6 Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com> Co-authored-by: Benjamin Lefaudeux <blefaudeux@users.noreply.github.com> Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> Co-authored-by: msbaines <35972327+msbaines@users.noreply.github.com> Co-authored-by: Leonard Lausen <leonard@lausen.nl> Co-authored-by: Myle Ott <myleott@fb.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

restoring the regression test, adding a test of the for_each optims

9028acc

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 2, 2021

blefaudeux requested review from anj-s and min-xu-ai February 2, 2021 18:00

fix the regression test on circleci

0e0ad6e

blefaudeux commented Feb 2, 2021

View reviewed changes

removing unused flags

ec5d422

blefaudeux commented Feb 2, 2021

View reviewed changes

fixing the loss test. Strange cpu test issue again

cd7e3f3

msbaines approved these changes Feb 3, 2021

View reviewed changes

anj-s reviewed Feb 3, 2021

View reviewed changes

anj-s approved these changes Feb 3, 2021

View reviewed changes

blefaudeux merged commit de713d1 into master Feb 3, 2021

blefaudeux deleted the oss_multi_tensors branch February 3, 2021 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] OSS Benchmark - regression test + background testing new optims#352

[feat] OSS Benchmark - regression test + background testing new optims#352
blefaudeux merged 4 commits intomasterfrom
oss_multi_tensors

blefaudeux commented Feb 2, 2021 •

edited

Loading

Uh oh!

blefaudeux Feb 2, 2021

Uh oh!

blefaudeux Feb 2, 2021

Uh oh!

blefaudeux Feb 2, 2021

Uh oh!

blefaudeux commented Feb 2, 2021

Uh oh!

anj-s Feb 3, 2021

Uh oh!

blefaudeux Feb 3, 2021

Uh oh!

anj-s Feb 3, 2021

Uh oh!

anj-s Feb 3, 2021

Uh oh!

blefaudeux Feb 3, 2021

Uh oh!

anj-s Feb 3, 2021

Uh oh!

blefaudeux Feb 3, 2021

Uh oh!

blefaudeux commented Feb 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

blefaudeux commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

What does this PR do?

PR review

Did you have fun?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Feb 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Feb 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blefaudeux commented Feb 2, 2021 •

edited

Loading