[fix] ShardedDDP - cpu testfix - remove Gloo/CPU by blefaudeux · Pull Request #350 · facebookresearch/fairscale

blefaudeux · 2021-02-01T20:25:19Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes intermittent issues with newer python versions, I could not reproduce locally. I don't think that cpu use is super important really so not loosing much

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

min-xu-ai · 2021-02-01T21:30:09Z

is it only failing on py38? Is so, perhaps skip it only on py38? CPU test can still be useful for testing thing when multiple GPUs are not available. What do you think?

blefaudeux · 2021-02-01T21:43:14Z

is it only failing on py38? Is so, perhaps skip it only on py38? CPU test can still be useful for testing thing when multiple GPUs are not available. What do you think?

it seems to be only failing on py3.8, only for cpu only installs for some reason (the same test was run, also on cpu, on gpu enabled hosts and never had any problem, see for instance https://app.circleci.com/pipelines/github/facebookresearch/fairscale/1504/workflows/9859ca20-6552-44e2-befa-642eb07f4a8e. It only randomly fails on the new cpu-only tests, and I can't even repro it locally, I guess it's tied to the specific torch that we install). I'm sure it can be investigated more but it takes a lot of time, I'm not sure that it's worth it right now. Besides, how would you only skip this test on py38 ? the job is not really visible from the python codebase

min-xu-ai · 2021-02-01T22:22:14Z

skipping based on version should be simple. I had it at one point but never merged. I did the following:

add skip_if_py38 in testing.py:

it is something like:

import sys

if sys.version_info.major ==3 and sys.version_info.minor == 8: pytest.skip()

then, just use @skip_if_py38 to annotate the test.

min-xu-ai · 2021-02-01T22:23:45Z

another way is to skip the assert model differ between ranks when it is running with a specific py version. I think that should minimal the overhead of us debugging it but still keep some portion of the code covered in different versions.

blefaudeux · 2021-02-02T01:14:53Z

sys.version_info.major ==3 and sys.version_info.minor == 8

something similar done, I would not have guessed that it was a preferred solution but that's completely fine by me

min-xu-ai

awesome!

* [chore] Fix lint errors that broke master (#348) authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> * [fix] ShardedDDP - cpu testfix - remove Gloo/CPU (#350) * no idea about the root issue, but it proved to be fairly narrowed (gloo+cpu+python3.8+no cuda installed) so I guess that's out of scope for fairscale * [feat][OSS] elastic and pytorch compatible checkpoints (#310) * adding a test to prove the inter operability with upstream pytorch * updating the changelog * eager state pruning * pytorch 1.5 compat * [fix] ShardedDDP - properly handle post device change (#353) * adding the .to(device) support + unit testing * doc update * [feat] Add AdaScaleWrapper (#347) * [feat] Add AdaScaleWrapper - This enables a different API for wrapping an optimizer with AdaScale. - This also enables AdaScale to be wrapped by OSS. - However, OSS wrapping AdaScale results in different optimization, which future research will be needed to study its effects. testing: add unit tests. * addressed comment: typo * [refactor] Refactor and enable multiprocess nn.Pipe benchmarks. (#319) * mp cleanup * round of multiprocess refactoring * test golden run * print cuda stats * fix lint errors * enable multiprocess pipe benchmarks * set world size to be available gpus * more changes * use synthetic loaders for intermediate pipeline stages * merged master * fix for the devices property * dataloader fix * modify rank check * print wps stats * enable verification * fix logging * fix flag name * fix flag name * check for rank * fix indent * pass args * pass args * modify golden data * remove unused print messsage * fix lint errors * add comments * fix benchmarks Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> * [refactor] pipe: simplify balance and module checks (#346) * [chore] v0.1.5 (#355) * [chore] disheartening switch off of a OSS cpu test (#356) * precise skip, only if agent has only cpu * [feat][minor] OSS Benchmark - regression test + background testing new optims (#352) * restoring the regression test, adding a test of the for_each optims * fix the regression test on circleci * removing unused flags * [refactor] multiprocess_pipe: cleanup __init__ (#357) * [refactor] multiprocess_pipe: remove retain_graph __init__ param (#358) It is not currently being used so we can simplify the interface by removing it. * [refactor] multiprocess_pipe: focus on LazyModule usage (#360) * [feat] ShardedDDP : Adding a proper DDP parity / AMP unit test, overdue (#361) * Adding a proper ddp parity / AMP unit test, overdue * catch non-AMP pytorch * [perf][OSS] Clip grad norm : minor obvious speedup (#363) cache this iterator, easy speed up * [refactor] multiprocess_pipe: remove pipelined_backward (#362) * [perf] ShardedDDP - small memory use reduction - minor speedup (#366) * minor * minor * [fix] repro+fix (#365) fix a broken earlier commit, only worked for the first step * [refactor] OSS only use flat buffers (#371) * flat params all along, way simpler * updating the docstring * [refactor] AsyncPipe: do not sub-class MultiProcessPipe (#370) * [refactor] remove multiprocess dependency on async (#373) * [fix] Workaround need for pip --no-build-isolation (#375) * Add fairscale.nn.misc.checkpoint_activations (#376) * Add fairscale.utils.containers Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com> * Add fairscale.nn.misc.checkpoint_activations Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * [chore] v0.1.6 (#377) * v0.1.6 Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com> Co-authored-by: Benjamin Lefaudeux <blefaudeux@users.noreply.github.com> Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair> Co-authored-by: msbaines <35972327+msbaines@users.noreply.github.com> Co-authored-by: Leonard Lausen <leonard@lausen.nl> Co-authored-by: Myle Ott <myleott@fb.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

blefaudeux added 2 commits February 1, 2021 12:10

removing the cpu test, not too interesting anyway

24713da

removing gloo for now

2dabbcc

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 1, 2021

blefaudeux requested review from anj-s, min-xu-ai and msbaines February 1, 2021 20:25

review, thanks Min

f37fef2

now with a reason

08beb60

min-xu-ai approved these changes Feb 2, 2021

View reviewed changes

blefaudeux merged commit c2dd6c3 into master Feb 2, 2021

blefaudeux deleted the shardedddp-cpu-testfix branch February 2, 2021 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] ShardedDDP - cpu testfix - remove Gloo/CPU#350

[fix] ShardedDDP - cpu testfix - remove Gloo/CPU#350
blefaudeux merged 4 commits intomasterfrom
shardedddp-cpu-testfix

blefaudeux commented Feb 1, 2021

Uh oh!

min-xu-ai commented Feb 1, 2021

Uh oh!

blefaudeux commented Feb 1, 2021

Uh oh!

min-xu-ai commented Feb 1, 2021

Uh oh!

min-xu-ai commented Feb 1, 2021

Uh oh!

blefaudeux commented Feb 2, 2021

Uh oh!

min-xu-ai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

blefaudeux commented Feb 1, 2021

Before submitting

What does this PR do?

PR review

Did you have fun?

Uh oh!

min-xu-ai commented Feb 1, 2021

Uh oh!

blefaudeux commented Feb 1, 2021

Uh oh!

min-xu-ai commented Feb 1, 2021

Uh oh!

min-xu-ai commented Feb 1, 2021

Uh oh!

blefaudeux commented Feb 2, 2021

Uh oh!

min-xu-ai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants