[ShardedDDP] Sync buffers + small cleanup by blefaudeux · Pull Request #112 · facebookresearch/fairscale

blefaudeux · 2020-09-24T21:45:28Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes Support broadcast_buffers in OssDdp #68, adding a 'broadcast buffers' option.
Removes unused code in ShardedDDP, check for buffers that were just set

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

blefaudeux · 2020-09-24T21:46:11Z

Long lived branch so rebased a couple of times against master, most of the commits listed above are unrelated

blefaudeux · 2020-09-24T21:47:43Z

tests/nn/data_parallel/test_sharded_ddp.py

-    # Check that the optimization process makes sense (ie. loss goes down for the same data)
-    optimizer.step()
-    new_eval = ddp(input_tensor).abs().sum() / input_tensor.numel()
-    # assert new_eval.item() < output.item()


this would be LR and random values dependent, not a very good test so it was not actually being used

blefaudeux · 2020-09-24T21:49:08Z

fairscale/nn/data_parallel/sharded_ddp.py

        """
        assert self.module.training, "Cannot call reduce in eval"

-        def reduce_params(params: List[Parameter], params_rank: int) -> None:


some of these cases were never used here, I assumed that they came from a copy-paste where this function was more generic than just reducing grads

yeah, it was me mis-named it before.

blefaudeux · 2020-09-28T21:40:43Z

ping reviewers, @msbaines @min-xu-ai

min-xu-ai

Looks good. Feel free to add a TODO to address my comments on potential perf optimization. sorry for the delay.

min-xu-ai · 2020-09-29T03:29:56Z

fairscale/nn/data_parallel/sharded_ddp.py

+            map(
+                lambda x: x.wait(),
+                map(
+                    lambda x: dist.broadcast(x, self.authoritative_rank, self.process_group, async_op=True),


does it make sense to combine some buffers and reduce the total number of broadcast, like what we do above for reduce_grad?

yes, I think it would be an interesting option, I have an old branch doing that which I should revive. There's a trade-off though because there are copies in that case, in practice since the broadcast became async I saw a big speed bump and wondered whether this bucketing strategy is still to be followed ? I'll add a TODO, good idea

min-xu-ai · 2020-09-29T03:30:15Z

fairscale/nn/data_parallel/sharded_ddp.py

        """
        assert self.module.training, "Cannot call reduce in eval"

-        def reduce_params(params: List[Parameter], params_rank: int) -> None:


yeah, it was me mis-named it before.

blefaudeux added 26 commits September 16, 2020 16:51

add a small tutorial, same as README

2867718

simplify the reduce step, only reducing grads for now

989c2b9

update benchmark config

89f8ea3

cleanup, no bug after all

71aff60

brainfart

f55ff9c

properly testing SDP, needs improvements

dc156c3

remove GC, useless. Still slower and more memory

b2e8cc5

reduced benchmark verbosity, still report the loss

0e318e3

some fixes across the board

e886633

trying to figure out the hanging issues

a7aae48

doc + bench fixes

33210ee

More docs, API

a5d305e

Merge remote-tracking branch 'upstream/master' into oss_doc

3c91371

autocomplete being high

84f07a3

Merge branch 'oss_doc' into oss_sharded_ddp

f66f224

wip

a7df03d

Merge remote-tracking branch 'upstream/master' into oss_sharded_ddp

e3ede4e

In working state, too slow. Now needs some overlapping compute blocks

60e1083

oss benchmark: serve different random data per gpu, always the same

59da68c

Merge remote-tracking branch 'upstream/master' into oss_sharded_ddp

352695a

world 4 benchmark + remove broadcast_buffer wording for now

9efc5ad

Initial support for #68

2876d38

unit test hotfix

63efccc

Merge remote-tracking branch 'upstream/master' into oss_sharded_ddp

e171308

adding a unit test for the buffer sync

da64695

typo

434e57c

blefaudeux requested a review from min-xu-ai September 24, 2020 21:45

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 24, 2020

blefaudeux requested review from msbaines and myleott September 24, 2020 21:45

blefaudeux commented Sep 24, 2020

View reviewed changes

min-xu-ai approved these changes Sep 29, 2020

View reviewed changes

code review, thanks Min

b7fbad5

blefaudeux merged commit 79ded82 into master Sep 29, 2020

blefaudeux deleted the oss_sharded_ddp branch September 29, 2020 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ShardedDDP] Sync buffers + small cleanup#112

[ShardedDDP] Sync buffers + small cleanup#112
blefaudeux merged 27 commits intomasterfrom
oss_sharded_ddp

blefaudeux commented Sep 24, 2020

Uh oh!

blefaudeux commented Sep 24, 2020

Uh oh!

blefaudeux Sep 24, 2020

Uh oh!

blefaudeux Sep 24, 2020

Uh oh!

min-xu-ai Sep 29, 2020

Uh oh!

blefaudeux commented Sep 28, 2020

Uh oh!

min-xu-ai left a comment

Uh oh!

min-xu-ai Sep 29, 2020

Uh oh!

blefaudeux Sep 29, 2020

Uh oh!

min-xu-ai Sep 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

blefaudeux commented Sep 24, 2020

Before submitting

What does this PR do?

PR review

Did you have fun?

Uh oh!

blefaudeux commented Sep 24, 2020

Uh oh!

blefaudeux Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

min-xu-ai Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Sep 28, 2020

Uh oh!

min-xu-ai left a comment

Choose a reason for hiding this comment

Uh oh!

min-xu-ai Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

min-xu-ai Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants