Add Tuple input and token support to all-gather and reduce-scatter. by hjm-aws · Pull Request #58377 · tensorflow/tensorflow

hjm-aws · 2022-10-31T07:15:15Z

No description provided.

Committer: Junmin Hao <junminh@amazon.com>

google-cla · 2022-10-31T07:15:19Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gbaned · 2022-10-31T16:05:19Z

@hjm-aws Can you please resolve conflicts? Thank you!

gbaned · 2022-11-02T10:02:03Z

@hjm-aws Can you please sign CLA. Thank you!

hjm-aws · 2022-11-10T20:21:52Z

@hjm-aws Can you please sign CLA. Thank you!
Done. Thanks!

gbaned · 2022-11-24T22:39:07Z

@hjm-aws It still shows CLA is pending, can you please sign CLA. Thank you!

cheshire

Thanks a lot! Overall this makes sense, let me also check internally.

QQ: this is not decomposable, right? E.g. changes inside XLA could not be split from builder changes?

cheshire · 2022-11-26T11:36:13Z

-        Shape inferred_shape,
-        ShapeInference::InferAllGatherShape({operand_shape},
-                                            all_gather_dimension, shard_count));
+    std::vector<const Shape*> operand_shapes;


Could you also update documentation of semantics in operation_semantics.md?

cheshire · 2022-11-26T11:38:00Z

Thanks, overall this looks like a very good change! Added @Kariddi and @blakehechtman for clarifications.

jurahul · 2022-11-29T15:39:15Z

              HasSubstr("Replica groups expected to be of uniform size"));
 }

+TEST_F(HloVerifierTest, ReduceScatterTwoTokens) {


I think there is some confusion here between tokens and tuples. I think the intent is to add tuple support, so we should remove any mention of tokens in test names or in the change description.

jurahul · 2022-11-29T17:11:24Z

-                                            all_gather_dimension, shard_count));
+    std::vector<const Shape*> operand_shapes;
+    std::vector<XlaOp> operands;
+    if (operand_shape->IsTuple()) {


This looks like copy/pasted from all-reduce. Is it possible to factor it out into a helper function?

jurahul · 2022-11-29T17:12:36Z

      ShapeUtil::Equal(root->shape(), ShapeUtil::MakeShape(F32, {4, 64})));
 }

+TEST_F(XlaBuilderTest, AllGatherWithToken) {


AllGatherWithTuple?

jurahul · 2022-11-29T17:13:25Z

      ShapeUtil::Equal(root->shape(), ShapeUtil::MakeShape(F32, {4, 8})));
 }

+TEST_F(XlaBuilderTest, ReduceScatterWithToken) {


ReduceScatterWithTuple?

jurahul · 2022-11-29T17:17:45Z

+  TF_RET_CHECK(ag->operand_count() >= 1);

  int64_t shard_count;
+  // There can be one token in the input Tuple. The token is a scalar or


This is confusing IMO. Can we get clarification as to what is intent of the token input to all-gather? Also treating a scalar as a token is confusing.

jurahul · 2022-11-29T17:23:51Z

Overall it seems this is attempting to add tuple support for all-gather and reduce-scatter as well as add a optional dummy token input to the all-gather, the purpose of which is unclear.

I think we should split this into 2 PRs, one for tuple support and discuss support for token types in all-gather separately before adding it.

gbaned · 2022-12-05T09:06:59Z

@hjm-aws Can you please check @jurahul's comments and keep us posted ? Thank you!

gbaned · 2022-12-29T11:37:29Z

@hjm-aws Any update on this PR? Please. Thank you!

gbaned · 2023-03-21T19:32:37Z

@hjm-aws Any update on this PR? Please. Thank you!

github-actions · 2023-04-05T01:49:04Z

This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-04-19T01:56:34Z

This PR was closed because it has been inactive for 14 days since being marked as stale. Please reopen if you'd like to work on this further.

jeffhataws · 2023-09-22T04:29:04Z

Overall it seems this is attempting to add tuple support for all-gather and reduce-scatter as well as add a optional dummy token input to the all-gather, the purpose of which is unclear.

I think we should split this into 2 PRs, one for tuple support and discuss support for token types in all-gather separately before adding it.

@gbaned @jurahul , the revived PR on openxla is openxla/xla#5740 . The new PR has a description that hopefully answer you questions.

Imported from GitHub PR #5740 This PR adds tuple input support to all-gather and reduce-scatter. This is a revival of part of tensorflow/tensorflow#58377 and to be used in conjunction with pytorch/xla#5624 . In FSDP, different layers' weights need to be all-gathered/reduced-scatter during training. If some layers are small, multiple layers' weights can be aggregated for more efficient data transfer (same concept as bucket_cap_mb in DDP). With existing all-gather and reduce-scatter in PyTorch-XLA, you would have to do the bucketing and decomposing outside of the operation. This PR enables multiple different tensors to be all-gathered/reduce-scatter, keeping the original tensor shapes to enable bucketing and decomposing optimizations inside the operation. Original PR has token support like the token used for allreduce to ensure order between CCops. That will be separate PR if needed. Copybara import of the project: -- 7ea1159 by Junmin Hao <junminh@amazon.com>: Add Tuple input and token support to all-gather and reduce-scatter. Committer: Junmin Hao <junminh@amazon.com> -- cdb873e by Junmin Hao <junminh@amazon.com>: lint fix -- aad3521 by Jeffrey Huynh <jthuynh@amazon.com>: Fix hlo_verifier_test failure due to changed expectation -- 32e8145 by Jeffrey Huynh <jthuynh@amazon.com>: Separate the token change out into a separate PR with RFC. -- b301c2a by Jeffrey Huynh <jthuynh@amazon.com>: Change *WithToken tests to *WithTuple -- 5890278 by Jeffrey Huynh <jthuynh@amazon.com>: Fix missing parenthesis Merging this change closes #5740 COPYBARA_INTEGRATE_REVIEW=#5740 from jeffhataws:ag_rs_coalesce_revived 14e09f0 PiperOrigin-RevId: 573976449

Imported from GitHub PR openxla/xla#5740 This PR adds tuple input support to all-gather and reduce-scatter. This is a revival of part of #58377 and to be used in conjunction with pytorch/xla#5624 . In FSDP, different layers' weights need to be all-gathered/reduced-scatter during training. If some layers are small, multiple layers' weights can be aggregated for more efficient data transfer (same concept as bucket_cap_mb in DDP). With existing all-gather and reduce-scatter in PyTorch-XLA, you would have to do the bucketing and decomposing outside of the operation. This PR enables multiple different tensors to be all-gathered/reduce-scatter, keeping the original tensor shapes to enable bucketing and decomposing optimizations inside the operation. Original PR has token support like the token used for allreduce to ensure order between CCops. That will be separate PR if needed. Copybara import of the project: -- 7ea1159a1464efddebe9384e87ed6df504d89b2e by Junmin Hao <junminh@amazon.com>: Add Tuple input and token support to all-gather and reduce-scatter. Committer: Junmin Hao <junminh@amazon.com> -- cdb873e6d97f5f12b3d3c587bb5782d58e3554c5 by Junmin Hao <junminh@amazon.com>: lint fix -- aad352117ba950ac5ae62330e3980f4b5898a701 by Jeffrey Huynh <jthuynh@amazon.com>: Fix hlo_verifier_test failure due to changed expectation -- 32e814524b88a474af5e4e904c0dd19841430b86 by Jeffrey Huynh <jthuynh@amazon.com>: Separate the token change out into a separate PR with RFC. -- b301c2a2a5b52180f9e9626173e6b67a78782960 by Jeffrey Huynh <jthuynh@amazon.com>: Change *WithToken tests to *WithTuple -- 5890278fc16c9f900782d32a92d40ecf548aea85 by Jeffrey Huynh <jthuynh@amazon.com>: Fix missing parenthesis Merging this change closes #5740 PiperOrigin-RevId: 573976449

…tter Imported from GitHub PR openxla#5740 This PR adds tuple input support to all-gather and reduce-scatter. This is a revival of part of tensorflow/tensorflow#58377 and to be used in conjunction with pytorch/xla#5624 . In FSDP, different layers' weights need to be all-gathered/reduced-scatter during training. If some layers are small, multiple layers' weights can be aggregated for more efficient data transfer (same concept as bucket_cap_mb in DDP). With existing all-gather and reduce-scatter in PyTorch-XLA, you would have to do the bucketing and decomposing outside of the operation. This PR enables multiple different tensors to be all-gathered/reduce-scatter, keeping the original tensor shapes to enable bucketing and decomposing optimizations inside the operation. Original PR has token support like the token used for allreduce to ensure order between CCops. That will be separate PR if needed. Copybara import of the project: -- 7ea1159 by Junmin Hao <junminh@amazon.com>: Add Tuple input and token support to all-gather and reduce-scatter. Committer: Junmin Hao <junminh@amazon.com> -- cdb873e by Junmin Hao <junminh@amazon.com>: lint fix -- aad3521 by Jeffrey Huynh <jthuynh@amazon.com>: Fix hlo_verifier_test failure due to changed expectation -- 32e8145 by Jeffrey Huynh <jthuynh@amazon.com>: Separate the token change out into a separate PR with RFC. -- b301c2a by Jeffrey Huynh <jthuynh@amazon.com>: Change *WithToken tests to *WithTuple -- 5890278 by Jeffrey Huynh <jthuynh@amazon.com>: Fix missing parenthesis Merging this change closes openxla#5740 COPYBARA_INTEGRATE_REVIEW=openxla#5740 from jeffhataws:ag_rs_coalesce_revived 14e09f0 PiperOrigin-RevId: 573976449

Add Tuple input and token support to all-gather and reduce-scatter.

cc7ff16

Committer: Junmin Hao <junminh@amazon.com>

google-ml-butler Bot added the size:L CL Change Size: Large label Oct 31, 2022

google-ml-butler Bot assigned gbaned Oct 31, 2022

google-ml-butler Bot requested a review from r4nt October 31, 2022 07:15

google-ml-butler Bot added the awaiting review Pull request awaiting review label Oct 31, 2022

lint fix

782a30f

hjm-aws mentioned this pull request Nov 1, 2022

Add all-gather and reduce-scatter coalescence support and use that in… pytorch/xla#4145

Open

gbaned added the comp:xla XLA label Nov 2, 2022

hjm-aws mentioned this pull request Nov 6, 2022

Accumulated callbacks during backward pass can cause variation in computation graph. pytorch/xla#4160

Open

gbaned removed the awaiting review Pull request awaiting review label Nov 24, 2022

gbaned requested a review from cheshire November 24, 2022 22:39

google-ml-butler Bot added the awaiting review Pull request awaiting review label Nov 24, 2022

cheshire suggested changes Nov 26, 2022

View reviewed changes

cheshire requested review from Kariddi and blakehechtman November 26, 2022 11:37

jurahul self-requested a review November 29, 2022 15:38

jurahul reviewed Nov 29, 2022

View reviewed changes

jurahul suggested changes Nov 29, 2022

View reviewed changes

gbaned removed the awaiting review Pull request awaiting review label Dec 5, 2022

gbaned added the stat:awaiting response Status - Awaiting response from author label Dec 5, 2022

github-actions Bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 5, 2023

github-actions Bot closed this Apr 19, 2023

google-ml-butler Bot removed the stat:awaiting response Status - Awaiting response from author label Apr 19, 2023

jeffhataws mentioned this pull request Sep 20, 2023

Add tuple input support to all-gather and reduce-scatter openxla/xla#5740

Closed

Conversation

hjm-aws commented Oct 31, 2022

Uh oh!

google-cla Bot commented Oct 31, 2022

Uh oh!

gbaned commented Oct 31, 2022

Uh oh!

gbaned commented Nov 2, 2022

Uh oh!

hjm-aws commented Nov 10, 2022

Uh oh!

gbaned commented Nov 24, 2022

Uh oh!

cheshire left a comment

Choose a reason for hiding this comment

Uh oh!

cheshire Nov 26, 2022

Choose a reason for hiding this comment

Uh oh!

cheshire commented Nov 26, 2022

Uh oh!

jurahul Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

jurahul Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

jurahul Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

jurahul Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

jurahul Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

jurahul commented Nov 29, 2022

Uh oh!

gbaned commented Dec 5, 2022

Uh oh!

gbaned commented Dec 29, 2022

Uh oh!

gbaned commented Mar 21, 2023

Uh oh!

github-actions Bot commented Apr 5, 2023

Uh oh!

github-actions Bot commented Apr 19, 2023

Uh oh!

jeffhataws commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jeffhataws commented Sep 22, 2023 •

edited

Loading