[DTensor] decomposed sharding propagation by tianyu-l · Pull Request #130887 · pytorch/pytorch

tianyu-l · 2024-07-17T01:23:27Z

Stack from ghstack (oldest at bottom):

This PR adds the feature of sharding propagation via op decomposition.

#TODO: summary to be added

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-07-17T01:23:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130887

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job

As of commit 7446907 with merge base df59193 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_inductor/codecache.py:
pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_amax_cpu_float32

CANCELLED JOB - The following job was cancelled. Please retry:

Check Labels / Check labels (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 668ea73 Pull Request resolved: #130887

This PR adds the feature of sharding propagation via op decomposition. #TODO: summary to be added cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: a7ea252 Pull Request resolved: #130887

wanchaol

Nice work! This looks reasonably good already, only have some minor comments

wanchaol · 2024-08-05T01:51:21Z

torch/distributed/_tensor/ops/decomposed_ops.py

@@ -0,0 +1,26 @@
+# mypy: allow-untyped-defs


Please rebase and make this a private module

wanchaol · 2024-08-05T01:52:44Z

torch/distributed/_tensor/ops/math_ops.py

 LINEAR_REDUCTION_OP_MAP = {
    aten.all.default: "sum",
    aten.all.dim: "sum",
+    aten.amax.default: "max",


there should be some test can be enabled in test_dtensor_ops.py given that we enabled additional ops here?

wanchaol · 2024-08-05T03:50:10Z

torch/distributed/_tensor/_sharding_prop.py

+                for strtg in node_output_strategy.strategies:
+                    if strtg.input_specs is None:
+                        assert isinstance(strtg.output_specs, DTensorSpec)
+                    for idx, input_strtg in enumerate(


Let's just name this as input_strategy as it's not getting shortened that much

wanchaol · 2024-08-05T04:03:00Z

torch/distributed/_tensor/_sharding_prop.py

+                node_to_spec[node] = node_output_spec
+            elif node.op == "output":
+                output_node = node.args[0]
+                graph_output_specs = [node_to_spec[node] for node in output_node]


hmmm I think here you only handled the case when the output is a list of tensors, we should probably handle for the cases where if the output is a single tensor, a tuple of tensors too, you can refer to the wrap_output_spec/wrap logic to see how to handle those cases.

IIRC this handles both cases of single tensor and tuple of tensors. See tests on both aten.aminmax (tuple of two tensors) and aten._log_softmax (single tensor). In other words, the output_node would a list of results (possibly singleton) regardless the designated output type of the function.

wanchaol · 2024-08-05T04:05:15Z

torch/distributed/_tensor/_sharding_prop.py

+                )
+                all_possible_schema.append(possible_arg_specs)
+            else:
+                all_possible_schema.append((arg_spec,))


I guess the reason it appends a tuple here for non-tensor arg is to allow product later?

wanchaol · 2024-08-05T04:09:08Z

test/distributed/_tensor/test_decomposed_ops.py

+        y_dt = torch.nn.functional.log_softmax(x_dt, dim=softmax_dim)
+
+        self.assertTrue(y_dt.placements[0].is_replicate())
+        # TODO(lty): numerical test doesn't work -- similar to the complex mul bug


hmmm I wonder why? iirc the complex mul bug is specific to handling complex numbers, but softmax/log_softmax does not involve complex numbers?

looks like you are comparing numerics for log_softmax and regular softmax -- if they are both log this seems fine.

github-actions · 2024-10-05T02:01:24Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions · 2024-12-06T20:35:53Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions · 2025-02-04T23:34:00Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions · 2025-04-06T02:11:39Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions · 2025-06-05T18:53:32Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@tianyu-l

Following @tianyu-l's #130887 Adds support for ops with no sharding prop strategy, but a registered decomposition. Now if sharding prop sees a decomposable op, it: 1. Runs the decomposed op under a custom TorchDispatchMode, which propagates the placements as side information (initially used a make_fx implementation, but this required a threading lock as it relies on [global state](https://github.com/pytorch/pytorch/blob/2a26c9a32661ee2b4b049e3bd1b889fc3af30880/torch/fx/_symbolic_trace.py#L1167)) 2. Enumerates potential input placement combinations based on the actual input placements, on a single-dim mesh, then for each of them, propagates through torch_dispatch via sharding prop, while banning any intermediate redistributions. 3. Returns the expanded full-mesh strategy from the filtered strategies. Some caveats: - Since the dispatch mode runs sharding prop, the shard prop cache should kick in, both in the normal case (running the same op twice), and also when we recursively decompose (if op1 -> op2 -> some decomp, running op1 caches for op2). - One common failure case is decompositions calling factory methods (e.g. [torch.ones, torch.arange](https://github.com/pytorch/pytorch/blob/41f42a0fc3ea1fbfdf05b4c030d7df815bdfe19d/torch/_decomp/decompositions.py#L818-L821)). The main problem seems to be assigning placements to these tensors, and it's not so obvious what their placements should be, especially when they might take in sharded sizes, and we can't completely detect when this is the case. For now, intermediate shard prop will fail (no sharding strategy; they don't take DTensor inputs), but a potential future improvement is to permit the full-Replicate case for these graphs. - Sharding prop is currently via a `propagate_op_sharding` call, on explicit placement types. Once [single-dim strategy](#167677) coverage is broader, this should be doable on _ShardPlaceholders instead, making the enumeration & propagation process cheaper, though maybe more manual. - (Maybe hackily) uses a fake 1-rank 1d mesh to do single-dim propagation Removes the following xfails (+some more aten ops with decomp coverage, but still failing tests): ``` __rsub__ addmv addr alias_copy all any count_nonzero dist expand_copy fill floor_divide index_select linalg.vecdot masked_fill mv nn.functional.celu nn.functional.channel_shuffle nn.functional.elu nn.functional.hardsigmoid nn.functional.hardswish nn.functional.hardtanh nn.functional.leaky_relu nn.functional.logsigmoid nn.functional.margin_ranking_loss nn.functional.mish nn.functional.multilabel_soft_margin_loss nn.functional.pairwise_distance nn.functional.pixel_shuffle nn.functional.pixel_unshuffle nn.functional.prelu nn.functional.relu6 nn.functional.selu nn.functional.softplus nn.functional.softshrink nn.functional.triplet_margin_loss nn.functional.triplet_margin_with_distance_loss permute_copy rsub t_copy trace vdot view_copy ``` Pull Request resolved: #171652 Approved by: https://github.com/wconstab

@tianyu-l

Following @tianyu-l's pytorch#130887 Adds support for ops with no sharding prop strategy, but a registered decomposition. Now if sharding prop sees a decomposable op, it: 1. Runs the decomposed op under a custom TorchDispatchMode, which propagates the placements as side information (initially used a make_fx implementation, but this required a threading lock as it relies on [global state](https://github.com/pytorch/pytorch/blob/2a26c9a32661ee2b4b049e3bd1b889fc3af30880/torch/fx/_symbolic_trace.py#L1167)) 2. Enumerates potential input placement combinations based on the actual input placements, on a single-dim mesh, then for each of them, propagates through torch_dispatch via sharding prop, while banning any intermediate redistributions. 3. Returns the expanded full-mesh strategy from the filtered strategies. Some caveats: - Since the dispatch mode runs sharding prop, the shard prop cache should kick in, both in the normal case (running the same op twice), and also when we recursively decompose (if op1 -> op2 -> some decomp, running op1 caches for op2). - One common failure case is decompositions calling factory methods (e.g. [torch.ones, torch.arange](https://github.com/pytorch/pytorch/blob/41f42a0fc3ea1fbfdf05b4c030d7df815bdfe19d/torch/_decomp/decompositions.py#L818-L821)). The main problem seems to be assigning placements to these tensors, and it's not so obvious what their placements should be, especially when they might take in sharded sizes, and we can't completely detect when this is the case. For now, intermediate shard prop will fail (no sharding strategy; they don't take DTensor inputs), but a potential future improvement is to permit the full-Replicate case for these graphs. - Sharding prop is currently via a `propagate_op_sharding` call, on explicit placement types. Once [single-dim strategy](pytorch#167677) coverage is broader, this should be doable on _ShardPlaceholders instead, making the enumeration & propagation process cheaper, though maybe more manual. - (Maybe hackily) uses a fake 1-rank 1d mesh to do single-dim propagation Removes the following xfails (+some more aten ops with decomp coverage, but still failing tests): ``` __rsub__ addmv addr alias_copy all any count_nonzero dist expand_copy fill floor_divide index_select linalg.vecdot masked_fill mv nn.functional.celu nn.functional.channel_shuffle nn.functional.elu nn.functional.hardsigmoid nn.functional.hardswish nn.functional.hardtanh nn.functional.leaky_relu nn.functional.logsigmoid nn.functional.margin_ranking_loss nn.functional.mish nn.functional.multilabel_soft_margin_loss nn.functional.pairwise_distance nn.functional.pixel_shuffle nn.functional.pixel_unshuffle nn.functional.prelu nn.functional.relu6 nn.functional.selu nn.functional.softplus nn.functional.softshrink nn.functional.triplet_margin_loss nn.functional.triplet_margin_with_distance_loss permute_copy rsub t_copy trace vdot view_copy ``` Pull Request resolved: pytorch#171652 Approved by: https://github.com/wconstab

@tianyu-l

Following @tianyu-l's pytorch#130887 Adds support for ops with no sharding prop strategy, but a registered decomposition. Now if sharding prop sees a decomposable op, it: 1. Runs the decomposed op under a custom TorchDispatchMode, which propagates the placements as side information (initially used a make_fx implementation, but this required a threading lock as it relies on [global state](https://github.com/pytorch/pytorch/blob/2a26c9a32661ee2b4b049e3bd1b889fc3af30880/torch/fx/_symbolic_trace.py#L1167)) 2. Enumerates potential input placement combinations based on the actual input placements, on a single-dim mesh, then for each of them, propagates through torch_dispatch via sharding prop, while banning any intermediate redistributions. 3. Returns the expanded full-mesh strategy from the filtered strategies. Some caveats: - Since the dispatch mode runs sharding prop, the shard prop cache should kick in, both in the normal case (running the same op twice), and also when we recursively decompose (if op1 -> op2 -> some decomp, running op1 caches for op2). - One common failure case is decompositions calling factory methods (e.g. [torch.ones, torch.arange](https://github.com/pytorch/pytorch/blob/41f42a0fc3ea1fbfdf05b4c030d7df815bdfe19d/torch/_decomp/decompositions.py#L818-L821)). The main problem seems to be assigning placements to these tensors, and it's not so obvious what their placements should be, especially when they might take in sharded sizes, and we can't completely detect when this is the case. For now, intermediate shard prop will fail (no sharding strategy; they don't take DTensor inputs), but a potential future improvement is to permit the full-Replicate case for these graphs. - Sharding prop is currently via a `propagate_op_sharding` call, on explicit placement types. Once [single-dim strategy](pytorch#167677) coverage is broader, this should be doable on _ShardPlaceholders instead, making the enumeration & propagation process cheaper, though maybe more manual. - (Maybe hackily) uses a fake 1-rank 1d mesh to do single-dim propagation Removes the following xfails (+some more aten ops with decomp coverage, but still failing tests): ``` __rsub__ addmv addr alias_copy all any count_nonzero dist expand_copy fill floor_divide index_select linalg.vecdot masked_fill mv nn.functional.celu nn.functional.channel_shuffle nn.functional.elu nn.functional.hardsigmoid nn.functional.hardswish nn.functional.hardtanh nn.functional.leaky_relu nn.functional.logsigmoid nn.functional.margin_ranking_loss nn.functional.mish nn.functional.multilabel_soft_margin_loss nn.functional.pairwise_distance nn.functional.pixel_shuffle nn.functional.pixel_unshuffle nn.functional.prelu nn.functional.relu6 nn.functional.selu nn.functional.softplus nn.functional.softshrink nn.functional.triplet_margin_loss nn.functional.triplet_margin_with_distance_loss permute_copy rsub t_copy trace vdot view_copy ``` Pull Request resolved: pytorch#171652 Approved by: https://github.com/wconstab

[DTensor] decomposed sharding propagation

28a6bf6

[ghstack-poisoned]

tianyu-l mentioned this pull request Jul 17, 2024

add decomposition_table as an arg to get_isolated_graphmodule #130886

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 17, 2024

tianyu-l added a commit that referenced this pull request Jul 17, 2024

[DTensor] decomposed sharding propagation

97cf867

ghstack-source-id: 668ea73 Pull Request resolved: #130887

tianyu-l requested a review from wanchaol July 17, 2024 04:37

tianyu-l marked this pull request as draft July 17, 2024 04:41

Update on "[DTensor] decomposed sharding propagation"

7446907

This PR adds the feature of sharding propagation via op decomposition. #TODO: summary to be added cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Jul 19, 2024

[DTensor] decomposed sharding propagation

1d328ff

ghstack-source-id: a7ea252 Pull Request resolved: #130887

wanchaol approved these changes Aug 5, 2024

View reviewed changes

github-actions bot added the Stale label Oct 5, 2024

tianyu-l removed the Stale label Oct 7, 2024

github-actions bot added the Stale label Dec 6, 2024

tianyu-l removed the Stale label Dec 6, 2024

github-actions bot added the Stale label Feb 4, 2025

tianyu-l removed the Stale label Feb 5, 2025

danielvegamyhre mentioned this pull request Feb 21, 2025

[Float8] Unable to run asyncTP + Float8 row with 'full' AC active, leading dims mismatch pytorch/torchtitan#864

Closed

github-actions bot added the Stale label Apr 6, 2025

tianyu-l removed the Stale label Apr 6, 2025

github-actions bot added the Stale label Jun 5, 2025

tianyu-l added no-stale and removed Stale labels Jun 17, 2025

tianyu-l mentioned this pull request Jul 27, 2025

TP broken due to newly added fused RMSNorm op pytorch/torchtitan#1421

Closed

pianpwk mentioned this pull request Dec 23, 2025

[DTensor] initial support for decomps + sharding prop #171090

Closed

pianpwk mentioned this pull request Jan 3, 2026

[DTensor] initial support for decomps + sharding prop #171652

Closed

Conversation

tianyu-l commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130887

❌ 2 New Failures, 1 Cancelled Job

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 5, 2024

Uh oh!

github-actions bot commented Dec 6, 2024

Uh oh!

github-actions bot commented Feb 4, 2025

Uh oh!

github-actions bot commented Apr 6, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianyu-l commented Jul 17, 2024 •

edited

Loading

pytorch-bot bot commented Jul 17, 2024 •

edited

Loading