[WIP][DTensor] CuTe layout composition for view ops sharding propagation by weifengpy · Pull Request #178454 · pytorch/pytorch

weifengpy · 2026-03-26T00:47:45Z

Stack from ghstack (oldest at bottom):

CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).

The full test_view_ops.py suite passes with CuTe enabled (40 tests).

Authored with Claude.

…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]

pytorch-bot · 2026-03-26T00:47:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178454

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Unrelated Failure

As of commit 77bbb78 with merge base 3edbad8 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang-partial / linux-job (gh)
>>> Lint for torch/distributed/tensor/_ops/_cute_view_propagation.py:
pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 3, lf.linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db_nn_functional_channel_shuffle_cpu_float32
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 3, lf.linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestCompiledDTensorOpsCPU::test_compiled_dtensor_op_db_nn_functional_channel_shuffle_cpu_float32

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

Lint OSDC (unstable) / lintrunner-noclang-partial / lint (gh)
>>> Lint for torch/distributed/tensor/_ops/_cute_view_propagation.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 8c01203 Pull Request resolved: #178454

stmcgovern · 2026-03-26T02:45:55Z

Design doc: https://fburl.com/gdoc/w7ay78d0

Any publicly available information?

…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]

…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 8e849d3 Pull Request resolved: #178454

…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]

…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: abf9e7e Pull Request resolved: #178454

…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]

…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: b16c6f6 Pull Request resolved: #178454

…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]

…opagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]

CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 14b7e48 Pull Request resolved: #178454

weifengpy · 2026-03-26T13:18:07Z

Design doc: https://fburl.com/gdoc/w7ay78d0

Any publicly available information?

@stmcgovern i commited .md. all things are claude generated at this moment. it seems to fallback to regular strided shard code path a lot. nothing too serious yet

stmcgovern · 2026-03-26T13:28:29Z

Design doc: https://fburl.com/gdoc/w7ay78d0

Any publicly available information?

@stmcgovern i commited .md. all things are claude generated at this moment. it seems to fallback to regular strided shard code path a lot. nothing too serious yet

@weifengpy Thanks for sharing! Interesting prototype idea here and I was curious to know more.

…ng propagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Authored with Claude. [ghstack-poisoned]

CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. CuTe is now the sole code path (no env var gating, no Phase 2 fallback). For cases that can't be represented as CuTe layouts (uneven sharding, multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path (_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces placements through the DimMap rule tree without constructing layouts. The full test_view_ops.py suite passes (all 22 DTensor tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 355bd5e Pull Request resolved: #178454

…ng propagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Authored with Claude. [ghstack-poisoned]

CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. CuTe is now the sole code path (no env var gating, no Phase 2 fallback). For cases that can't be represented as CuTe layouts (uneven sharding, multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path (_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces placements through the DimMap rule tree without constructing layouts. The full test_view_ops.py suite passes (all 22 DTensor tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: a162583 Pull Request resolved: #178454

…ng propagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Authored with Claude. [ghstack-poisoned]

CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. CuTe is now the sole code path (no env var gating, no Phase 2 fallback). For cases that can't be represented as CuTe layouts (uneven sharding, multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path (_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces placements through the DimMap rule tree without constructing layouts. The full test_view_ops.py suite passes (all 22 DTensor tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 34d56f4 Pull Request resolved: #178454

This was referenced Mar 26, 2026

[DTensor] support DTensor view (flatten/unflatten) with _StridedSharding #166483

Closed

[DTensor] Add unflatten tests for multi-mesh sharding in view ops #176151

Closed

pytorch-bot Bot added ciflow/dtensor Run DTensor specific tests topic: not user facing topic category labels Mar 26, 2026

pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests release notes: distributed (dtensor) release notes category labels Mar 26, 2026

weifengpy mentioned this pull request Mar 26, 2026

[DTensor] Fix CuTe view propagation edge cases for production readiness #178504

Closed

weifengpy changed the title ~~[DTensor] Prototype: CuTe layout composition for view ops sharding propagation~~ [DTensor] CuTe layout composition for view ops sharding propagation Mar 26, 2026

weifengpy changed the title ~~[DTensor] CuTe layout composition for view ops sharding propagation~~ [WIP][DTensor] CuTe layout composition for view ops sharding propagation Mar 26, 2026

weifengpy marked this pull request as draft March 26, 2026 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DTensor] CuTe layout composition for view ops sharding propagation#178454

[WIP][DTensor] CuTe layout composition for view ops sharding propagation#178454
weifengpy wants to merge 9 commits intogh/weifengpy/99/basefrom
gh/weifengpy/99/head

weifengpy commented Mar 26, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

stmcgovern commented Mar 26, 2026

Uh oh!

weifengpy commented Mar 26, 2026

Uh oh!

stmcgovern commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weifengpy commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178454

❌ 3 New Failures, 1 Unrelated Failure

Uh oh!

stmcgovern commented Mar 26, 2026

Uh oh!

weifengpy commented Mar 26, 2026

Uh oh!

stmcgovern commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weifengpy commented Mar 26, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 26, 2026 •

edited

Loading