Skip to content

[WIP][DTensor] CuTe layout composition for view ops sharding propagation#178454

Draft
weifengpy wants to merge 9 commits intogh/weifengpy/99/basefrom
gh/weifengpy/99/head
Draft

[WIP][DTensor] CuTe layout composition for view ops sharding propagation#178454
weifengpy wants to merge 9 commits intogh/weifengpy/99/basefrom
gh/weifengpy/99/head

Conversation

@weifengpy
Copy link
Copy Markdown
Contributor

@weifengpy weifengpy commented Mar 26, 2026

Stack from ghstack (oldest at bottom):

CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).

The full test_view_ops.py suite passes with CuTe enabled (40 tests).

Authored with Claude.

…opagation

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178454

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Unrelated Failure

As of commit 77bbb78 with merge base 3edbad8 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added ciflow/dtensor Run DTensor specific tests topic: not user facing topic category labels Mar 26, 2026
weifengpy added a commit that referenced this pull request Mar 26, 2026
…opagation

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: 8c01203
Pull Request resolved: #178454
@stmcgovern
Copy link
Copy Markdown
Collaborator

Design doc: https://fburl.com/gdoc/w7ay78d0

Any publicly available information?

…sharding propagation"

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Mar 26, 2026
…opagation

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: 8e849d3
Pull Request resolved: #178454
…sharding propagation"

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Mar 26, 2026
…opagation

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: abf9e7e
Pull Request resolved: #178454
@pytorch-bot pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests release notes: distributed (dtensor) release notes category labels Mar 26, 2026
…sharding propagation"

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Mar 26, 2026
…opagation

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: b16c6f6
Pull Request resolved: #178454
…sharding propagation"

Demonstrates that CuTe's layout algebra (logical_divide, composition) can
replace _ViewShardingPropagator's two-phase stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

22 tests covering flatten, unflatten, partial flatten, general reshape,
2D mesh, and round-trips. All pass.

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

[ghstack-poisoned]
…opagation"

CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).

The full test_view_ops.py suite passes with CuTe enabled (40 tests).

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

[ghstack-poisoned]
@weifengpy weifengpy changed the title [DTensor] Prototype: CuTe layout composition for view ops sharding propagation [DTensor] CuTe layout composition for view ops sharding propagation Mar 26, 2026
weifengpy added a commit that referenced this pull request Mar 26, 2026
CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).

The full test_view_ops.py suite passes with CuTe enabled (40 tests).

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: 14b7e48
Pull Request resolved: #178454
@weifengpy weifengpy changed the title [DTensor] CuTe layout composition for view ops sharding propagation [WIP][DTensor] CuTe layout composition for view ops sharding propagation Mar 26, 2026
@weifengpy weifengpy marked this pull request as draft March 26, 2026 13:13
@weifengpy
Copy link
Copy Markdown
Contributor Author

Design doc: https://fburl.com/gdoc/w7ay78d0

Any publicly available information?

@stmcgovern i commited .md. all things are claude generated at this moment. it seems to fallback to regular strided shard code path a lot. nothing too serious yet

@stmcgovern
Copy link
Copy Markdown
Collaborator

Design doc: https://fburl.com/gdoc/w7ay78d0

Any publicly available information?

@stmcgovern i commited .md. all things are claude generated at this moment. it seems to fallback to regular strided shard code path a lot. nothing too serious yet

@weifengpy Thanks for sharing! Interesting prototype idea here and I was curious to know more.

…ng propagation"


CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).

The full test_view_ops.py suite passes with CuTe enabled (40 tests).


Authored with Claude.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Mar 26, 2026
CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

CuTe is now the sole code path (no env var gating, no Phase 2 fallback).
For cases that can't be represented as CuTe layouts (uneven sharding,
multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path
(_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces
placements through the DimMap rule tree without constructing layouts.

The full test_view_ops.py suite passes (all 22 DTensor tests).

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: 355bd5e
Pull Request resolved: #178454
…ng propagation"


CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).

The full test_view_ops.py suite passes with CuTe enabled (40 tests).


Authored with Claude.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Mar 26, 2026
CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

CuTe is now the sole code path (no env var gating, no Phase 2 fallback).
For cases that can't be represented as CuTe layouts (uneven sharding,
multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path
(_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces
placements through the DimMap rule tree without constructing layouts.

The full test_view_ops.py suite passes (all 22 DTensor tests).

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: a162583
Pull Request resolved: #178454
…ng propagation"


CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).

The full test_view_ops.py suite passes with CuTe enabled (40 tests).


Authored with Claude.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Mar 26, 2026
CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.

The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.

CuTe is now the sole code path (no env var gating, no Phase 2 fallback).
For cases that can't be represented as CuTe layouts (uneven sharding,
multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path
(_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces
placements through the DimMap rule tree without constructing layouts.

The full test_view_ops.py suite passes (all 22 DTensor tests).

Design doc: https://fburl.com/gdoc/w7ay78d0

Authored with Claude.

ghstack-source-id: 34d56f4
Pull Request resolved: #178454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/dtensor Run DTensor specific tests ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests release notes: distributed (dtensor) release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants