[WIP][DTensor] CuTe layout composition for view ops sharding propagation#178454
[WIP][DTensor] CuTe layout composition for view ops sharding propagation#178454weifengpy wants to merge 9 commits intogh/weifengpy/99/basefrom
Conversation
…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178454
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 1 Unrelated FailureAs of commit 77bbb78 with merge base 3edbad8 ( NEW FAILURES - The following jobs have failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 8c01203 Pull Request resolved: #178454
Any publicly available information? |
…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]
…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 8e849d3 Pull Request resolved: #178454
…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]
…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: abf9e7e Pull Request resolved: #178454
…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]
…opagation Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: b16c6f6 Pull Request resolved: #178454
…sharding propagation" Demonstrates that CuTe's layout algebra (logical_divide, composition) can replace _ViewShardingPropagator's two-phase stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. 22 tests covering flatten, unflatten, partial flatten, general reshape, 2D mesh, and round-trips. All pass. Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]
…opagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. [ghstack-poisoned]
CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 14b7e48 Pull Request resolved: #178454
@stmcgovern i commited .md. all things are claude generated at this moment. it seems to fallback to regular strided shard code path a lot. nothing too serious yet |
@weifengpy Thanks for sharing! Interesting prototype idea here and I was curious to know more. |
…ng propagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Authored with Claude. [ghstack-poisoned]
CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. CuTe is now the sole code path (no env var gating, no Phase 2 fallback). For cases that can't be represented as CuTe layouts (uneven sharding, multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path (_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces placements through the DimMap rule tree without constructing layouts. The full test_view_ops.py suite passes (all 22 DTensor tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 355bd5e Pull Request resolved: #178454
…ng propagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Authored with Claude. [ghstack-poisoned]
CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. CuTe is now the sole code path (no env var gating, no Phase 2 fallback). For cases that can't be represented as CuTe layouts (uneven sharding, multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path (_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces placements through the DimMap rule tree without constructing layouts. The full test_view_ops.py suite passes (all 22 DTensor tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: a162583 Pull Request resolved: #178454
…ng propagation" CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default). Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven sharding in flatten ranges, incompatible multi-mesh stride decompositions). The full test_view_ops.py suite passes with CuTe enabled (40 tests). Authored with Claude. [ghstack-poisoned]
CuTe's layout algebra (logical_divide, composition) replaces _ViewShardingPropagator's Phase 2 stateful algorithm for view ops. The key insight: represent tensor sharding as a CuTe layout where sharded dims have (local, gpu) sub-modes via logical_divide. A view op doesn't change data — it reinterprets coordinates — so compose_view just routes sub-modes from input dims to output dims based on DimMap rules. There is no Shard vs _StridedShard distinction; both are GPU modes at different strides. CuTe is now the sole code path (no env var gating, no Phase 2 fallback). For cases that can't be represented as CuTe layouts (uneven sharding, multi-mesh-same-dim, symbolic shapes), a lightweight rule-tracing path (_trace_multi_mesh_placement, _symbolic_rewrite_output_placements) traces placements through the DimMap rule tree without constructing layouts. The full test_view_ops.py suite passes (all 22 DTensor tests). Design doc: https://fburl.com/gdoc/w7ay78d0 Authored with Claude. ghstack-source-id: 34d56f4 Pull Request resolved: #178454
Stack from ghstack (oldest at bottom):
CuTe's layout algebra (logical_divide, composition) replaces
_ViewShardingPropagator's Phase 2 stateful algorithm for view ops.
The key insight: represent tensor sharding as a CuTe layout where sharded
dims have (local, gpu) sub-modes via logical_divide. A view op doesn't
change data — it reinterprets coordinates — so compose_view just routes
sub-modes from input dims to output dims based on DimMap rules. There is
no Shard vs _StridedShard distinction; both are GPU modes at different
strides.
Gated behind USE_CUTE_VIEW_PROPAGATION=1 (env var, off by default).
Falls back to Phase 2 for unsupported cases (symbolic shapes, uneven
sharding in flatten ranges, incompatible multi-mesh stride decompositions).
The full test_view_ops.py suite passes with CuTe enabled (40 tests).
Authored with Claude.