[DTensor] constant_pad_nd non-replicate strategy by pianpwk · Pull Request #175656 · pytorch/pytorch

pianpwk · 2026-02-24T20:39:19Z

Stack from ghstack (oldest at bottom):

Upstreaming from autoparallel: https://github.com/meta-pytorch/autoparallel/blob/454780d2a27456a380c0d8e997c8fc2cf82ef5d8/autoparallel/shardings/propagation_rules.py#L630

The previous strategy required full-Replicate: we can passthrough on non-padded dims, and allow Partial inputs when pad value = 0 (arguable if we should fix this). Rewritten as a single-dim strategy

[ghstack-poisoned]

pytorch-bot · 2026-02-24T20:39:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175656

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 30cbb04 with merge base ea9fce2 ():

NEW FAILURE - The following job has failed:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx950.4) (gh)
test/distributed/checkpoint/fsdp/test_fsdp_dsd.py::TestFullyShardWithDistributedStateDict::test_1d_fsdp_get_model_state_dict

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (disabled by #152520)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_lu_cuda_float32

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: eb3b67d Pull Request resolved: #175656

[ghstack-poisoned]

ghstack-source-id: 8f253d1 Pull Request resolved: #175656

[ghstack-poisoned]

ghstack-source-id: 3acb663 Pull Request resolved: #175656

wconstab · 2026-02-26T19:27:22Z

-        ]
-    )
+@register_single_dim_strategy(aten.constant_pad_nd.default)
+def constant_pad_nd_single_dim_strategy(


from claude review:

● Found the issues. Here's my review: --- 1. BUG: Boolean logic error in padded_dims detection (line 675) if not guard_or_false(pad[i * 2] == 0) and guard_or_false(pad[i * 2 + 1] == 0): Due to Python operator precedence (not binds tighter than and), this evaluates as: (not guard_or_false(pad_left == 0)) and (guard_or_false(pad_right == 0)) Which means: "left is non-zero AND right IS zero" — i.e. only detects left-only padding. It misses: ┌──────────┬───────────┬─────────────────────┬────────────────┐ │ pad_left │ pad_right │ Current result │ Correct result │ ├──────────┼───────────┼─────────────────────┼────────────────┤ │ 0 │ 0 │ not added (correct) │ not added │ ├──────────┼───────────┼─────────────────────┼────────────────┤ │ 1 │ 0 │ added │ added │ ├──────────┼───────────┼─────────────────────┼────────────────┤ │ 0 │ 1 │ not added (wrong) │ added │ ├──────────┼───────────┼─────────────────────┼────────────────┤ │ 1 │ 1 │ not added (wrong) │ added │ └──────────┴───────────┴─────────────────────┴────────────────┘ The most common case ([1, 1] — pad both sides) is not detected! This means sharding is incorrectly allowed on padded dimensions. The fix should be: if not (guard_or_false(pad[i * 2] == 0) and guard_or_false(pad[i * 2 + 1] == 0)): Note: the unit test test_constant_pad_nd_bans_shard_on_padded_dim uses pad=[1,1], which would fail with this bug — it asserts assertNotIn(1, shard_dims) but the buggy code would include dim 1. 2. Missing Partial rules — P(avg), P(max), P(min) work for ANY pad value The code only generates [Partial(), Partial()] (i.e. P(sum)→P(sum)) when value==0. But the analysis is incomplete: For any pad value v, consider rank-local computation pad(A_i, pad, v): - At non-padded positions: reduce(output_i[j]) = reduce(A_i[j]) — correct by definition of P(x) - At padded positions: every rank writes the same constant v, so: - avg(v, v, ..., v) = v — P(avg) works for any v - max(v, v, ..., v) = v — P(max) works for any v - min(v, v, ..., v) = v — P(min) works for any v - sum(v, v, ..., v) = N*v ≠ v — P(sum) only works when v=0 So the complete rules are: - Always (any value): P(avg)→P(avg), P(max)→P(max), P(min)→P(min) - Only when value=0: P(sum)→P(sum) The current code misses the first group entirely. This won't produce incorrect results (inputs with those placements would fall back to Replicate), but it's a missed optimization — unnecessary all-gathers. 3. Minor: the value==0 comment is imprecise Line 685 says "Linearity: pad(a+b, 0) == pad(a, 0) + pad(b, 0)". This is correct for P(sum), but the linearity framing obscures the avg/max/min cases which hold for different reasons (idempotency of the reduce on identical values, not linearity). --- Summary: One correctness bug (#1 — wrong boolean logic lets sharding through on padded dims), one completeness gap (#2 — missing P(avg/max/min) rules that hold unconditionally).

also i dunno if sharding validator runs. i tried to run but my pytorch needs a rebuild. if you can run it and also ask claude to analyze any 'missing rule's thats a good thing to paste on these PRs by default

Just updated the PR, I guess the min/max/avg rules with padding are interesting.

There's a lot of false positives when the output tensor is 0-sized (allclose reports True), I'll put that fix up in a followup:

(pytorch-3048) [pianpwk@devvm3048.dkl0 /data/users/pianpwk/pytorch (95594238)]$ python -m torch.distributed.tensor._ops.strategy_validation --op constant_pad_nd Testing ops: aten.constant_pad_nd Device: cuda, Dtype: torch.float32, World size: 2 [1/1] aten.constant_pad_nd — Samples: 35 (16 skipped), Combinations: 2280 ---------------------------------------------------------------------- Possibly missing (valid in ground truth but no DTensor rule) [aten.constant_pad_nd.default] P(avg) -> P(max) P(avg) -> P(min) P(avg) -> P(sum) P(avg) -> R P(avg) -> S(0) P(avg) -> S(1) P(max) -> P(avg) P(max) -> P(min) P(max) -> P(sum) P(max) -> R P(max) -> S(0) P(max) -> S(1) P(min) -> P(avg) P(min) -> P(max) P(min) -> P(sum) P(min) -> R P(min) -> S(0) P(min) -> S(1) P(sum) -> P(avg) P(sum) -> P(max) P(sum) -> P(min) P(sum) -> P(sum) P(sum) -> R P(sum) -> S(0) P(sum) -> S(1) S(0) -> S(0) ====================================================================== Summary ====================================================================== Op Correct Incorrect Missing Time --------------------------------------------------------- aten.constant_pad_nd 149 0 26 38.2s --------------------------------------------------------- Total 149 0 26 38.2s

Once that's added the rules are correct:

(pytorch-3048) [pianpwk@devvm3048.dkl0 /data/users/pianpwk/pytorch (95594238)]$ python -m torch.distributed.tensor._ops.strategy_validation --op constant_pad_nd Testing ops: aten.constant_pad_nd Device: cuda, Dtype: torch.float32, World size: 2 [1/1] aten.constant_pad_nd — Samples: 33 (18 skipped), Combinations: 2196 ---------------------------------------------------------------------- ====================================================================== Summary ====================================================================== Op Correct Incorrect Missing Time --------------------------------------------------------- aten.constant_pad_nd 143 0 0 35.3s --------------------------------------------------------- Total 143 0 0 35.3s

[ghstack-poisoned]

ghstack-source-id: ac8476f Pull Request resolved: #175656

pianpwk · 2026-02-27T21:18:53Z

+    tensor_metas = tuple(
        TensorMeta(shape=t.shape, stride=t.stride(), dtype=t.dtype) for _, t in tensors
    )
+    args_meta = tensor_metas + non_tensor_args


seems we need these for padding amounts, value

yea, take a look at #175821 too- not sure if it helps to land mine first and remove this, or land yours first and rebase mine.

ah, I think yours is closer to landing

[ghstack-poisoned]

ghstack-source-id: c289db2 Pull Request resolved: #175656

[ghstack-poisoned]

ghstack-source-id: bffc6d1 Pull Request resolved: #175656

pianpwk · 2026-03-04T21:51:43Z

@pytorchbot merge

pytorchmergebot · 2026-03-04T21:54:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-04T22:36:49Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx950.4)

Details for Dev Infra team

Raised by workflow job

pianpwk · 2026-03-05T01:15:36Z

@pytorchbot merge -i

pytorchmergebot · 2026-03-05T01:17:45Z

Merge started

Your change will be merged while ignoring the following 3 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable), inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu), trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx950.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Upstreaming from autoparallel: https://github.com/meta-pytorch/autoparallel/blob/454780d2a27456a380c0d8e997c8fc2cf82ef5d8/autoparallel/shardings/propagation_rules.py#L630 The previous strategy required full-Replicate: we can passthrough on non-padded dims, and allow Partial inputs when pad value = 0 (arguable if we should fix this). Rewritten as a single-dim strategy Pull Request resolved: pytorch#175656 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#175776

…_layer_norm, and native_layer_norm_backward These three rules were carried as local overrides in autoparallel while upstream PyTorch lacked proper handling: - constant_pad_nd: non-replicate strategy filtering on padded dims (upstreamed in pytorch/pytorch#175656) - native_layer_norm forward: correct per-output shapes and contiguous strides (upstreamed in pytorch/pytorch#175652) - native_layer_norm backward: contiguous stride handling for grad_input (upstreamed in a companion PR to pytorch/pytorch) With all three fixes now in upstream PyTorch, the overrides can be removed and autoparallel defers to the upstream register_op_strategy implementations. Authored with Claude.

Upstreaming from autoparallel: https://github.com/meta-pytorch/autoparallel/blob/454780d2a27456a380c0d8e997c8fc2cf82ef5d8/autoparallel/shardings/propagation_rules.py#L630 The previous strategy required full-Replicate: we can passthrough on non-padded dims, and allow Partial inputs when pad value = 0 (arguable if we should fix this). Rewritten as a single-dim strategy Pull Request resolved: pytorch#175656 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#175776

[DTensor] constant_pad_nd non-replicate strategy

187a7e9

[ghstack-poisoned]

pianpwk mentioned this pull request Feb 24, 2026

[DTensor] layernorm output meta #175652

Closed

pytorch-bot Bot added ciflow/inductor release notes: distributed (dtensor) release notes category labels Feb 24, 2026

pianpwk added a commit that referenced this pull request Feb 24, 2026

[DTensor] constant_pad_nd non-replicate strategy

320526a

ghstack-source-id: eb3b67d Pull Request resolved: #175656

Update on "[DTensor] constant_pad_nd non-replicate strategy"

01fed23

[ghstack-poisoned]

pianpwk mentioned this pull request Feb 25, 2026

[DTensor] fix max.dim/min.dim strategy #175776

Closed

pianpwk added a commit that referenced this pull request Feb 25, 2026

[DTensor] constant_pad_nd non-replicate strategy

a610144

ghstack-source-id: 8f253d1 Pull Request resolved: #175656

pianpwk requested review from anshul-si, fmassa, wconstab and zpcore February 25, 2026 20:50

pianpwk marked this pull request as ready for review February 25, 2026 20:50

Update

b8c3eb6

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Feb 26, 2026

[DTensor] constant_pad_nd non-replicate strategy

381ee34

ghstack-source-id: 3acb663 Pull Request resolved: #175656

wconstab reviewed Feb 26, 2026

View reviewed changes

Update

79cb1b5

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Feb 27, 2026

[DTensor] constant_pad_nd non-replicate strategy

0392acd

ghstack-source-id: ac8476f Pull Request resolved: #175656

pianpwk commented Feb 27, 2026

View reviewed changes

wconstab approved these changes Feb 27, 2026

View reviewed changes

Update

0260d84

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Feb 28, 2026

[DTensor] constant_pad_nd non-replicate strategy

ef8f2d2

ghstack-source-id: c289db2 Pull Request resolved: #175656

Update

30cbb04

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Mar 3, 2026

[DTensor] constant_pad_nd non-replicate strategy

ccd7354

ghstack-source-id: bffc6d1 Pull Request resolved: #175656

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 4, 2026

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

pytorchmergebot added the merging label Mar 5, 2026

pytorchmergebot added the Merged label Mar 5, 2026

pytorchmergebot closed this in 9dfac50 Mar 5, 2026

pytorchmergebot removed the merging label Mar 5, 2026

pianpwk mentioned this pull request Mar 18, 2026

Add sharding rules for convolution, uniform, scatter, and index ops meta-pytorch/autoparallel#370

Merged

pianpwk mentioned this pull request Mar 18, 2026

Remove upstreamed sharding rule overrides for constant_pad_nd, native_layer_norm, and native_layer_norm_backward meta-pytorch/autoparallel#373

Draft

github-actions Bot deleted the gh/pianpwk/101/head branch April 4, 2026 02:23

Conversation

pianpwk commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175656

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

wconstab Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

pianpwk Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

pianpwk Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

pianpwk Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

pianpwk Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

pianpwk commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge failed

Uh oh!

pianpwk commented Mar 5, 2026

Uh oh!

pytorchmergebot commented Mar 5, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pianpwk commented Feb 24, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 24, 2026 •

edited

Loading

wconstab Feb 26, 2026 •

edited

Loading