[DTensor] Enable Dijkstra search in sharding propagation by wconstab · Pull Request #175999 · pytorch/pytorch

wconstab · 2026-02-27T18:44:59Z

Stack from ghstack (oldest at bottom):

-> [DTensor] Enable Dijkstra search in sharding propagation #175999

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding
propagation path. For ops with single-dim strategies, try the PQ search
first; fall back to full O(S^N) expansion when it returns None
(StridedShard, symbolic shapes, or TupleStrategy inputs).

Authored with Claude.

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. [ghstack-poisoned]

pytorch-bot · 2026-02-27T18:45:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175999

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 2 Unrelated Failures

As of commit 8700de5 with merge base 1011d3f ():

NEW FAILURES - The following jobs have failed:

inductor / unit-test / inductor-pallas-cpu-build / build (gh)
Unexpected HTTP response: 404
pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 3, linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestCompiledDTensorOpsCPU::test_compiled_dtensor_op_db_nn_functional_poisson_nll_loss_cpu_float32
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 3, linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db_nn_functional_poisson_nll_loss_cpu_float32
pull / linux-jammy-py3.10-gcc11 / test (distributed, 3, 3, linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestMultiThreadedDTensorOpsCPU::test_dtensor_op_db_nn_functional_poisson_nll_loss_cpu_float32
trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
test/distributed/tensor/test_pointwise_ops.py::DistElementwiseOpsTest::test_inplace_add_replicate_with_partial_avg_requires_comm
trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (distributed, 2, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
test/distributed/tensor/debug/test_comm_mode.py::TestCommMode::test_comm_mode_with_dtensor
trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (distributed, 3, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
test/distributed/tensor/debug/test_debug_mode.py::TestDTensorDebugMode::test_debug_mode_backward

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: da66f22 Pull Request resolved: #175999

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. [ghstack-poisoned]

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: b105032 Pull Request resolved: #175999

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. [ghstack-poisoned]

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: 949e0c9 Pull Request resolved: #175999

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. [ghstack-poisoned]

wconstab · 2026-03-10T21:05:41Z

@pytorchbot merge

pytorchmergebot · 2026-03-10T21:08:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-auto-revert · 2026-03-10T22:25:40Z

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

pull: distributed/tensor/test_dtensor_ops.py::test_dtensor_op_db_baddbmm_cpu_float32 (hud), distributed/tensor/test_dtensor_ops.py::test_compiled_dtensor_op_db_nn_functional_poisson_nll_loss_cpu_float32 (hud)

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

claude · 2026-03-10T22:26:07Z

Claude finished @pytorch-auto-revert[bot]'s task in 8m 26s —— View job

Revert Diagnostics for PR #175999

Read the PR changes and source code
Read the Dijkstra expansion implementation
Analyze the failing test patterns
Diagnose the root cause
Provide fix guidance

Failing Tests

test_dtensor_op_db_baddbmm_cpu_float32 (in TestLocalDTensorOps)
test_compiled_dtensor_op_db_nn_functional_poisson_nll_loss_cpu_float32 (in TestCompiledDTensorOps)

Is the Revert Legitimate?

Yes. The Dijkstra-based search path introduced by this PR has behavioral differences from the full O(S^N) expansion that cause real test failures.

Root Cause Analysis

The PR wires _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path at _sharding_prop.py:697-703, attempting it first and falling back to full expansion only when the Dijkstra function returns None. There are two classes of issues:

1. Hard failure: AssertionError when no valid strategy is found

The Dijkstra function at single_dim_strategy.py:936-940 raises AssertionError("No valid strategy found...") when the priority queue is exhausted without finding a matching state. By contrast, the full expansion path in expand_to_full_mesh_op_strategy (utils.py:559) returns OpStrategy(all_strategies) which may be an empty list — the downstream _select_min_cost_strategy would then crash on min() of an empty sequence, but the error handling is different. More importantly, the full expansion path filters strategies more gracefully through multiple checks (shardability, inplace, out-variant, mixed-partial, etc.) and the strategies that pass form a valid set. The Dijkstra search space exploration may miss valid strategies due to:

try_propagate shardability check is too strict (single_dim_strategy.py:389-393): It calls is_tensor_shardable(spec.tensor_meta.shape, spec) without passing allow_unbacked_sharding or checking allow_uneven_sharding. The full expansion uses is_tensor_shardable(..., allow_unbacked_sharding=allow_unbacked_sharding) and additionally has an allow_uneven_sharding fallback (utils.py:518-528). For baddbmm, which is registered with allow_unbacked_sharding=True, this mismatch could cause valid strategies to be incorrectly rejected.
Neighbor generation limitations (_get_neighbor_placements): The transition rules at single_dim_strategy.py:654-706 model one-shot placement changes. If the optimal strategy requires multi-step transitions that pass through states rejected by is_tensor_shardable, the search may fail to find it.

2. Indirect failures via decomposition

For poisson_nll_loss: this op is not registered with register_single_dim_strategy at all — it's not in the DTensor ops registry. It decomposes at the ATen level into constituent ops (mul, log, exp, etc.), some of which DO use single-dim strategy and would now go through the Dijkstra path. The failure would propagate up from a decomposed sub-op hitting the issues described above.

Suggested Fixes

Pass allow_unbacked_sharding and handle allow_uneven_sharding in try_propagate: The _PreparedSingleDimStrategy already stores allow_unbacked_sharding and allow_uneven_sharding. The try_propagate method at line 389-393 should use them:

# In try_propagate, replace lines 389-394:
if not all(
    is_tensor_shardable(
        spec.tensor_meta.shape, spec,
        allow_unbacked_sharding=self.allow_unbacked_sharding
    )
    or (
        self.allow_uneven_sharding
        and input_spec.placements == spec.placements
    )
    for spec, input_spec in zip(arg_specs, input_specs)
    if spec.tensor_meta is not None
):
    return None

Convert the hard AssertionError to return None: Instead of raising at line 936-940, return None to let the caller fall back to full expansion. This ensures the Dijkstra path is a pure optimization that gracefully degrades:

# Replace lines 936-940:
# No strategy found via Dijkstra; fall back to full expansion
return None

Add the failing ops to test coverage: After fixing, verify that both baddbmm and poisson_nll_loss pass with the Dijkstra path enabled, and also that FORCE_FULLY_EXPAND_SINGLE_DIM = True produces the same results (to validate equivalence).

Fix #2 is the most conservative and least risky change — it turns Dijkstra misses into graceful fallbacks rather than hard failures. Fix #1 addresses the root cause of why valid strategies are being rejected.

pytorchmergebot · 2026-03-10T22:27:21Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…5999)" This reverts commit f13ba06. Reverted #175999 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175999 (comment)))

pytorchmergebot · 2026-03-10T22:27:30Z

@wconstab your PR has been successfully reverted.

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. [ghstack-poisoned]

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: 9867cd0 Pull Request resolved: #175999

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: e264f02 Pull Request resolved: pytorch/pytorch#175999

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: dbbb2c1 Pull Request resolved: pytorch/pytorch#175999

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. [ghstack-poisoned]

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: 31ac13d Pull Request resolved: #175999

torch/distributed/tensor/_sharding_prop.py

wconstab · 2026-03-13T04:44:12Z

@pytorchbot rebase

wconstab · 2026-03-13T13:26:43Z

@pytorchbot rebase

pytorchmergebot · 2026-03-13T13:28:34Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2026-03-13T13:28:40Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/23052976057

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. [ghstack-poisoned]

Wire _dijkstra_expand_single_dim_strategy_to_mesh into the sharding propagation path. For ops with single-dim strategies, try the PQ search first; fall back to full O(S^N) expansion when it returns None (StridedShard, symbolic shapes, or TupleStrategy inputs). Authored with Claude. ghstack-source-id: 1e047be Pull Request resolved: #175999

wconstab mentioned this pull request Feb 27, 2026

[DTensor] Add Dijkstra-based single-dim strategy search #169438

Closed

pytorch-bot bot added ciflow/inductor release notes: distributed (dtensor) release notes category labels Feb 27, 2026

Skylion007 approved these changes Mar 1, 2026

View reviewed changes

This was referenced Mar 2, 2026

[DTensor] Fix kwargs tensor handling in single-dim strategy expansion #176220

Closed

[DTensor] Handle kwargs tensor inputs in dijkstra search #176221

Open

wconstab added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 3, 2026

wconstab added the ciflow/torchtitan Run TorchTitan integration tests label Mar 9, 2026

anshul-si mentioned this pull request Mar 10, 2026

[DTensor] Register single-dim strategies for categorized pointwise ops #175795

Closed

pytorchmergebot added the merging label Mar 10, 2026

pytorchmergebot closed this in f13ba06 Mar 10, 2026

pytorchmergebot added Merged and removed merging labels Mar 10, 2026

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Mar 10, 2026

pytorchmergebot reopened this Mar 10, 2026

This was referenced Mar 11, 2026

[DTensor] Fix Dijkstra sharding search: shardability checks and graceful fallback #177167

Closed

[DTensor] Validate Dijkstra match feasibility with redistribute_cost #177168

Closed

zpcore reviewed Mar 12, 2026

View reviewed changes

torch/distributed/tensor/_sharding_prop.py Show resolved Hide resolved

zpcore approved these changes Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] Enable Dijkstra search in sharding propagation#175999

[DTensor] Enable Dijkstra search in sharding propagation#175999
wconstab wants to merge 9 commits intogh/wconstab/551/basefrom
gh/wconstab/551/head

wconstab commented Feb 27, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

wconstab commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

pytorch-auto-revert bot commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

Uh oh!

wconstab commented Mar 13, 2026

Uh oh!

wconstab commented Mar 13, 2026

Uh oh!

pytorchmergebot commented Mar 13, 2026

Uh oh!

pytorchmergebot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wconstab commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175999

❌ 7 New Failures, 2 Unrelated Failures

Uh oh!

wconstab commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Merge started

Uh oh!

pytorch-auto-revert bot commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Revert Diagnostics for PR #175999

Failing Tests

Is the Revert Legitimate?

Root Cause Analysis

Suggested Fixes

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

Uh oh!

wconstab commented Mar 13, 2026

Uh oh!

wconstab commented Mar 13, 2026

Uh oh!

pytorchmergebot commented Mar 13, 2026

Uh oh!

pytorchmergebot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Feb 27, 2026 •

edited

Loading

pytorch-bot bot commented Feb 27, 2026 •

edited

Loading

claude bot commented Mar 10, 2026 •

edited

Loading