[shard prop] OpInfo strategy validation suite by pianpwk · Pull Request #176258 · pytorch/pytorch

pianpwk · 2026-03-03T04:53:50Z

Stack from ghstack (oldest at bottom):

-> [shard prop] OpInfo strategy validation suite #176258

[ghstack-poisoned]

pytorch-bot · 2026-03-03T04:53:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176258

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 16b8c81 with merge base ff91f31 ():

NEW FAILURES - The following jobs have failed:

pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestStrategyValidationCPU::test_strategy_validation_op_db_as_strided_partial_views_cpu_float32
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestStrategyValidationCPU::test_strategy_validation_op_db__segment_reduce_offsets_cpu_float32

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 08df7a4 Pull-Request: #176258

[ghstack-poisoned]

ghstack-source-id: 7d01754 Pull Request resolved: #176258

[ghstack-poisoned]

ghstack-source-id: 6ac85a9 Pull Request resolved: #176258

pianpwk · 2026-03-06T00:10:23Z

@wconstab changed validation to skip non 1-1 entries, the list of fails is much smaller now

[ghstack-poisoned]

ghstack-source-id: 268ef27 Pull Request resolved: #176258

wconstab · 2026-03-06T23:58:37Z


            if not torch.allclose(
-                gt, full_output, atol=1e-5, rtol=1e-5, equal_nan=True
+                gt, full_output, atol=1e-3, rtol=1e-5, equal_nan=True


was this needed still after you xfailed the to_copy ops?

surprisingly this was for the baddbmm op

wconstab · 2026-03-07T00:04:17Z

+            # Ops like inner (permute, view, mm, view) decompose into multiple
+            # aten calls — validating the high-level sample against one captured
+            # op produces wrong results.
+            with _CaptureAtenOp() as _probe:


wonder if we can push this check upstream so that we are determining our fate earlier. compare_operator calls _discover_aten_op, which perhaps could be more assertive (single-aten op found),

and then maybe get_aten_op_for_sample can raise the skip_reason[non-1-1-mapping] right inside if it sees more than one op in graph

this path is more critical than _discover_aten_op, since it runs once per sample and since each sample can give a different aten op / graph

just updated

[ghstack-poisoned]

ghstack-source-id: a9d984e Pull Request resolved: #176258

Skylion007 · 2026-03-10T16:16:19Z

    aten_op = _discover_aten_op(opinfos, device, dtype)
    if aten_op is None or not _has_dtensor_support(aten_op):
+        if verbose:
+            print(f"  ATEN_OP_MAP: {op_name} -> {aten_op} [no_support]")


These should all be logging not printing!

that's my bad, i noticed this at one point but i didn't clean it up. the script is using print consistently, at least. I am supportive of a PR to change it to use logging and make it nicely configurable.

Skylion007 · 2026-03-10T16:16:57Z

    dtype: torch.dtype = torch.float32,
    world_size: int = 2,
    max_samples: int | None = None,
    verbose: bool = False,


Verboseness here could be refactored as a logging.LEVEL lol

Skylion007 · 2026-03-10T16:17:11Z

                    dtype,
                    args.world_size,
                    args.max_samples,
+                    verbose=True,


… 1-1 OpInfo-aten entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]

… entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]

… 1-1 OpInfo-aten entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]

… entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]

…177595) Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: #177595 Approved by: https://github.com/wconstab

…ytorch#177595) Taking changes from pytorch#176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: pytorch#177595 Approved by: https://github.com/wconstab

…177595) Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: #177595 Approved by: https://github.com/wconstab

…ytorch#177595) Taking changes from pytorch#176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: pytorch#177595 Approved by: https://github.com/wconstab

Update

a739287

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Mar 3, 2026

[shard prop] OpInfo strategy validation suite

f818561

ghstack-source-id: 08df7a4 Pull-Request: #176258

pytorch-bot Bot added ciflow/inductor release notes: distributed (dtensor) release notes category labels Mar 3, 2026

Update on "[shard prop] OpInfo strategy validation suite"

550adbf

[ghstack-poisoned]

This was referenced Feb 28, 2026

[DTensor] Track per-output placements for multi-output ops in strategy validation #175893

Closed

[DTensor] skip zero-numel outputs for strategy validator #176020

Closed

pianpwk added a commit that referenced this pull request Mar 5, 2026

[shard prop] OpInfo strategy validation suite

b464993

ghstack-source-id: 7d01754 Pull Request resolved: #176258

Update on "[shard prop] OpInfo strategy validation suite"

d609b7d

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Mar 6, 2026

[shard prop] OpInfo strategy validation suite

43e959a

ghstack-source-id: 6ac85a9 Pull Request resolved: #176258

pianpwk marked this pull request as ready for review March 6, 2026 22:32

pianpwk requested a review from wconstab March 6, 2026 22:32

Update

75eeef3

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Mar 6, 2026

[shard prop] OpInfo strategy validation suite

6514569

ghstack-source-id: 268ef27 Pull Request resolved: #176258

wconstab reviewed Mar 7, 2026

View reviewed changes

Update on "[shard prop] OpInfo strategy validation suite"

16b8c81

[ghstack-poisoned]

pianpwk added a commit that referenced this pull request Mar 9, 2026

[shard prop] OpInfo strategy validation suite

ca56038

ghstack-source-id: a9d984e Pull Request resolved: #176258

pianpwk mentioned this pull request Mar 10, 2026

[dtensor][single_dim_strategies] flipping output and input placements #176523

Open

Skylion007 reviewed Mar 10, 2026

View reviewed changes

pianpwk mentioned this pull request Mar 16, 2026

[shard prop] default sharding validator to 1-1 OpInfo-aten entries #177595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[shard prop] OpInfo strategy validation suite#176258

[shard prop] OpInfo strategy validation suite#176258
pianpwk wants to merge 5 commits intogh/pianpwk/107/basefrom
gh/pianpwk/107/head

pianpwk commented Mar 3, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

pianpwk commented Mar 6, 2026

Uh oh!

wconstab Mar 6, 2026

Uh oh!

pianpwk Mar 9, 2026

Uh oh!

wconstab Mar 7, 2026

Uh oh!

pianpwk Mar 9, 2026

Uh oh!

Skylion007 Mar 10, 2026

Uh oh!

wconstab Mar 10, 2026

Uh oh!

Skylion007 Mar 10, 2026

Uh oh!

Skylion007 Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pianpwk commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176258

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

pianpwk commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pianpwk commented Mar 3, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 3, 2026 •

edited

Loading