Skip to content

[shard prop] OpInfo strategy validation suite#176258

Open
pianpwk wants to merge 5 commits intogh/pianpwk/107/basefrom
gh/pianpwk/107/head
Open

[shard prop] OpInfo strategy validation suite#176258
pianpwk wants to merge 5 commits intogh/pianpwk/107/basefrom
gh/pianpwk/107/head

Conversation

@pianpwk
Copy link
Copy Markdown
Contributor

@pianpwk pianpwk commented Mar 3, 2026

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 3, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176258

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 16b8c81 with merge base ff91f31 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pianpwk added a commit that referenced this pull request Mar 3, 2026
pianpwk added a commit that referenced this pull request Mar 6, 2026
ghstack-source-id: 6ac85a9
Pull Request resolved: #176258
@pianpwk
Copy link
Copy Markdown
Contributor Author

pianpwk commented Mar 6, 2026

@wconstab changed validation to skip non 1-1 entries, the list of fails is much smaller now

@pianpwk pianpwk marked this pull request as ready for review March 6, 2026 22:32
@pianpwk pianpwk requested a review from wconstab March 6, 2026 22:32
[ghstack-poisoned]
pianpwk added a commit that referenced this pull request Mar 6, 2026
ghstack-source-id: 268ef27
Pull Request resolved: #176258

if not torch.allclose(
gt, full_output, atol=1e-5, rtol=1e-5, equal_nan=True
gt, full_output, atol=1e-3, rtol=1e-5, equal_nan=True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this needed still after you xfailed the to_copy ops?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprisingly this was for the baddbmm op

# Ops like inner (permute, view, mm, view) decompose into multiple
# aten calls — validating the high-level sample against one captured
# op produces wrong results.
with _CaptureAtenOp() as _probe:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we can push this check upstream so that we are determining our fate earlier. compare_operator calls _discover_aten_op, which perhaps could be more assertive (single-aten op found),

and then maybe get_aten_op_for_sample can raise the skip_reason[non-1-1-mapping] right inside if it sees more than one op in graph

  • this path is more critical than _discover_aten_op, since it runs once per sample and since each sample can give a different aten op / graph

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just updated

pianpwk added a commit that referenced this pull request Mar 9, 2026
ghstack-source-id: a9d984e
Pull Request resolved: #176258
aten_op = _discover_aten_op(opinfos, device, dtype)
if aten_op is None or not _has_dtensor_support(aten_op):
if verbose:
print(f" ATEN_OP_MAP: {op_name} -> {aten_op} [no_support]")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should all be logging not printing!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's my bad, i noticed this at one point but i didn't clean it up. the script is using print consistently, at least. I am supportive of a PR to change it to use logging and make it nicely configurable.

dtype: torch.dtype = torch.float32,
world_size: int = 2,
max_samples: int | None = None,
verbose: bool = False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verboseness here could be refactored as a logging.LEVEL lol

dtype,
args.world_size,
args.max_samples,
verbose=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh...

pianpwk added a commit that referenced this pull request Mar 17, 2026
… 1-1 OpInfo-aten entries"


Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry.

Also relaxes atol to 1e-3 to loosen false negatives

Authored with Claude.

[ghstack-poisoned]
pianpwk added a commit that referenced this pull request Mar 17, 2026
… entries"


Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry.

Also relaxes atol to 1e-3 to loosen false negatives

Authored with Claude.

[ghstack-poisoned]
pianpwk added a commit that referenced this pull request Mar 20, 2026
… 1-1 OpInfo-aten entries"


Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry.

Also relaxes atol to 1e-3 to loosen false negatives

Authored with Claude.

[ghstack-poisoned]
pianpwk added a commit that referenced this pull request Mar 20, 2026
… entries"


Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry.

Also relaxes atol to 1e-3 to loosen false negatives

Authored with Claude.

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2026
…177595)

Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: #177595
Approved by: https://github.com/wconstab
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
pytorch-bot Bot pushed a commit that referenced this pull request Apr 2, 2026
…177595)

Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: #177595
Approved by: https://github.com/wconstab
IvanKobzarev pushed a commit to IvanKobzarev/pytorch that referenced this pull request Apr 3, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
bobrenjc93 pushed a commit to bobrenjc93/pytorch that referenced this pull request Apr 9, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
bobrenjc93 pushed a commit to bobrenjc93/pytorch that referenced this pull request Apr 10, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants