Skip to content

[shard prop] default sharding validator to 1-1 OpInfo-aten entries#177595

Closed
pianpwk wants to merge 4 commits intogh/pianpwk/123/basefrom
gh/pianpwk/123/head
Closed

[shard prop] default sharding validator to 1-1 OpInfo-aten entries#177595
pianpwk wants to merge 4 commits intogh/pianpwk/123/basefrom
gh/pianpwk/123/head

Conversation

@pianpwk
Copy link
Copy Markdown
Contributor

@pianpwk pianpwk commented Mar 16, 2026

Stack from ghstack (oldest at bottom):

Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with --allow-composite to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177595

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ccd65dd with merge base 985f9ee (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pianpwk added a commit that referenced this pull request Mar 16, 2026
Validate sharding rules at the aten op level using captured args/kwargs
instead of OpInfo SampleInput pytrees. Adds an exhaustive mode that
validates each supported aten op individually for decomposed ops,
eliminating silent skips for non-1:1 aten mappings. Relaxes atol to
1e-3 to accommodate accumulated floating point error.

Authored with Claude.


ghstack-source-id: 2078207
Pull-Request: #177595
@pytorch-bot pytorch-bot Bot added ciflow/dtensor Run DTensor specific tests ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests release notes: distributed (dtensor) release notes category labels Mar 16, 2026
@pianpwk pianpwk changed the title [dtensor] Aten-level strategy validation with exhaustive mode [shard prop] default sharding validator to 1-1 OpInfo-aten entries Mar 16, 2026
return tensors


def validate_aten_combination(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we decompose the other validate fn into calls into this fn? (mainly to dedup code) - or would that not be as easy as it sounds?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saw you addressed this, thanks

… entries"


Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry.

Also relaxes atol to 1e-3 to loosen false negatives

Authored with Claude.

[ghstack-poisoned]
pianpwk added a commit that referenced this pull request Mar 17, 2026
Validate sharding rules at the aten op level using captured args/kwargs
instead of OpInfo SampleInput pytrees. Adds an exhaustive mode that
validates each supported aten op individually for decomposed ops,
eliminating silent skips for non-1:1 aten mappings. Relaxes atol to
1e-3 to accommodate accumulated floating point error.

Authored with Claude.

ghstack-source-id: d08a6bb
Pull Request resolved: #177595
return None


def _check_ground_truth(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad name? doesn't actually check the ground truth, checks whether the result is usable as a ground truth!

Copy link
Copy Markdown
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering, how useful is exhaustive mode? and how much smaller/cleaner would this diff be if instead of maintaining both modes we just made the only mode the 1:1 mode?

there are a few places that seem to duplicate codepaths between modes. like preparing mitigations, and compare_ harnesses. ideally i'd rather the 'exhaustive' mode be a simple layer on top of the 1:1 mode, nothing more.

@pianpwk
Copy link
Copy Markdown
Contributor Author

pianpwk commented Mar 19, 2026

wondering, how useful is exhaustive mode? and how much smaller/cleaner would this diff be if instead of maintaining both modes we just made the only mode the 1:1 mode?

hmm are you able to access all aten ops with only 1-1 mode? say entry A runs aten ops B, C, D (where C is what we care about testing) - there might not be another entry that 1-1 matches C?

@wconstab
Copy link
Copy Markdown
Contributor

hmm. if you want to test C, is it even obvious how to figure out which op A to tell the validator to validate so that you get coverage for C?

Is it more useful to 'validate A', or would it be more useful to say 'find ops that decompose to C and gather their opinfo samples to directly validate C'? (which is more useful to you as a user i mean)

@pianpwk
Copy link
Copy Markdown
Contributor Author

pianpwk commented Mar 19, 2026

Is it more useful to 'validate A', or would it be more useful to say 'find ops that decompose to C and gather their opinfo samples to directly validate C'? (which is more useful to you as a user i mean)

I think by "validating A", you have a good idea that it's C you want, e.g. adaptive_avg_pool2d -> aten::adaptive_avg_pool2d, even if B & D are random clone/view/detach ops.

The gathering needs some OpInfo -> aten op mapping :\

Copy link
Copy Markdown
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok- maybe --exhaustive isn't the most informative name. --allow-composite ?

@pianpwk pianpwk added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 20, 2026
… entries"


Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry.

Also relaxes atol to 1e-3 to loosen false negatives

Authored with Claude.

[ghstack-poisoned]
pianpwk added a commit that referenced this pull request Mar 20, 2026
Validate sharding rules at the aten op level using captured args/kwargs
instead of OpInfo SampleInput pytrees. Adds an exhaustive mode that
validates each supported aten op individually for decomposed ops,
eliminating silent skips for non-1:1 aten mappings. Relaxes atol to
1e-3 to accommodate accumulated floating point error.

Authored with Claude.

ghstack-source-id: de365fb
Pull Request resolved: #177595
[ghstack-poisoned]
pianpwk added a commit that referenced this pull request Mar 30, 2026
Validate sharding rules at the aten op level using captured args/kwargs
instead of OpInfo SampleInput pytrees. Adds an exhaustive mode that
validates each supported aten op individually for decomposed ops,
eliminating silent skips for non-1:1 aten mappings. Relaxes atol to
1e-3 to accommodate accumulated floating point error.

Authored with Claude.

ghstack-source-id: ff1b917
Pull Request resolved: #177595
@pianpwk
Copy link
Copy Markdown
Contributor Author

pianpwk commented Mar 30, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
pytorch-bot Bot pushed a commit that referenced this pull request Apr 2, 2026
…177595)

Taking changes from #176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: #177595
Approved by: https://github.com/wconstab
IvanKobzarev pushed a commit to IvanKobzarev/pytorch that referenced this pull request Apr 3, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
bobrenjc93 pushed a commit to bobrenjc93/pytorch that referenced this pull request Apr 9, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
bobrenjc93 pushed a commit to bobrenjc93/pytorch that referenced this pull request Apr 10, 2026
…ytorch#177595)

Taking changes from pytorch#176258

One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc.

By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs).

Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry.

Authored with Claude.
Pull Request resolved: pytorch#177595
Approved by: https://github.com/wconstab
@github-actions github-actions Bot deleted the gh/pianpwk/123/head branch April 30, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/dtensor Run DTensor specific tests ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants