[shard prop] OpInfo strategy validation suite#176258
[shard prop] OpInfo strategy validation suite#176258pianpwk wants to merge 5 commits intogh/pianpwk/107/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176258
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit 16b8c81 with merge base ff91f31 ( NEW FAILURES - The following jobs have failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
[ghstack-poisoned]
|
@wconstab changed validation to skip non 1-1 entries, the list of fails is much smaller now |
|
|
||
| if not torch.allclose( | ||
| gt, full_output, atol=1e-5, rtol=1e-5, equal_nan=True | ||
| gt, full_output, atol=1e-3, rtol=1e-5, equal_nan=True |
There was a problem hiding this comment.
was this needed still after you xfailed the to_copy ops?
There was a problem hiding this comment.
surprisingly this was for the baddbmm op
| # Ops like inner (permute, view, mm, view) decompose into multiple | ||
| # aten calls — validating the high-level sample against one captured | ||
| # op produces wrong results. | ||
| with _CaptureAtenOp() as _probe: |
There was a problem hiding this comment.
wonder if we can push this check upstream so that we are determining our fate earlier. compare_operator calls _discover_aten_op, which perhaps could be more assertive (single-aten op found),
and then maybe get_aten_op_for_sample can raise the skip_reason[non-1-1-mapping] right inside if it sees more than one op in graph
- this path is more critical than _discover_aten_op, since it runs once per sample and since each sample can give a different aten op / graph
[ghstack-poisoned]
| aten_op = _discover_aten_op(opinfos, device, dtype) | ||
| if aten_op is None or not _has_dtensor_support(aten_op): | ||
| if verbose: | ||
| print(f" ATEN_OP_MAP: {op_name} -> {aten_op} [no_support]") |
There was a problem hiding this comment.
These should all be logging not printing!
There was a problem hiding this comment.
that's my bad, i noticed this at one point but i didn't clean it up. the script is using print consistently, at least. I am supportive of a PR to change it to use logging and make it nicely configurable.
| dtype: torch.dtype = torch.float32, | ||
| world_size: int = 2, | ||
| max_samples: int | None = None, | ||
| verbose: bool = False, |
There was a problem hiding this comment.
Verboseness here could be refactored as a logging.LEVEL lol
| dtype, | ||
| args.world_size, | ||
| args.max_samples, | ||
| verbose=True, |
… 1-1 OpInfo-aten entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]
… entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]
… 1-1 OpInfo-aten entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]
… entries" Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--exhaustive` to validate ALL underlying aten ops for the OpInfo entry. Also relaxes atol to 1e-3 to loosen false negatives Authored with Claude. [ghstack-poisoned]
…177595) Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: #177595 Approved by: https://github.com/wconstab
…ytorch#177595) Taking changes from pytorch#176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: pytorch#177595 Approved by: https://github.com/wconstab
…177595) Taking changes from #176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: #177595 Approved by: https://github.com/wconstab
…ytorch#177595) Taking changes from pytorch#176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: pytorch#177595 Approved by: https://github.com/wconstab
…ytorch#177595) Taking changes from pytorch#176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: pytorch#177595 Approved by: https://github.com/wconstab
…ytorch#177595) Taking changes from pytorch#176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: pytorch#177595 Approved by: https://github.com/wconstab
…ytorch#177595) Taking changes from pytorch#176258 One source of sharding validator false positives/negatives has been OpInfo entries which run multiple aten ops underneath. This misleads the aten op to check, the # of inputs/output, etc. By default we now only run if the OpInfo-aten op mapping is 1-1, and use the aten op inputs (ignore the top-level inputs). Alternatively, run with `--allow-composite` to validate ALL underlying aten ops for the OpInfo entry. Authored with Claude. Pull Request resolved: pytorch#177595 Approved by: https://github.com/wconstab
Stack from ghstack (oldest at bottom):