[DTensor] Strategy Validation (2/3): partial input creation and validation engine by wconstab · Pull Request #174799 · pytorch/pytorch

wconstab · 2026-02-11T19:35:02Z

Stack from ghstack (oldest at bottom):

Adds the validation engine that tests whether a sharding rule is correct
by simulating distributed execution on a single machine using LocalTensor.

For each placement combination, it creates local tensors that would
reduce to the original (e.g., for P(sum), splits values across ranks so
they sum back), runs the op on those local tensors, wraps the output as
a DTensor, redistributes to Replicate, and compares against ground
truth.

The main challenge is avoiding false positives where a rule appears
valid on a specific input but is actually incorrect. Several techniques
are used:

Asymmetric splits for P(sum)/P(avg): instead of splitting evenly
(tensor/2 per rank), uses a 60/40 ratio (varied by tensor index) so
that ops which are not truly linear don't accidentally produce
matching outputs.
Sign-varying offsets for P(sum)/P(avg): adds an offset that
alternates sign across elements, so local tensors have mixed positive
and negative values. Without this, proportional splits preserve the
sign pattern of the original tensor, causing non-linear ops like abs
to falsely validate P(sum)->P(sum).
Distinct magnitudes for P(min) vs P(max): P(min) offsets non-holding
ranks by +0.7 while P(max) offsets by -1.3. Using different
magnitudes prevents accidental cancellation when min and max
placements appear in the same combination.
Alternating rank ownership for P(min)/P(max): a mask alternating by
element (shifted by tensor index) controls which rank holds the true
value vs the offset value. This prevents the degenerate case where
one rank always holds all true values.

Authored with Claude.

…idation engine [ghstack-poisoned]

pytorch-bot · 2026-02-11T19:35:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174799

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit a0da973 with merge base 003e05b ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_factor_cuda_float32
pull / linux-jammy-py3.14t-clang15 / test (crossref, 1, 2, linux.2xlarge) (gh) (similar failure)
test/test_serialization.py::TestSerialization::test_serialization_4gb_file

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#174919)
pytorch_CycleGAN_and_pix2pix
inductor / inductor-test-cuda13 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (#174930)
pytorch_CycleGAN_and_pix2pix

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ion and validation engine" [ghstack-poisoned]

…ion and validation engine" Adds the validation engine that tests whether a sharding rule is correct by simulating distributed execution on a single machine using LocalTensor. For each placement combination, it creates local tensors that would reduce to the original (e.g., for P(sum), splits values across ranks so they sum back), runs the op on those local tensors, wraps the output as a DTensor, redistributes to Replicate, and compares against ground truth. The main challenge is avoiding false positives where a rule appears valid on a specific input but is actually incorrect. Several techniques are used: - Asymmetric splits for P(sum)/P(avg): instead of splitting evenly (tensor/2 per rank), uses a 60/40 ratio (varied by tensor index) so that ops which are not truly linear don't accidentally produce matching outputs. - Sign-varying offsets for P(sum)/P(avg): adds an offset that alternates sign across elements, so local tensors have mixed positive and negative values. Without this, proportional splits preserve the sign pattern of the original tensor, causing non-linear ops like abs to falsely validate P(sum)->P(sum). - Distinct magnitudes for P(min) vs P(max): P(min) offsets non-holding ranks by +0.7 while P(max) offsets by -1.3. Using different magnitudes prevents accidental cancellation when min and max placements appear in the same combination. - Alternating rank ownership for P(min)/P(max): a mask alternating by element (shifted by tensor index) controls which rank holds the true value vs the offset value. This prevents the degenerate case where one rank always holds all true values. Authored with Claude. [ghstack-poisoned]

pianpwk · 2026-02-12T01:00:10Z

+    ground_truth: torch.Tensor,
+    world_size: int = 2,
+    mesh=None,
+) -> tuple[bool, str]:


if you want to delete what's introduced in #172990, that's fine with me

i will look into it later, its a good idea to unify the codepaths

pianpwk · 2026-02-12T01:02:03Z

+            is_valid,
+            "Expected True (false positive) for all-zero output, showing "
+            "why compare_operator must skip such samples",
+        )


generally wondering if there can be less tests, but looks fine

i'm partially sympathetic- this is a lot of LOC. but also i feel like its a questionable use of time to try to prove deleting one test is safe by verifying it is covered by another test, and this whole test suite runs in a couple of seconds, so i'm probably going to ignore this

ok i went ahead and removed ones that were pretty easy to argue were covered by the exhaustive test. also added P(min) to the exhaustive test in service of this.

…n and validation engine" Adds the validation engine that tests whether a sharding rule is correct by simulating distributed execution on a single machine using LocalTensor. For each placement combination, it creates local tensors that would reduce to the original (e.g., for P(sum), splits values across ranks so they sum back), runs the op on those local tensors, wraps the output as a DTensor, redistributes to Replicate, and compares against ground truth. The main challenge is avoiding false positives where a rule appears valid on a specific input but is actually incorrect. Several techniques are used: - Asymmetric splits for P(sum)/P(avg): instead of splitting evenly (tensor/2 per rank), uses a 60/40 ratio (varied by tensor index) so that ops which are not truly linear don't accidentally produce matching outputs. - Sign-varying offsets for P(sum)/P(avg): adds an offset that alternates sign across elements, so local tensors have mixed positive and negative values. Without this, proportional splits preserve the sign pattern of the original tensor, causing non-linear ops like abs to falsely validate P(sum)->P(sum). - Distinct magnitudes for P(min) vs P(max): P(min) offsets non-holding ranks by +0.7 while P(max) offsets by -1.3. Using different magnitudes prevents accidental cancellation when min and max placements appear in the same combination. - Alternating rank ownership for P(min)/P(max): a mask alternating by element (shifted by tensor index) controls which rank holds the true value vs the offset value. This prevents the degenerate case where one rank always holds all true values. Authored with Claude. [ghstack-poisoned]

anshul-si · 2026-02-13T01:25:45Z

+        # But NOT:
+        # - R + P(sum) -> P(sum) for add (R gets added on each rank, then summed)
+        VALID_RULES = {
+            torch.add: [


are we leaving out avg intentionally? from my understanding only time it was failing was when we had empty scalar?

i forgot if we said we don't care about avg in general. i think someone proposed we should delete it. but i should probably just add it for now.

i ended up pulling both Pavg and Pmin into a special branch for TEST_WITH_SLOW since they add considerable runtime and i don't think they are that important for iterative development, but they will at least run in CI

anshul-si · 2026-02-13T01:35:05Z

+                "S(1),S(1)->S(1)",
+                # Partial sum * Replicate -> Partial sum (multiplicative linearity)
+                # r * (p0+p1) = r*p0 + r*p1 where pi are per-rank
+                "P(sum),R->P(sum)",


again not sure if avg is being intentionally left out, but it should work here as well

…n and validation engine" Adds the validation engine that tests whether a sharding rule is correct by simulating distributed execution on a single machine using LocalTensor. For each placement combination, it creates local tensors that would reduce to the original (e.g., for P(sum), splits values across ranks so they sum back), runs the op on those local tensors, wraps the output as a DTensor, redistributes to Replicate, and compares against ground truth. The main challenge is avoiding false positives where a rule appears valid on a specific input but is actually incorrect. Several techniques are used: - Asymmetric splits for P(sum)/P(avg): instead of splitting evenly (tensor/2 per rank), uses a 60/40 ratio (varied by tensor index) so that ops which are not truly linear don't accidentally produce matching outputs. - Sign-varying offsets for P(sum)/P(avg): adds an offset that alternates sign across elements, so local tensors have mixed positive and negative values. Without this, proportional splits preserve the sign pattern of the original tensor, causing non-linear ops like abs to falsely validate P(sum)->P(sum). - Distinct magnitudes for P(min) vs P(max): P(min) offsets non-holding ranks by +0.7 while P(max) offsets by -1.3. Using different magnitudes prevents accidental cancellation when min and max placements appear in the same combination. - Alternating rank ownership for P(min)/P(max): a mask alternating by element (shifted by tensor index) controls which rank holds the true value vs the offset value. This prevents the degenerate case where one rank always holds all true values. Authored with Claude. [ghstack-poisoned]

wconstab · 2026-02-13T04:50:57Z

@pytorchbot merge

pytorchmergebot · 2026-02-13T04:52:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-02-13T05:09:10Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cuda12.8-py3 / build

Details for Dev Infra team

Raised by workflow job

pytorch-auto-revert · 2026-02-13T15:00:38Z

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

Lint: lintrunner-pyrefly-all / linux-job (hud)

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

claude · 2026-02-13T15:01:04Z

Claude finished @pytorch-auto-revert[bot]'s task in 53s —— View job

Analyzing Revert Request

Read the CI failure logs from the lintrunner-pyrefly-all job
Analyze the error cause
Provide diagnosis and fix guidance

Diagnosis

The revert is NOT caused by code in this PR. The CI failure is a lintrunner infrastructure issue, not a problem with the PR's code changes.

The error from the logs:

>>> General linter failure:

  Error (PYREFLY) json-parse-error
    Failed to parse pyrefly JSON output: Extra data: line 16 column 2 (char 542)

No linters to run after applying --take/--skip filters.
Skipping lintrunner execution.
Lint failed!

Explanation

This error indicates that the pyrefly type checker itself produced malformed JSON output that the lintrunner couldn't parse. The error "Extra data: line 16 column 2 (char 542)" means pyrefly emitted multiple JSON objects instead of a single valid JSON array/object.

This is NOT a legitimate reason for reverting this PR because:

The error is in the linting infrastructure (pyrefly tool output parsing), not in the PR's Python code
The PR code changes are in torch/distributed/tensor/_ops/strategy_validation.py and test/distributed/tensor/test_strategy_validation.py - these don't affect pyrefly's JSON serialization
This looks like a flaky CI issue or a bug in the pyrefly tool version being used

Recommendation for the Author

Since this is an infrastructure issue rather than a code issue, the author should:

Re-run the CI job - This may simply pass on retry if it was a transient issue
Check if pyrefly was recently updated - The version in CI may have a bug
File an issue with the DevX/CI team if the problem persists, as this appears to be a linting infrastructure problem

The code changes in this PR are unrelated to the pyrefly JSON parsing failure.

pytorchmergebot · 2026-02-13T15:03:04Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…nd validation engine (#174799)" This reverts commit 701313b. Reverted #174799 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#174799 (comment)))

pytorchmergebot · 2026-02-13T15:03:14Z

@wconstab your PR has been successfully reverted.

…n and validation engine" Adds the validation engine that tests whether a sharding rule is correct by simulating distributed execution on a single machine using LocalTensor. For each placement combination, it creates local tensors that would reduce to the original (e.g., for P(sum), splits values across ranks so they sum back), runs the op on those local tensors, wraps the output as a DTensor, redistributes to Replicate, and compares against ground truth. The main challenge is avoiding false positives where a rule appears valid on a specific input but is actually incorrect. Several techniques are used: - Asymmetric splits for P(sum)/P(avg): instead of splitting evenly (tensor/2 per rank), uses a 60/40 ratio (varied by tensor index) so that ops which are not truly linear don't accidentally produce matching outputs. - Sign-varying offsets for P(sum)/P(avg): adds an offset that alternates sign across elements, so local tensors have mixed positive and negative values. Without this, proportional splits preserve the sign pattern of the original tensor, causing non-linear ops like abs to falsely validate P(sum)->P(sum). - Distinct magnitudes for P(min) vs P(max): P(min) offsets non-holding ranks by +0.7 while P(max) offsets by -1.3. Using different magnitudes prevents accidental cancellation when min and max placements appear in the same combination. - Alternating rank ownership for P(min)/P(max): a mask alternating by element (shifted by tensor index) controls which rank holds the true value vs the offset value. This prevents the degenerate case where one rank always holds all true values. Authored with Claude. [ghstack-poisoned]

pytorchmergebot · 2026-02-13T23:37:24Z