Skip to content

[DTensor] make debugmode print optimized transforminfos#173436

Closed
wconstab wants to merge 9 commits intogh/wconstab/506/basefrom
gh/wconstab/506/head
Closed

[DTensor] make debugmode print optimized transforminfos#173436
wconstab wants to merge 9 commits intogh/wconstab/506/basefrom
gh/wconstab/506/head

Conversation

@wconstab
Copy link
Copy Markdown
Contributor

@wconstab wconstab commented Jan 26, 2026

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/173436

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 6 Unrelated Failures

As of commit 1100b4f with merge base 7754b55 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Jan 26, 2026
ghstack-source-id: 5fd1002
Pull Request resolved: #173436
)
shard_order_dict[src_dim].pop()
# Remove mesh dims in order (from shard_order_dict perspective)
for _ in mesh_dims_to_update:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert check for x = shard_order_dict[src_dim].pop(), we must have x in mesh_dims_to_update? Same to dst_dim_placement. Just to be safe that the optimized transforminfo is correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cur_placement[transform_info.mesh_dim] = dst_dim_placement
# Add mesh dims in order
for mesh_dim in mesh_dims_to_update:
shard_order_dict[dst_dim].append(mesh_dim)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order here is related to #172610 (comment). Let's address that one first.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Suggest using something like '-->' (double dash) to show this is an optimized transforms.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

wconstab added a commit that referenced this pull request Jan 27, 2026
ghstack-source-id: 3f4a9fd
Pull Request resolved: #173436
wconstab added a commit that referenced this pull request Jan 27, 2026
ghstack-source-id: d792492
Pull Request resolved: #173436
self.assertExpectedInline(
trace_str,
"""S(0)[0]S(0)[1]_S(0, 3)->S(0)[0]S(0)[1]R->S(0)[0]RR->RRR->RS(0)[1]R->RS(0)[1]S(0)[2]""", # noqa: B950
"""S(0)[0]S(0)[1]_S(0, 3)->S(0)[0]S(0)[1]R->S(0)RR->RRR->RS(0)R->RS(0)[0]S(0)[1]""", # noqa: B950
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zpcore do you buy this explanation?

● I see the issue. The old code had a bug where it didn't handle _StridedShard in the pop logic (since _StridedShard.is_shard() returns
False). This caused it to produce an incorrect shard_order, and the test's expected traces were written to match that incorrect
behavior.

Now that I've fixed the code to properly handle _StridedShard, the correct output differs from the expected. I need to update the
test's expected traces to match the correct behavior.

Let me run the test with EXPECTTEST_ACCEPT=1 to see what all the correct traces should be.

Copy link
Copy Markdown
Member

@zpcore zpcore Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue that I didn't add the string of order for StridedShard in the output...

The fix from Claude still have some missing order for _StridedShard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix from Claude still have some missing order for _StridedShard.

can you say more about this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh- just that we don't have the [i] after _S in the string repr. fixing that.

wconstab added a commit that referenced this pull request Jan 27, 2026
ghstack-source-id: 84d1a92
Pull Request resolved: #173436
@wconstab wconstab changed the title WIP fix debugmode issue [DTensor] make debugmode print optimized transforminfos Jan 27, 2026
wconstab added a commit that referenced this pull request Jan 27, 2026
ghstack-source-id: 5bf4130
Pull Request resolved: #173436
Copy link
Copy Markdown
Member

@zpcore zpcore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can fix the _StridedShard order later, since _StridedShard is not that important now.

wconstab added a commit that referenced this pull request Jan 27, 2026
ghstack-source-id: 8179eac
Pull Request resolved: #173436
wconstab added a commit that referenced this pull request Jan 28, 2026
ghstack-source-id: 142eec6
Pull Request resolved: #173436
wconstab added a commit that referenced this pull request Jan 28, 2026
ghstack-source-id: 5b87217
Pull Request resolved: #173436
wconstab added a commit that referenced this pull request Jan 28, 2026
ghstack-source-id: f265a09
Pull Request resolved: #173436
@wconstab
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 28, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cpu-py3 / build

Details for Dev Infra team Raised by workflow job

@wconstab
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / win-vs2022-cpu-py3 / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kapilsh pushed a commit to kapilsh/pytorch that referenced this pull request Feb 2, 2026
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pytorchbot revert -m="Diff reverted internally" -c="ghfirst"

This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).)

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Feb 3, 2026
)"

This reverts commit 47260be.

Reverted #173436 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](#173436 (comment)))
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@wconstab your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Feb 3, 2026
@wconstab
Copy link
Copy Markdown
Contributor Author

wconstab commented Feb 9, 2026

relanding via #174630

@wconstab wconstab closed this Feb 9, 2026
pytorchmergebot pushed a commit that referenced this pull request Feb 10, 2026
Reland of #172610: same code as previous land except:
- includes #173873 (credit @bdhirsh)
- includes #173790 (credit @IvanKobzarev)
- includes #173436
- adds disable contextmanager + test

Ensures that when possible (when such a flattened mesh exists), DTensor will find and use it to avoid more costly sequential comms, and particularly for reduce comms, also avoids the risk of different reduction orders causing divergent results. (See [this doc](https://docs.google.com/document/d/1hJsnodQmHfs1QosNgR39HZNiOOzfnZ6bnALqonDpcDs/edit?userstoinvite=rrathaur@redhat.com&sharingaction=manageaccess&role=reader&tab=t.0) for more info.

Example: For a (2,2,2) mesh with dims (A,B,C) and placements
when redistributing from (Psum, Replicate, Psum) -> (Replicate,
Replicate, Replicate) - the original behavior would be 2 separate
all_reduces.  After this PR, if the user flattens dims A,C, this becomes
one larger all_reduce.

Compared with earlier attempt #172119, this PR
- includes optimization for comms other than all_reduce
- explicitly bans mixed partial types (Psum, Pmax) is not a valid placement, so we don't have to worry about optimizing around it
- therefore uses a simpler implementation involving grouping adjacent transforminfos and then merging like kinds
- Warns once per mesh shape for missing flattened meshes
- Won't optimize reduce_scatters when they shard an uneven sized tensor dim

Details/Limitations
- all_to_all is never merged (left for possible future work, but not obvious how to do it in general)
- reduce_scatter is only merged when the outermost partial shape is evenly divisible by the flattened mesh - otherwise, warns
- reduce_scatter and all_gather are only merged when the shards are in left-to-right (ascending) order, since DeviceMesh only supports flattening in ascending order and the mesh ordering impacts correctness.
- groups of like-kind collectives are NOT combined if they are not adjacent in the transform_info list
- flattened device-meshes are not automatically created due to preference of explicit creation and ensuring torch.compile works, but warnings prompt the user to create them when it would help allow an optimization
- DOES support merging mixed Partial (sum, avg) reductions, using the product of the avg dim sizes to scale after performing a sum reduction on the merged mesh.  Refuses to merge any other combinations of mixed partials.

Fixes #171916

Note: initial attempt used stable sort with a __lt__
method in TransformInfo comparing comm type key, but this was not correct because sorting a
local (no-comm) operation like chunking before or after a comm operation
on the same mesh time affects results.

Differential Revision: D92540256

Pull Request resolved: #174630
Approved by: https://github.com/zpcore
radeksm pushed a commit to radeksm/pytorch that referenced this pull request Feb 20, 2026
libohao1201 pushed a commit to libohao1201/pytorch that referenced this pull request Mar 2, 2026
@github-actions github-actions Bot deleted the gh/wconstab/506/head branch March 12, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (dtensor) release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants