[DTensor] Optimize redistribute comms using flattened meshes by wconstab · Pull Request #174630 · pytorch/pytorch

wconstab · 2026-02-09T23:05:11Z

Reland of #172610: same code as previous land except:

includes fix redistribute() handling for finding flattened device mesh dims under compile #173873 (credit @bdhirsh)
includes [dtensor] fix flatten mesh dims arg relative to submesh #173790 (credit @IvanKobzarev)
includes [DTensor] make debugmode print optimized transforminfos #173436
adds disable contextmanager + test

Ensures that when possible (when such a flattened mesh exists), DTensor will find and use it to avoid more costly sequential comms, and particularly for reduce comms, also avoids the risk of different reduction orders causing divergent results. (See this doc for more info.

Example: For a (2,2,2) mesh with dims (A,B,C) and placements
when redistributing from (Psum, Replicate, Psum) -> (Replicate,
Replicate, Replicate) - the original behavior would be 2 separate
all_reduces. After this PR, if the user flattens dims A,C, this becomes
one larger all_reduce.

Compared with earlier attempt #172119, this PR

includes optimization for comms other than all_reduce
explicitly bans mixed partial types (Psum, Pmax) is not a valid placement, so we don't have to worry about optimizing around it
therefore uses a simpler implementation involving grouping adjacent transforminfos and then merging like kinds
Warns once per mesh shape for missing flattened meshes
Won't optimize reduce_scatters when they shard an uneven sized tensor dim

Details/Limitations

all_to_all is never merged (left for possible future work, but not obvious how to do it in general)
reduce_scatter is only merged when the outermost partial shape is evenly divisible by the flattened mesh - otherwise, warns
reduce_scatter and all_gather are only merged when the shards are in left-to-right (ascending) order, since DeviceMesh only supports flattening in ascending order and the mesh ordering impacts correctness.
groups of like-kind collectives are NOT combined if they are not adjacent in the transform_info list
flattened device-meshes are not automatically created due to preference of explicit creation and ensuring torch.compile works, but warnings prompt the user to create them when it would help allow an optimization
DOES support merging mixed Partial (sum, avg) reductions, using the product of the avg dim sizes to scale after performing a sum reduction on the merged mesh. Refuses to merge any other combinations of mixed partials.

Fixes #171916

Note: initial attempt used stable sort with a lt
method in TransformInfo comparing comm type key, but this was not correct because sorting a
local (no-comm) operation like chunking before or after a comm operation
on the same mesh time affects results.

Differential Revision: D92540256

pytorch-bot · 2026-02-09T23:05:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174630

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit c170b91 with merge base 4674618 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx) (gh) (trunk failure)
pytorch_CycleGAN_and_pix2pix
inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
pytorch_CycleGAN_and_pix2pix
inductor / inductor-test-cuda13 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
pytorch_CycleGAN_and_pix2pix
inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32
trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx950.1) (gh) (trunk failure)
test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_embedding_backward_dynamic_shapes_large_grid_cuda
trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (trunk failure)
test/export/test_schema.py::TestSchema::test_schema_check

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-02-09T23:05:18Z

@wconstab has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92540256.

zpcore

LGTM!

Summary: Reland of #172610 - includes fixes #173873 (credit bdhirsh) and #173790 (credit IvanKobzarev) Ensures that when possible (when such a flattened mesh exists), DTensor will find and use it to avoid more costly sequential comms, and particularly for reduce comms, also avoids the risk of different reduction orders causing divergent results. (See [this doc](https://docs.google.com/document/d/1hJsnodQmHfs1QosNgR39HZNiOOzfnZ6bnALqonDpcDs/edit?userstoinvite=rrathaur@redhat.com&sharingaction=manageaccess&role=reader&tab=t.0) for more info. Example: For a (2,2,2) mesh with dims (A,B,C) and placements when redistributing from (Psum, Replicate, Psum) -> (Replicate, Replicate, Replicate) - the original behavior would be 2 separate all_reduces. After this PR, if the user flattens dims A,C, this becomes one larger all_reduce. Compared with earlier attempt #172119, this PR - includes optimization for comms other than all_reduce - explicitly bans mixed partial types (Psum, Pmax) is not a valid placement, so we don't have to worry about optimizing around it - therefore uses a simpler implementation involving grouping adjacent transforminfos and then merging like kinds - Warns once per mesh shape for missing flattened meshes - Won't optimize reduce_scatters when they shard an uneven sized tensor dim Details/Limitations - all_to_all is never merged (left for possible future work, but not obvious how to do it in general) - reduce_scatter is only merged when the outermost partial shape is evenly divisible by the flattened mesh - otherwise, warns - reduce_scatter and all_gather are only merged when the shards are in left-to-right (ascending) order, since DeviceMesh only supports flattening in ascending order and the mesh ordering impacts correctness. - groups of like-kind collectives are NOT combined if they are not adjacent in the transform_info list - flattened device-meshes are not automatically created due to preference of explicit creation and ensuring torch.compile works, but warnings prompt the user to create them when it would help allow an optimization - DOES support merging mixed Partial (sum, avg) reductions, using the product of the avg dim sizes to scale after performing a sum reduction on the merged mesh. Refuses to merge any other combinations of mixed partials. Fixes #171916 Note: initial attempt used stable sort with a __lt__ method in TransformInfo comparing comm type key, but this was not correct because sorting a local (no-comm) operation like chunking before or after a comm operation on the same mesh time affects results. Differential Revision: D92540256

facebook-github-bot · 2026-02-10T21:41:59Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2026-02-10T21:44:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot Bot added ciflow/inductor release notes: distributed (dtensor) release notes category labels Feb 9, 2026

meta-codesync Bot added fb-exported meta-exported labels Feb 9, 2026

zpcore approved these changes Feb 9, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2026

facebook-github-bot force-pushed the export-D92540256 branch from 870de79 to c170b91 Compare February 10, 2026 14:34

pytorchmergebot added the merging label Feb 10, 2026

pytorchmergebot added the Merged label Feb 10, 2026

pytorchmergebot closed this in 87ddb5d Feb 10, 2026

pytorchmergebot removed the merging label Feb 10, 2026

github-actions Bot deleted the export-D92540256 branch March 13, 2026 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] Optimize redistribute comms using flattened meshes#174630

[DTensor] Optimize redistribute comms using flattened meshes#174630
wconstab wants to merge 1 commit intomainfrom
export-D92540256

wconstab commented Feb 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Feb 9, 2026

Uh oh!

zpcore left a comment

Uh oh!

facebook-github-bot commented Feb 10, 2026

Uh oh!

pytorchmergebot commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wconstab commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174630

✅ You can merge normally! (6 Unrelated Failures)

Uh oh!

meta-codesync Bot commented Feb 9, 2026

Uh oh!

zpcore left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Feb 10, 2026

Uh oh!

pytorchmergebot commented Feb 10, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Feb 9, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 9, 2026 •

edited

Loading