[FSDP][Collectives] skipping reduce_scatter when world size is 1 by anshul-si · Pull Request #160136 · pytorch/pytorch

anshul-si · 2025-08-07T20:31:22Z

Summary: In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command.

Test Cases

pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1
pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-08-07T20:31:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160136

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 8 Unrelated Failures

As of commit 421bed3 with merge base e6aa728 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
test_dataloader.py::TestDataLoaderPersistentWorkers::test_get_worker_info
periodic / linux-jammy-rocm-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed) (gh)
distributed/_composable/test_composability/test_2d_composability.py::TestFullyShard2DTraining::test_tp_with_fsdp_offloading

FLAKY - The following job failed but was likely due to flakiness present on trunk:

s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 4, 10, linux.s390x) (gh) (similar failure)
test_proxy_tensor.py::TestSymbolicTracing::test_constant_specialization

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
functorch_maml_omniglot
inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 1, 2, linux.8xlarge.amx) (gh) (trunk failure)
functorch_maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx) (gh) (trunk failure)
maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 1, 2, linux.8xlarge.amx) (gh) (trunk failure)
functorch_maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx) (gh) (trunk failure)
maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 37b336b Pull Request resolved: #160136

…e is 1" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py

…e is 1" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

…e is 1" **Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. **Test Cases** 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py

anshul-si · 2025-08-18T23:27:46Z

@pytorchbot merge

pytorchmergebot · 2025-08-18T23:29:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…vice movements (#160147) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device 2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams Pull Request resolved: #160147 Approved by: https://github.com/weifengpy ghstack dependencies: #160135, #160136

jithunnair-amd · 2025-08-21T01:32:29Z

@pytorchbot revert -m "Sorry, but looks like this broke ROCm distributed CI" -c nosignal

@pragupta can provide some more triage details

pytorchmergebot · 2025-08-21T01:34:01Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…input device movements (#160147)" This reverts commit a3a82e3. Reverted #160147 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](#160136 (comment)))

…s 1 (#160136)" This reverts commit 3d126e1. Reverted #160136 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](#160136 (comment)))

pytorchmergebot · 2025-08-21T01:34:26Z

@anshul-si your PR has been successfully reverted.

ghstack-source-id: c19e06f Pull Request resolved: pytorch#160136

…orch#160136) **Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. **Test Cases** 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading Pull Request resolved: pytorch#160136 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#160135

…vice movements (pytorch#160147) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device 2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams Pull Request resolved: pytorch#160147 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#160135, pytorch#160136

…e is 1" **Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. **Test Cases** 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

…orch#160136) **Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. **Test Cases** 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading Pull Request resolved: pytorch#160136 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#160135

…vice movements (pytorch#160147) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device 2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams Pull Request resolved: pytorch#160147 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#160135, pytorch#160136

…orch#160136) **Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. **Test Cases** 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading Pull Request resolved: pytorch#160136 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#160135

…vice movements (pytorch#160147) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device 2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams Pull Request resolved: pytorch#160147 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#160135, pytorch#160136

…input device movements (pytorch#160147)" This reverts commit a3a82e3. Reverted pytorch#160147 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](pytorch#160136 (comment)))

…s 1 (pytorch#160136)" This reverts commit 3d126e1. Reverted pytorch#160136 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](pytorch#160136 (comment)))

[FSDP][Collectives] skipping reduce_scatter when world size is 1

4d16990

[ghstack-poisoned]

anshul-si mentioned this pull request Aug 7, 2025

[FSDP][Collectives] skipping allgather when world size is 1 #160135

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Aug 7, 2025

anshul-si added a commit that referenced this pull request Aug 7, 2025

[FSDP][Collectives] skipping reduce_scatter when world size is 1

f91d029

ghstack-source-id: 37b336b Pull Request resolved: #160136

anshul-si requested review from mori360 and weifengpy August 7, 2025 20:41

anshul-si mentioned this pull request Aug 7, 2025

[FSDP][Replicate] replicate tests for param registration and input device movements #160147

Closed

anshul-si added 2 commits August 8, 2025 17:48

Update on "[FSDP][Collectives] skipping reduce_scatter when world siz…

33761fc

…e is 1" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[FSDP][Collectives] skipping reduce_scatter when world siz…

5981768

…e is 1" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

This was referenced Aug 11, 2025

[FSDP][Replicate] replicate tests for casting module after init #160344

Closed

[FSDP][Replicate] Testing replicate parity in single and multigroup #160390

Closed

weifengpy reviewed Aug 12, 2025

View reviewed changes

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py Outdated Show resolved Hide resolved

weifengpy reviewed Aug 12, 2025

View reviewed changes

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py Show resolved Hide resolved

weifengpy reviewed Aug 12, 2025

View reviewed changes

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py Show resolved Hide resolved

Update on "[FSDP][Collectives] skipping reduce_scatter when world siz…

973e907

…e is 1" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

anshul-si requested a review from weifengpy August 12, 2025 20:54

This was referenced Aug 13, 2025

[FSDP][Replicate] replicate with custom forward method #160545

Closed

[FSDP][Replicate] test verifies that replicate is composable with tp #160563

Closed

weifengpy reviewed Aug 18, 2025

View reviewed changes

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py Show resolved Hide resolved

weifengpy approved these changes Aug 18, 2025

View reviewed changes

pytorchmergebot added the merging label Aug 18, 2025

pytorchmergebot added the Merged label Aug 19, 2025

pytorchmergebot closed this in 3d126e1 Aug 19, 2025

pytorchmergebot removed the merging label Aug 19, 2025

jithunnair-amd added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 21, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Aug 21, 2025

pytorchmergebot reopened this Aug 21, 2025

mori360 approved these changes Aug 21, 2025

View reviewed changes

anshul-si added a commit to anshul-si/pytorch that referenced this pull request Aug 21, 2025

[FSDP][Collectives] skipping reduce_scatter when world size is 1

bcf2e4a

ghstack-source-id: c19e06f Pull Request resolved: pytorch#160136

anshul-si closed this Sep 2, 2025

github-actions bot deleted the gh/anshul-si/17/head branch October 3, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP][Collectives] skipping reduce_scatter when world size is 1#160136

[FSDP][Collectives] skipping reduce_scatter when world size is 1#160136
anshul-si wants to merge 6 commits intogh/anshul-si/17/basefrom
gh/anshul-si/17/head

anshul-si commented Aug 7, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anshul-si commented Aug 18, 2025

Uh oh!

pytorchmergebot commented Aug 18, 2025

Uh oh!

jithunnair-amd commented Aug 21, 2025

Uh oh!

pytorchmergebot commented Aug 21, 2025

Uh oh!

pytorchmergebot commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

anshul-si commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160136

❌ 2 New Failures, 8 Unrelated Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anshul-si commented Aug 18, 2025

Uh oh!

pytorchmergebot commented Aug 18, 2025

Merge started

Uh oh!

jithunnair-amd commented Aug 21, 2025

Uh oh!

pytorchmergebot commented Aug 21, 2025

Uh oh!

pytorchmergebot commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anshul-si commented Aug 7, 2025 •

edited

Loading

pytorch-bot bot commented Aug 7, 2025 •

edited

Loading