Use SymmMem for reduce-scatter in FSDP by kwen2501 · Pull Request #177111 · pytorch/pytorch

kwen2501 · 2026-03-11T03:42:42Z

Stack from ghstack (oldest at bottom):

Summary

This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather.

Changes

Added SymmMemReduceScatter class: Similar to SymmMemAllGather, this class:
- Allocates tensors from symmetric memory pool (via SymmMemAllocMixin)
- Rendezvouses both input and output tensors before calling reduce_scatter_tensor
- This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel
Updated set_symm_mem(): Now sets both all-gather and reduce-scatter to use symmetric memory implementations when set_symm_mem_for_comm() is called.
Enhanced testing: Added parametrized test to verify ReduceOp.SUM reduction modes work with symmetric memory.

Testing

NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads

Notes

Today symmetric kernel is enabled for ReduceOp.SUM only (when set_force_sum_reduction_for_comms(True) is used). For other ReduceOp, NCCL falls back to regular kernel.

[ghstack-poisoned]

pytorch-bot · 2026-03-11T03:42:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177111

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 4 Pending, 4 Unrelated Failures

As of commit e833582 with merge base 1ef51a6 ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-cpu-py3.10-gcc11-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge) (gh)
Build completed, 1 test FAILED, 177 total actions

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / inductor-test / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 5, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (similar failure)
test/inductor/test_auto_chunker.py::AutoChunkerTest::test_fused_linear_cel

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
vision_maskrcnn

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 84c806c Pull-Request: #177111

[ghstack-poisoned]

ghstack-source-id: cf9be9a Pull-Request: #177111

[ghstack-poisoned]

ghstack-source-id: e96a995 Pull-Request: #177111

kwen2501 · 2026-03-13T01:50:41Z

@pytorchbot merge -f "Failures are from Inductor tests and unrelated"

pytorchmergebot · 2026-03-13T01:52:31Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit 332e4c7. Reverted #177111 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#176613 (comment)))

pytorchmergebot · 2026-03-13T02:17:54Z

@kwen2501 your PR has been reverted as part of the stack under #176613.

[ghstack-poisoned]

ghstack-source-id: 29270ca Pull-Request: #177111

kwen2501 · 2026-03-16T17:21:18Z

@pytorchbot merge -i

pytorchmergebot · 2026-03-16T17:23:47Z

Merge started

Your change will be merged while ignoring the following 5 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable), inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx950.1)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit 0ae127b. Reverted #177111 on behalf of https://github.com/yangw-dev due to internal test failed due to AssertionError: Unexpected methods found in class: {'set_symm_mem_for_comm'}, Missing methods: set(), please ask intenral folks to help add set_symm_mem_for_comm in expected method D96767236 ([comment](#176613 (comment)))

pytorchmergebot · 2026-03-17T19:18:22Z

@kwen2501 your PR has been reverted as part of the stack under #176613.

[ghstack-poisoned]

ghstack-source-id: c2f7600 Pull-Request: #177111

kwen2501 · 2026-03-20T05:57:21Z

@pytorchbot merge -f "Failure is unrelated"

pytorchmergebot · 2026-03-20T05:59:17Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613

This reverts commit 332e4c7. Reverted pytorch#177111 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#176613 (comment)))

## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613

This reverts commit 0ae127b. Reverted pytorch#177111 on behalf of https://github.com/yangw-dev due to internal test failed due to AssertionError: Unexpected methods found in class: {'set_symm_mem_for_comm'}, Missing methods: set(), please ask intenral folks to help add set_symm_mem_for_comm in expected method D96767236 ([comment](pytorch#176613 (comment)))

## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613

This reverts commit 0ae127b. Reverted pytorch#177111 on behalf of https://github.com/yangw-dev due to internal test failed due to AssertionError: Unexpected methods found in class: {'set_symm_mem_for_comm'}, Missing methods: set(), please ask intenral folks to help add set_symm_mem_for_comm in expected method D96767236 ([comment](pytorch#176613 (comment)))

## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613

Update

a80d83c

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests release notes: distributed (fsdp) release notes category labels Mar 11, 2026

kwen2501 added a commit that referenced this pull request Mar 11, 2026

Use SymmMem for reduce-scatter in FSDP

1bd62f2

ghstack-source-id: 84c806c Pull-Request: #177111

kwen2501 mentioned this pull request Mar 11, 2026

Enable Copy Engine all-gather in FSDP #176613

Closed

kwen2501 requested a review from weifengpy March 11, 2026 03:51

kwen2501 added the module: symm_mem Issues and PRs of Symmetric Memory label Mar 11, 2026

pytorchbot added the open source label Mar 11, 2026

Update

d9c42ec

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Mar 11, 2026

Use SymmMem for reduce-scatter in FSDP

4dc119c

ghstack-source-id: cf9be9a Pull-Request: #177111

Skylion007 approved these changes Mar 11, 2026

View reviewed changes

weifengpy approved these changes Mar 12, 2026

View reviewed changes

Update

f26104b

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Mar 12, 2026

Use SymmMem for reduce-scatter in FSDP

64fb94d

ghstack-source-id: e96a995 Pull-Request: #177111

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 12, 2026

pytorchmergebot added the merging label Mar 13, 2026

pytorchmergebot closed this in 332e4c7 Mar 13, 2026

pytorchmergebot added Merged and removed merging labels Mar 13, 2026

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Mar 13, 2026

pytorchmergebot reopened this Mar 13, 2026

kwen2501 mentioned this pull request Mar 13, 2026

Refactor NCCLDevCommManager: Improve API design; do not create devComm by default #177380

Closed

Update

0af5af9

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Mar 16, 2026

Use SymmMem for reduce-scatter in FSDP

1e2ec19

ghstack-source-id: 29270ca Pull-Request: #177111

pytorchmergebot added the merging label Mar 16, 2026

pytorchmergebot closed this in 0ae127b Mar 16, 2026

pytorchmergebot removed the merging label Mar 16, 2026

pytorchmergebot reopened this Mar 17, 2026

Update

e833582

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Mar 19, 2026

Use SymmMem for reduce-scatter in FSDP

3568235

ghstack-source-id: c2f7600 Pull-Request: #177111

pytorchmergebot added the merging label Mar 20, 2026

pytorchmergebot closed this in 0535c23 Mar 20, 2026

pytorchmergebot removed the merging label Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SymmMem for reduce-scatter in FSDP#177111

Use SymmMem for reduce-scatter in FSDP#177111
kwen2501 wants to merge 5 commits intogh/kwen2501/325/basefrom
gh/kwen2501/325/head

kwen2501 commented Mar 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

kwen2501 commented Mar 13, 2026

Uh oh!

pytorchmergebot commented Mar 13, 2026

Uh oh!

pytorchmergebot commented Mar 13, 2026

Uh oh!

kwen2501 commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 17, 2026

Uh oh!

kwen2501 commented Mar 20, 2026

Uh oh!

pytorchmergebot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kwen2501 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Notes

Uh oh!

pytorch-bot bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177111

❌ 1 New Failure, 4 Pending, 4 Unrelated Failures

Uh oh!

kwen2501 commented Mar 13, 2026

Uh oh!

pytorchmergebot commented Mar 13, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 13, 2026

Uh oh!

kwen2501 commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 16, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 17, 2026

Uh oh!

kwen2501 commented Mar 20, 2026

Uh oh!

pytorchmergebot commented Mar 20, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Mar 11, 2026 •

edited

Loading

pytorch-bot bot commented Mar 11, 2026 •

edited

Loading