Use SymmMem for reduce-scatter in FSDP#177111
Use SymmMem for reduce-scatter in FSDP#177111kwen2501 wants to merge 5 commits intogh/kwen2501/325/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177111
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 4 Pending, 4 Unrelated FailuresAs of commit e833582 with merge base 1ef51a6 ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge -f "Failures are from Inductor tests and unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This reverts commit 332e4c7. Reverted #177111 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#176613 (comment)))
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 5 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable), inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx950.1) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This reverts commit 0ae127b. Reverted #177111 on behalf of https://github.com/yangw-dev due to internal test failed due to AssertionError: Unexpected methods found in class: {'set_symm_mem_for_comm'}, Missing methods: set(), please ask intenral folks to help add set_symm_mem_for_comm in expected method D96767236 ([comment](#176613 (comment)))
|
@pytorchbot merge -f "Failure is unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613
This reverts commit 332e4c7. Reverted pytorch#177111 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#176613 (comment)))
## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613
This reverts commit 0ae127b. Reverted pytorch#177111 on behalf of https://github.com/yangw-dev due to internal test failed due to AssertionError: Unexpected methods found in class: {'set_symm_mem_for_comm'}, Missing methods: set(), please ask intenral folks to help add set_symm_mem_for_comm in expected method D96767236 ([comment](pytorch#176613 (comment)))
## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613
This reverts commit 0ae127b. Reverted pytorch#177111 on behalf of https://github.com/yangw-dev due to internal test failed due to AssertionError: Unexpected methods found in class: {'set_symm_mem_for_comm'}, Missing methods: set(), please ask intenral folks to help add set_symm_mem_for_comm in expected method D96767236 ([comment](pytorch#176613 (comment)))
## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613
## Summary This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather. ## Changes 1. **Added `SymmMemReduceScatter` class**: Similar to `SymmMemAllGather`, this class: - Allocates tensors from symmetric memory pool (via `SymmMemAllocMixin`) - Rendezvouses both input and output tensors before calling `reduce_scatter_tensor` - This allows NCCL to detect symmetric memory tensors and use the optimized symmetric kernel 2. **Updated `set_symm_mem()`**: Now sets both all-gather and reduce-scatter to use symmetric memory implementations when `set_symm_mem_for_comm()` is called. 3. **Enhanced testing**: Added parametrized test to verify `ReduceOp.SUM` reduction modes work with symmetric memory. ## Testing ``` NCCL INFO ReduceScatter [Symmetric]: 100681728 Bytes -> Kernel ReduceScatter_LDMC nchannels 2 nthreads ``` ## Notes - Today symmetric kernel is enabled for `ReduceOp.SUM` only (when `set_force_sum_reduction_for_comms(True)` is used). For other ReduceOp, NCCL falls back to regular kernel. Pull Request resolved: pytorch#177111 Approved by: https://github.com/Skylion007, https://github.com/weifengpy ghstack dependencies: pytorch#176613
Stack from ghstack (oldest at bottom):
Summary
This change enables symmetric memory optimizations for reduce-scatter collectives, matching the behavior already available for all-gather.
Changes
Added
SymmMemReduceScatterclass: Similar toSymmMemAllGather, this class:SymmMemAllocMixin)reduce_scatter_tensorUpdated
set_symm_mem(): Now sets both all-gather and reduce-scatter to use symmetric memory implementations whenset_symm_mem_for_comm()is called.Enhanced testing: Added parametrized test to verify
ReduceOp.SUMreduction modes work with symmetric memory.Testing
Notes
ReduceOp.SUMonly (whenset_force_sum_reduction_for_comms(True)is used). For other ReduceOp, NCCL falls back to regular kernel.