Preliminary registered-buffer collective support via Inductor by yifuwang · Pull Request #138029 · pytorch/pytorch

yifuwang · 2024-10-15T22:31:52Z

Stack from ghstack (oldest at bottom):

NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.

Changes

Created comm_lowering.py for the lowerings of _c10d_functional ops. This is to prevent cluttering lowering.py as we add more lowering-time collective optimizations. This PR moved the lowerings for all_reduce and all_reduce_ to the file.
Added comm_buffer_type: Dict[str, str] to GraphLowering to track whether a buffer is a comm buffer and the type of the comm buffer.
Added codegen allocation support for comm buffers of type "symm_mem".
Added support for auto-lowering _c10d_functional.all_reduce_ to symm_mem.one_shot_all_reduce.
Added an Inductor config for collective optimizations in general (config._collective).

Limitation

Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu

[ghstack-poisoned]

pytorch-bot · 2024-10-15T22:31:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138029

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f2f8e1f with merge base 23d590e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…tor" ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

torch/_inductor/graph.py

torch/_inductor/comm_lowering.py

…tor" ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: aa9608c Pull Request resolved: #138029

…tor" ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 64d7f48 Pull Request resolved: #138029

Chillee

Mostly LGTM!

Chillee · 2024-10-18T00:02:22Z

torch/_inductor/codegen/wrapper.py

+        try:
+            # Only add empty_strided_p2p() if distributed and SymmetricMemory
+            # is available
+            from torch._C._distributed_c10d import _SymmetricMemory  # noqa: F401


I feel like it'd be good for us to refactor this code in some manner so that we only import the dependencies we need.

torch/_inductor/codegen/wrapper.py

test/distributed/test_symmetric_memory.py

Chillee · 2024-10-18T21:40:02Z

torch/_inductor/comm_lowering.py

+    )
+
+
+_bufs_to_skip_wait: OrderedSet[Tuple[int, str]] = OrderedSet()


What is this for?

pytorchmergebot · 2024-10-23T18:28:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-23T18:29:31Z

Merge failed

Reason: 17 jobs have failed, first few of them are: inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_freezing_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_amp_freezing_torchbench, 2, 2, linux.16xlarge.spr), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_aot_inductor_freezing_torchbench, 2, 2, linux.12xlarge)

Details for Dev Infra team

Raised by workflow job

…tor" ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov XilunWu [ghstack-poisoned]

yifuwang · 2024-10-24T03:53:25Z

@pytorchbot merge

pytorchmergebot · 2024-10-24T03:55:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-24T09:53:46Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

…tor" ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov XilunWu [ghstack-poisoned]

ghstack-source-id: ac656e2 Pull Request resolved: #138029

yifuwang · 2024-10-30T03:49:19Z

@pytorchbot rebase

pytorchmergebot · 2024-10-30T03:50:48Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-10-30T03:51:03Z

Successfully rebased gh/yifuwang/148/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/138029)

ghstack-source-id: c474a11 Pull Request resolved: #138029

yifuwang · 2024-10-30T18:08:21Z

@pytorchbot merge

pytorchmergebot · 2024-10-30T18:10:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2024-10-31T03:55:53Z

@yifuwang I think this change is failing on ROCm, could you help take a look at the failure?

distributed/test_symmetric_memory.py::LoweringTest::test_lowering_one_shot_all_reduce GH job link HUD commit link

@yifuwang

#139414) I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed #138029 Pull Request resolved: #139414 Approved by: https://github.com/huydhn, https://github.com/yifuwang

…h#138029) ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. Pull Request resolved: pytorch#138029 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#138028

@yifuwang

pytorch#139414) I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed pytorch#138029 Pull Request resolved: pytorch#139414 Approved by: https://github.com/huydhn, https://github.com/yifuwang

Preliminary registered-buffer collective support via Inductor

5ff39a4

[ghstack-poisoned]

yifuwang mentioned this pull request Oct 15, 2024

get_symm_mem_workspace(): print helpful error during graph capture #138028

Closed

pytorch-bot bot added ciflow/inductor module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Oct 15, 2024

yifuwang requested a review from Chillee October 15, 2024 22:33

yifuwang added the topic: not user facing topic category label Oct 15, 2024

Chillee reviewed Oct 16, 2024

View reviewed changes

torch/_inductor/graph.py Outdated Show resolved Hide resolved

torch/_inductor/comm_lowering.py Outdated Show resolved Hide resolved

yifuwang pushed a commit that referenced this pull request Oct 16, 2024

Preliminary registered-buffer collective support via Inductor

fb16791

ghstack-source-id: aa9608c Pull Request resolved: #138029

yifuwang mentioned this pull request Oct 17, 2024

debug #138166

Closed

yifuwang pushed a commit that referenced this pull request Oct 17, 2024

Preliminary registered-buffer collective support via Inductor

cb01009

ghstack-source-id: 64d7f48 Pull Request resolved: #138029

Chillee mentioned this pull request Oct 17, 2024

Refactor FlexibleLayout to separate out "this stride can be changed" and "how this buffer is allocated can be changed" #138280

Open

Chillee requested changes Oct 18, 2024

View reviewed changes

Chillee approved these changes Oct 18, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 23, 2024

pytorchmergebot removed the merging label Oct 23, 2024

pytorchmergebot added the merging label Oct 24, 2024

yifuwang pushed a commit that referenced this pull request Oct 25, 2024

Preliminary registered-buffer collective support via Inductor

750b2ee

ghstack-source-id: ac656e2 Pull Request resolved: #138029

Update

f2f8e1f

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Oct 30, 2024

Preliminary registered-buffer collective support via Inductor

1de6077

ghstack-source-id: c474a11 Pull Request resolved: #138029

pytorchmergebot added the Merged label Oct 30, 2024

pytorchmergebot closed this in 7765d1e Oct 30, 2024

pytorchmergebot removed the merging label Oct 30, 2024

eqy mentioned this pull request Oct 31, 2024

[ROCM][CUDA][NCCL] Disable test_lowering_one_shot_all_reduce on ROCM #139414

Closed

github-actions bot deleted the gh/yifuwang/148/head branch December 1, 2024 02:20

This was referenced Jan 13, 2026

[inductor] Basic Comm Buffer Reuse for Symmetric Memory #171909

Closed

[Inductor] Track deterministic alloc_id assignment across cached compilation artifacts #172475

Open

		)


		_bufs_to_skip_wait: OrderedSet[Tuple[int, str]] = OrderedSet()

Conversation

yifuwang commented Oct 15, 2024 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Limitation

Uh oh!

pytorch-bot bot commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138029

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Chillee left a comment

Choose a reason for hiding this comment

Uh oh!

Chillee Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Chillee Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge failed

Uh oh!

yifuwang commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 24, 2024

Uh oh!

yifuwang commented Oct 30, 2024

Uh oh!

pytorchmergebot commented Oct 30, 2024

Uh oh!

pytorchmergebot commented Oct 30, 2024

Uh oh!

yifuwang commented Oct 30, 2024

Uh oh!

pytorchmergebot commented Oct 30, 2024

Merge started

Uh oh!

huydhn commented Oct 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yifuwang commented Oct 15, 2024 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading