Skip to content

Preliminary registered-buffer collective support via Inductor#138029

Closed
yifuwang wants to merge 14 commits intogh/yifuwang/148/basefrom
gh/yifuwang/148/head
Closed

Preliminary registered-buffer collective support via Inductor#138029
yifuwang wants to merge 14 commits intogh/yifuwang/148/basefrom
gh/yifuwang/148/head

Conversation

@yifuwang
Copy link
Copy Markdown
Collaborator

@yifuwang yifuwang commented Oct 15, 2024

Stack from ghstack (oldest at bottom):

NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.

Changes

  • Created comm_lowering.py for the lowerings of _c10d_functional ops. This is to prevent cluttering lowering.py as we add more lowering-time collective optimizations. This PR moved the lowerings for all_reduce and all_reduce_ to the file.
  • Added comm_buffer_type: Dict[str, str] to GraphLowering to track whether a buffer is a comm buffer and the type of the comm buffer.
  • Added codegen allocation support for comm buffers of type "symm_mem".
  • Added support for auto-lowering _c10d_functional.all_reduce_ to symm_mem.one_shot_all_reduce.
  • Added an Inductor config for collective optimizations in general (config._collective).

Limitation

Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Oct 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138029

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f2f8e1f with merge base 23d590e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Oct 15, 2024
@yifuwang yifuwang requested a review from Chillee October 15, 2024 22:33
@yifuwang yifuwang added the topic: not user facing topic category label Oct 15, 2024
…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Oct 16, 2024
…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@yifuwang yifuwang mentioned this pull request Oct 17, 2024
…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
Copy link
Copy Markdown
Collaborator

@Chillee Chillee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM!

try:
# Only add empty_strided_p2p() if distributed and SymmetricMemory
# is available
from torch._C._distributed_c10d import _SymmetricMemory # noqa: F401
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it'd be good for us to refactor this code in some manner so that we only import the dependencies we need.

)


_bufs_to_skip_wait: OrderedSet[Tuple[int, str]] = OrderedSet()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov XilunWu

[ghstack-poisoned]
@yifuwang
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

…tor"


```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov XilunWu

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Oct 25, 2024
@yifuwang
Copy link
Copy Markdown
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased gh/yifuwang/148/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/138029)

pytorchmergebot pushed a commit that referenced this pull request Oct 30, 2024
@yifuwang
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Oct 31, 2024

@yifuwang I think this change is failing on ROCm, could you help take a look at the failure?

distributed/test_symmetric_memory.py::LoweringTest::test_lowering_one_shot_all_reduce GH job link HUD commit link

pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2024
#139414)

I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed #138029

Pull Request resolved: #139414
Approved by: https://github.com/huydhn, https://github.com/yifuwang
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…h#138029)

```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

Pull Request resolved: pytorch#138029
Approved by: https://github.com/Chillee
ghstack dependencies: pytorch#138028
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
pytorch#139414)

I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed pytorch#138029

Pull Request resolved: pytorch#139414
Approved by: https://github.com/huydhn, https://github.com/yifuwang
@github-actions github-actions bot deleted the gh/yifuwang/148/head branch December 1, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants