[core][compiled graphs] Support experimental_compile(_default_communicator=comm) by ruisearch42 · Pull Request #50023 · ray-project/ray

ruisearch42 · 2025-01-23T02:28:48Z

Why are these changes needed?

Currently, with_tensor_transport(transport=arg) supports 3 types of arg: "nccl", "auto", or a Communicator. If "nccl" is used, or "auto" is used and resolves to using NCCL, Ray creates a communicator internally and uses it.

In order to better manage the communicator lifecycle, we make the following changes:

Add API to support experimental_compile(_default_communicator=default_comm), default_comm is a default communicator user provided
_default_communicator in the above API can be "create", which explicitly tells Ray to create a default communicator and use it for all with_tensor_transport() sites where a specific communicator was not provided
If _default_communicator is not specified, Ray will not create a default communicator or reuse communicators passed in from other with_tensor_transport() sites, but throw an error when a communicator is needed. Note that this is backward incompatible.
For collectives, if a custom communicator is specified at a specific site, we use it; otherwise, if default is provided, we use the default; otherwise if default is "create", we create a communicator per collective op, but reuse the communicator op when actor set is the same. For all passed-in communicator, we check its actor set is the same as that of the communicator.
For "create", a single p2p communicator is created for all involved actors without a passed-in communicator. We reuse a previously created collective communicator if its actor set includes all p2p actors.
_init_communicator() is called for all communicators (passed in or created) on its involved actors, and _destroy_communicator() is only called for the ones created by Compiled Graph.

By default, experimental_compile(_default_communicator="create") is used, therefore:

Any custom communicator passed in (via with_tensor_transport() for p2p, and collective.allreduce.bind() for collective) will take precedence and be used in the corresponding operations. These communicators won't be reused at other sites. (User should explicitly specify if they want.)
For each collective operation without a custom communicator, a new communicator is created, and this communicator is reused for all other collective-operation-without-custom-communicator that has the same set of actors.
For all p2p communications without custom communicator, a single communicator is created for all involved actors. Creation is skipped if there is a communicator created for collective operation, whose actors include all p2p actors: the collective communicator is reused in p2p communications in this case.

This PR also refactors CompiledDAG._preprocess() so that code is more organized.

Related issue number

Closes #47540

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kevin85421

overall looks good

python/ray/dag/dag_node.py

python/ray/dag/compiled_dag_node.py

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kevin85421

Leave some comments. Overall looks good to me. Because this is an API change, it'd be helpful to get a review from @stephanie-wang.

python/ray/dag/compiled_dag_node.py

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

python/ray/dag/compiled_dag_node.py

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

…cator=comm) (ray-project#50023) Currently, `with_tensor_transport(transport=arg)` supports 3 types of arg: "nccl", "auto", or a `Communicator`. If "nccl" is used, or "auto" is used and resolves to using NCCL, Ray creates a communicator internally and uses it. In order to better manage the communicator lifecycle, we make the following changes: - Add API to support `experimental_compile(_default_communicator=default_comm)`, default_comm is a default communicator user provided - `_default_communicator` in the above API can be "create", which explicitly tells Ray to create a default communicator and use it for all `with_tensor_transport()` sites where a specific communicator was not provided - If `_default_communicator` is not specified, Ray will not create a default communicator or reuse communicators passed in from other `with_tensor_transport()` sites, but throw an error when a communicator is needed. Note that this is backward incompatible. - For collectives, if a custom communicator is specified at a specific site, we use it; otherwise, if default is provided, we use the default; otherwise if default is "create", we create a communicator per collective op, but reuse the communicator op when actor set is the same. For all passed-in communicator, we check its actor set is the same as that of the communicator. - For "create", a single p2p communicator is created for all involved actors without a passed-in communicator. We reuse a previously created collective communicator if its actor set includes all p2p actors. - `_init_communicator()` is called for all communicators (passed in or created) on its involved actors, and `_destroy_communicator()` is only called for the ones created by Compiled Graph. By default, `experimental_compile(_default_communicator="create")` is used, therefore: - Any custom communicator passed in (via `with_tensor_transport()` for p2p, and `collective.allreduce.bind()` for collective) will take precedence and be used in the corresponding operations. These communicators won't be reused at other sites. (User should explicitly specify if they want.) - For each collective operation without a custom communicator, a new communicator is created, and this communicator is reused for all other collective-operation-without-custom-communicator that has the same set of actors. - For all p2p communications without custom communicator, a single communicator is created for all involved actors. Creation is skipped if there is a communicator created for collective operation, whose actors include all p2p actors: the collective communicator is reused in p2p communications in this case. This PR also refactors `CompiledDAG._preprocess()` so that code is more organized. Closes ray-project#47540 --------- Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 added 2 commits January 23, 2025 02:28

wip

0f56424

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

f362475

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 changed the title ~~wip~~ [core][compiled graphs] Support Support experimental_compile(_default_communicator=comm) Jan 23, 2025

ruisearch42 marked this pull request as ready for review January 23, 2025 22:18

up

3a568e4

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 force-pushed the explicit_comm2 branch from 82bd905 to 3a568e4 Compare January 24, 2025 02:40

up

91259c6

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 added the go add ONLY when ready to merge, run all tests label Jan 24, 2025

ruisearch42 assigned stephanie-wang and kevin85421 Jan 24, 2025

ruisearch42 added 6 commits January 27, 2025 02:39

up

4117496

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

5d6b5a9

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

bb49f71

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

b29d1e2

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

f9bca7b

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

clean up

9af2320

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 changed the title ~~[core][compiled graphs] Support Support experimental_compile(_default_communicator=comm)~~ [core][compiled graphs] Support experimental_compile(_default_communicator=comm) Jan 30, 2025

ruisearch42 removed the go add ONLY when ready to merge, run all tests label Jan 30, 2025

ruisearch42 added 10 commits January 31, 2025 00:37

up

4b9ce1c

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

028957c

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

3123768

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

2586df2

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

a753ca2

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

fa98929

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

5e92d38

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

2d8b80d

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

2379d2a

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

da10b63

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 added the go add ONLY when ready to merge, run all tests label Jan 31, 2025

kevin85421 reviewed Jan 31, 2025

View reviewed changes

ruisearch42 added 5 commits January 31, 2025 22:39

up

80847a6

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

699e556

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

9296310

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

6ebbece

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

9bd3689

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kevin85421 approved these changes Feb 3, 2025

View reviewed changes

ruisearch42 added 4 commits February 3, 2025 23:27

up

4afe03b

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

5081e59

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

d307c89

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

8642198

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

stephanie-wang reviewed Feb 4, 2025

View reviewed changes

ruisearch42 added 4 commits February 5, 2025 02:01

up

ab1498d

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

8d74767

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

rename

8fe2d47

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

8283f7e

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

stephanie-wang merged commit a0631d7 into ray-project:master Feb 5, 2025
2 checks passed

ruisearch42 mentioned this pull request May 13, 2025

(WIP) [ADAG] Support dag.experimental_compile(_custom_nccl_group= nccl_group) in aDAG #47987

Closed

8 tasks

hainesmichaelc added the community-backlog label May 22, 2025

Conversation

ruisearch42 commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ruisearch42 commented Jan 23, 2025 •

edited

Loading