-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[core][compiled graphs] Execute multiple p2p NCCL transfers with collective ops #47938
Copy link
Copy link
Open
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalcommunity-backlogcompiled-graphscoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance
Description
Description
Right now if an actor sends a torch tensor(s) to other actor(s), it always executes using one p2p op per transfer. This can be much slower than running the same op with a NCCL collective. For example:
with InputNode() as inp:
t = a.foo.bind(inp)
t = t.with_type_hint(TorchTensorType(transport="nccl"))
t_b = b.foo.bind(t)
t_c = c.foo.bind(t)
dag = a.update(t, t_b, t_c)This will execute 4 p2p transfers, a->b, a->c, b->a, c->a, when it could be executed in 2 collective ops, broadcast then gather. Ideally we should try to identify such cases and replace the individual transfers with collective ops.
Use case
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalcommunity-backlogcompiled-graphscoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance