Skip to content

[core][compiled graphs] Execute multiple p2p NCCL transfers with collective ops #47938

@stephanie-wang

Description

@stephanie-wang

Description

Right now if an actor sends a torch tensor(s) to other actor(s), it always executes using one p2p op per transfer. This can be much slower than running the same op with a NCCL collective. For example:

with InputNode() as inp:
  t = a.foo.bind(inp)
  t = t.with_type_hint(TorchTensorType(transport="nccl"))
  t_b = b.foo.bind(t)
  t_c = c.foo.bind(t)
  dag = a.update(t, t_b, t_c)

This will execute 4 p2p transfers, a->b, a->c, b->a, c->a, when it could be executed in 2 collective ops, broadcast then gather. Ideally we should try to identify such cases and replace the individual transfers with collective ops.

Use case

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalcommunity-backlogcompiled-graphscoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilityperformance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions