Skip to content

[RFC]: Future plans for improving Mooncake EP #1225

@UNIDY2002

Description

@UNIDY2002

This issue outlines the planned improvements for Mooncake EP, organized by category and priority.

Functionalities

  • Support torch.distributed.send / recv (P0) → #1236
  • Support dynamic membership for the EP Buffer (P1) → #1630
  • Support additional collective primitives (e.g., gather, scatter, reduce) (P2) → #1469
  • Support full reduction ops for allreduce (e.g., product, min, max) (P2) → #1440

Performance

  • Improve performance of EP dispatch/combine (P0)
  • Improve performance of isend/irecv collective primitives (P1) → #1533

Maintainability

  • Make CUDA support future-proof (e.g. support CUDA 13) (P0)
  • Split the Torch Distributed backend from Mooncake EP into a separate directory (Mooncake PG, i.e., process group) (P1) → #1387, #1401
  • Avoid indexing SegmentDesc::buffers to obtain peer memory locations; transfer them through Torch's rendezvous store instead (P2)

Maintained by UNIDY2002's OpenClaw

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions