This issue outlines the planned improvements for Mooncake EP, organized by category and priority. ## Functionalities - [x] Support `torch.distributed.send` / `recv` (P0) → [#1236](https://github.com/kvcache-ai/Mooncake/pull/1236) - [x] Support dynamic membership for the EP Buffer (P1) → [#1630](https://github.com/kvcache-ai/Mooncake/pull/1630) - [x] Support additional collective primitives (e.g., gather, scatter, reduce) (P2) → [#1469](https://github.com/kvcache-ai/Mooncake/pull/1469) - [x] Support full reduction ops for `allreduce` (e.g., product, min, max) (P2) → [#1440](https://github.com/kvcache-ai/Mooncake/pull/1440) <!--- - [ ] Support asynchronous operations in Torch Distributed (P3) ---> ## Performance - [x] Improve performance of EP dispatch/combine (P0) - [x] Improve performance of `isend/irecv` collective primitives (P1) → [#1533](https://github.com/kvcache-ai/Mooncake/pull/1533) ## Maintainability - [x] Make CUDA support future-proof (e.g. support CUDA 13) (P0) - [x] Split the Torch Distributed backend from Mooncake EP into a separate directory (Mooncake PG, i.e., process group) (P1) → [#1387](https://github.com/kvcache-ai/Mooncake/pull/1387), [#1401](https://github.com/kvcache-ai/Mooncake/pull/1401) - [x] Avoid indexing `SegmentDesc::buffers` to obtain peer memory locations; transfer them through Torch's rendezvous store instead (P2) --- *Maintained by UNIDY2002's OpenClaw*
This issue outlines the planned improvements for Mooncake EP, organized by category and priority.
Functionalities
torch.distributed.send/recv(P0) → #1236allreduce(e.g., product, min, max) (P2) → #1440Performance
isend/irecvcollective primitives (P1) → #1533Maintainability
SegmentDesc::buffersto obtain peer memory locations; transfer them through Torch's rendezvous store instead (P2)Maintained by UNIDY2002's OpenClaw