[Misc] Mooncake EP & Mooncake Backend by UNIDY2002 · Pull Request #805 · kvcache-ai/Mooncake

UNIDY2002 · 2025-09-04T09:26:07Z

In this PR, we propose Mooncake EP and the Mooncake Backend.

Mooncake EP is an adaptation of DeepEP that supports fault tolerance for large-scale MoE inference. It remains API-compatible with DeepEP, with an extra broken_ranks tensor to track failed ranks.

Mooncake Backend is a PyTorch distributed backend, designed as a fault-tolerant replacement for NCCL and Gloo. it can continue to perform collective communication under rank failures and reports them to upper layers for graceful handling.

Tests

Since the C++ APIs are not intended for direct use, no C++ unit tests are provided. Instead, three Python unit tests are included under mooncake-wheel/tests/:

test_mooncake_ep.py: Adapted from DeepEP’s test_low_latency.py. Verifies the correctness of the EP APIs and includes a basic performance test.
test_mooncake_backend.py: Validates the correctness of the Mooncake Backend.
test_mooncake_backend_perf.py: Compares the performance of the Mooncake Backend against NCCL and Gloo.

Performance

Tested on a 8 * H100 node.

Mooncake EP (pure RDMA)

Impl	Dispatch bandwidth	Dispatch latency	Combine bandwidth	Combine latency
Mooncake	41 GB/s	184 us	38 GB/s	387 us
DeepEP	46 GB/s	163 us	46 GB/s	318 us

Mooncake Backend

Here is the preliminary performance result of the Mooncake Backend. Further optimizations will be done in the future.

All data are in microseconds.

Mooncake v.s. Gloo

Allgather

Data Size	Mooncake	Gloo
1K	94	681
4K	125	834
16K	288	1121
64K	928	6253
256K	3715	8163
1M	7929	37067
4M	31239	142334

Allreduce

Data Size	Mooncake	Gloo
1K	87	1334
4K	163	1358
16K	476	1482
64K	1623	1606
256K	6382	2202
1M	23194	5324
4M	92664	15734

Broadcast

Data Size	Mooncake	Gloo
1K	61	101
4K	87	129
16K	142	177
64K	389	449
256K	1389	1130
1M	1662	2759
4M	7876	11559

Mooncake v.s. NCCL

Allgather

Data Size	Mooncake	NCCL
1K	67	93
4K	69	88
16K	78	93
64K	122	84
256K	293	81
1M	1038	178
4M	4158	521

Allreduce

Data Size	Mooncake	NCCL
1K	57	34
4K	60	30
16K	77	31
64K	122	30
256K	300	31
1M	1112	53
4M	14421	119

Broadcast

Data Size	Mooncake	NCCL
1K	50	28
4K	38	26
16K	47	27
64K	100	28
256K	246	34
1M	834	28
4M	3196	68

whybeyoung · 2025-09-15T07:02:02Z

Amazing work!

ShangmingCai · 2025-09-15T08:06:06Z

+if int(os.getenv("BUILD_WITH_EP", "0")):
+    import torch
+    from torch.utils.cpp_extension import BuildExtension, CUDAExtension
+    abi_flag = int(torch._C._GLIBCXX_USE_CXX11_ABI)
+    current_dir = os.path.abspath(os.path.dirname(__file__))
+    ext_modules = [
+        CUDAExtension(
+            name="mooncake.ep",
+            include_dirs=[
+                os.path.join(current_dir, "../mooncake-ep/include"),
+                os.path.join(current_dir, "../mooncake-transfer-engine/include"),
+            ],
+            sources=["../mooncake-integration/ep/ep_py.cpp"],
+            extra_compile_args={
+                "cxx": [f"-D_GLIBCXX_USE_CXX11_ABI={abi_flag}", "-std=c++20"],
+                "nvcc": [f"-D_GLIBCXX_USE_CXX11_ABI={abi_flag}", "-std=c++20"],
+            },
+            libraries=["ibverbs", "mlx5"],
+            extra_objects=[
+                os.path.join(current_dir, "../build/mooncake-ep/src/libmooncake_ep.a"),
+                os.path.join(current_dir, "mooncake/engine.so"),
+            ],
+        )
+    ]
+    setup(
+        distclass=BinaryDistribution,
+        cmdclass={
+            "bdist_wheel": CustomBdistWheel,
+            "build_ext": BuildExtension,
+        },
+        ext_modules=ext_modules,
+    )
+else:
+    setup(
+        distclass=BinaryDistribution,
+        cmdclass={"bdist_wheel": CustomBdistWheel},
+    )


Is -std=c++20 the minimum required version? cc: @xiaguan

Mooncake Store needs C++20, others could probably use a lower C++ standard like C++17.

It seems that a C++20 feature is used here (starts_with)

Mooncake/mooncake-transfer-engine/include/common.h

Line 165 in 4e03dbe

if (server_name.starts_with("[")) {

ShangmingCai · 2025-09-15T08:43:36Z

+    def dispatch(self, x: torch.Tensor, topk_idx: torch.Tensor, broken_ranks: torch.Tensor,
+                 num_max_dispatch_tokens_per_rank: int, num_experts: int, timeout_us: int,
+                 use_fp8: bool = True, async_finish: bool = False, return_recv_hook: bool = False) -> \
+            Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor, Tuple, EventOverlap, Callable]:


This should be fixed as well.

Changed Tuple[torch.Tensor, torch.Tensor] to Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]

ShangmingCai · 2025-09-15T09:08:10Z

I have another urgent PR need to test and review today, will continue with this PR tomorrow.

@alogfans Please take some time to review this PR as well.

ShangmingCai · 2025-09-22T02:43:07Z

+    TORCH_CHECK(tensorSize * meta->size < kBufferSize, "Too large!");
+    auto future = c10::make_intrusive<c10::ivalue::Future>(
+        c10::ListType::create(c10::TensorType::get()));
+    int taskId = cpuTaskCount % 2;


Maybe need a comment here for clarification?

A comment is added.

ShangmingCai · 2025-09-22T02:45:57Z

+                       .attr("__version__")
+                       .attr("split")("+")
+                       .cast<std::vector<std::string>>()[0];
+    TORCH_CHECK(version == "2.8.0", "Mooncake Backend requires torch==2.8.0");


Should we use >= in case SGLang/vLLM requires a newer version of PyTorch?

I'm afraid a strict equal is required here, as the Mooncake lib should match the libtorch C++ ABI.

If SGLang/vLLM require a newer version of PyTorch, perhaps we have to recompile Mooncake with the corresponding PyTorch version. (Or, to be optimistic, we might figure out a better solution in the following versions.)

ShangmingCai

This is a huge PR. I have finished several rounds of basic reviews with some easy-to-fix problems. I think we can merge this first after addressing the above comments to see if we can get some user feedback. CC: @alogfans, better take a look before merging this PR.

UNIDY2002 · 2025-09-22T09:45:01Z

@ShangmingCai Thanks for your review and valuable feedbacks! I'll fix the issues.

alogfans · 2025-09-26T02:02:11Z

I agree with @ShangmingCai, merge it first.

* Initialize a mooncake backend * Add pybind * Fix incorrect backend registration * Fix wheel building of mooncake_ep * Add a fake allreduce implementation * Introduce transfer_engine to mooncake_backend * Add a basic CPU proxy execution framework * Implement a seemingly working allgather * Remove mooncake_ep's dependency on etcd * Implement `_allgather_base` * Implement `allreduce` * Implement `alltoall` * Use an even-odd pattern for data transfer * Add a `set_host_ip` method * Switch to an extended-API implementation of the Mooncake backend * Implement `broadcast` * Implement `barrier` * Extend Mooncake backend to CPU * Support more operations for reduction * Fix the backend-worker coordination logic * Optimize CPU worker with a callback pattern * Add a timeout-based broken-ranks detection * Merge EP module into Mooncake's build system * Share transfer buffer across all worker instances * Switch to a more robust approach to detect broken ranks * Specify CUDA device for test_mooncake_backend.py * Explicitly stop mooncake worker * Use transfer engine's notifications to implement collective signals * Remove the unused `all_reduce_without` API * Switch to mooncake backend for test_mooncake_ep.py * Support both IB and RoCE * Fix EP unit test * Pass the auto-detected nic_id to EP Buffer * Fix CMake conditional branches when `PYTORCH_CMAKE_PATH` is not set * Fix ibgda syncing for RoCE * Revert "Share transfer buffer across all worker instances" This reverts commit 964e0a9 * Implement `_reduce_scatter_base` * Make CPU backends aware of broken ranks * Fix .typos.toml * Add a perf test for mooncake backend * Support more dtypes for reduction * Revert "Use transfer engine's notifications to implement collective signals" This reverts commit f20ffb2 * Share worker thread among all process groups * Share transfer engine among all process groups * Fix unit tests * Add a warmup phase for transfer engine * Fix transfer engine buffer locations * Fix incorrect calculation of mooncake ep buffer * Do not use timeout detection in mooncake_ep tests * Update mooncake backend perf test * Demangle per-group buffer offset from the shared taskId * Stop allocating the useless `cuda_counter_buffer` and `cuda_data_buffer` * Split the task list into a CPU region and a CUDA region * Add a warmup for test_mooncake_backend_perf.py * Switch from raw cudaEvent to `torch::Event` * Fix MooncakeWorkCuda::wait() to make it compatible with cuda graphs * Add doc * Fix perf test * Implement all-gather for perf test * Move impl of `MooncakeEpBuffer`'s member functions to .cpp * Change `gathered_experts` to `broken_nodes` to make the API more consistent * `broken_nodes` should be `broken_ranks` * API rename * Fix format * Enable WITH_EP option in CI * Try installing torch in advance in CI * Set `TORCH_CUDA_ARCH_LIST` in CMakeLists.txt * Install required dependencies in the CI CUDA environment * [CI] Add the matching PyTorch * [CI] Add a workaround for missing `CUDA::nvToolsExt` * Remove unused pybind base class declaration of `MooncakeBackendOptions` * Support `set_device_filter` * Remove unused headers for ep_py.cpp * Build the EP-wheel with setuptools on CI * [CI] Add the build-with-ep process to release.yaml * Minor format fix * Update build guide * Fix docs * Only build EP wheel with torch==2.8.0 * Add a torch version assertion for Mooncake Backend * Fix some python typing * Use the correct group for EP's initial data sharing * API: invert `broken_ranks` and change into `active_ranks` * Followup fix for inverting the API * Fix format * Bug-fix in mooncake_ep_kernel.cu * Mooncake EP has to be built with USE_CUDA on * Fixed some issues according to the review * Fix bug

UNIDY2002 added 30 commits August 12, 2025 16:16

Initialize a mooncake backend

34d7deb

Add pybind

fa80700

Fix incorrect backend registration

a401f32

Fix wheel building of mooncake_ep

b1e5dfc

Add a fake allreduce implementation

b535bbb

Introduce transfer_engine to mooncake_backend

779b447

Add a basic CPU proxy execution framework

fb8918e

Implement a seemingly working allgather

cf75d22

Remove mooncake_ep's dependency on etcd

1427db9

Implement _allgather_base

fab5451

Implement allreduce

bfe83e0

Implement alltoall

74834c1

Use an even-odd pattern for data transfer

3f72b68

Add a set_host_ip method

1c548b3

Switch to an extended-API implementation of the Mooncake backend

e99eb08

Implement broadcast

7bc9438

Implement barrier

635197f

Extend Mooncake backend to CPU

b1a5a37

Support more operations for reduction

9a610cb

Fix the backend-worker coordination logic

1d06727

Optimize CPU worker with a callback pattern

6169ff9

Add a timeout-based broken-ranks detection

10f2f8e

Merge EP module into Mooncake's build system

bbcf85c

Share transfer buffer across all worker instances

964e0a9

Switch to a more robust approach to detect broken ranks

d391ba9

Specify CUDA device for test_mooncake_backend.py

2f308f9

Explicitly stop mooncake worker

3a0d872

Use transfer engine's notifications to implement collective signals

f20ffb2

Remove the unused all_reduce_without API

896f668

Switch to mooncake backend for test_mooncake_ep.py

9f82a37