[Store] High Availability V2: Client Failover by ykwd · Pull Request #501 · kvcache-ai/Mooncake

ykwd · 2025-06-16T12:25:05Z

Following #451, this PR serves as the 2nd step towards store's high availability.

This PR mainly focuses on the failover on the client side. More specifically, this PR adds the following features:

When the master has not heard from a certain client through Pings for a pre-defined time period, either because the client terminated or there are some network partitions, the master will remount all segments on this client.
When the client regains connections to the master after a network partition, it can automatically remount all the local segments.

There are many corner cases, such as:

When a client crashes and then restarts, it may attempt to mount a segment with the same name as before, e.g., using its hostname as the segment name. Though the two segments have the same name, the old one is no longer available. The master must identify this situation, unmount the old segment, and mount the new one.
When a client gets partitioned from the master, the master unmounts its segments, but has not finished deleting all the replicas that are allocated on the unmounted segments. At this time, the client connects to the master and remounts its segment. Then the segment shall not be mounted and allocated memory to new key-values, as the old key-values that reside on this segment may still exist.
There is a leader view change. The client needs to remount all the segments to the new leader.
etc.

To deal with all the corner cases:

A UUID is associated with each client instance and each segment.
The segment unmounting is divided into two phases: prepare and commit. In the prepare phase, the segment's buffer allocator is deleted and is marked as unmounting. After all the associated replicas are deleted, in the commit phase, the segment is fully deleted. A segment cannot be mounted if the segment with the same UUID is unmounting.
On the master side, the client status and segment status shall be updated together in an atomic way to avoid state inconsistency.

Currently, there is no distributed testing or chaos testing for the HA features. Thus, the HA features shall be considered as highly unstable. So this PR still does not change the default behavior of the system. HA features will be toggled on only when users explicitly specify --enable-ha=true. We will design and run more testing, including integration testing, distributed testing, and chaos testing, in the subsequent PRs. Only after passing all these tests would it be safe to introduce the HA features in the docs.

…st main branch

…etrics bug.

…e are set.

bug fix: add string name of new errors that will be used in tostring.

…ternal error.

xiaguan

Great work! Just a couple of minor issues.
I just skimmed through the code and didn’t look too closely at the logic yet.

…calhost name

staryxchen · 2025-06-19T04:31:27Z

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF?
I saw an error even if USE_ETCD is OFF:

go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout

The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?

ykwd · 2025-06-19T04:43:23Z

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF? I saw an error even if USE_ETCD is OFF:
go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout
The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?

Sorry for the inconvenience. There is no need to install etcd. This is a go package used in mooncake-store. Do you want to compile transfer engine only, or want to compile both engine and store? In the former case, setting WITH_STORE=OFF will bypass this dependency.

staryxchen · 2025-06-19T04:48:58Z

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF? I saw an error even if USE_ETCD is OFF:
go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout
The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?
Sorry for the inconvenience. There is no need to install etcd. This is a go package used in mooncake-store. Do you want to compile transfer engine only, or want to compile both engine and store? In the former case, setting WITH_STORE=OFF will bypass this dependency.

So now there is no way to compile store without ETCD? Can we make it to an optional feature?

ykwd · 2025-06-19T04:54:31Z

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF? I saw an error even if USE_ETCD is OFF:
go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout
The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?
Sorry for the inconvenience. There is no need to install etcd. This is a go package used in mooncake-store. Do you want to compile transfer engine only, or want to compile both engine and store? In the former case, setting WITH_STORE=OFF will bypass this dependency.
So now there is no way to compile store without ETCD? Can we make it to an optional feature?

Thanks for the suggestion. I will try to make a hot fix for this.

ykwd · 2025-06-19T06:21:52Z

Updates

Modifications suggested by reviews.
On the client side, 1) add a restriction to only mount segments with the same name as the localhost; 2) allow mounting multiple non-overlapped segments with the same name.
Add a tool, clientctl, to make manual e2e tests easier.

ykwd · 2025-06-19T07:50:13Z

staryxchen

@staryxchen Just submitted a hotfix. #520

xiaguan

I think unifying the clien and segment abstraction would makes sense. The current client -> segments setup adds a lot of unnecessary complexity. It could simplify things quite a bit on both the master and client sides. Plus, it'll make future changes cleaner since we won't need to handle the one-client-to-multiple-segments case anymore.

xiaguan · 2025-06-20T05:09:29Z

+#include <string>
+#include <string_view>
+#include <unordered_map>
+#include <variant>


remove unsed header.

OK. Removed the variant.

xiaguan · 2025-06-20T05:49:05Z

+    for (auto& segment_id : it->second) {
+        auto segment_it = segment_manager_->mounted_segments_.find(segment_id);
+        if (segment_it != segment_manager_->mounted_segments_.end()) {
+            segments.push_back(segment_it->second.segment);


Use emplace_back instead of push_back where possible to avoid creating temporary objects. This can improve performance, especially for complex types.

xiaguan · 2025-06-20T05:51:54Z

+     * @return client status from the master
+     * @return ErrorCode indicating success/failure
+     */
+    [[nodiscard]] PingResponse Ping(const UUID& client_id);


Even though both functions are named 'ping', they actually have different meanings. We should probably rename them to better reflect what each one does.

By "both functions", are there two functions named 'ping'?

one for heartbeat, and the other to check if the master is available?

OK. I find it. That is weird. I wonder where the duplicate ping comes from. Perhaps it is from a merging. I will remove the duplicate one.

xiaguan · 2025-06-20T05:55:18Z

@@ -677,46 +709,44 @@ void Client::PingThreadFunc(int current_version) {
    const int fail_ping_interval_ms = 1000;


We should probably turn these into configurable options too in the future

Unlike the master, currently it is very hard to configure the client, especially adding configuration options to the client. Additionally, I do not see the reason to make this configurable for users.

We're planning to add a client config module in the future.

The ping interval does seem a bit long—do you think setting the default to 100ms would work better?

We're planning to add a client config module in the future.

That would be great. Looking forward to seeing it.

The ping interval does seem a bit long—do you think setting the default to 100ms would work better?

Perhaps not. Currently, if the leader crashes, it takes several seconds (perhaps 5 to 10+ seconds) before the new leader begin to serve. If the leader crashes, it takes 3 ping fails to trigger the client to query etcd for the new leader address. This takes 3 seconds, which fits the leader change timespan well. Additionally, too many pings will also brings burden to the master.

xiaguan · 2025-06-20T06:11:14Z

+        return allocators_by_name_;
+    }
+
+    std::vector<std::shared_ptr<BufferAllocator>>& getAllocators() {


xiaguan · 2025-06-20T06:11:37Z

+          allocators_(allocators),
+          lock_(mutex) {}
+
+    std::unordered_map<std::string,


return a const ref

Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.

commit 08fcdc8 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jul 14 08:35:05 2025 +0000 Reformat code commit f99cae8 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 11 07:46:02 2025 +0000 Optimize the use of CQ and QP in RDMA workers commit 12f6e41 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 09:23:54 2025 +0000 Cache remote segment commit d335449 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 08:08:41 2025 +0000 Move generatePostPath in async threads commit 4a24c5d Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 07:13:17 2025 +0000 Avoid pointer copy in transfer_engine.h commit 22c018b Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 06:57:36 2025 +0000 Rename Segment Tracker commit 0cdc522 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 06:47:12 2025 +0000 Use thread local storage commit b993750 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:24:05 2025 +0000 Revert all modifcations commit 6ee61c8 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:23:06 2025 +0000 Update commit 68ae198 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:17:07 2025 +0000 Update commit 1c3b6fc Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:12:45 2025 +0000 Update commit 27420b4 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 11:15:01 2025 +0000 Revert back commit e17d3fc Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 11:04:38 2025 +0000 Optimize allocateBatch and freeBatch commit baf39d2 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:42:32 2025 +0000 Test commit 945aaf7 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:39:59 2025 +0000 Test commit f54dcd9 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:35:06 2025 +0000 Revert commit 8602c19 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:33:08 2025 +0000 Test commit 7bd251a Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:30:35 2025 +0000 Test commit 9b0c6e4 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:28:35 2025 +0000 Upload test code commit 36965d1 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 08:44:33 2025 +0000 Hack commit a096f1c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 08:18:17 2025 +0000 Update commit 53b4946 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 08:07:15 2025 +0000 Update slab allocator commit ebcedf5 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 07:38:37 2025 +0000 Use Slab instead of new/delete commit 87482cc Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:48:03 2025 +0000 Update commit e1da71c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:46:22 2025 +0000 Fix commit c07df31 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:45:54 2025 +0000 Fix failed commit 18115b0 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:42:58 2025 +0000 Add log commit 590b02b Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:38:16 2025 +0000 Log commit f3ae54a Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:32:43 2025 +0000 Add message commit 3cfac7c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:27:44 2025 +0000 Update commit 211412c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:25:04 2025 +0000 Add assert commit bb3e2b7 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:17:28 2025 +0000 Add trace commit ea713a2 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 02:42:00 2025 +0000 Fix local transfer via RDMA commit 2d337ed Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 02:36:23 2025 +0000 Add logs commit 2f3640f Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 8 09:24:28 2025 +0000 Add notify message in stderr commit 2c2cc36 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 8 09:11:41 2025 +0000 Add backoff in metadata commit d8cea58 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jul 7 05:38:22 2025 +0000 Add an auto-generated doc file of new test bench commit 323d0ea Author: Feng Ren <alogfans@gmail.com> Date: Mon Jul 7 03:28:04 2025 +0000 Add CXL support in SHM transport commit f0138dd Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 4 07:58:00 2025 +0000 Update MNNVL fix Fix bug Add GDS build refactor gds transport add include Fix Final fix Final fix Update Merge all modifications about GDS commit 97d8ca3 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 4 07:16:15 2025 +0000 Add MNNVL to default build commit 9af7604 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 4 06:53:48 2025 +0000 fix cuda runtime error commit 646d570 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 13:06:33 2025 +0000 extract thread pool for all transports commit 0c680e3 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 12:49:47 2025 +0000 Fix SHM problem commit 2db2ca7 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 05:58:40 2025 +0000 fix rpc commit 13fc066 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 03:00:54 2025 +0000 Fix rpc commit 2d3b8bb Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 02:27:58 2025 +0000 add benchmark features commit 594ddd9 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 2 03:42:54 2025 +0000 add tcp notify update bench v1 commit 9d3d1a4 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 1 14:22:14 2025 +0000 Add MNNVL support commit 8084ccf Author: qicosmos <qicosmos@linux.alibaba.com> Date: Tue Jul 1 14:18:33 2025 +0800 [cmake]fix cmake for centos (kvcache-ai#573) * fix cmake for centos * remove find_package jsoncpp * update * add cmake file commit f2f5950 Author: Sgt.Pepper <1303471564@qq.com> Date: Tue Jul 1 14:03:59 2025 +0800 fix Naming errors in doc transfer-engine-python.md (kvcache-ai#508) commit 7892711 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Tue Jul 1 13:52:28 2025 +0800 [P2P Store] Add cuda link option when it is installed (kvcache-ai#560) commit 3638313 Author: SCDESPERTATE <74419971+SCDESPERTATE@users.noreply.github.com> Date: Tue Jul 1 13:49:17 2025 +0800 Optimize slice handling to accelerate the large batch transfer operation (kvcache-ai#557) * kick off transfer first when there are too many slices to post in `RdmaTransport::submitTransferTask` * allow the last slice of a `TransferRequest` to be larger to reduce wr&&slice related overhead * add configs && docs explanation * remove a unnecessary reset operation commit 88f75f2 Author: ykwd <oneday117@qq.com> Date: Tue Jul 1 11:40:01 2025 +0800 [Store] Add Chaos Tests and Fix Bugs (kvcache-ai#568) - Add some chaos tests to verify the system's failover ability, including chaos_test, chaos_rand_test, e2e_rand_test. - Add chaosctl to do a manual and configurable chaos test. - Fix bugs found in the tests. - Add readme file for the e2e and chaos tests. commit 96da7a6 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Tue Jul 1 11:17:40 2025 +0800 [TransferEngine] Add support to force MNNVL transport by MC_FORCE_MNNVL (kvcache-ai#572) * [TransferEngine] Add support to force MNNVL transport by MC_FORCE_MNNVL * Change default parameters in test code commit c5b2524 Author: Stary <151037142+staryxchen@users.noreply.github.com> Date: Tue Jul 1 10:53:38 2025 +0800 [TransferEngine] ensure proper socket closure in destructor (kvcache-ai#566) commit c5cc9ec Author: JinYan Su <751080330@qq.com> Date: Mon Jun 30 17:10:58 2025 +0800 feat(store): add zero copy batch put and get for python binding (kvcache-ai#551) * feat(store): add zero copy batch put and get for python binding * Add comprehensive tests for batch_get_into and batch_put_from operations - Add test_batch_get_into_operations: Tests batch zero-copy read operations * Validates interface correctness with multiple keys and buffer sizes * Tests data integrity for 3 different-sized objects (2.3KB, 4.6KB, 3.5KB) * Verifies error handling for mismatched array sizes and empty inputs * Ensures proper buffer registration and management - Add test_batch_put_from_operations: Tests batch zero-copy write operations * Validates interface correctness with multiple keys and buffer sizes * Tests data integrity for 3 different-sized objects (1.8KB, 3.6KB, 2.7KB) * Verifies error handling for mismatched array sizes and empty inputs * Confirms stored data can be retrieved correctly Both tests follow the existing test patterns and include comprehensive error case coverage while keeping the interface validation simple and focused. * Fix CI test failures for batch operations - Fix buffer size allocation in batch_put_from test: allocate buffer_size = len(data) + 1024 for registration but use len(data) for actual put operation to avoid buffer registration errors (-600) seen in CI - Fix mismatched array size tests: ensure all arrays (keys, buffer_ptrs, buffer_sizes) have consistent slice lengths to properly test error handling - Both tests now pass locally and should resolve CI failures related to buffer registration and size validation commit 10f588c Author: xinranwang17 <87713897+xinranwang17@users.noreply.github.com> Date: Mon Jun 30 00:07:49 2025 +0800 [Store] feat: support batch put/get api in python module (kvcache-ai#556) * support batch put/get api in python module * feat: refine put_batch python API use put_batch(list, list) instead of put_batch(dict) * Delete mooncake-integration/store/test/uc_store.py commit 8ff2efc Author: Teng Ma <teng-ma@linux.alibaba.com> Date: Fri Jun 27 14:24:43 2025 +0800 [Integration] feat: expose batch reg API (kvcache-ai#558) commit 8bc4b6f Author: doujiang24 <doujiang24@gmail.com> Date: Fri Jun 27 12:19:35 2025 +0800 [TransferEngine] fix segfault when create cq failed. (kvcache-ai#535) Signed-off-by: doujiang24 <doujiang24@gmail.com> commit 3f0a784 Author: Wenjie <1186093704@qq.com> Date: Fri Jun 27 12:18:23 2025 +0800 [TransferEngine]: fix compilation warning (kvcache-ai#550) Signed-off-by: swj <1186093704@qq.com> commit d505516 Author: JinYan Su <751080330@qq.com> Date: Wed Jun 25 16:22:14 2025 +0800 chore: checkout specific version of yalantinglibs in script (kvcache-ai#555) commit 7d01004 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Wed Jun 25 15:57:27 2025 +0800 chore: bump version to 0.3.4.post2 in pyproject.toml (kvcache-ai#554) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit f2da050 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 25 15:46:48 2025 +0800 [TransferEngine] Fix side effect of wild location registration (kvcache-ai#552) commit cc286b0 Author: JinYan Su <751080330@qq.com> Date: Tue Jun 24 14:34:24 2025 +0800 feat(store): add batch exist support for master (kvcache-ai#542) * feat(store): add batch exist support for master Signed-off-by: Jinyang Su <751080330@qq.com> * refactor(client): simplify BatchIsExist logic by removing duplicate checks Co-authored-by: Xinran Wang <wangxinran.wxr@antgroup.com> Co-authored-by: Yongke Zhang <yongke.zyk@antgroup.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com> Co-authored-by: Xinran Wang <wangxinran.wxr@antgroup.com> Co-authored-by: Yongke Zhang <yongke.zyk@antgroup.com> commit fa7fc23 Author: Star <151037142+staryxchen@users.noreply.github.com> Date: Mon Jun 23 19:46:11 2025 +0800 [TransferEngine] support redis authentication and select db index (kvcache-ai#512) Signed-off-by: staryxchen <staryxchen@tencent.com> commit c324966 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Mon Jun 23 19:26:45 2025 +0800 chore: bump version to 0.3.4.post1 in pyproject.toml (kvcache-ai#544) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 07516f5 Author: Teng Ma <teng-ma@linux.alibaba.com> Date: Mon Jun 23 19:20:57 2025 +0800 revert: disable pr 483 (kvcache-ai#543) commit 34dddea Author: haobayuxi <whaohit@gmail.com> Date: Mon Jun 23 17:53:33 2025 +0800 add notify support (kvcache-ai#528) * add notify support * update transfer_engine_c.h wrapper * modification * remove vector.h commit 4635724 Author: JinYan Su <751080330@qq.com> Date: Mon Jun 23 15:57:56 2025 +0800 feat(master): support rpc server address parameter (kvcache-ai#530) * feat(master): support rpc server address parameter * docs: Add documentation for rpc_conn_timeout parameter - Add inline comment to MasterServiceSupervisor constructor explaining that rpc_conn_timeout=0 means no timeout (infinite) - Add clarifying comment in master.cpp about timeout behavior - Addresses PR feedback requesting documentation for timeout parameter semantics commit 6c482da Author: JinYan Su <751080330@qq.com> Date: Mon Jun 23 12:43:11 2025 +0800 feat(store): add thread safety analysis with clang annotations (kvcache-ai#538) * feat(store): add thread safety analysis with clang annotations - Add GUARDED_BY annotations to metadata hash table - Use MutexLocker instead of std::unique_lock for better thread safety analysis - Add NO_THREAD_SAFETY_ANALYSIS annotations where needed - Enable clang thread safety checking in build system - Fix all thread safety warnings in master service * Update mooncake-store/src/master_service.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> commit 73a15dd Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Fri Jun 20 17:54:32 2025 +0800 chore: bump version to 0.3.4 in pyproject.toml (kvcache-ai#533) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 4fa0038 Author: JinYan Su <751080330@qq.com> Date: Fri Jun 20 17:46:24 2025 +0800 feat(store): add zero-copy operations for python binding (kvcache-ai#532) * feat(store): add zero-copy operations for python binding * test: rename dict fuzz e2e test to run last * test: remove obsolete test_multicards.py from repository * chore(tests): remove multicards test execution from script commit d530800 Author: ykwd <oneday117@qq.com> Date: Fri Jun 20 16:59:55 2025 +0800 [Store] High Availability V2: Client Failover (kvcache-ai#501) Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests. commit 8028747 Author: Francis <38564764+ssssnow@users.noreply.github.com> Date: Fri Jun 20 16:36:16 2025 +0800 add support for batch transfer to accelerate transfer operation (kvcache-ai#499) * add support for batch transfer to accelerate transfer operation * fix tcp port exhausted issue * add more info and fix double free * rm unused freeBatchID * [chore] add TODO comment for batchTransfer * [Fix] reset 0 to transfer bytes --------- Co-authored-by: Francis <Francis> commit bffda70 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Fri Jun 20 15:46:35 2025 +0800 [Build] Optimize store build control for wheel and local build (kvcache-ai#531) * [Build] Optimize store build control for wheel and local build Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix typo Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix rm Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> --------- Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 41e8fa6 Author: doujiang24 <doujiang24@gmail.com> Date: Fri Jun 20 15:15:14 2025 +0800 use kWildcardLocation instead of hardcode "cpu:0" to recognize cpu numa node automatically. (kvcache-ai#527) Signed-off-by: doujiang24 <doujiang24@gmail.com> commit c743172 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Fri Jun 20 15:02:19 2025 +0800 [TransferEngine] Disabling auto-delete QP trying to avoid the availabilty problem (kvcache-ai#483) need a proper fix in the feature. commit 85ba034 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Fri Jun 20 14:19:41 2025 +0800 [Build] Deprecate stale adaptor usage to reduce whl package size (kvcache-ai#529) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 29532e8 Author: Ke Yang <oneday117@qq.com> Date: Fri Jun 20 02:40:31 2025 +0000 Update readme commit 9688967 Author: Ke Yang <oneday117@qq.com> Date: Thu Jun 19 07:46:39 2025 +0000 Add a message in cmakelists commit 21aacb1 Author: Ke Yang <oneday117@qq.com> Date: Thu Jun 19 07:25:17 2025 +0000 Skip etcd go package compilation in default. commit 4f9a379 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 21:30:30 2025 +0800 [Bugfix] Fix missing option and sglang integration doc (kvcache-ai#526) commit e815d66 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 20:51:52 2025 +0800 [TransferEngine] Change option use_nvlink to use_mnnvl to clarify the usage (kvcache-ai#525) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 3c823a9 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 20:00:09 2025 +0800 [Build] Add allocator class to support nvlink for more use-cases (kvcache-ai#524) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 6c57e34 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 19:51:52 2025 +0800 [Build] Optimize nvlink allocator build logic and fix name issue (kvcache-ai#523) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit c2793df Author: shangmingc <caishangming@linux.alibaba.com> Date: Wed Jun 18 21:12:13 2025 +0800 [Build] add nvlink hook into python package dir for local build (kvcache-ai#517) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit eebd666 Author: Teng Ma <teng-ma@linux.alibaba.com> Date: Wed Jun 18 19:04:09 2025 +0800 [Build] add TE bench into wheel package (kvcache-ai#514) commit d8e94e1 Author: dong <guodong9211@gmail.com> Date: Tue Jun 17 16:41:43 2025 +0800 [MooncakeIntegration] Fix find class id (kvcache-ai#500) commit caa1c4f Author: JinYan Su <751080330@qq.com> Date: Tue Jun 17 00:44:41 2025 +0800 fix(transfer-task): fix error hanlding logic in transfer task (kvcache-ai#503) commit e68d8ba Author: shangmingc <caishangming@linux.alibaba.com> Date: Mon Jun 16 20:13:40 2025 +0800 chore: bump version to 0.3.3.post2 in pyproject.toml (kvcache-ai#498) commit 8b75f03 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Mon Jun 16 18:54:57 2025 +0800 [TransferEngine] Optimize custom allocator function name Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 7743789 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 1 06:53:53 2025 +0000 update commit e0e56ed Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 1 03:28:33 2025 +0000 add same machine check commit 2c62ebc Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 30 12:57:46 2025 +0000 change memory alloc apis commit 00426cb Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 30 09:01:11 2025 +0000 Update rpc commit f9542d7 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 30 06:40:09 2025 +0000 support buffer for multiple xports commit 32191ab Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 26 14:38:32 2025 +0000 new local segment helper commit a6cc8d1 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 26 13:09:44 2025 +0000 Stage commit bb218aa Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 23 07:44:52 2025 +0000 Add memory allocation APIs commit 54fa77b Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 23 07:09:30 2025 +0000 add tcp transport commit 0c336b5 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jun 20 09:26:29 2025 +0000 add allocator APIs for each transport commit 0040915 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jun 20 08:28:53 2025 +0000 Move IP functions commit 1bf9e6f Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 09:40:22 2025 +0000 Update conf commit f20ca5a Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 08:13:24 2025 +0000 Update metadata passing logic commit fb4a929 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 06:49:39 2025 +0000 Update commit b6b5912 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 05:08:09 2025 +0000 Update config manager commit 0281569 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jun 18 02:36:03 2025 +0000 generalize slab allocator commit 7e589e3 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jun 18 02:02:00 2025 +0000 finalize Status report commit f2c91fe Author: Feng Ren <alogfans@gmail.com> Date: Tue Jun 17 08:11:06 2025 +0000 Change return value type commit 805c824 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jun 17 03:32:46 2025 +0000 Update status report string commit af7b2e9 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 16 03:11:58 2025 +0000 Pack common dependenies to common.h commit 081e92b Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 16 02:41:16 2025 +0000 Rebase code to keep both v0 and v1 seperately commit c126276 Merge: 5640086 b414934 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Mon Jun 16 10:29:41 2025 +0800 Merge branch 'kvcache-ai:main' into main commit 5640086 Merge: f7eaf85 20829bc Author: Feng Ren <alogfans@users.noreply.github.com> Date: Thu Jun 12 10:35:40 2025 +0800 Merge branch 'kvcache-ai:main' into main commit f7eaf85 Merge: deee23d f09c501 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 11:29:50 2025 +0800 Merge branch 'kvcache-ai:main' into main commit deee23d Author: Feng Ren <alogfans@gmail.com> Date: Tue Jun 10 03:28:56 2025 +0000 [TransferEngine] Fix compilation bug in NVLink xport

Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.

ykwd added 30 commits May 28, 2025 08:59

A temp version. Better to continue development after merging the late…

2896379

…st main branch

Resolve merge conflicts

483762f

Temp version to merge the latest main branch

b1abc1f

merge main

8c87ecb

Allow optional use HA mode, in default use non-HA mode. Fix a minor m…

0bbcef9

…etrics bug.

Refactor the etcd_helper

fb55405

refactor ha_helper

6e10b11

Add some unit tests. Refactor the code

778f2c4

Update cmakelists: build etcd_wrapper in default

aa60561

Merge main

3df656c

Fix ci problems. Compile etcd wrapper only when use_etcd or with_stor…

bc5e1ad

…e are set.

Update python config relating to mooncake-store client

aa312a1

Merge remote-tracking branch 'origin' into feature/ha

30f1870

make some blocking etcd helper function cancellable.

a73f5da

bug fix: add string name of new errors that will be used in tostring.

Refactor etcd related code

def6e68

Bug fix

984c6bd

Add basic masterviewhelper unit tests

b5b36d2

In ci flow, install and start etcd to run HA feature unit test.

124157d

Merge lastest main branch

60cafd8

Fix a ci bug

9f0d1e1

Reuse master_server_address parameter and remove enable_ha parameter.

d28d6be

Format the code. Fix a minor bug.

5ffa05e

Handle the error case: the coro server may fail to start or return in…

3fa316c

…ternal error.

Merge main

20b5aaa

Unmount the segment when heartbeat expires (a temp version)

ab0937e

A basic complete version

efda989

Refactor the code: add SegmentManager

42d60d8

Bug fix and minor updates

b4f5547

Merge main

ed9a8a0

Fix bugs and typos. Format the code

5c07ca0

ykwd marked this pull request as ready for review June 17, 2025 09:59

xiaguan reviewed Jun 17, 2025

View reviewed changes

james0zan mentioned this pull request Jun 18, 2025

[Store] Enable Client SSD Offload And Storage Persistence #437

Merged

ykwd added 4 commits June 18, 2025 06:30

Address review comments

2c2c42c

Fix a bug; Add clientctl, an small tool to do manual e2e test.

55b6c75

Fix typo

450e9d7

Resolve the issue: in Store, the segment name must be equal to the lo…

b560855

…calhost name

Fix a metrics bug. Improve the clientctl and add two e2e test cases.

dce6f73

Merge remote-tracking branch 'origin/main'

0b0e6a2

ykwd mentioned this pull request Jun 19, 2025

[Build] Skip etcd go package compilation by default #520

Merged

xiaguan self-requested a review June 20, 2025 05:03

xiaguan reviewed Jun 20, 2025

View reviewed changes

ykwd added 2 commits June 20, 2025 08:12

Update according to review

16796ae

Merge main

42ebdba

xiaguan approved these changes Jun 20, 2025

View reviewed changes

ykwd merged commit 91c5778 into kvcache-ai:main Jun 20, 2025
10 checks passed

ykwd deleted the feature/ha2 branch July 10, 2025 06:36

ykwd mentioned this pull request Aug 21, 2025

[RoadMap] Mooncake Store V2 #378

Open

29 tasks

		@@ -677,46 +709,44 @@ void Client::PingThreadFunc(int current_version) {
		const int fail_ping_interval_ms = 1000;

Conversation

ykwd commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaguan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

staryxchen commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ykwd commented Jun 19, 2025

Uh oh!

staryxchen commented Jun 19, 2025

Uh oh!

ykwd commented Jun 19, 2025

Uh oh!

ykwd commented Jun 19, 2025

Updates

Uh oh!

ykwd commented Jun 19, 2025

Uh oh!

xiaguan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ykwd commented Jun 16, 2025 •

edited

Loading

staryxchen commented Jun 19, 2025 •

edited

Loading