[Store] High Availability 1: Master Failover by ykwd · Pull Request #451 · kvcache-ai/Mooncake

ykwd · 2025-06-05T07:50:17Z

Currently, the high availability goal consists following contents:

The client can detect when the master goes down and automatically remount segments after the master restarts. This way, if the master crashes, only the master needs to be restarted without having to restart the entire cluster.
The master service contains multiple instances, one of which is the leader and serves the client requests. When the current leader fails, other instances can campaign to become the new leader and continue providing services. However, since the previous metadata is lost, it's equivalent to the cluster's key-value store being cleared, requiring the cache to be repopulated.
Tolerant client failure. The master can detect when a client fails and mark its associated data as invalid. When the client comes back online, it can re-register with the master.
KV metadata failover. Persist or replicate the kvs' metadata so that the newly elected leader can sync to the latest state, and the previously existing key-value data is still preserved.
Add fault-tolerant logic to the existing code to make the system more robust.
Add more tests to validate the robustness of the system under various failures.

This PR is the first step of a series of attempts to make mooncake-store achieve high availability. This PR achieves the 1st and 2nd goals listed above.

More precisely, this PR adds an optional HA mode that depends on etcd service. In HA mode, users can deploy multiple master instances, and at any time, only one instance is the leader to serve client requests. If N instances are deployed, the cluster can tolerate at most N-1 instances fail as long as the etcd service is alive. Currently, we also support non-HA mode, which is the default mode, and the system behaviour is equivalent to before.

As the high availability depends on etcd service, the installation of go and the compilation of etcd-wrapper become necessary if mooncake-store is to be compiled. Both mooncake-store and transferengine have an independent etcd client instance, so they won't interfere with each other.

Additionally, to do the corresponding unit tests in CI, we also need to deploy etcd service in the CI environment.

…st main branch

…etrics bug.

…e are set.

bug fix: add string name of new errors that will be used in tostring.

maobaolong

@ykwd Thanks for this great meaningful feature! Do not finish looking at this PR.

Left some comments inline first.

maobaolong · 2025-06-07T01:32:46Z

     * @brief Internal helper functions for initialization and data transfer
     */
-    ErrorCode ConnectToMaster(const std::string& master_addr);
+    ErrorCode ConnectToMaster(const std::string& master_entries, bool enable_ha);


Do we need extra enable_ha argument if we can detect multi addresses given from master_entries ?

Good question! I am also hesitating about how to specify using high availability mode.

One way is to use the prefix to distinguish the mode, e.g. etcd:://0.0.0.0:2345 denotes using HA mode (or more precisely, in current situation, denotes using etcd to find the master address), while 0.0.0.0: 2345 denotes using non-HA mode, similar to transfer engine. In this way, existing commands or scripts to start the client do not need any modification.

Furthermore, it can also decouple HA mode from etcd. Some HA features such as detecting the liveness of clients simply relies on the regular Pings from clients to master, which do not need to rely on etcd. So even if etcd is not set, we can enable the client liveness detection feature. On the other hand, this also means HA features become default settings, e.g. users do not need to explicitly set enable_ha=true.

Is this ok? Or any suggestions?

On the other hand, as the HA mode is still unstable, or some users may not need HA mode, only enabling HA mode when etcd prefix is set may not be a bad idea.

The sure thing is that we'd better to keep the default behavior remain unchanged. So using etcd://xxx:yyy is a possible way for now.

But, if we want to lead users to use HA mode by default, it could be a common way to publish a notice for a period(May be one month or several minor version released).

Some HA features such as detecting the liveness of clients simply relies on the regular Pings from clients to master, which do not need to rely on etcd

Yeah, there are two concepts introduced by HA.

Multiply masters, or masters discover.

Client side remote services Liveness detection.

For 1. it would be an optional feature and the original non-ha use cause could have no motivation to use HA mode by introduce etcd and extra master servers.

For 2, It would be possible to exist all the time, and we can step a forward to make it from liveness ping thread to heartbeat thread, this beyond the concept of HA, leverage by this, master can serve more operation from user and it can be implemented by the heartbeat replied commands, client-side can executed the commands replied base on the heartbeat communication.

Let me summarize the discussion. Correct me if I misunderstood something.

In configuration, reuse master_server_address instead of changing to master_server_entries. If the prefix is etcd://, then use etcd to get the real master address.

Remove the enable_ha mode parameter in client side. Currently, if the master_server_address's prefix is etcd://, then use HA mode, otherwise keep the default behavior unchanged.

Publish a notice and introduce HA mode for users.

@ykwd Yeah, the summary is very accurate.

maobaolong · 2025-06-07T01:41:27Z

 }

+ErrorCode Client::ConnectToMaster(const std::string& master_entries, bool enable_ha) {
+    if (enable_ha) {


maobaolong · 2025-06-07T01:44:14Z

+
+        // Start Ping thread to monitor master view changes and remount segments if needed
+        ping_running_ = true;
+        ping_thread_ = std::thread(&Client::PingThreadFunc, this, master_version);


This can be a heartbeat thread also, for master, if the heartbeat lost for a while, master can unmount the segment from this client.

Do you suppose to make this thread to be daemon thread?

Yes. These regular Pings can be used by master to detect the liveness of clients. We will implement this feature in the next PR.

After a short discussion offline, this topic can be closed, as it is almost the best practice to close the thread in the destructor.

maobaolong

Left some questions, and it would be nice if you can supply a introduction contains

How to avoid brain-split
How to avoid forever wait
How to fallback to the re-compute performance if there are some unrecoverable issue happened.

maobaolong · 2025-06-09T03:16:56Z

            "protocol": "tcp",
            "device_name": "",
-            "master_server_address": "localhost:8081"
+            "master_server_entries": "localhost:8081",


Do we need to keep backward-compatibility ? Maybe we can re-use the master_server_address to identify the ha mode?

"master_server_address": "localhost:8081" --- non-ha mode, since we config only one host here.

"master_server_address": "localhost:8081,localhost:8082,localhost:8083" --- ha mode, since we config 3 address here.

In ha mode, the master server address is not given by input parameters but is get from etcd. So the input address is the endpoints of etcd.

But considering backward-compatibility, maybe we can re-use the master_server_address, though there might be a slight ambiguity.

there might be a slight ambiguity.
@ykwd Thanks for the explain, it definitely be ambiguity if we use master_server_address stand for etcd endpoints directly. But we can use scheme to distinguish whether it stand for a standard master address or etcd endpoint addresses.

"master_server_address": "localhost:8081" --- non-ha mode, since we config only one host here.

"master_server_address": "etcd://localhost:8081,localhost:8082,localhost:8083" --- ha mode, since the scheme is etcd.

Is this better then the above approach?

Sounds good!

maobaolong · 2025-06-09T03:51:18Z

+}
+
+void MasterViewHelper::ElectLeader(const std::string& master_address, ViewVersionId& version, EtcdLeaseId& lease_id) {
+    while (true) {


Is it possible to be a forever loop here? For example, masters are all failed.

maobaolong · 2025-06-09T03:55:41Z

+        auto ret = EtcdHelper::Get(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY), current_master, current_version);
+        if (ret != ErrorCode::OK && ret != ErrorCode::ETCD_KEY_NOT_EXIST) {
+            LOG(ERROR) << "Failed to get current leader: " << ret;
+            std::this_thread::sleep_for(std::chrono::seconds(1));


Maybe we can introduce Exponential Backoff and Retry approach to avoid the retry storm?

There are only a few master instances, e.g. three instances. And each instance retries every 1 second. So there will not be much overhead for etcd cluster.

Sounds good.

ykwd · 2025-06-09T09:05:31Z

Left some questions, and it would be nice if you can supply a introduction contains

How to avoid brain-split

How to avoid forever wait

How to fallback to the re-compute performance if there are some unrecoverable issue happened.

Thanks for the review.

How to avoid brain-split
The kv-pair (MASTER_VIEW_KEY, leader address) in etcd determines which instance is the leader. When this kv is deleted by lease expiration, all instances will be notified and campaign to elect a new leader. The old leader will close the local master service when notified. However, as the notification may not be very quick and the master service closing may take some time, there might be a short time when brain-split can happen, which is very hard to prevent.
References:
What kind of safety guarantee does the leader election primitives provide? etcd-io/etcd#10125
https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

To avoid this, we can let the new leader begin serving requests only after a long enough time, e.g., 10 seconds. In the future, this problem can be solved using the following strategy:
The leader records a local lease expiration time Exp.
Every ttl / 2 seconds, the leader does the following steps:

The leader records the current time T.
The leader sends KeepAliveOnce request to etcd to extend the lease for ttl seconds. If the KeepAliveOnce request succeeds, the leader knows the lease will not expire until T+x. So it is safe to update Exp to T+x.
For every request, the leader only returns the OK result to the client if this request finishes before Exp.

How to avoid forever wait
Forever loop in leader election: The master process is supposed to run forever. As long as this process is not terminated and the local master is not a leader, it will try to elect itself as leader.
How to fallback to the re-compute performance if there are some unrecoverable issue happened.
This PR does not introduce additional overhead for requests such as get or put, but it also doesn't address the existing issues, if there are any. If ensuring this property is important, we can solve this in a new PR.

xiaguan · 2025-06-09T08:27:56Z

+        std::string current_master;
+        auto ret = EtcdHelper::Get(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY), current_master, current_version);
+        if (ret != ErrorCode::OK && ret != ErrorCode::ETCD_KEY_NOT_EXIST) {
+            LOG(ERROR) << "Failed to get current leader: " << ret;


Let's improve this log message to indicate that the failure is likely due to an etcd issue.
This should help users diagnose the problem more quickly.

In this case, the returned errorcode is ETCD_OPERATION_ERROR. The LOG output is:
ha_helper.cpp:16] Failed to get current leader: ETCD_OPERATION_ERROR

xiaguan · 2025-06-09T08:30:35Z

+        ret = EtcdHelper::CreateWithLease(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY),
+            master_address.c_str(), master_address.size(), lease_id, version);
+        if (ret == ErrorCode::ETCD_TRANSACTION_FAIL) {
+            LOG(INFO) << "Failed to elect self as leader";


same as above, "Etcd transcation failed xxxx"

OK. Will fix this

xiaguan · 2025-06-09T08:34:38Z

+        if (mv_helper.ConnectToEtcd(etcd_endpoints_) != ErrorCode::OK) {
+            LOG(ERROR) << "Failed to connect to etcd endpoints: " << etcd_endpoints_;
+            return -1;
+        }


We could move this into the MasterViewHelper constructor and throw an exception if this fails, it basically means the Master can't start up properly anyway.

OK. Sounds reasonable.

In retrospect, the MasterViewHelper::connectToEtcd is invoked in Client::connectToMaster(). If we move this into the MasterViewHelper constructor, it means the connection logic is also partially moved to client's constructor. Besides, this will only be invoked in HA mode. In none HA mode, there is is no etcd and this function should not be called.

xiaguan · 2025-06-09T08:37:33Z

+    ETCD_OPERATION_ERROR = -1000,  ///< etcd operation failed.
+    ETCD_KEY_NOT_EXIST = -1001,  ///< key not found in etcd.
+    ETCD_TRANSACTION_FAIL = -1002,  ///< etcd transaction failed.
+    ETCD_CTX_CANCELLED = -1003,  ///< etcd context cancelled.


update toString also

Sure. This is updated in type.cpp

xiaguan · 2025-06-09T08:39:37Z

+ErrorCode Client::ConnectToMaster(const std::string& master_entries, bool enable_ha) {
+    if (enable_ha) {
+        // Get master address from ETCD
+        auto err = master_view_helper_.ConnectToEtcd(master_entries);


maybe we can introduce a new interface like get_current_leader_master_address()?

maybe we could move all of that down into MasterClient to keep Client clean and simple?

maybe we can introduce a new interface like get_current_leader_master_address()?

There is already an interface to get the leader address, i.e., master_view_helper_.GetMasterView.

maybe we could move all of that down into MasterClient to keep Client clean and simple?

The HA logic, such as ping thread, requires resources and information of the client, such as mounted_segments_.

xiaguan · 2025-06-09T09:11:19Z

By the way, since we're getting more and more configuration and constants, I think it's a good idea to create a separate config module. I'll go ahead and open a new issue for that

…ternal error.

ykwd · 2025-06-10T02:44:01Z

Updates:

For client-side parameters: reuse master_server_address, remove enable_ha;
In ci flow: install and start etcd service, run high_availability_test;
Format code syle;
Fix some bugs and address some minor issues.

To do in the future PR:

More reliable split-brain prevention.
Documentations (will be added together with client fail auto-detection feature).
Add chaos testing to test the system stability.

xiaguan

LGTM

xiaguan · 2025-06-11T06:17:04Z

@maobaolong PTAL

maobaolong

LGTM for the code. It would be nice for user if we can add a document related, as this pr is already big enough, you can add it by another PR.

commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark

commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com>

…tch optimization (#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468)" (#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>

SpecterCipher · 2025-08-04T13:36:37Z

May I kindly inquire if there is a planned timeline for supporting KV metadata failover?" @ykwd

ykwd · 2025-08-04T15:12:34Z

May I kindly inquire if there is a planned timeline for supporting KV metadata failover?" @ykwd

Thank you for your interest! Supporting KV metadata failover is indeed on our roadmap, but it's not a top priority at the moment. At least for August, we don’t plan to work on it yet.

SpecterCipher · 2025-08-05T01:12:21Z

May I kindly inquire if there is a planned timeline for supporting KV metadata failover?" @ykwd

Thank you for your interest! Supporting KV metadata failover is indeed on our roadmap, but it's not a top priority at the moment. At least for August, we don’t plan to work on it yet.

ok，Thank you for your response.

…cache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error.

…tch optimization (kvcache-ai#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 6b07418 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 506e204. commit 60567fc Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit 506e204 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>

…cache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error.

…tch optimization (kvcache-ai#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 6e154d0 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 4675e9d. commit a2ca348 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit 4675e9d Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>

ykwd added 17 commits May 28, 2025 08:59

A temp version. Better to continue development after merging the late…

2896379

…st main branch

Resolve merge conflicts

483762f

Temp version to merge the latest main branch

b1abc1f

merge main

8c87ecb

Allow optional use HA mode, in default use non-HA mode. Fix a minor m…

0bbcef9

…etrics bug.

Refactor the etcd_helper

fb55405

refactor ha_helper

6e10b11

Add some unit tests. Refactor the code

778f2c4

Update cmakelists: build etcd_wrapper in default

aa60561

Merge main

3df656c

Fix ci problems. Compile etcd wrapper only when use_etcd or with_stor…

bc5e1ad

…e are set.

Update python config relating to mooncake-store client

aa312a1

Merge remote-tracking branch 'origin' into feature/ha

30f1870

make some blocking etcd helper function cancellable.

a73f5da

bug fix: add string name of new errors that will be used in tostring.

Refactor etcd related code

def6e68

Bug fix

984c6bd

Add basic masterviewhelper unit tests

b5b36d2

maobaolong reviewed Jun 7, 2025

View reviewed changes

maobaolong reviewed Jun 9, 2025

View reviewed changes

In ci flow, install and start etcd to run HA feature unit test.

124157d

ykwd marked this pull request as ready for review June 9, 2025 05:57

ykwd added 2 commits June 9, 2025 06:10

Merge lastest main branch

60cafd8

Fix a ci bug

9f0d1e1

xiaguan assigned xiaguan and unassigned xiaguan Jun 9, 2025

xiaguan self-requested a review June 9, 2025 08:12

xiaguan reviewed Jun 9, 2025

View reviewed changes

Reuse master_server_address parameter and remove enable_ha parameter.

d28d6be

ykwd added 2 commits June 9, 2025 12:09

Format the code. Fix a minor bug.

5ffa05e

Handle the error case: the coro server may fail to start or return in…

3fa316c

…ternal error.

ykwd requested review from maobaolong and xiaguan June 10, 2025 02:44

Merge main

20b5aaa

xiaguan approved these changes Jun 11, 2025

View reviewed changes

maobaolong approved these changes Jun 11, 2025

View reviewed changes

xiaguan merged commit 41b1df7 into kvcache-ai:main Jun 11, 2025
10 checks passed

ykwd mentioned this pull request Jun 16, 2025

[Store] High Availability V2: Client Failover #501

Merged

ykwd deleted the feature/ha branch July 10, 2025 06:36

SgtPepperr mentioned this pull request Aug 4, 2025

[Store]feat: Migrate Persistence Metadata from Client to Master Service #690

Merged

ykwd mentioned this pull request Aug 21, 2025

[RoadMap] Mooncake Store V2 #378

Open

29 tasks

Conversation

ykwd commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ykwd commented Jun 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ykwd Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Jun 9, 2025

Uh oh!

ykwd commented Jun 10, 2025

Uh oh!

ykwd commented Jun 5, 2025 •

edited

Loading

ykwd Jun 9, 2025 •

edited

Loading