Skip to content

[Store] High Availability 1: Master Failover#451

Merged
xiaguan merged 24 commits intokvcache-ai:mainfrom
ykwd:feature/ha
Jun 11, 2025
Merged

[Store] High Availability 1: Master Failover#451
xiaguan merged 24 commits intokvcache-ai:mainfrom
ykwd:feature/ha

Conversation

@ykwd
Copy link
Copy Markdown
Collaborator

@ykwd ykwd commented Jun 5, 2025

Currently, the high availability goal consists following contents:

  1. The client can detect when the master goes down and automatically remount segments after the master restarts. This way, if the master crashes, only the master needs to be restarted without having to restart the entire cluster.

  2. The master service contains multiple instances, one of which is the leader and serves the client requests. When the current leader fails, other instances can campaign to become the new leader and continue providing services. However, since the previous metadata is lost, it's equivalent to the cluster's key-value store being cleared, requiring the cache to be repopulated.

  3. Tolerant client failure. The master can detect when a client fails and mark its associated data as invalid. When the client comes back online, it can re-register with the master.

  4. KV metadata failover. Persist or replicate the kvs' metadata so that the newly elected leader can sync to the latest state, and the previously existing key-value data is still preserved.

  5. Add fault-tolerant logic to the existing code to make the system more robust.

  6. Add more tests to validate the robustness of the system under various failures.

This PR is the first step of a series of attempts to make mooncake-store achieve high availability. This PR achieves the 1st and 2nd goals listed above.

More precisely, this PR adds an optional HA mode that depends on etcd service. In HA mode, users can deploy multiple master instances, and at any time, only one instance is the leader to serve client requests. If N instances are deployed, the cluster can tolerate at most N-1 instances fail as long as the etcd service is alive. Currently, we also support non-HA mode, which is the default mode, and the system behaviour is equivalent to before.

As the high availability depends on etcd service, the installation of go and the compilation of etcd-wrapper become necessary if mooncake-store is to be compiled. Both mooncake-store and transferengine have an independent etcd client instance, so they won't interfere with each other.

Additionally, to do the corresponding unit tests in CI, we also need to deploy etcd service in the CI environment.

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ykwd Thanks for this great meaningful feature! Do not finish looking at this PR.

Left some comments inline first.

Comment thread mooncake-store/include/client.h Outdated
* @brief Internal helper functions for initialization and data transfer
*/
ErrorCode ConnectToMaster(const std::string& master_addr);
ErrorCode ConnectToMaster(const std::string& master_entries, bool enable_ha);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need extra enable_ha argument if we can detect multi addresses given from master_entries ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I am also hesitating about how to specify using high availability mode.

One way is to use the prefix to distinguish the mode, e.g. etcd:://0.0.0.0:2345 denotes using HA mode (or more precisely, in current situation, denotes using etcd to find the master address), while 0.0.0.0: 2345 denotes using non-HA mode, similar to transfer engine. In this way, existing commands or scripts to start the client do not need any modification.

Furthermore, it can also decouple HA mode from etcd. Some HA features such as detecting the liveness of clients simply relies on the regular Pings from clients to master, which do not need to rely on etcd. So even if etcd is not set, we can enable the client liveness detection feature. On the other hand, this also means HA features become default settings, e.g. users do not need to explicitly set enable_ha=true.

Is this ok? Or any suggestions?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, as the HA mode is still unstable, or some users may not need HA mode, only enabling HA mode when etcd prefix is set may not be a bad idea.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sure thing is that we'd better to keep the default behavior remain unchanged. So using etcd://xxx:yyy is a possible way for now.

But, if we want to lead users to use HA mode by default, it could be a common way to publish a notice for a period(May be one month or several minor version released).

Some HA features such as detecting the liveness of clients simply relies on the regular Pings from clients to master, which do not need to rely on etcd

Yeah, there are two concepts introduced by HA.

    1. Multiply masters, or masters discover.
    1. Client side remote services Liveness detection.

For 1. it would be an optional feature and the original non-ha use cause could have no motivation to use HA mode by introduce etcd and extra master servers.

For 2, It would be possible to exist all the time, and we can step a forward to make it from liveness ping thread to heartbeat thread, this beyond the concept of HA, leverage by this, master can serve more operation from user and it can be implemented by the heartbeat replied commands, client-side can executed the commands replied base on the heartbeat communication.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me summarize the discussion. Correct me if I misunderstood something.

  1. In configuration, reuse master_server_address instead of changing to master_server_entries. If the prefix is etcd://, then use etcd to get the real master address.
  2. Remove the enable_ha mode parameter in client side. Currently, if the master_server_address's prefix is etcd://, then use HA mode, otherwise keep the default behavior unchanged.
  3. Publish a notice and introduce HA mode for users.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ykwd Yeah, the summary is very accurate.

Comment thread mooncake-store/src/client.cpp Outdated
}

ErrorCode Client::ConnectToMaster(const std::string& master_entries, bool enable_ha) {
if (enable_ha) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread mooncake-store/src/client.cpp Outdated

// Start Ping thread to monitor master view changes and remount segments if needed
ping_running_ = true;
ping_thread_ = std::thread(&Client::PingThreadFunc, this, master_version);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a heartbeat thread also, for master, if the heartbeat lost for a while, master can unmount the segment from this client.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suppose to make this thread to be daemon thread?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. These regular Pings can be used by master to detect the liveness of clients. We will implement this feature in the next PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a short discussion offline, this topic can be closed, as it is almost the best practice to close the thread in the destructor.

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions, and it would be nice if you can supply a introduction contains

  • How to avoid brain-split
  • How to avoid forever wait
  • How to fallback to the re-compute performance if there are some unrecoverable issue happened.

"protocol": "tcp",
"device_name": "",
"master_server_address": "localhost:8081"
"master_server_entries": "localhost:8081",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep backward-compatibility ? Maybe we can re-use the master_server_address to identify the ha mode?

  • "master_server_address": "localhost:8081" --- non-ha mode, since we config only one host here.
  • "master_server_address": "localhost:8081,localhost:8082,localhost:8083" --- ha mode, since we config 3 address here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ha mode, the master server address is not given by input parameters but is get from etcd. So the input address is the endpoints of etcd.

But considering backward-compatibility, maybe we can re-use the master_server_address, though there might be a slight ambiguity.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there might be a slight ambiguity.
@ykwd Thanks for the explain, it definitely be ambiguity if we use master_server_address stand for etcd endpoints directly. But we can use scheme to distinguish whether it stand for a standard master address or etcd endpoint addresses.

  • "master_server_address": "localhost:8081" --- non-ha mode, since we config only one host here.
  • "master_server_address": "etcd://localhost:8081,localhost:8082,localhost:8083" --- ha mode, since the scheme is etcd.

Is this better then the above approach?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

}

void MasterViewHelper::ElectLeader(const std::string& master_address, ViewVersionId& version, EtcdLeaseId& lease_id) {
while (true) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to be a forever loop here? For example, masters are all failed.

auto ret = EtcdHelper::Get(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY), current_master, current_version);
if (ret != ErrorCode::OK && ret != ErrorCode::ETCD_KEY_NOT_EXIST) {
LOG(ERROR) << "Failed to get current leader: " << ret;
std::this_thread::sleep_for(std::chrono::seconds(1));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can introduce Exponential Backoff and Retry approach to avoid the retry storm?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only a few master instances, e.g. three instances. And each instance retries every 1 second. So there will not be much overhead for etcd cluster.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@ykwd ykwd marked this pull request as ready for review June 9, 2025 05:57
@xiaguan xiaguan assigned xiaguan and unassigned xiaguan Jun 9, 2025
@xiaguan xiaguan self-requested a review June 9, 2025 08:12
@ykwd
Copy link
Copy Markdown
Collaborator Author

ykwd commented Jun 9, 2025

Left some questions, and it would be nice if you can supply a introduction contains

  • How to avoid brain-split
  • How to avoid forever wait
  • How to fallback to the re-compute performance if there are some unrecoverable issue happened.

Thanks for the review.

To avoid this, we can let the new leader begin serving requests only after a long enough time, e.g., 10 seconds. In the future, this problem can be solved using the following strategy:
The leader records a local lease expiration time Exp.
Every ttl / 2 seconds, the leader does the following steps:

  1. The leader records the current time T.
  2. The leader sends KeepAliveOnce request to etcd to extend the lease for ttl seconds. If the KeepAliveOnce request succeeds, the leader knows the lease will not expire until T+x. So it is safe to update Exp to T+x.
    For every request, the leader only returns the OK result to the client if this request finishes before Exp.
  • How to avoid forever wait
    Forever loop in leader election: The master process is supposed to run forever. As long as this process is not terminated and the local master is not a leader, it will try to elect itself as leader.

  • How to fallback to the re-compute performance if there are some unrecoverable issue happened.
    This PR does not introduce additional overhead for requests such as get or put, but it also doesn't address the existing issues, if there are any. If ensuring this property is important, we can solve this in a new PR.

std::string current_master;
auto ret = EtcdHelper::Get(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY), current_master, current_version);
if (ret != ErrorCode::OK && ret != ErrorCode::ETCD_KEY_NOT_EXIST) {
LOG(ERROR) << "Failed to get current leader: " << ret;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve this log message to indicate that the failure is likely due to an etcd issue.
This should help users diagnose the problem more quickly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the returned errorcode is ETCD_OPERATION_ERROR. The LOG output is:
ha_helper.cpp:16] Failed to get current leader: ETCD_OPERATION_ERROR

Comment thread mooncake-store/src/ha_helper.cpp Outdated
ret = EtcdHelper::CreateWithLease(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY),
master_address.c_str(), master_address.size(), lease_id, version);
if (ret == ErrorCode::ETCD_TRANSACTION_FAIL) {
LOG(INFO) << "Failed to elect self as leader";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, "Etcd transcation failed xxxx"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will fix this

Comment on lines +105 to +108
if (mv_helper.ConnectToEtcd(etcd_endpoints_) != ErrorCode::OK) {
LOG(ERROR) << "Failed to connect to etcd endpoints: " << etcd_endpoints_;
return -1;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move this into the MasterViewHelper constructor and throw an exception if this fails, it basically means the Master can't start up properly anyway.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Sounds reasonable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect, the MasterViewHelper::connectToEtcd is invoked in Client::connectToMaster(). If we move this into the MasterViewHelper constructor, it means the connection logic is also partially moved to client's constructor. Besides, this will only be invoked in HA mode. In none HA mode, there is is no etcd and this function should not be called.

ETCD_OPERATION_ERROR = -1000, ///< etcd operation failed.
ETCD_KEY_NOT_EXIST = -1001, ///< key not found in etcd.
ETCD_TRANSACTION_FAIL = -1002, ///< etcd transaction failed.
ETCD_CTX_CANCELLED = -1003, ///< etcd context cancelled.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update toString also

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This is updated in type.cpp

Comment thread mooncake-store/src/client.cpp Outdated
ErrorCode Client::ConnectToMaster(const std::string& master_entries, bool enable_ha) {
if (enable_ha) {
// Get master address from ETCD
auto err = master_view_helper_.ConnectToEtcd(master_entries);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can introduce a new interface like get_current_leader_master_address()?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could move all of that down into MasterClient to keep Client clean and simple?

Copy link
Copy Markdown
Collaborator Author

@ykwd ykwd Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can introduce a new interface like get_current_leader_master_address()?

There is already an interface to get the leader address, i.e., master_view_helper_.GetMasterView.

maybe we could move all of that down into MasterClient to keep Client clean and simple?

The HA logic, such as ping thread, requires resources and information of the client, such as mounted_segments_.

@xiaguan
Copy link
Copy Markdown
Collaborator

xiaguan commented Jun 9, 2025

By the way, since we're getting more and more configuration and constants, I think it's a good idea to create a separate config module. I'll go ahead and open a new issue for that

@ykwd
Copy link
Copy Markdown
Collaborator Author

ykwd commented Jun 10, 2025

Updates:

  • For client-side parameters: reuse master_server_address, remove enable_ha;
  • In ci flow: install and start etcd service, run high_availability_test;
  • Format code syle;
  • Fix some bugs and address some minor issues.

To do in the future PR:

  • More reliable split-brain prevention.
  • Documentations (will be added together with client fail auto-detection feature).
  • Add chaos testing to test the system stability.

@ykwd ykwd requested review from maobaolong and xiaguan June 10, 2025 02:44
Copy link
Copy Markdown
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaguan
Copy link
Copy Markdown
Collaborator

xiaguan commented Jun 11, 2025

@maobaolong PTAL

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the code. It would be nice for user if we can add a document related, as this pr is already big enough, you can add it by another PR.

@xiaguan xiaguan merged commit 41b1df7 into kvcache-ai:main Jun 11, 2025
10 checks passed
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>
xiaguan added a commit that referenced this pull request Jun 12, 2025
…tch optimization (#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468)" (#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
@ykwd ykwd deleted the feature/ha branch July 10, 2025 06:36
@SpecterCipher
Copy link
Copy Markdown
Contributor

May I kindly inquire if there is a planned timeline for supporting KV metadata failover?" @ykwd

@ykwd
Copy link
Copy Markdown
Collaborator Author

ykwd commented Aug 4, 2025

May I kindly inquire if there is a planned timeline for supporting KV metadata failover?" @ykwd

Thank you for your interest! Supporting KV metadata failover is indeed on our roadmap, but it's not a top priority at the moment. At least for August, we don’t plan to work on it yet.

@SpecterCipher
Copy link
Copy Markdown
Contributor

May I kindly inquire if there is a planned timeline for supporting KV metadata failover?" @ykwd

Thank you for your interest! Supporting KV metadata failover is indeed on our roadmap, but it's not a top priority at the moment. At least for August, we don’t plan to work on it yet.

ok,Thank you for your response.

@ykwd ykwd mentioned this pull request Aug 21, 2025
29 tasks
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
…cache-ai#451)

* A temp version. Better to continue development after merging the latest main branch

* Temp version to merge the latest main branch

* Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

* Refactor the etcd_helper

* refactor ha_helper

* Add some unit tests. Refactor the code

* Update cmakelists: build etcd_wrapper in default

* Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

* Update python config relating to mooncake-store client

* make some blocking etcd helper function cancellable.
bug fix: add string name of new errors that will be used in tostring.

* Refactor etcd related code

* Bug fix

* Add basic masterviewhelper unit tests

* In ci flow, install and start etcd to run HA feature unit test.

* Fix a ci bug

* Reuse master_server_address parameter and remove enable_ha parameter.

* Format the code. Fix a minor bug.

* Handle the error case: the coro server may fail to start or return internal error.
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
…tch optimization (kvcache-ai#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 6b07418
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit 506e204.

commit 60567fc
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit 506e204
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
…cache-ai#451)

* A temp version. Better to continue development after merging the latest main branch

* Temp version to merge the latest main branch

* Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

* Refactor the etcd_helper

* refactor ha_helper

* Add some unit tests. Refactor the code

* Update cmakelists: build etcd_wrapper in default

* Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

* Update python config relating to mooncake-store client

* make some blocking etcd helper function cancellable.
bug fix: add string name of new errors that will be used in tostring.

* Refactor etcd related code

* Bug fix

* Add basic masterviewhelper unit tests

* In ci flow, install and start etcd to run HA feature unit test.

* Fix a ci bug

* Reuse master_server_address parameter and remove enable_ha parameter.

* Format the code. Fix a minor bug.

* Handle the error case: the coro server may fail to start or return internal error.
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
…tch optimization (kvcache-ai#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 6e154d0
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit 4675e9d.

commit a2ca348
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit 4675e9d
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants