[Store] High Availability 1: Master Failover#451
Conversation
bug fix: add string name of new errors that will be used in tostring.
maobaolong
left a comment
There was a problem hiding this comment.
@ykwd Thanks for this great meaningful feature! Do not finish looking at this PR.
Left some comments inline first.
| * @brief Internal helper functions for initialization and data transfer | ||
| */ | ||
| ErrorCode ConnectToMaster(const std::string& master_addr); | ||
| ErrorCode ConnectToMaster(const std::string& master_entries, bool enable_ha); |
There was a problem hiding this comment.
Do we need extra enable_ha argument if we can detect multi addresses given from master_entries ?
There was a problem hiding this comment.
Good question! I am also hesitating about how to specify using high availability mode.
One way is to use the prefix to distinguish the mode, e.g. etcd:://0.0.0.0:2345 denotes using HA mode (or more precisely, in current situation, denotes using etcd to find the master address), while 0.0.0.0: 2345 denotes using non-HA mode, similar to transfer engine. In this way, existing commands or scripts to start the client do not need any modification.
Furthermore, it can also decouple HA mode from etcd. Some HA features such as detecting the liveness of clients simply relies on the regular Pings from clients to master, which do not need to rely on etcd. So even if etcd is not set, we can enable the client liveness detection feature. On the other hand, this also means HA features become default settings, e.g. users do not need to explicitly set enable_ha=true.
Is this ok? Or any suggestions?
There was a problem hiding this comment.
On the other hand, as the HA mode is still unstable, or some users may not need HA mode, only enabling HA mode when etcd prefix is set may not be a bad idea.
There was a problem hiding this comment.
The sure thing is that we'd better to keep the default behavior remain unchanged. So using etcd://xxx:yyy is a possible way for now.
But, if we want to lead users to use HA mode by default, it could be a common way to publish a notice for a period(May be one month or several minor version released).
Some HA features such as detecting the liveness of clients simply relies on the regular Pings from clients to master, which do not need to rely on etcd
Yeah, there are two concepts introduced by HA.
-
- Multiply masters, or masters discover.
-
- Client side remote services Liveness detection.
For 1. it would be an optional feature and the original non-ha use cause could have no motivation to use HA mode by introduce etcd and extra master servers.
For 2, It would be possible to exist all the time, and we can step a forward to make it from liveness ping thread to heartbeat thread, this beyond the concept of HA, leverage by this, master can serve more operation from user and it can be implemented by the heartbeat replied commands, client-side can executed the commands replied base on the heartbeat communication.
There was a problem hiding this comment.
Let me summarize the discussion. Correct me if I misunderstood something.
- In configuration, reuse master_server_address instead of changing to master_server_entries. If the prefix is etcd://, then use etcd to get the real master address.
- Remove the enable_ha mode parameter in client side. Currently, if the master_server_address's prefix is etcd://, then use HA mode, otherwise keep the default behavior unchanged.
- Publish a notice and introduce HA mode for users.
| } | ||
|
|
||
| ErrorCode Client::ConnectToMaster(const std::string& master_entries, bool enable_ha) { | ||
| if (enable_ha) { |
|
|
||
| // Start Ping thread to monitor master view changes and remount segments if needed | ||
| ping_running_ = true; | ||
| ping_thread_ = std::thread(&Client::PingThreadFunc, this, master_version); |
There was a problem hiding this comment.
This can be a heartbeat thread also, for master, if the heartbeat lost for a while, master can unmount the segment from this client.
There was a problem hiding this comment.
Do you suppose to make this thread to be daemon thread?
There was a problem hiding this comment.
Yes. These regular Pings can be used by master to detect the liveness of clients. We will implement this feature in the next PR.
There was a problem hiding this comment.
After a short discussion offline, this topic can be closed, as it is almost the best practice to close the thread in the destructor.
maobaolong
left a comment
There was a problem hiding this comment.
Left some questions, and it would be nice if you can supply a introduction contains
- How to avoid brain-split
- How to avoid forever wait
- How to fallback to the re-compute performance if there are some unrecoverable issue happened.
| "protocol": "tcp", | ||
| "device_name": "", | ||
| "master_server_address": "localhost:8081" | ||
| "master_server_entries": "localhost:8081", |
There was a problem hiding this comment.
Do we need to keep backward-compatibility ? Maybe we can re-use the master_server_address to identify the ha mode?
- "master_server_address": "localhost:8081" --- non-ha mode, since we config only one host here.
- "master_server_address": "localhost:8081,localhost:8082,localhost:8083" --- ha mode, since we config 3 address here.
There was a problem hiding this comment.
In ha mode, the master server address is not given by input parameters but is get from etcd. So the input address is the endpoints of etcd.
But considering backward-compatibility, maybe we can re-use the master_server_address, though there might be a slight ambiguity.
There was a problem hiding this comment.
there might be a slight ambiguity.
@ykwd Thanks for the explain, it definitely be ambiguity if we usemaster_server_addressstand for etcd endpoints directly. But we can use scheme to distinguish whether it stand for a standard master address or etcd endpoint addresses.
- "master_server_address": "localhost:8081" --- non-ha mode, since we config only one host here.
- "master_server_address": "etcd://localhost:8081,localhost:8082,localhost:8083" --- ha mode, since the scheme is etcd.
Is this better then the above approach?
| } | ||
|
|
||
| void MasterViewHelper::ElectLeader(const std::string& master_address, ViewVersionId& version, EtcdLeaseId& lease_id) { | ||
| while (true) { |
There was a problem hiding this comment.
Is it possible to be a forever loop here? For example, masters are all failed.
| auto ret = EtcdHelper::Get(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY), current_master, current_version); | ||
| if (ret != ErrorCode::OK && ret != ErrorCode::ETCD_KEY_NOT_EXIST) { | ||
| LOG(ERROR) << "Failed to get current leader: " << ret; | ||
| std::this_thread::sleep_for(std::chrono::seconds(1)); |
There was a problem hiding this comment.
Maybe we can introduce Exponential Backoff and Retry approach to avoid the retry storm?
There was a problem hiding this comment.
There are only a few master instances, e.g. three instances. And each instance retries every 1 second. So there will not be much overhead for etcd cluster.
Thanks for the review.
To avoid this, we can let the new leader begin serving requests only after a long enough time, e.g., 10 seconds. In the future, this problem can be solved using the following strategy:
|
| std::string current_master; | ||
| auto ret = EtcdHelper::Get(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY), current_master, current_version); | ||
| if (ret != ErrorCode::OK && ret != ErrorCode::ETCD_KEY_NOT_EXIST) { | ||
| LOG(ERROR) << "Failed to get current leader: " << ret; |
There was a problem hiding this comment.
Let's improve this log message to indicate that the failure is likely due to an etcd issue.
This should help users diagnose the problem more quickly.
There was a problem hiding this comment.
In this case, the returned errorcode is ETCD_OPERATION_ERROR. The LOG output is:
ha_helper.cpp:16] Failed to get current leader: ETCD_OPERATION_ERROR
| ret = EtcdHelper::CreateWithLease(MASTER_VIEW_KEY, strlen(MASTER_VIEW_KEY), | ||
| master_address.c_str(), master_address.size(), lease_id, version); | ||
| if (ret == ErrorCode::ETCD_TRANSACTION_FAIL) { | ||
| LOG(INFO) << "Failed to elect self as leader"; |
There was a problem hiding this comment.
same as above, "Etcd transcation failed xxxx"
| if (mv_helper.ConnectToEtcd(etcd_endpoints_) != ErrorCode::OK) { | ||
| LOG(ERROR) << "Failed to connect to etcd endpoints: " << etcd_endpoints_; | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
We could move this into the MasterViewHelper constructor and throw an exception if this fails, it basically means the Master can't start up properly anyway.
There was a problem hiding this comment.
OK. Sounds reasonable.
There was a problem hiding this comment.
In retrospect, the MasterViewHelper::connectToEtcd is invoked in Client::connectToMaster(). If we move this into the MasterViewHelper constructor, it means the connection logic is also partially moved to client's constructor. Besides, this will only be invoked in HA mode. In none HA mode, there is is no etcd and this function should not be called.
| ETCD_OPERATION_ERROR = -1000, ///< etcd operation failed. | ||
| ETCD_KEY_NOT_EXIST = -1001, ///< key not found in etcd. | ||
| ETCD_TRANSACTION_FAIL = -1002, ///< etcd transaction failed. | ||
| ETCD_CTX_CANCELLED = -1003, ///< etcd context cancelled. |
There was a problem hiding this comment.
Sure. This is updated in type.cpp
| ErrorCode Client::ConnectToMaster(const std::string& master_entries, bool enable_ha) { | ||
| if (enable_ha) { | ||
| // Get master address from ETCD | ||
| auto err = master_view_helper_.ConnectToEtcd(master_entries); |
There was a problem hiding this comment.
maybe we can introduce a new interface like get_current_leader_master_address()?
There was a problem hiding this comment.
maybe we could move all of that down into MasterClient to keep Client clean and simple?
There was a problem hiding this comment.
maybe we can introduce a new interface like get_current_leader_master_address()?
There is already an interface to get the leader address, i.e., master_view_helper_.GetMasterView.
maybe we could move all of that down into MasterClient to keep Client clean and simple?
The HA logic, such as ping thread, requires resources and information of the client, such as mounted_segments_.
|
By the way, since we're getting more and more configuration and constants, I think it's a good idea to create a separate config module. I'll go ahead and open a new issue for that |
|
Updates:
To do in the future PR:
|
|
@maobaolong PTAL |
maobaolong
left a comment
There was a problem hiding this comment.
LGTM for the code. It would be nice for user if we can add a document related, as this pr is already big enough, you can add it by another PR.
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com>
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com>
…tch optimization (#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468)" (#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>
|
May I kindly inquire if there is a planned timeline for supporting KV metadata failover?" @ykwd |
Thank you for your interest! Supporting KV metadata failover is indeed on our roadmap, but it's not a top priority at the moment. At least for August, we don’t plan to work on it yet. |
ok,Thank you for your response. |
…cache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error.
…tch optimization (kvcache-ai#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 6b07418 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 506e204. commit 60567fc Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit 506e204 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>
…cache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error.
…tch optimization (kvcache-ai#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 6e154d0 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 4675e9d. commit a2ca348 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit 4675e9d Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>
Currently, the high availability goal consists following contents:
The client can detect when the master goes down and automatically remount segments after the master restarts. This way, if the master crashes, only the master needs to be restarted without having to restart the entire cluster.
The master service contains multiple instances, one of which is the leader and serves the client requests. When the current leader fails, other instances can campaign to become the new leader and continue providing services. However, since the previous metadata is lost, it's equivalent to the cluster's key-value store being cleared, requiring the cache to be repopulated.
Tolerant client failure. The master can detect when a client fails and mark its associated data as invalid. When the client comes back online, it can re-register with the master.
KV metadata failover. Persist or replicate the kvs' metadata so that the newly elected leader can sync to the latest state, and the previously existing key-value data is still preserved.
Add fault-tolerant logic to the existing code to make the system more robust.
Add more tests to validate the robustness of the system under various failures.
This PR is the first step of a series of attempts to make mooncake-store achieve high availability. This PR achieves the 1st and 2nd goals listed above.
More precisely, this PR adds an optional HA mode that depends on etcd service. In HA mode, users can deploy multiple master instances, and at any time, only one instance is the leader to serve client requests. If N instances are deployed, the cluster can tolerate at most N-1 instances fail as long as the etcd service is alive. Currently, we also support non-HA mode, which is the default mode, and the system behaviour is equivalent to before.
As the high availability depends on etcd service, the installation of go and the compilation of etcd-wrapper become necessary if mooncake-store is to be compiled. Both mooncake-store and transferengine have an independent etcd client instance, so they won't interfere with each other.
Additionally, to do the corresponding unit tests in CI, we also need to deploy etcd service in the CI environment.