[Store] High Availability V2: Client Failover #501
Conversation
bug fix: add string name of new errors that will be used in tostring.
xiaguan
left a comment
There was a problem hiding this comment.
Great work! Just a couple of minor issues.
I just skimmed through the code and didn’t look too closely at the logic yet.
|
Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF? The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected? |
Sorry for the inconvenience. There is no need to install etcd. This is a go package used in mooncake-store. Do you want to compile transfer engine only, or want to compile both engine and store? In the former case, setting WITH_STORE=OFF will bypass this dependency. |
So now there is no way to compile store without ETCD? Can we make it to an optional feature? |
Thanks for the suggestion. I will try to make a hot fix for this. |
Updates
|
@staryxchen Just submitted a hotfix. #520 |
xiaguan
left a comment
There was a problem hiding this comment.
I think unifying the clien and segment abstraction would makes sense. The current client -> segments setup adds a lot of unnecessary complexity. It could simplify things quite a bit on both the master and client sides. Plus, it'll make future changes cleaner since we won't need to handle the one-client-to-multiple-segments case anymore.
| #include <string> | ||
| #include <string_view> | ||
| #include <unordered_map> | ||
| #include <variant> |
There was a problem hiding this comment.
OK. Removed the variant.
| for (auto& segment_id : it->second) { | ||
| auto segment_it = segment_manager_->mounted_segments_.find(segment_id); | ||
| if (segment_it != segment_manager_->mounted_segments_.end()) { | ||
| segments.push_back(segment_it->second.segment); |
There was a problem hiding this comment.
Use emplace_back instead of push_back where possible to avoid creating temporary objects. This can improve performance, especially for complex types.
| * @return client status from the master | ||
| * @return ErrorCode indicating success/failure | ||
| */ | ||
| [[nodiscard]] PingResponse Ping(const UUID& client_id); |
There was a problem hiding this comment.
Even though both functions are named 'ping', they actually have different meanings. We should probably rename them to better reflect what each one does.
There was a problem hiding this comment.
By "both functions", are there two functions named 'ping'?
There was a problem hiding this comment.
one for heartbeat, and the other to check if the master is available?
There was a problem hiding this comment.
OK. I find it. That is weird. I wonder where the duplicate ping comes from. Perhaps it is from a merging. I will remove the duplicate one.
| @@ -677,46 +709,44 @@ void Client::PingThreadFunc(int current_version) { | |||
| const int fail_ping_interval_ms = 1000; | |||
There was a problem hiding this comment.
We should probably turn these into configurable options too in the future
There was a problem hiding this comment.
Unlike the master, currently it is very hard to configure the client, especially adding configuration options to the client. Additionally, I do not see the reason to make this configurable for users.
There was a problem hiding this comment.
We're planning to add a client config module in the future.
The ping interval does seem a bit long—do you think setting the default to 100ms would work better?
There was a problem hiding this comment.
We're planning to add a client config module in the future.
That would be great. Looking forward to seeing it.
The ping interval does seem a bit long—do you think setting the default to 100ms would work better?
Perhaps not. Currently, if the leader crashes, it takes several seconds (perhaps 5 to 10+ seconds) before the new leader begin to serve. If the leader crashes, it takes 3 ping fails to trigger the client to query etcd for the new leader address. This takes 3 seconds, which fits the leader change timespan well. Additionally, too many pings will also brings burden to the master.
| return allocators_by_name_; | ||
| } | ||
|
|
||
| std::vector<std::shared_ptr<BufferAllocator>>& getAllocators() { |
| allocators_(allocators), | ||
| lock_(mutex) {} | ||
|
|
||
| std::unordered_map<std::string, |
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
commit 08fcdc8 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jul 14 08:35:05 2025 +0000 Reformat code commit f99cae8 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 11 07:46:02 2025 +0000 Optimize the use of CQ and QP in RDMA workers commit 12f6e41 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 09:23:54 2025 +0000 Cache remote segment commit d335449 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 08:08:41 2025 +0000 Move generatePostPath in async threads commit 4a24c5d Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 07:13:17 2025 +0000 Avoid pointer copy in transfer_engine.h commit 22c018b Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 06:57:36 2025 +0000 Rename Segment Tracker commit 0cdc522 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 10 06:47:12 2025 +0000 Use thread local storage commit b993750 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:24:05 2025 +0000 Revert all modifcations commit 6ee61c8 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:23:06 2025 +0000 Update commit 68ae198 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:17:07 2025 +0000 Update commit 1c3b6fc Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 12:12:45 2025 +0000 Update commit 27420b4 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 11:15:01 2025 +0000 Revert back commit e17d3fc Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 11:04:38 2025 +0000 Optimize allocateBatch and freeBatch commit baf39d2 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:42:32 2025 +0000 Test commit 945aaf7 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:39:59 2025 +0000 Test commit f54dcd9 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:35:06 2025 +0000 Revert commit 8602c19 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:33:08 2025 +0000 Test commit 7bd251a Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:30:35 2025 +0000 Test commit 9b0c6e4 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 10:28:35 2025 +0000 Upload test code commit 36965d1 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 08:44:33 2025 +0000 Hack commit a096f1c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 08:18:17 2025 +0000 Update commit 53b4946 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 08:07:15 2025 +0000 Update slab allocator commit ebcedf5 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 07:38:37 2025 +0000 Use Slab instead of new/delete commit 87482cc Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:48:03 2025 +0000 Update commit e1da71c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:46:22 2025 +0000 Fix commit c07df31 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:45:54 2025 +0000 Fix failed commit 18115b0 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:42:58 2025 +0000 Add log commit 590b02b Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 05:38:16 2025 +0000 Log commit f3ae54a Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:32:43 2025 +0000 Add message commit 3cfac7c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:27:44 2025 +0000 Update commit 211412c Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:25:04 2025 +0000 Add assert commit bb3e2b7 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 03:17:28 2025 +0000 Add trace commit ea713a2 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 02:42:00 2025 +0000 Fix local transfer via RDMA commit 2d337ed Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 9 02:36:23 2025 +0000 Add logs commit 2f3640f Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 8 09:24:28 2025 +0000 Add notify message in stderr commit 2c2cc36 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 8 09:11:41 2025 +0000 Add backoff in metadata commit d8cea58 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jul 7 05:38:22 2025 +0000 Add an auto-generated doc file of new test bench commit 323d0ea Author: Feng Ren <alogfans@gmail.com> Date: Mon Jul 7 03:28:04 2025 +0000 Add CXL support in SHM transport commit f0138dd Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 4 07:58:00 2025 +0000 Update MNNVL fix Fix bug Add GDS build refactor gds transport add include Fix Final fix Final fix Update Merge all modifications about GDS commit 97d8ca3 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 4 07:16:15 2025 +0000 Add MNNVL to default build commit 9af7604 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jul 4 06:53:48 2025 +0000 fix cuda runtime error commit 646d570 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 13:06:33 2025 +0000 extract thread pool for all transports commit 0c680e3 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 12:49:47 2025 +0000 Fix SHM problem commit 2db2ca7 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 05:58:40 2025 +0000 fix rpc commit 13fc066 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 03:00:54 2025 +0000 Fix rpc commit 2d3b8bb Author: Feng Ren <alogfans@gmail.com> Date: Thu Jul 3 02:27:58 2025 +0000 add benchmark features commit 594ddd9 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jul 2 03:42:54 2025 +0000 add tcp notify update bench v1 commit 9d3d1a4 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 1 14:22:14 2025 +0000 Add MNNVL support commit 8084ccf Author: qicosmos <qicosmos@linux.alibaba.com> Date: Tue Jul 1 14:18:33 2025 +0800 [cmake]fix cmake for centos (kvcache-ai#573) * fix cmake for centos * remove find_package jsoncpp * update * add cmake file commit f2f5950 Author: Sgt.Pepper <1303471564@qq.com> Date: Tue Jul 1 14:03:59 2025 +0800 fix Naming errors in doc transfer-engine-python.md (kvcache-ai#508) commit 7892711 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Tue Jul 1 13:52:28 2025 +0800 [P2P Store] Add cuda link option when it is installed (kvcache-ai#560) commit 3638313 Author: SCDESPERTATE <74419971+SCDESPERTATE@users.noreply.github.com> Date: Tue Jul 1 13:49:17 2025 +0800 Optimize slice handling to accelerate the large batch transfer operation (kvcache-ai#557) * kick off transfer first when there are too many slices to post in `RdmaTransport::submitTransferTask` * allow the last slice of a `TransferRequest` to be larger to reduce wr&&slice related overhead * add configs && docs explanation * remove a unnecessary reset operation commit 88f75f2 Author: ykwd <oneday117@qq.com> Date: Tue Jul 1 11:40:01 2025 +0800 [Store] Add Chaos Tests and Fix Bugs (kvcache-ai#568) - Add some chaos tests to verify the system's failover ability, including chaos_test, chaos_rand_test, e2e_rand_test. - Add chaosctl to do a manual and configurable chaos test. - Fix bugs found in the tests. - Add readme file for the e2e and chaos tests. commit 96da7a6 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Tue Jul 1 11:17:40 2025 +0800 [TransferEngine] Add support to force MNNVL transport by MC_FORCE_MNNVL (kvcache-ai#572) * [TransferEngine] Add support to force MNNVL transport by MC_FORCE_MNNVL * Change default parameters in test code commit c5b2524 Author: Stary <151037142+staryxchen@users.noreply.github.com> Date: Tue Jul 1 10:53:38 2025 +0800 [TransferEngine] ensure proper socket closure in destructor (kvcache-ai#566) commit c5cc9ec Author: JinYan Su <751080330@qq.com> Date: Mon Jun 30 17:10:58 2025 +0800 feat(store): add zero copy batch put and get for python binding (kvcache-ai#551) * feat(store): add zero copy batch put and get for python binding * Add comprehensive tests for batch_get_into and batch_put_from operations - Add test_batch_get_into_operations: Tests batch zero-copy read operations * Validates interface correctness with multiple keys and buffer sizes * Tests data integrity for 3 different-sized objects (2.3KB, 4.6KB, 3.5KB) * Verifies error handling for mismatched array sizes and empty inputs * Ensures proper buffer registration and management - Add test_batch_put_from_operations: Tests batch zero-copy write operations * Validates interface correctness with multiple keys and buffer sizes * Tests data integrity for 3 different-sized objects (1.8KB, 3.6KB, 2.7KB) * Verifies error handling for mismatched array sizes and empty inputs * Confirms stored data can be retrieved correctly Both tests follow the existing test patterns and include comprehensive error case coverage while keeping the interface validation simple and focused. * Fix CI test failures for batch operations - Fix buffer size allocation in batch_put_from test: allocate buffer_size = len(data) + 1024 for registration but use len(data) for actual put operation to avoid buffer registration errors (-600) seen in CI - Fix mismatched array size tests: ensure all arrays (keys, buffer_ptrs, buffer_sizes) have consistent slice lengths to properly test error handling - Both tests now pass locally and should resolve CI failures related to buffer registration and size validation commit 10f588c Author: xinranwang17 <87713897+xinranwang17@users.noreply.github.com> Date: Mon Jun 30 00:07:49 2025 +0800 [Store] feat: support batch put/get api in python module (kvcache-ai#556) * support batch put/get api in python module * feat: refine put_batch python API use put_batch(list, list) instead of put_batch(dict) * Delete mooncake-integration/store/test/uc_store.py commit 8ff2efc Author: Teng Ma <teng-ma@linux.alibaba.com> Date: Fri Jun 27 14:24:43 2025 +0800 [Integration] feat: expose batch reg API (kvcache-ai#558) commit 8bc4b6f Author: doujiang24 <doujiang24@gmail.com> Date: Fri Jun 27 12:19:35 2025 +0800 [TransferEngine] fix segfault when create cq failed. (kvcache-ai#535) Signed-off-by: doujiang24 <doujiang24@gmail.com> commit 3f0a784 Author: Wenjie <1186093704@qq.com> Date: Fri Jun 27 12:18:23 2025 +0800 [TransferEngine]: fix compilation warning (kvcache-ai#550) Signed-off-by: swj <1186093704@qq.com> commit d505516 Author: JinYan Su <751080330@qq.com> Date: Wed Jun 25 16:22:14 2025 +0800 chore: checkout specific version of yalantinglibs in script (kvcache-ai#555) commit 7d01004 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Wed Jun 25 15:57:27 2025 +0800 chore: bump version to 0.3.4.post2 in pyproject.toml (kvcache-ai#554) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit f2da050 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 25 15:46:48 2025 +0800 [TransferEngine] Fix side effect of wild location registration (kvcache-ai#552) commit cc286b0 Author: JinYan Su <751080330@qq.com> Date: Tue Jun 24 14:34:24 2025 +0800 feat(store): add batch exist support for master (kvcache-ai#542) * feat(store): add batch exist support for master Signed-off-by: Jinyang Su <751080330@qq.com> * refactor(client): simplify BatchIsExist logic by removing duplicate checks Co-authored-by: Xinran Wang <wangxinran.wxr@antgroup.com> Co-authored-by: Yongke Zhang <yongke.zyk@antgroup.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com> Co-authored-by: Xinran Wang <wangxinran.wxr@antgroup.com> Co-authored-by: Yongke Zhang <yongke.zyk@antgroup.com> commit fa7fc23 Author: Star <151037142+staryxchen@users.noreply.github.com> Date: Mon Jun 23 19:46:11 2025 +0800 [TransferEngine] support redis authentication and select db index (kvcache-ai#512) Signed-off-by: staryxchen <staryxchen@tencent.com> commit c324966 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Mon Jun 23 19:26:45 2025 +0800 chore: bump version to 0.3.4.post1 in pyproject.toml (kvcache-ai#544) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 07516f5 Author: Teng Ma <teng-ma@linux.alibaba.com> Date: Mon Jun 23 19:20:57 2025 +0800 revert: disable pr 483 (kvcache-ai#543) commit 34dddea Author: haobayuxi <whaohit@gmail.com> Date: Mon Jun 23 17:53:33 2025 +0800 add notify support (kvcache-ai#528) * add notify support * update transfer_engine_c.h wrapper * modification * remove vector.h commit 4635724 Author: JinYan Su <751080330@qq.com> Date: Mon Jun 23 15:57:56 2025 +0800 feat(master): support rpc server address parameter (kvcache-ai#530) * feat(master): support rpc server address parameter * docs: Add documentation for rpc_conn_timeout parameter - Add inline comment to MasterServiceSupervisor constructor explaining that rpc_conn_timeout=0 means no timeout (infinite) - Add clarifying comment in master.cpp about timeout behavior - Addresses PR feedback requesting documentation for timeout parameter semantics commit 6c482da Author: JinYan Su <751080330@qq.com> Date: Mon Jun 23 12:43:11 2025 +0800 feat(store): add thread safety analysis with clang annotations (kvcache-ai#538) * feat(store): add thread safety analysis with clang annotations - Add GUARDED_BY annotations to metadata hash table - Use MutexLocker instead of std::unique_lock for better thread safety analysis - Add NO_THREAD_SAFETY_ANALYSIS annotations where needed - Enable clang thread safety checking in build system - Fix all thread safety warnings in master service * Update mooncake-store/src/master_service.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> commit 73a15dd Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Fri Jun 20 17:54:32 2025 +0800 chore: bump version to 0.3.4 in pyproject.toml (kvcache-ai#533) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 4fa0038 Author: JinYan Su <751080330@qq.com> Date: Fri Jun 20 17:46:24 2025 +0800 feat(store): add zero-copy operations for python binding (kvcache-ai#532) * feat(store): add zero-copy operations for python binding * test: rename dict fuzz e2e test to run last * test: remove obsolete test_multicards.py from repository * chore(tests): remove multicards test execution from script commit d530800 Author: ykwd <oneday117@qq.com> Date: Fri Jun 20 16:59:55 2025 +0800 [Store] High Availability V2: Client Failover (kvcache-ai#501) Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests. commit 8028747 Author: Francis <38564764+ssssnow@users.noreply.github.com> Date: Fri Jun 20 16:36:16 2025 +0800 add support for batch transfer to accelerate transfer operation (kvcache-ai#499) * add support for batch transfer to accelerate transfer operation * fix tcp port exhausted issue * add more info and fix double free * rm unused freeBatchID * [chore] add TODO comment for batchTransfer * [Fix] reset 0 to transfer bytes --------- Co-authored-by: Francis <Francis> commit bffda70 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Fri Jun 20 15:46:35 2025 +0800 [Build] Optimize store build control for wheel and local build (kvcache-ai#531) * [Build] Optimize store build control for wheel and local build Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix typo Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix rm Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> --------- Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 41e8fa6 Author: doujiang24 <doujiang24@gmail.com> Date: Fri Jun 20 15:15:14 2025 +0800 use kWildcardLocation instead of hardcode "cpu:0" to recognize cpu numa node automatically. (kvcache-ai#527) Signed-off-by: doujiang24 <doujiang24@gmail.com> commit c743172 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Fri Jun 20 15:02:19 2025 +0800 [TransferEngine] Disabling auto-delete QP trying to avoid the availabilty problem (kvcache-ai#483) need a proper fix in the feature. commit 85ba034 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Fri Jun 20 14:19:41 2025 +0800 [Build] Deprecate stale adaptor usage to reduce whl package size (kvcache-ai#529) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 29532e8 Author: Ke Yang <oneday117@qq.com> Date: Fri Jun 20 02:40:31 2025 +0000 Update readme commit 9688967 Author: Ke Yang <oneday117@qq.com> Date: Thu Jun 19 07:46:39 2025 +0000 Add a message in cmakelists commit 21aacb1 Author: Ke Yang <oneday117@qq.com> Date: Thu Jun 19 07:25:17 2025 +0000 Skip etcd go package compilation in default. commit 4f9a379 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 21:30:30 2025 +0800 [Bugfix] Fix missing option and sglang integration doc (kvcache-ai#526) commit e815d66 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 20:51:52 2025 +0800 [TransferEngine] Change option use_nvlink to use_mnnvl to clarify the usage (kvcache-ai#525) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 3c823a9 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 20:00:09 2025 +0800 [Build] Add allocator class to support nvlink for more use-cases (kvcache-ai#524) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 6c57e34 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Thu Jun 19 19:51:52 2025 +0800 [Build] Optimize nvlink allocator build logic and fix name issue (kvcache-ai#523) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit c2793df Author: shangmingc <caishangming@linux.alibaba.com> Date: Wed Jun 18 21:12:13 2025 +0800 [Build] add nvlink hook into python package dir for local build (kvcache-ai#517) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit eebd666 Author: Teng Ma <teng-ma@linux.alibaba.com> Date: Wed Jun 18 19:04:09 2025 +0800 [Build] add TE bench into wheel package (kvcache-ai#514) commit d8e94e1 Author: dong <guodong9211@gmail.com> Date: Tue Jun 17 16:41:43 2025 +0800 [MooncakeIntegration] Fix find class id (kvcache-ai#500) commit caa1c4f Author: JinYan Su <751080330@qq.com> Date: Tue Jun 17 00:44:41 2025 +0800 fix(transfer-task): fix error hanlding logic in transfer task (kvcache-ai#503) commit e68d8ba Author: shangmingc <caishangming@linux.alibaba.com> Date: Mon Jun 16 20:13:40 2025 +0800 chore: bump version to 0.3.3.post2 in pyproject.toml (kvcache-ai#498) commit 8b75f03 Author: Shangming Cai <caishangming@linux.alibaba.com> Date: Mon Jun 16 18:54:57 2025 +0800 [TransferEngine] Optimize custom allocator function name Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> commit 7743789 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 1 06:53:53 2025 +0000 update commit e0e56ed Author: Feng Ren <alogfans@gmail.com> Date: Tue Jul 1 03:28:33 2025 +0000 add same machine check commit 2c62ebc Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 30 12:57:46 2025 +0000 change memory alloc apis commit 00426cb Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 30 09:01:11 2025 +0000 Update rpc commit f9542d7 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 30 06:40:09 2025 +0000 support buffer for multiple xports commit 32191ab Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 26 14:38:32 2025 +0000 new local segment helper commit a6cc8d1 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 26 13:09:44 2025 +0000 Stage commit bb218aa Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 23 07:44:52 2025 +0000 Add memory allocation APIs commit 54fa77b Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 23 07:09:30 2025 +0000 add tcp transport commit 0c336b5 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jun 20 09:26:29 2025 +0000 add allocator APIs for each transport commit 0040915 Author: Feng Ren <alogfans@gmail.com> Date: Fri Jun 20 08:28:53 2025 +0000 Move IP functions commit 1bf9e6f Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 09:40:22 2025 +0000 Update conf commit f20ca5a Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 08:13:24 2025 +0000 Update metadata passing logic commit fb4a929 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 06:49:39 2025 +0000 Update commit b6b5912 Author: Feng Ren <alogfans@gmail.com> Date: Thu Jun 19 05:08:09 2025 +0000 Update config manager commit 0281569 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jun 18 02:36:03 2025 +0000 generalize slab allocator commit 7e589e3 Author: Feng Ren <alogfans@gmail.com> Date: Wed Jun 18 02:02:00 2025 +0000 finalize Status report commit f2c91fe Author: Feng Ren <alogfans@gmail.com> Date: Tue Jun 17 08:11:06 2025 +0000 Change return value type commit 805c824 Author: Feng Ren <alogfans@gmail.com> Date: Tue Jun 17 03:32:46 2025 +0000 Update status report string commit af7b2e9 Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 16 03:11:58 2025 +0000 Pack common dependenies to common.h commit 081e92b Author: Feng Ren <alogfans@gmail.com> Date: Mon Jun 16 02:41:16 2025 +0000 Rebase code to keep both v0 and v1 seperately commit c126276 Merge: 5640086 b414934 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Mon Jun 16 10:29:41 2025 +0800 Merge branch 'kvcache-ai:main' into main commit 5640086 Merge: f7eaf85 20829bc Author: Feng Ren <alogfans@users.noreply.github.com> Date: Thu Jun 12 10:35:40 2025 +0800 Merge branch 'kvcache-ai:main' into main commit f7eaf85 Merge: deee23d f09c501 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 11:29:50 2025 +0800 Merge branch 'kvcache-ai:main' into main commit deee23d Author: Feng Ren <alogfans@gmail.com> Date: Tue Jun 10 03:28:56 2025 +0000 [TransferEngine] Fix compilation bug in NVLink xport
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
Following #451, this PR serves as the 2nd step towards store's high availability.
This PR mainly focuses on the failover on the client side. More specifically, this PR adds the following features:
There are many corner cases, such as:
To deal with all the corner cases:
Currently, there is no distributed testing or chaos testing for the HA features. Thus, the HA features shall be considered as highly unstable. So this PR still does not change the default behavior of the system. HA features will be toggled on only when users explicitly specify --enable-ha=true. We will design and run more testing, including integration testing, distributed testing, and chaos testing, in the subsequent PRs. Only after passing all these tests would it be safe to introduce the HA features in the docs.