Skip to content

[Store] High Availability V2: Client Failover #501

Merged
ykwd merged 39 commits intokvcache-ai:mainfrom
ykwd:feature/ha2
Jun 20, 2025
Merged

[Store] High Availability V2: Client Failover #501
ykwd merged 39 commits intokvcache-ai:mainfrom
ykwd:feature/ha2

Conversation

@ykwd
Copy link
Copy Markdown
Collaborator

@ykwd ykwd commented Jun 16, 2025

Following #451, this PR serves as the 2nd step towards store's high availability.

This PR mainly focuses on the failover on the client side. More specifically, this PR adds the following features:

  • When the master has not heard from a certain client through Pings for a pre-defined time period, either because the client terminated or there are some network partitions, the master will remount all segments on this client.
  • When the client regains connections to the master after a network partition, it can automatically remount all the local segments.

There are many corner cases, such as:

  • When a client crashes and then restarts, it may attempt to mount a segment with the same name as before, e.g., using its hostname as the segment name. Though the two segments have the same name, the old one is no longer available. The master must identify this situation, unmount the old segment, and mount the new one.
  • When a client gets partitioned from the master, the master unmounts its segments, but has not finished deleting all the replicas that are allocated on the unmounted segments. At this time, the client connects to the master and remounts its segment. Then the segment shall not be mounted and allocated memory to new key-values, as the old key-values that reside on this segment may still exist.
  • There is a leader view change. The client needs to remount all the segments to the new leader.
  • etc.

To deal with all the corner cases:

  • A UUID is associated with each client instance and each segment.
  • The segment unmounting is divided into two phases: prepare and commit. In the prepare phase, the segment's buffer allocator is deleted and is marked as unmounting. After all the associated replicas are deleted, in the commit phase, the segment is fully deleted. A segment cannot be mounted if the segment with the same UUID is unmounting.
  • On the master side, the client status and segment status shall be updated together in an atomic way to avoid state inconsistency.

Currently, there is no distributed testing or chaos testing for the HA features. Thus, the HA features shall be considered as highly unstable. So this PR still does not change the default behavior of the system. HA features will be toggled on only when users explicitly specify --enable-ha=true. We will design and run more testing, including integration testing, distributed testing, and chaos testing, in the subsequent PRs. Only after passing all these tests would it be safe to introduce the HA features in the docs.

ykwd added 30 commits May 28, 2025 08:59
bug fix: add string name of new errors that will be used in tostring.
@ykwd ykwd marked this pull request as ready for review June 17, 2025 09:59
Copy link
Copy Markdown
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just a couple of minor issues.
I just skimmed through the code and didn’t look too closely at the logic yet.

Comment thread mooncake-store/include/master_metric_manager.h
Comment thread mooncake-store/include/master_service.h
Comment thread mooncake-store/include/segment.h
Comment thread mooncake-store/include/segment.h
Comment thread mooncake-store/include/types.h
Comment thread mooncake-store/include/types.h
Comment thread mooncake-store/src/client.cpp
Comment thread mooncake-store/src/client.cpp Outdated
@staryxchen
Copy link
Copy Markdown
Collaborator

staryxchen commented Jun 19, 2025

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF?
I saw an error even if USE_ETCD is OFF:

go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout

The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?

@ykwd
Copy link
Copy Markdown
Collaborator Author

ykwd commented Jun 19, 2025

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF? I saw an error even if USE_ETCD is OFF:

go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout

The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?

Sorry for the inconvenience. There is no need to install etcd. This is a go package used in mooncake-store. Do you want to compile transfer engine only, or want to compile both engine and store? In the former case, setting WITH_STORE=OFF will bypass this dependency.

@staryxchen
Copy link
Copy Markdown
Collaborator

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF? I saw an error even if USE_ETCD is OFF:

go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout

The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?

Sorry for the inconvenience. There is no need to install etcd. This is a go package used in mooncake-store. Do you want to compile transfer engine only, or want to compile both engine and store? In the former case, setting WITH_STORE=OFF will bypass this dependency.

So now there is no way to compile store without ETCD? Can we make it to an optional feature?

@ykwd
Copy link
Copy Markdown
Collaborator Author

ykwd commented Jun 19, 2025

Hi, Is the previous PR #451 made ETCD a necessary dependency even if USE_ETCD set to OFF? I saw an error even if USE_ETCD is OFF:

go: github.com/kvcache-ai/Mooncake/mooncake-common/etcd imports
        go.etcd.io/etcd/client/v3: go.etcd.io/etcd/client/v3@v3.5.21: Get "https://proxy.golang.org/go.etcd.io/etcd/client/v3/@v/v3.5.21.zip": dial tcp ***1:443: i/o timeout

The immediate cause could be due to a network issue, but I don't need ETCD, so why can't I bypass this error by setting USE_ETCD to OFF? Is this as expected?

Sorry for the inconvenience. There is no need to install etcd. This is a go package used in mooncake-store. Do you want to compile transfer engine only, or want to compile both engine and store? In the former case, setting WITH_STORE=OFF will bypass this dependency.

So now there is no way to compile store without ETCD? Can we make it to an optional feature?

Thanks for the suggestion. I will try to make a hot fix for this.

@ykwd
Copy link
Copy Markdown
Collaborator Author

ykwd commented Jun 19, 2025

Updates

  • Modifications suggested by reviews.
  • On the client side, 1) add a restriction to only mount segments with the same name as the localhost; 2) allow mounting multiple non-overlapped segments with the same name.
  • Add a tool, clientctl, to make manual e2e tests easier.

@ykwd
Copy link
Copy Markdown
Collaborator Author

ykwd commented Jun 19, 2025

staryxchen

@staryxchen Just submitted a hotfix. #520

@xiaguan xiaguan self-requested a review June 20, 2025 05:03
Copy link
Copy Markdown
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unifying the clien and segment abstraction would makes sense. The current client -> segments setup adds a lot of unnecessary complexity. It could simplify things quite a bit on both the master and client sides. Plus, it'll make future changes cleaner since we won't need to handle the one-client-to-multiple-segments case anymore.

Comment thread mooncake-store/include/segment.h Outdated
#include <string>
#include <string_view>
#include <unordered_map>
#include <variant>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unsed header.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Removed the variant.

Comment thread mooncake-store/src/segment.cpp Outdated
for (auto& segment_id : it->second) {
auto segment_it = segment_manager_->mounted_segments_.find(segment_id);
if (segment_it != segment_manager_->mounted_segments_.end()) {
segments.push_back(segment_it->second.segment);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use emplace_back instead of push_back where possible to avoid creating temporary objects. This can improve performance, especially for complex types.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

* @return client status from the master
* @return ErrorCode indicating success/failure
*/
[[nodiscard]] PingResponse Ping(const UUID& client_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though both functions are named 'ping', they actually have different meanings. We should probably rename them to better reflect what each one does.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "both functions", are there two functions named 'ping'?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one for heartbeat, and the other to check if the master is available?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I find it. That is weird. I wonder where the duplicate ping comes from. Perhaps it is from a merging. I will remove the duplicate one.

@@ -677,46 +709,44 @@ void Client::PingThreadFunc(int current_version) {
const int fail_ping_interval_ms = 1000;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably turn these into configurable options too in the future

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the master, currently it is very hard to configure the client, especially adding configuration options to the client. Additionally, I do not see the reason to make this configurable for users.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're planning to add a client config module in the future.

The ping interval does seem a bit long—do you think setting the default to 100ms would work better?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're planning to add a client config module in the future.

That would be great. Looking forward to seeing it.

The ping interval does seem a bit long—do you think setting the default to 100ms would work better?

Perhaps not. Currently, if the leader crashes, it takes several seconds (perhaps 5 to 10+ seconds) before the new leader begin to serve. If the leader crashes, it takes 3 ping fails to trigger the client to query etcd for the new leader address. This takes 3 seconds, which fits the leader change timespan well. Additionally, too many pings will also brings burden to the master.

Comment thread mooncake-store/include/segment.h Outdated
return allocators_by_name_;
}

std::vector<std::shared_ptr<BufferAllocator>>& getAllocators() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add const

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment thread mooncake-store/include/segment.h Outdated
allocators_(allocators),
lock_(mutex) {}

std::unordered_map<std::string,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return a const ref

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@ykwd ykwd merged commit 91c5778 into kvcache-ai:main Jun 20, 2025
10 checks passed
alogfans pushed a commit to alogfans/Mooncake that referenced this pull request Jul 1, 2025
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
@ykwd ykwd deleted the feature/ha2 branch July 10, 2025 06:36
alogfans added a commit to alogfans/Mooncake that referenced this pull request Jul 14, 2025
commit 08fcdc8
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jul 14 08:35:05 2025 +0000

    Reformat code

commit f99cae8
Author: Feng Ren <alogfans@gmail.com>
Date:   Fri Jul 11 07:46:02 2025 +0000

    Optimize the use of CQ and QP in RDMA workers

commit 12f6e41
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 10 09:23:54 2025 +0000

    Cache remote segment

commit d335449
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 10 08:08:41 2025 +0000

    Move generatePostPath in async threads

commit 4a24c5d
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 10 07:13:17 2025 +0000

    Avoid pointer copy in transfer_engine.h

commit 22c018b
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 10 06:57:36 2025 +0000

    Rename Segment Tracker

commit 0cdc522
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 10 06:47:12 2025 +0000

    Use thread local storage

commit b993750
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 12:24:05 2025 +0000

    Revert all modifcations

commit 6ee61c8
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 12:23:06 2025 +0000

    Update

commit 68ae198
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 12:17:07 2025 +0000

    Update

commit 1c3b6fc
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 12:12:45 2025 +0000

    Update

commit 27420b4
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 11:15:01 2025 +0000

    Revert back

commit e17d3fc
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 11:04:38 2025 +0000

    Optimize allocateBatch and freeBatch

commit baf39d2
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 10:42:32 2025 +0000

    Test

commit 945aaf7
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 10:39:59 2025 +0000

    Test

commit f54dcd9
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 10:35:06 2025 +0000

    Revert

commit 8602c19
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 10:33:08 2025 +0000

    Test

commit 7bd251a
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 10:30:35 2025 +0000

    Test

commit 9b0c6e4
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 10:28:35 2025 +0000

    Upload test code

commit 36965d1
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 08:44:33 2025 +0000

    Hack

commit a096f1c
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 08:18:17 2025 +0000

    Update

commit 53b4946
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 08:07:15 2025 +0000

    Update slab allocator

commit ebcedf5
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 07:38:37 2025 +0000

    Use Slab instead of new/delete

commit 87482cc
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 05:48:03 2025 +0000

    Update

commit e1da71c
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 05:46:22 2025 +0000

    Fix

commit c07df31
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 05:45:54 2025 +0000

    Fix failed

commit 18115b0
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 05:42:58 2025 +0000

    Add log

commit 590b02b
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 05:38:16 2025 +0000

    Log

commit f3ae54a
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 03:32:43 2025 +0000

    Add message

commit 3cfac7c
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 03:27:44 2025 +0000

    Update

commit 211412c
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 03:25:04 2025 +0000

    Add assert

commit bb3e2b7
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 03:17:28 2025 +0000

    Add trace

commit ea713a2
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 02:42:00 2025 +0000

    Fix local transfer via RDMA

commit 2d337ed
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 9 02:36:23 2025 +0000

    Add logs

commit 2f3640f
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jul 8 09:24:28 2025 +0000

    Add notify message in stderr

commit 2c2cc36
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jul 8 09:11:41 2025 +0000

    Add backoff in metadata

commit d8cea58
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jul 7 05:38:22 2025 +0000

    Add an auto-generated doc file of new test bench

commit 323d0ea
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jul 7 03:28:04 2025 +0000

    Add CXL support in SHM transport

commit f0138dd
Author: Feng Ren <alogfans@gmail.com>
Date:   Fri Jul 4 07:58:00 2025 +0000

    Update MNNVL

    fix

    Fix bug

    Add GDS build

    refactor gds transport

    add include

    Fix

    Final fix

    Final fix

    Update

    Merge all modifications about GDS

commit 97d8ca3
Author: Feng Ren <alogfans@gmail.com>
Date:   Fri Jul 4 07:16:15 2025 +0000

    Add MNNVL to default build

commit 9af7604
Author: Feng Ren <alogfans@gmail.com>
Date:   Fri Jul 4 06:53:48 2025 +0000

    fix cuda runtime error

commit 646d570
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 3 13:06:33 2025 +0000

    extract thread pool for all transports

commit 0c680e3
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 3 12:49:47 2025 +0000

    Fix SHM problem

commit 2db2ca7
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 3 05:58:40 2025 +0000

    fix rpc

commit 13fc066
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 3 03:00:54 2025 +0000

    Fix rpc

commit 2d3b8bb
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jul 3 02:27:58 2025 +0000

    add benchmark features

commit 594ddd9
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jul 2 03:42:54 2025 +0000

    add tcp notify

    update bench v1

commit 9d3d1a4
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jul 1 14:22:14 2025 +0000

    Add MNNVL support

commit 8084ccf
Author: qicosmos <qicosmos@linux.alibaba.com>
Date:   Tue Jul 1 14:18:33 2025 +0800

    [cmake]fix cmake for centos (kvcache-ai#573)

    * fix cmake for centos

    * remove find_package jsoncpp

    * update

    * add cmake file

commit f2f5950
Author: Sgt.Pepper <1303471564@qq.com>
Date:   Tue Jul 1 14:03:59 2025 +0800

    fix Naming errors in doc transfer-engine-python.md (kvcache-ai#508)

commit 7892711
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Tue Jul 1 13:52:28 2025 +0800

    [P2P Store] Add cuda link option when it is installed (kvcache-ai#560)

commit 3638313
Author: SCDESPERTATE <74419971+SCDESPERTATE@users.noreply.github.com>
Date:   Tue Jul 1 13:49:17 2025 +0800

    Optimize slice handling to accelerate the large batch transfer operation (kvcache-ai#557)

    * kick off transfer first when there are too many slices to post in `RdmaTransport::submitTransferTask`

    * allow the last slice of a `TransferRequest` to be larger to reduce wr&&slice related overhead

    * add configs && docs explanation

    * remove a unnecessary reset operation

commit 88f75f2
Author: ykwd <oneday117@qq.com>
Date:   Tue Jul 1 11:40:01 2025 +0800

    [Store] Add Chaos Tests and Fix Bugs (kvcache-ai#568)

    - Add some chaos tests to verify the system's failover ability, including chaos_test, chaos_rand_test, e2e_rand_test.
    - Add chaosctl to do a manual and configurable chaos test.
    - Fix bugs found in the tests.
    - Add readme file for the e2e and chaos tests.

commit 96da7a6
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Tue Jul 1 11:17:40 2025 +0800

    [TransferEngine] Add support to force MNNVL transport by MC_FORCE_MNNVL (kvcache-ai#572)

    * [TransferEngine] Add support to force MNNVL transport by MC_FORCE_MNNVL

    * Change default parameters in test code

commit c5b2524
Author: Stary <151037142+staryxchen@users.noreply.github.com>
Date:   Tue Jul 1 10:53:38 2025 +0800

    [TransferEngine] ensure proper socket closure in destructor (kvcache-ai#566)

commit c5cc9ec
Author: JinYan Su <751080330@qq.com>
Date:   Mon Jun 30 17:10:58 2025 +0800

    feat(store): add zero copy batch put and get for python binding (kvcache-ai#551)

    * feat(store): add zero copy batch put and get for python binding

    * Add comprehensive tests for batch_get_into and batch_put_from operations

    - Add test_batch_get_into_operations: Tests batch zero-copy read operations
      * Validates interface correctness with multiple keys and buffer sizes
      * Tests data integrity for 3 different-sized objects (2.3KB, 4.6KB, 3.5KB)
      * Verifies error handling for mismatched array sizes and empty inputs
      * Ensures proper buffer registration and management

    - Add test_batch_put_from_operations: Tests batch zero-copy write operations
      * Validates interface correctness with multiple keys and buffer sizes
      * Tests data integrity for 3 different-sized objects (1.8KB, 3.6KB, 2.7KB)
      * Verifies error handling for mismatched array sizes and empty inputs
      * Confirms stored data can be retrieved correctly

    Both tests follow the existing test patterns and include comprehensive
    error case coverage while keeping the interface validation simple and focused.

    * Fix CI test failures for batch operations

    - Fix buffer size allocation in batch_put_from test: allocate buffer_size = len(data) + 1024
      for registration but use len(data) for actual put operation to avoid buffer registration
      errors (-600) seen in CI

    - Fix mismatched array size tests: ensure all arrays (keys, buffer_ptrs, buffer_sizes)
      have consistent slice lengths to properly test error handling

    - Both tests now pass locally and should resolve CI failures related to buffer
      registration and size validation

commit 10f588c
Author: xinranwang17 <87713897+xinranwang17@users.noreply.github.com>
Date:   Mon Jun 30 00:07:49 2025 +0800

    [Store] feat: support batch put/get api in python module (kvcache-ai#556)

    * support batch put/get api in python module

    * feat: refine put_batch python API

    use put_batch(list, list) instead of put_batch(dict)

    * Delete mooncake-integration/store/test/uc_store.py

commit 8ff2efc
Author: Teng Ma <teng-ma@linux.alibaba.com>
Date:   Fri Jun 27 14:24:43 2025 +0800

    [Integration] feat: expose batch reg API (kvcache-ai#558)

commit 8bc4b6f
Author: doujiang24 <doujiang24@gmail.com>
Date:   Fri Jun 27 12:19:35 2025 +0800

    [TransferEngine] fix segfault when create cq failed. (kvcache-ai#535)

    Signed-off-by: doujiang24 <doujiang24@gmail.com>

commit 3f0a784
Author: Wenjie <1186093704@qq.com>
Date:   Fri Jun 27 12:18:23 2025 +0800

    [TransferEngine]: fix compilation warning (kvcache-ai#550)

    Signed-off-by: swj <1186093704@qq.com>

commit d505516
Author: JinYan Su <751080330@qq.com>
Date:   Wed Jun 25 16:22:14 2025 +0800

    chore: checkout specific version of yalantinglibs in script (kvcache-ai#555)

commit 7d01004
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Wed Jun 25 15:57:27 2025 +0800

    chore: bump version to 0.3.4.post2 in pyproject.toml (kvcache-ai#554)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit f2da050
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 25 15:46:48 2025 +0800

    [TransferEngine] Fix side effect of wild location registration (kvcache-ai#552)

commit cc286b0
Author: JinYan Su <751080330@qq.com>
Date:   Tue Jun 24 14:34:24 2025 +0800

    feat(store): add batch exist support for master (kvcache-ai#542)

    * feat(store): add batch exist support for master

    Signed-off-by: Jinyang Su <751080330@qq.com>

    * refactor(client): simplify BatchIsExist logic by removing duplicate checks

    Co-authored-by: Xinran Wang <wangxinran.wxr@antgroup.com>
    Co-authored-by: Yongke Zhang <yongke.zyk@antgroup.com>

    ---------

    Signed-off-by: Jinyang Su <751080330@qq.com>
    Co-authored-by: Xinran Wang <wangxinran.wxr@antgroup.com>
    Co-authored-by: Yongke Zhang <yongke.zyk@antgroup.com>

commit fa7fc23
Author: Star <151037142+staryxchen@users.noreply.github.com>
Date:   Mon Jun 23 19:46:11 2025 +0800

    [TransferEngine] support redis authentication and select db index (kvcache-ai#512)

    Signed-off-by: staryxchen <staryxchen@tencent.com>

commit c324966
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Mon Jun 23 19:26:45 2025 +0800

    chore: bump version to 0.3.4.post1 in pyproject.toml (kvcache-ai#544)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit 07516f5
Author: Teng Ma <teng-ma@linux.alibaba.com>
Date:   Mon Jun 23 19:20:57 2025 +0800

    revert: disable pr 483 (kvcache-ai#543)

commit 34dddea
Author: haobayuxi <whaohit@gmail.com>
Date:   Mon Jun 23 17:53:33 2025 +0800

    add notify support (kvcache-ai#528)

    * add notify support

    * update transfer_engine_c.h wrapper

    * modification

    * remove vector.h

commit 4635724
Author: JinYan Su <751080330@qq.com>
Date:   Mon Jun 23 15:57:56 2025 +0800

    feat(master): support rpc server address parameter (kvcache-ai#530)

    * feat(master): support rpc server address parameter

    * docs: Add documentation for rpc_conn_timeout parameter

    - Add inline comment to MasterServiceSupervisor constructor explaining that rpc_conn_timeout=0 means no timeout (infinite)
    - Add clarifying comment in master.cpp about timeout behavior
    - Addresses PR feedback requesting documentation for timeout parameter semantics

commit 6c482da
Author: JinYan Su <751080330@qq.com>
Date:   Mon Jun 23 12:43:11 2025 +0800

    feat(store): add thread safety analysis with clang annotations (kvcache-ai#538)

    * feat(store): add thread safety analysis with clang annotations

    - Add GUARDED_BY annotations to metadata hash table
    - Use MutexLocker instead of std::unique_lock for better thread safety analysis
    - Add NO_THREAD_SAFETY_ANALYSIS annotations where needed
    - Enable clang thread safety checking in build system
    - Fix all thread safety warnings in master service

    * Update mooncake-store/src/master_service.cpp

    Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

    ---------

    Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

commit 73a15dd
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Fri Jun 20 17:54:32 2025 +0800

    chore: bump version to 0.3.4 in pyproject.toml (kvcache-ai#533)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit 4fa0038
Author: JinYan Su <751080330@qq.com>
Date:   Fri Jun 20 17:46:24 2025 +0800

    feat(store): add zero-copy operations for python binding (kvcache-ai#532)

    * feat(store): add zero-copy operations for python binding

    * test: rename dict fuzz e2e test to run last

    * test: remove obsolete test_multicards.py from repository

    * chore(tests): remove multicards test execution from script

commit d530800
Author: ykwd <oneday117@qq.com>
Date:   Fri Jun 20 16:59:55 2025 +0800

    [Store] High Availability V2: Client Failover  (kvcache-ai#501)

    Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.

commit 8028747
Author: Francis <38564764+ssssnow@users.noreply.github.com>
Date:   Fri Jun 20 16:36:16 2025 +0800

    add support for batch transfer to accelerate transfer operation (kvcache-ai#499)

    * add support for batch transfer to accelerate transfer operation

    * fix tcp port exhausted issue

    * add more info and fix double free

    * rm unused freeBatchID

    * [chore] add TODO comment for batchTransfer

    * [Fix] reset 0 to transfer bytes

    ---------

    Co-authored-by: Francis <Francis>

commit bffda70
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Fri Jun 20 15:46:35 2025 +0800

    [Build] Optimize store build control for wheel and local build (kvcache-ai#531)

    * [Build] Optimize store build control for wheel and local build

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

    * fix typo

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

    * fix rm

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

    ---------

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit 41e8fa6
Author: doujiang24 <doujiang24@gmail.com>
Date:   Fri Jun 20 15:15:14 2025 +0800

    use kWildcardLocation instead of hardcode "cpu:0" to recognize cpu numa node automatically. (kvcache-ai#527)

    Signed-off-by: doujiang24 <doujiang24@gmail.com>

commit c743172
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Fri Jun 20 15:02:19 2025 +0800

    [TransferEngine] Disabling auto-delete QP trying to avoid the availabilty problem (kvcache-ai#483)

    need a proper fix in the feature.

commit 85ba034
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Fri Jun 20 14:19:41 2025 +0800

    [Build] Deprecate stale adaptor usage to reduce whl package size (kvcache-ai#529)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit 29532e8
Author: Ke Yang <oneday117@qq.com>
Date:   Fri Jun 20 02:40:31 2025 +0000

    Update readme

commit 9688967
Author: Ke Yang <oneday117@qq.com>
Date:   Thu Jun 19 07:46:39 2025 +0000

    Add a message in cmakelists

commit 21aacb1
Author: Ke Yang <oneday117@qq.com>
Date:   Thu Jun 19 07:25:17 2025 +0000

    Skip etcd go package compilation in default.

commit 4f9a379
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Thu Jun 19 21:30:30 2025 +0800

    [Bugfix] Fix missing option and sglang integration doc (kvcache-ai#526)

commit e815d66
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Thu Jun 19 20:51:52 2025 +0800

    [TransferEngine] Change option use_nvlink to use_mnnvl to clarify the usage (kvcache-ai#525)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit 3c823a9
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Thu Jun 19 20:00:09 2025 +0800

    [Build] Add allocator class to support nvlink for more use-cases (kvcache-ai#524)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit 6c57e34
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Thu Jun 19 19:51:52 2025 +0800

    [Build] Optimize nvlink allocator build logic and fix name issue (kvcache-ai#523)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit c2793df
Author: shangmingc <caishangming@linux.alibaba.com>
Date:   Wed Jun 18 21:12:13 2025 +0800

    [Build] add nvlink hook into python package dir for local build (kvcache-ai#517)

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit eebd666
Author: Teng Ma <teng-ma@linux.alibaba.com>
Date:   Wed Jun 18 19:04:09 2025 +0800

    [Build] add TE bench into wheel package (kvcache-ai#514)

commit d8e94e1
Author: dong <guodong9211@gmail.com>
Date:   Tue Jun 17 16:41:43 2025 +0800

    [MooncakeIntegration] Fix find class id (kvcache-ai#500)

commit caa1c4f
Author: JinYan Su <751080330@qq.com>
Date:   Tue Jun 17 00:44:41 2025 +0800

    fix(transfer-task): fix error hanlding logic in transfer task (kvcache-ai#503)

commit e68d8ba
Author: shangmingc <caishangming@linux.alibaba.com>
Date:   Mon Jun 16 20:13:40 2025 +0800

    chore: bump version to 0.3.3.post2 in pyproject.toml (kvcache-ai#498)

commit 8b75f03
Author: Shangming Cai <caishangming@linux.alibaba.com>
Date:   Mon Jun 16 18:54:57 2025 +0800

    [TransferEngine] Optimize custom allocator function name

    Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

commit 7743789
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jul 1 06:53:53 2025 +0000

    update

commit e0e56ed
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jul 1 03:28:33 2025 +0000

    add same machine check

commit 2c62ebc
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jun 30 12:57:46 2025 +0000

    change memory alloc apis

commit 00426cb
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jun 30 09:01:11 2025 +0000

    Update rpc

commit f9542d7
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jun 30 06:40:09 2025 +0000

    support buffer for multiple xports

commit 32191ab
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jun 26 14:38:32 2025 +0000

    new local segment helper

commit a6cc8d1
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jun 26 13:09:44 2025 +0000

    Stage

commit bb218aa
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jun 23 07:44:52 2025 +0000

    Add memory allocation APIs

commit 54fa77b
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jun 23 07:09:30 2025 +0000

    add tcp transport

commit 0c336b5
Author: Feng Ren <alogfans@gmail.com>
Date:   Fri Jun 20 09:26:29 2025 +0000

    add allocator APIs for each transport

commit 0040915
Author: Feng Ren <alogfans@gmail.com>
Date:   Fri Jun 20 08:28:53 2025 +0000

    Move IP functions

commit 1bf9e6f
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jun 19 09:40:22 2025 +0000

    Update conf

commit f20ca5a
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jun 19 08:13:24 2025 +0000

    Update metadata passing logic

commit fb4a929
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jun 19 06:49:39 2025 +0000

    Update

commit b6b5912
Author: Feng Ren <alogfans@gmail.com>
Date:   Thu Jun 19 05:08:09 2025 +0000

    Update config manager

commit 0281569
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jun 18 02:36:03 2025 +0000

    generalize slab allocator

commit 7e589e3
Author: Feng Ren <alogfans@gmail.com>
Date:   Wed Jun 18 02:02:00 2025 +0000

    finalize Status report

commit f2c91fe
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jun 17 08:11:06 2025 +0000

    Change return value type

commit 805c824
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jun 17 03:32:46 2025 +0000

    Update status report string

commit af7b2e9
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jun 16 03:11:58 2025 +0000

    Pack common dependenies to common.h

commit 081e92b
Author: Feng Ren <alogfans@gmail.com>
Date:   Mon Jun 16 02:41:16 2025 +0000

    Rebase code to keep both v0 and v1 seperately

commit c126276
Merge: 5640086 b414934
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Mon Jun 16 10:29:41 2025 +0800

    Merge branch 'kvcache-ai:main' into main

commit 5640086
Merge: f7eaf85 20829bc
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Thu Jun 12 10:35:40 2025 +0800

    Merge branch 'kvcache-ai:main' into main

commit f7eaf85
Merge: deee23d f09c501
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 11:29:50 2025 +0800

    Merge branch 'kvcache-ai:main' into main

commit deee23d
Author: Feng Ren <alogfans@gmail.com>
Date:   Tue Jun 10 03:28:56 2025 +0000

    [TransferEngine] Fix compilation bug in NVLink xport
201341 pushed a commit to 201341/Mooncake that referenced this pull request Jul 22, 2025
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
@ykwd ykwd mentioned this pull request Aug 21, 2025
29 tasks
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
Major changes include: Client failover. Refactor SegmentManager. On the client side, limit the segment name to be equal to the localhost name. Add clientctl for manual e2e tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants