Skip to content

feat(client): Abstract client-side data transmission for async and batch optimization#455

Merged
xiaguan merged 3 commits intokvcache-ai:mainfrom
xiaguan:transfer_builder
Jun 12, 2025
Merged

feat(client): Abstract client-side data transmission for async and batch optimization#455
xiaguan merged 3 commits intokvcache-ai:mainfrom
xiaguan:transfer_builder

Conversation

@xiaguan
Copy link
Copy Markdown
Collaborator

@xiaguan xiaguan commented Jun 6, 2025

Overview

This PR introduces a comprehensive abstraction layer for client-side data transmission.

  1. Async Memcpy: Foundation for making memory copy operations asynchronous
  2. Advanced Batching: Enhanced batch processing

Technical Details

Architecture Changes

// Before: Direct transfer engine calls
engine_.submitTransfer(batch_id, requests);

// After: Abstracted through TransferSubmitter
auto future = transfer_submitter_->submit(handles, slices, op_code);
ErrorCode result = future->get();

Future Roadmap

This abstraction enables several planned optimizations:

  1. Thread Pool Integration: Ready for introducing thread pools for parallel processing
  2. Performance Monitoring: Built-in hooks for transfer performance metrics

@xiaguan xiaguan force-pushed the transfer_builder branch 2 times, most recently from 45d4fd0 to 2913e35 Compare June 10, 2025 02:37
@xiaguan xiaguan requested review from Copilot and stmatengss June 10, 2025 02:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new abstraction layer for client-side data transfers, centralizing sync and async logic into TransferSubmitter and related classes.

  • Adds TransferSubmitter, TransferFuture, and OperationState to encapsulate transfer strategies.
  • Refactors Client to use the new submitter API and removes inline fast-path memcpy.
  • Updates RDMA header guard and CMakeLists to include the new module.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mooncake-transfer-engine/include/transport/rdma_transport/rdma_transport.h Renamed header guard to RDMA_TRANSPORT_H_
mooncake-store/include/transfer_task.h Introduces abstractions: OperationState, TransferFuture, etc.
mooncake-store/src/transfer_task.cpp Implements transfer logic, strategies, and async polling mechanics
mooncake-store/src/client.cpp Integrates TransferSubmitter, removes legacy memcpy fast paths
mooncake-store/include/client.h Declares transfer_submitter_ member
mooncake-store/src/CMakeLists.txt Adds transfer_task.cpp to build sources
Comments suppressed due to low confidence (4)

mooncake-store/include/transfer_task.h:131

  • The doc comment indicates a TaskCompletionStatus return, but the method is void. Update the documentation to match the signature.
void check_task_status();

mooncake-store/src/transfer_task.cpp:151

  • New async submission API is introduced here; consider adding unit tests covering success, failure, and strategy-selection paths.
std::optional<TransferFuture> TransferSubmitter::submit(

mooncake-store/src/transfer_task.cpp:301

  • [nitpick] Error message key invalid_partition_count is unclear. Use a more descriptive message, e.g., "Invalid transfer parameter: handles_size=...".
LOG(ERROR) << "invalid_partition_count handles_size=" << handles.size()

mooncake-store/include/transfer_task.h:84

  • assert is used but <cassert> is not included, which may lead to a compilation error. Add #include <cassert>.
assert(!result_.has_value());

Comment thread mooncake-store/src/transfer_task.cpp Outdated
Comment thread mooncake-store/src/transfer_task.cpp Outdated
Comment thread mooncake-store/include/transfer_task.h
@stmatengss
Copy link
Copy Markdown
Collaborator

One more thing: we can remove transfer task class to TE because it's general for async transferring.

Comment thread mooncake-store/include/transfer_task.h Outdated
Copy link
Copy Markdown
Collaborator

@stmatengss stmatengss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaguan xiaguan force-pushed the transfer_builder branch from 2913e35 to 4bcfc06 Compare June 10, 2025 12:08
@xiaguan xiaguan self-assigned this Jun 10, 2025
@xiaguan xiaguan force-pushed the transfer_builder branch from 4bcfc06 to fc3ff31 Compare June 11, 2025 09:19
xiaguan added 2 commits June 11, 2025 18:07
Signed-off-by: Jinyang Su <751080330@qq.com>
Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.
@xiaguan xiaguan force-pushed the transfer_builder branch 2 times, most recently from 64a4435 to f8a2dd3 Compare June 11, 2025 10:09
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>
@xiaguan xiaguan force-pushed the transfer_builder branch from f8a2dd3 to 765301c Compare June 11, 2025 10:32
@xiaguan xiaguan merged commit 1a53701 into kvcache-ai:main Jun 12, 2025
10 checks passed
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
…tch optimization (kvcache-ai#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 6b07418
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit 506e204.

commit 60567fc
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit 506e204
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
…tch optimization (kvcache-ai#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 6e154d0
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit 4675e9d.

commit a2ca348
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit 4675e9d
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants