[TransferEngine] Fix minor bugs in NVLink transport and benchmark#468
Merged
ShangmingCai merged 3 commits intokvcache-ai:mainfrom Jun 11, 2025
Merged
[TransferEngine] Fix minor bugs in NVLink transport and benchmark#468ShangmingCai merged 3 commits intokvcache-ai:mainfrom
ShangmingCai merged 3 commits intokvcache-ai:mainfrom
Conversation
xiaguan
added a commit
to xiaguan/Mooncake
that referenced
this pull request
Jun 11, 2025
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark
xiaguan
added a commit
to xiaguan/Mooncake
that referenced
this pull request
Jun 11, 2025
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark
xiaguan
added a commit
to xiaguan/Mooncake
that referenced
this pull request
Jun 11, 2025
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com>
xiaguan
added a commit
to xiaguan/Mooncake
that referenced
this pull request
Jun 11, 2025
commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com>
xiaguan
added a commit
that referenced
this pull request
Jun 12, 2025
…tch optimization (#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 38c435f Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468)" (#469) This reverts commit ffaad6a. commit 41b1df7 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit ffaad6a Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>
wanyue-wy
pushed a commit
to wanyue-wy/Mooncake
that referenced
this pull request
Dec 14, 2025
…cache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark
wanyue-wy
pushed a commit
to wanyue-wy/Mooncake
that referenced
this pull request
Dec 14, 2025
…mark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 506e204.
wanyue-wy
pushed a commit
to wanyue-wy/Mooncake
that referenced
this pull request
Dec 14, 2025
…tch optimization (kvcache-ai#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 6b07418 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 506e204. commit 60567fc Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit 506e204 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>
JasonZhang517
pushed a commit
to JasonZhang517/Mooncake
that referenced
this pull request
Feb 9, 2026
…cache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark
JasonZhang517
pushed a commit
to JasonZhang517/Mooncake
that referenced
this pull request
Feb 9, 2026
…mark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 4675e9d.
JasonZhang517
pushed a commit
to JasonZhang517/Mooncake
that referenced
this pull request
Feb 9, 2026
…tch optimization (kvcache-ai#455) * feat(client): add transfer submitter for optimized data transfer Signed-off-by: Jinyang Su <751080330@qq.com> * feat(store): implement async memcpy task execution with worker pool Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability. * Squashed commit of the following: commit 6e154d0 Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:50:29 2025 +0800 Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469) This reverts commit 4675e9d. commit a2ca348 Author: ykwd <oneday117@qq.com> Date: Wed Jun 11 16:37:05 2025 +0800 [Store] Add initial support for master high availability failover (kvcache-ai#451) * A temp version. Better to continue development after merging the latest main branch * Temp version to merge the latest main branch * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug. * Refactor the etcd_helper * refactor ha_helper * Add some unit tests. Refactor the code * Update cmakelists: build etcd_wrapper in default * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set. * Update python config relating to mooncake-store client * make some blocking etcd helper function cancellable. bug fix: add string name of new errors that will be used in tostring. * Refactor etcd related code * Bug fix * Add basic masterviewhelper unit tests * In ci flow, install and start etcd to run HA feature unit test. * Fix a ci bug * Reuse master_server_address parameter and remove enable_ha parameter. * Format the code. Fix a minor bug. * Handle the error case: the coro server may fail to start or return internal error. commit 4675e9d Author: Feng Ren <alogfans@users.noreply.github.com> Date: Wed Jun 11 16:02:41 2025 +0800 [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468) * [TransferEngine] Fix compilation bug in NVLink xport * [TransferEngine] Fix minor bugs in nvlink benchmark Signed-off-by: Jinyang Su <751080330@qq.com> --------- Signed-off-by: Jinyang Su <751080330@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.