Skip to content

[TransferEngine] Fix minor bugs in NVLink transport and benchmark#468

Merged
ShangmingCai merged 3 commits intokvcache-ai:mainfrom
alogfans:fix-bench-code
Jun 11, 2025
Merged

[TransferEngine] Fix minor bugs in NVLink transport and benchmark#468
ShangmingCai merged 3 commits intokvcache-ai:mainfrom
alogfans:fix-bench-code

Conversation

@alogfans
Copy link
Copy Markdown
Collaborator

No description provided.

@alogfans alogfans changed the title Fix minor bugs in nvlink benchmark [TransferEngine] Fix minor bugs in NVLink transport and benchmark Jun 11, 2025
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShangmingCai ShangmingCai merged commit ffaad6a into kvcache-ai:main Jun 11, 2025
10 checks passed
alogfans added a commit that referenced this pull request Jun 11, 2025
alogfans added a commit that referenced this pull request Jun 11, 2025
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>
xiaguan added a commit to xiaguan/Mooncake that referenced this pull request Jun 11, 2025
commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>
xiaguan added a commit that referenced this pull request Jun 12, 2025
…tch optimization (#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 38c435f
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468)" (#469)

    This reverts commit ffaad6a.

commit 41b1df7
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit ffaad6a
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
…cache-ai#468)

* [TransferEngine] Fix compilation bug in NVLink xport

* [TransferEngine] Fix minor bugs in nvlink benchmark
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
…tch optimization (kvcache-ai#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 6b07418
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit 506e204.

commit 60567fc
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit 506e204
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
…cache-ai#468)

* [TransferEngine] Fix compilation bug in NVLink xport

* [TransferEngine] Fix minor bugs in nvlink benchmark
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
…tch optimization (kvcache-ai#455)

* feat(client): add transfer submitter for optimized data transfer

Signed-off-by: Jinyang Su <751080330@qq.com>

* feat(store): implement async memcpy task execution with worker pool

Add `MemcpyWorkerPool` to manage asynchronous execution of memcpy tasks. Refactor `BatchGet` and `BatchPut` methods for parallel execution and enhance logging for better traceability.

* Squashed commit of the following:

commit 6e154d0
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:50:29 2025 +0800

    Revert "[TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)" (kvcache-ai#469)

    This reverts commit 4675e9d.

commit a2ca348
Author: ykwd <oneday117@qq.com>
Date:   Wed Jun 11 16:37:05 2025 +0800

    [Store] Add initial support for master high availability failover (kvcache-ai#451)

    * A temp version. Better to continue development after merging the latest main branch

    * Temp version to merge the latest main branch

    * Allow optional use HA mode, in default use non-HA mode. Fix a minor metrics bug.

    * Refactor the etcd_helper

    * refactor ha_helper

    * Add some unit tests. Refactor the code

    * Update cmakelists: build etcd_wrapper in default

    * Fix ci problems. Compile etcd wrapper only when use_etcd or with_store are set.

    * Update python config relating to mooncake-store client

    * make some blocking etcd helper function cancellable.
    bug fix: add string name of new errors that will be used in tostring.

    * Refactor etcd related code

    * Bug fix

    * Add basic masterviewhelper unit tests

    * In ci flow, install and start etcd to run HA feature unit test.

    * Fix a ci bug

    * Reuse master_server_address parameter and remove enable_ha parameter.

    * Format the code. Fix a minor bug.

    * Handle the error case: the coro server may fail to start or return internal error.

commit 4675e9d
Author: Feng Ren <alogfans@users.noreply.github.com>
Date:   Wed Jun 11 16:02:41 2025 +0800

    [TransferEngine] Fix minor bugs in NVLink transport and benchmark (kvcache-ai#468)

    * [TransferEngine] Fix compilation bug in NVLink xport

    * [TransferEngine] Fix minor bugs in nvlink benchmark

Signed-off-by: Jinyang Su <751080330@qq.com>

---------

Signed-off-by: Jinyang Su <751080330@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants