[Store] Expose is_local_disk_replica() to Python + enable offload RPC in standalone mooncake_client by zhewenl · Pull Request #2083 · kvcache-ai/Mooncake

zhewenl · 2026-05-11T21:31:57Z

Summary

Two small follow-ups to #2004 (LOCAL_DISK read-path fix) and to #1857 (which introduced the start_offload_rpc_server switch) that complete end-to-end disk-tier readback when the Mooncake owner is launched via the standalone mooncake_client binary path (e.g. through start_mooncake_owner.sh in vLLM integrations).

Without these two, even with #2004 + #1857 already merged:

writes succeed and .bucket files land on SSD, but
every disk-tier read returns -900 RPC_FAIL
replica descriptors round-trip to Python as unknown (neither is_memory_replica() nor is_disk_replica() matches a LocalDiskDescriptor)

These two commits were originally part of ivanium/Mooncake#5 (the consolidated end-to-end disk-read fix for GB200 + Kimi-K2.5-NVFP4 testing). #2004 picked up the in-process read-path fixes (Bug B / D / E + replica selection). This PR upstreams the two remaining pieces.

Bugs fixed

Commit 1 — `is_local_disk_replica()` Python binding

Replica::Descriptor is a 3-way std::variant {MemoryDescriptor, DiskDescriptor, LocalDiskDescriptor} with one predicate per variant on the C++ side. The Python wrapper bound is_memory_replica and is_disk_replica, but not is_local_disk_replica.

Mooncake's offload pipeline (NotifyOffloadSuccess) constructs LocalDiskDescriptor exclusively. Without the third predicate, any Python caller iterating replicas returned from the master — e.g. batch_get_replica_desc() used by vLLM's MooncakeStoreConnector tier-classification logging — sees every offloaded replica fail both checks and gets classified as unknown.

Pure additive: 3-line .def(...) addition in mooncake-integration/store/store_py.cpp. No behavior change for existing callers.

Commit 2 — `start_offload_rpc_server` default in standalone binary

#1857 added the start_offload_rpc_server parameter to setup_internal() and updated the Python-binding caller (setup_real) to pass true. The standalone mooncake_client binary's main() in real_client_main.cpp was missed, so it continued to use the default value (false).

Net effect: when an owner is launched via the binary path, local_rpc_addr (sent to master via NotifyOffloadSuccess.transport_endpoint) falls back to <host>:<FLAGS_port> — which only has the IPC abstract socket @mooncake_client_<port>.sock, no TCP listener. Worker disk-tier reads then see RPC_FAIL (-900) on every attempt.

Fix: add DEFINE_bool(start_offload_rpc_server, true, ...) to real_client_main.cpp and pass it to setup_internal(). Default true matches the Python-binding behavior; the flag is preserved as a seam for future write-only owner / test-isolation setups (per the #1857 design intent).

Validation

Branch built on top of current main (6caf4128) and tested end-to-end on a single GB200 node (Ubuntu 24.04, glibc 2.39, CUDA 13, vLLM dev674+gb896ec108 + MooncakeStoreConnector).

Workload: Qwen3-8B DP=4, 100 conversations × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024 to force GPU prefix-cache eviction into Mooncake; owner configured with 4 GiB CPU pool + 1 TB SSD to force CPU→disk spillover.

Signal	Pre-this-PR (just #2004)	Post-this-PR
writes succeed	yes — `NotifyOffloadSuccess` events fire	yes
`.bucket` files on disk	yes — 310 / 77 GB	yes
tier-log classification	every entry `unknown_keys=N`	`disk_keys=N` (correct tier)
per-key load result codes	every key `-900 RPC_FAIL`	`success_keys=N`, no `-900`

Sample tier-log line, pre-this-PR (writes work, reads broken):

batch_keys=165 memory_keys=0 disk_keys=0 unknown_keys=165
success_keys=0 failed_keys=165 bytes_by_tier={'memory': 0, 'disk': 0, 'unknown': 0}

Sample tier-log line, post-this-PR (matches the format the vLLM consumer expects):

batch_keys=101 memory_keys=0 disk_keys=101 unknown_keys=0
success_keys=101 failed_keys=0 bytes_by_tier={'memory': 0, 'disk': 238288896, 'unknown': 0}

[Store] Support SSD offload via Python setup() interface #1857 — added start_offload_rpc_server to setup_internal() and enabled it on the Python setup_real path; this PR is the missed-companion for the standalone-binary path.
[Store] Fix disk replica read paths for GPU KV cache (LOCAL_DISK zero-copy, DISK temp-buf scatter) #2004 — LOCAL_DISK read-path fix in real_client.cpp / replica selection (merged 2026-05-09); this PR is the small follow-up that makes the read path actually reachable when the owner is the standalone binary.
[Store] Fix end-to-end LOCAL_DISK read-back regression (#1857 follow-up) ivanium/Mooncake#5 — the original consolidated patch series these two commits came from.
mooncake: per-tier load logging + mndp config for disk-offload bench ivanium/vllm#32 — the vLLM-side consumer (VLLM_MOONCAKE_STORE_TIER_LOG=1 tier-summary logging) that exposes the success/failure signals quoted above.

Files changed

File	Change
`mooncake-integration/store/store_py.cpp`	Add `.def("is_local_disk_replica", ...)`
`mooncake-store/src/real_client_main.cpp`	Add `DEFINE_bool(start_offload_rpc_server, true, ...)`, pass to `setup_internal()`

Module

Mooncake Store (mooncake-store)

Type of Change

Bug fix

Replica::Descriptor is a 3-way std::variant {MemoryDescriptor, DiskDescriptor, LocalDiskDescriptor} with a corresponding C++ predicate per type. The Python wrapper was missing is_local_disk_replica. Mooncake's offload pipeline (NotifyOffloadSuccess) constructs LocalDiskDescriptor exclusively. With only is_memory_replica / is_disk_replica exposed to Python, every LOCAL_DISK descriptor returned by the master to a Python caller would test False on both predicates and be misclassified (e.g. as "unknown" in tier diagnostics) regardless of whether the actual load succeeded. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…che-ai#1857 follow-up) PR kvcache-ai#1857 added the start_offload_rpc_server parameter to setup_internal() and updated only the Python-binding caller (setup_real). The standalone mooncake_client binary in real_client_main.cpp was not updated, so it kept the default value (false) and never started the dedicated TCP RPC server for batch_get_offload_object / release_offload_buffer. Net effect: when an owner is launched via the binary path (e.g. via start_mooncake_owner.sh), local_rpc_addr falls back to <host>:<FLAGS_port> which has no TCP listener (only the IPC abstract socket @mooncake_client_<port>.sock lives at that label). Workers attempting disk-tier reads see RPC_FAIL (-900) on every attempt, and the disk tier silently delivers zero hits across the run despite NotifyOffloadSuccess events showing up on the master and bucket files growing on disk. This commit: - Adds DEFINE_bool(start_offload_rpc_server, true, ...) so the standalone binary defaults match the Python-binding behavior. - Passes the flag through to setup_internal so users can disable it via --start_offload_rpc_server=false for write-only owner setups (or keep the matching test seam exposed). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request adds a Python binding for is_local_disk_replica and introduces a start_offload_rpc_server flag to control the TCP RPC server for offload operations. Feedback suggests removing a redundant comment in real_client_main.cpp that duplicates the flag's help text to improve maintainability.

I am having trouble creating individual review comments. Click here to see my feedback.

mooncake-store/src/real_client_main.cpp (113-122)

This comment is very detailed and helpful, but it largely duplicates the help text for the start_offload_rpc_server flag defined on line 22. To improve maintainability and avoid having to update documentation in two places, consider removing this comment. The flag's help text is the more appropriate location for this level of detail.

zhewenl · 2026-05-11T21:45:13Z

cc @LujhCoconut @zhangzuo21

codecov-commenter · 2026-05-11T22:09:09Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-integration/store/store_py.cpp	0.00%	1 Missing ⚠️
mooncake-store/src/real_client_main.cpp	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

LujhCoconut

LGTM.

zhewenl · 2026-05-12T09:31:53Z

Thanks for the review @stmatengss @ykwd could you help with the merging this PR?🙏

… for amd64) The previous source-build path worked on both arches but added ~10 min to every Docker build (cmake + dependencies.sh + Go toolchain + ninja). Most of that is now redundant on arm64 — kvcache-ai/Mooncake#2083 landed on main (commit da9dfea3) so the upstream tree has everything needed, and a pre-built aarch64 wheel is available on the vllm-wheels S3 bucket. This change: - For TARGETPLATFORM=linux/arm64: install from https://vllm-wheels.s3.amazonaws.com/mooncake/mooncake_transfer_engine-0.3.10.post2-cp312-cp312-manylinux_2_39_aarch64.whl (~225 MB, ~30 s download vs ~10 min source build). The manylinux_2_39 tag requires the FINAL_BASE_IMAGE have glibc >= 2.39 — Ubuntu 24.04+. On the default 22.04 base, override with --build-arg UBUNTU_VERSION=24.04. - For TARGETPLATFORM=linux/amd64: preserve the existing source build at the new pinned MOONCAKE_REF=da9dfea3 (post-vllm-project#2083 merge). The upstream PyPI x86_64 wheel is unsuitable (built with WITH_NVIDIA_PEERMEM=ON, lacks the dmabuf-only registration path vLLM KV caches need). Validated end-to-end with the S3 wheel on a GB200 node (Qwen3-8B DP=4, 100 conv x 3 turns x 32 concurrent, 16K input): 94/100 conversations, 234 tier-log lines all with disk_keys>0 success_keys=N failed_keys=0, 57.75 GB read back from SSD, 0 RPC_FAIL / INVALID_PARAMS. Matches the expected post-vllm-project#2083 disk-tier readback signal.

…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix

… for amd64) The previous source-build path worked on both arches but added ~10 min to every Docker build (cmake + dependencies.sh + Go toolchain + ninja). Most of that is now redundant on arm64 — kvcache-ai/Mooncake#2083 landed on main (commit da9dfea3) so the upstream tree has everything needed, and a pre-built aarch64 wheel is available on the vllm-wheels S3 bucket. This change: - For TARGETPLATFORM=linux/arm64: install from https://vllm-wheels.s3.amazonaws.com/mooncake/mooncake_transfer_engine-0.3.10.post2-cp312-cp312-manylinux_2_39_aarch64.whl (~225 MB, ~30 s download vs ~10 min source build). The manylinux_2_39 tag requires the FINAL_BASE_IMAGE have glibc >= 2.39 — Ubuntu 24.04+. On the default 22.04 base, override with --build-arg UBUNTU_VERSION=24.04. - For TARGETPLATFORM=linux/amd64: preserve the existing source build at the new pinned MOONCAKE_REF=da9dfea3 (post-vllm-project#2083 merge). The upstream PyPI x86_64 wheel is unsuitable (built with WITH_NVIDIA_PEERMEM=ON, lacks the dmabuf-only registration path vLLM KV caches need). Validated end-to-end with the S3 wheel on a GB200 node (Qwen3-8B DP=4, 100 conv x 3 turns x 32 concurrent, 16K input): 94/100 conversations, 234 tier-log lines all with disk_keys>0 success_keys=N failed_keys=0, 57.75 GB read back from SSD, 0 RPC_FAIL / INVALID_PARAMS. Matches the expected post-vllm-project#2083 disk-tier readback signal.

…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix

github-actions Bot added run-ci Store labels May 11, 2026

zhewenl force-pushed the zhewen/disk-read-local-fixes branch from 04235fb to 3e3181c Compare May 11, 2026 21:42

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

zhewenl marked this pull request as ready for review May 11, 2026 21:44

zhewenl requested review from XucSh, YiXR, stmatengss and ykwd as code owners May 11, 2026 21:44

zhewenl mentioned this pull request May 12, 2026

[Mooncake] Forward-port owner-client topology + disk-offload + tier-log + perf fixes (squashed, rebased) ivanium/vllm#47

Closed

LujhCoconut approved these changes May 12, 2026

View reviewed changes

stmatengss merged commit da9dfea into kvcache-ai:main May 12, 2026
19 checks passed

zhewenl mentioned this pull request May 13, 2026

disk offloading ivanium/vllm#48

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store] Expose is_local_disk_replica() to Python + enable offload RPC in standalone mooncake_client#2083

[Store] Expose is_local_disk_replica() to Python + enable offload RPC in standalone mooncake_client#2083
stmatengss merged 2 commits into
kvcache-ai:mainfrom
zhewenl:zhewen/disk-read-local-fixes

zhewenl commented May 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

zhewenl commented May 11, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 11, 2026

Uh oh!

LujhCoconut left a comment

Uh oh!

zhewenl commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhewenl commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bugs fixed

Commit 1 — is_local_disk_replica() Python binding

Commit 2 — start_offload_rpc_server default in standalone binary

Validation

Related

Files changed

Module

Type of Change

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

mooncake-store/src/real_client_main.cpp (113-122)

Uh oh!

zhewenl commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 11, 2026

Codecov Report

Uh oh!

LujhCoconut left a comment

Choose a reason for hiding this comment

Uh oh!

zhewenl commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhewenl commented May 11, 2026 •

edited

Loading

Commit 1 — `is_local_disk_replica()` Python binding

Commit 2 — `start_offload_rpc_server` default in standalone binary

zhewenl commented May 11, 2026 •

edited

Loading