[Store] Expose is_local_disk_replica() to Python + enable offload RPC in standalone mooncake_client#2083
Conversation
Replica::Descriptor is a 3-way std::variant {MemoryDescriptor,
DiskDescriptor, LocalDiskDescriptor} with a corresponding C++ predicate
per type. The Python wrapper was missing is_local_disk_replica.
Mooncake's offload pipeline (NotifyOffloadSuccess) constructs
LocalDiskDescriptor exclusively. With only is_memory_replica /
is_disk_replica exposed to Python, every LOCAL_DISK descriptor returned
by the master to a Python caller would test False on both predicates
and be misclassified (e.g. as "unknown" in tier diagnostics) regardless
of whether the actual load succeeded.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che-ai#1857 follow-up) PR kvcache-ai#1857 added the start_offload_rpc_server parameter to setup_internal() and updated only the Python-binding caller (setup_real). The standalone mooncake_client binary in real_client_main.cpp was not updated, so it kept the default value (false) and never started the dedicated TCP RPC server for batch_get_offload_object / release_offload_buffer. Net effect: when an owner is launched via the binary path (e.g. via start_mooncake_owner.sh), local_rpc_addr falls back to <host>:<FLAGS_port> which has no TCP listener (only the IPC abstract socket @mooncake_client_<port>.sock lives at that label). Workers attempting disk-tier reads see RPC_FAIL (-900) on every attempt, and the disk tier silently delivers zero hits across the run despite NotifyOffloadSuccess events showing up on the master and bucket files growing on disk. This commit: - Adds DEFINE_bool(start_offload_rpc_server, true, ...) so the standalone binary defaults match the Python-binding behavior. - Passes the flag through to setup_internal so users can disable it via --start_offload_rpc_server=false for write-only owner setups (or keep the matching test seam exposed). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
04235fb to
3e3181c
Compare
There was a problem hiding this comment.
Code Review
This pull request adds a Python binding for is_local_disk_replica and introduces a start_offload_rpc_server flag to control the TCP RPC server for offload operations. Feedback suggests removing a redundant comment in real_client_main.cpp that duplicates the flag's help text to improve maintainability.
I am having trouble creating individual review comments. Click here to see my feedback.
mooncake-store/src/real_client_main.cpp (113-122)
This comment is very detailed and helpful, but it largely duplicates the help text for the start_offload_rpc_server flag defined on line 22. To improve maintainability and avoid having to update documentation in two places, consider removing this comment. The flag's help text is the more appropriate location for this level of detail.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Thanks for the review @stmatengss @ykwd could you help with the merging this PR?🙏 |
… for amd64) The previous source-build path worked on both arches but added ~10 min to every Docker build (cmake + dependencies.sh + Go toolchain + ninja). Most of that is now redundant on arm64 — kvcache-ai/Mooncake#2083 landed on main (commit da9dfea3) so the upstream tree has everything needed, and a pre-built aarch64 wheel is available on the vllm-wheels S3 bucket. This change: - For TARGETPLATFORM=linux/arm64: install from https://vllm-wheels.s3.amazonaws.com/mooncake/mooncake_transfer_engine-0.3.10.post2-cp312-cp312-manylinux_2_39_aarch64.whl (~225 MB, ~30 s download vs ~10 min source build). The manylinux_2_39 tag requires the FINAL_BASE_IMAGE have glibc >= 2.39 — Ubuntu 24.04+. On the default 22.04 base, override with --build-arg UBUNTU_VERSION=24.04. - For TARGETPLATFORM=linux/amd64: preserve the existing source build at the new pinned MOONCAKE_REF=da9dfea3 (post-vllm-project#2083 merge). The upstream PyPI x86_64 wheel is unsuitable (built with WITH_NVIDIA_PEERMEM=ON, lacks the dmabuf-only registration path vLLM KV caches need). Validated end-to-end with the S3 wheel on a GB200 node (Qwen3-8B DP=4, 100 conv x 3 turns x 32 concurrent, 16K input): 94/100 conversations, 234 tier-log lines all with disk_keys>0 success_keys=N failed_keys=0, 57.75 GB read back from SSD, 0 RPC_FAIL / INVALID_PARAMS. Matches the expected post-vllm-project#2083 disk-tier readback signal.
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
… for amd64) The previous source-build path worked on both arches but added ~10 min to every Docker build (cmake + dependencies.sh + Go toolchain + ninja). Most of that is now redundant on arm64 — kvcache-ai/Mooncake#2083 landed on main (commit da9dfea3) so the upstream tree has everything needed, and a pre-built aarch64 wheel is available on the vllm-wheels S3 bucket. This change: - For TARGETPLATFORM=linux/arm64: install from https://vllm-wheels.s3.amazonaws.com/mooncake/mooncake_transfer_engine-0.3.10.post2-cp312-cp312-manylinux_2_39_aarch64.whl (~225 MB, ~30 s download vs ~10 min source build). The manylinux_2_39 tag requires the FINAL_BASE_IMAGE have glibc >= 2.39 — Ubuntu 24.04+. On the default 22.04 base, override with --build-arg UBUNTU_VERSION=24.04. - For TARGETPLATFORM=linux/amd64: preserve the existing source build at the new pinned MOONCAKE_REF=da9dfea3 (post-vllm-project#2083 merge). The upstream PyPI x86_64 wheel is unsuitable (built with WITH_NVIDIA_PEERMEM=ON, lacks the dmabuf-only registration path vLLM KV caches need). Validated end-to-end with the S3 wheel on a GB200 node (Qwen3-8B DP=4, 100 conv x 3 turns x 32 concurrent, 16K input): 94/100 conversations, 234 tier-log lines all with disk_keys>0 success_keys=N failed_keys=0, 57.75 GB read back from SSD, 0 RPC_FAIL / INVALID_PARAMS. Matches the expected post-vllm-project#2083 disk-tier readback signal.
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
…og + perf fixes Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log) + PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector fix, on top of the rebased feat/mooncake-store-connector. What lands ========== Owner-client topology - scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers - mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe - vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection - requester-only setup refactor: vLLM ranks pass global_segment_size=0; the separate mooncake_client owner contributes the CPU pool + SSD tier Disk offload (re-introduced after intentional revert on PR-40900) - batch budget tracking with backpressure - batch-splitting for disk-tier loads - LookupKeyServer wiring (restored after cherry-pick drop) - store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083 PR-32 observability - VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and bytes_by_tier breakdown PR-36 perf - _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off - enable_offload field on MooncakeStoreConfig (read from JSON or MOONCAKE_ENABLE_OFFLOAD env) - dropped redundant count-based split trigger that was firing on every batch with ≥2 keys, doubling owner GET-RPCs PR-37 preferred_segment env - MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py - run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM MultiConnector bind_gpu_block_pool proxy - allow simple_cpu_backend-style child connectors to bind to the GPU block pool when wrapped by MultiConnector (PD-disaggregated setups) Validation ========== End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024, 4 GiB owner CPU pool to force SSD spillover): - 99/100 conversations completed - 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys - 59.5 GB read back from SSD - After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches halved (219 → 115) confirming count-trigger removal Layout ====== The owner-client commits assumed flat module layout (mooncake_store_*.py) while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat layout to match upstream PR-40900 author's pre-revert state. Related ======= - vllm-project#40900 — parent PR (basic connector); this stacks on it - ivanium#31 — segment-port readiness probe (folded in) - ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in) - ivanium#36 — enable_offload gate + count-trigger removal (folded in) - ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in) - kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
Summary
Two small follow-ups to #2004 (LOCAL_DISK read-path fix) and to #1857 (which introduced the
start_offload_rpc_serverswitch) that complete end-to-end disk-tier readback when the Mooncake owner is launched via the standalonemooncake_clientbinary path (e.g. throughstart_mooncake_owner.shin vLLM integrations).Without these two, even with #2004 + #1857 already merged:
.bucketfiles land on SSD, but-900 RPC_FAILunknown(neitheris_memory_replica()noris_disk_replica()matches aLocalDiskDescriptor)These two commits were originally part of ivanium/Mooncake#5 (the consolidated end-to-end disk-read fix for GB200 + Kimi-K2.5-NVFP4 testing). #2004 picked up the in-process read-path fixes (Bug B / D / E + replica selection). This PR upstreams the two remaining pieces.
Bugs fixed
Commit 1 —
is_local_disk_replica()Python bindingReplica::Descriptoris a 3-waystd::variant {MemoryDescriptor, DiskDescriptor, LocalDiskDescriptor}with one predicate per variant on the C++ side. The Python wrapper boundis_memory_replicaandis_disk_replica, but notis_local_disk_replica.Mooncake's offload pipeline (
NotifyOffloadSuccess) constructsLocalDiskDescriptorexclusively. Without the third predicate, any Python caller iterating replicas returned from the master — e.g.batch_get_replica_desc()used by vLLM'sMooncakeStoreConnectortier-classification logging — sees every offloaded replica fail both checks and gets classified asunknown.Pure additive: 3-line
.def(...)addition inmooncake-integration/store/store_py.cpp. No behavior change for existing callers.Commit 2 —
start_offload_rpc_serverdefault in standalone binary#1857 added the
start_offload_rpc_serverparameter tosetup_internal()and updated the Python-binding caller (setup_real) to passtrue. The standalonemooncake_clientbinary'smain()inreal_client_main.cppwas missed, so it continued to use the default value (false).Net effect: when an owner is launched via the binary path,
local_rpc_addr(sent to master viaNotifyOffloadSuccess.transport_endpoint) falls back to<host>:<FLAGS_port>— which only has the IPC abstract socket@mooncake_client_<port>.sock, no TCP listener. Worker disk-tier reads then seeRPC_FAIL (-900)on every attempt.Fix: add
DEFINE_bool(start_offload_rpc_server, true, ...)toreal_client_main.cppand pass it tosetup_internal(). Defaulttruematches the Python-binding behavior; the flag is preserved as a seam for future write-only owner / test-isolation setups (per the #1857 design intent).Validation
Branch built on top of current
main(6caf4128) and tested end-to-end on a single GB200 node (Ubuntu 24.04, glibc 2.39, CUDA 13, vLLMdev674+gb896ec108+MooncakeStoreConnector).Workload: Qwen3-8B DP=4, 100 conversations × 3 turns × 32 concurrent, 16K input,
--gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024to force GPU prefix-cache eviction into Mooncake; owner configured with 4 GiB CPU pool + 1 TB SSD to force CPU→disk spillover.NotifyOffloadSuccessevents fire.bucketfiles on diskunknown_keys=Ndisk_keys=N(correct tier)-900 RPC_FAILsuccess_keys=N, no-900Sample tier-log line, pre-this-PR (writes work, reads broken):
Sample tier-log line, post-this-PR (matches the format the vLLM consumer expects):
Related
start_offload_rpc_servertosetup_internal()and enabled it on the Pythonsetup_realpath; this PR is the missed-companion for the standalone-binary path.real_client.cpp/ replica selection (merged 2026-05-09); this PR is the small follow-up that makes the read path actually reachable when the owner is the standalone binary.VLLM_MOONCAKE_STORE_TIER_LOG=1tier-summary logging) that exposes the success/failure signals quoted above.Files changed
mooncake-integration/store/store_py.cpp.def("is_local_disk_replica", ...)mooncake-store/src/real_client_main.cppDEFINE_bool(start_offload_rpc_server, true, ...), pass tosetup_internal()Module
mooncake-store)Type of Change