Skip to content

[Store] Expose is_local_disk_replica() to Python + enable offload RPC in standalone mooncake_client#2083

Merged
stmatengss merged 2 commits into
kvcache-ai:mainfrom
zhewenl:zhewen/disk-read-local-fixes
May 12, 2026
Merged

[Store] Expose is_local_disk_replica() to Python + enable offload RPC in standalone mooncake_client#2083
stmatengss merged 2 commits into
kvcache-ai:mainfrom
zhewenl:zhewen/disk-read-local-fixes

Conversation

@zhewenl

@zhewenl zhewenl commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Two small follow-ups to #2004 (LOCAL_DISK read-path fix) and to #1857 (which introduced the start_offload_rpc_server switch) that complete end-to-end disk-tier readback when the Mooncake owner is launched via the standalone mooncake_client binary path (e.g. through start_mooncake_owner.sh in vLLM integrations).

Without these two, even with #2004 + #1857 already merged:

  • writes succeed and .bucket files land on SSD, but
  • every disk-tier read returns -900 RPC_FAIL
  • replica descriptors round-trip to Python as unknown (neither is_memory_replica() nor is_disk_replica() matches a LocalDiskDescriptor)

These two commits were originally part of ivanium/Mooncake#5 (the consolidated end-to-end disk-read fix for GB200 + Kimi-K2.5-NVFP4 testing). #2004 picked up the in-process read-path fixes (Bug B / D / E + replica selection). This PR upstreams the two remaining pieces.

Bugs fixed

Commit 1 — is_local_disk_replica() Python binding

Replica::Descriptor is a 3-way std::variant {MemoryDescriptor, DiskDescriptor, LocalDiskDescriptor} with one predicate per variant on the C++ side. The Python wrapper bound is_memory_replica and is_disk_replica, but not is_local_disk_replica.

Mooncake's offload pipeline (NotifyOffloadSuccess) constructs LocalDiskDescriptor exclusively. Without the third predicate, any Python caller iterating replicas returned from the master — e.g. batch_get_replica_desc() used by vLLM's MooncakeStoreConnector tier-classification logging — sees every offloaded replica fail both checks and gets classified as unknown.

Pure additive: 3-line .def(...) addition in mooncake-integration/store/store_py.cpp. No behavior change for existing callers.

Commit 2 — start_offload_rpc_server default in standalone binary

#1857 added the start_offload_rpc_server parameter to setup_internal() and updated the Python-binding caller (setup_real) to pass true. The standalone mooncake_client binary's main() in real_client_main.cpp was missed, so it continued to use the default value (false).

Net effect: when an owner is launched via the binary path, local_rpc_addr (sent to master via NotifyOffloadSuccess.transport_endpoint) falls back to <host>:<FLAGS_port> — which only has the IPC abstract socket @mooncake_client_<port>.sock, no TCP listener. Worker disk-tier reads then see RPC_FAIL (-900) on every attempt.

Fix: add DEFINE_bool(start_offload_rpc_server, true, ...) to real_client_main.cpp and pass it to setup_internal(). Default true matches the Python-binding behavior; the flag is preserved as a seam for future write-only owner / test-isolation setups (per the #1857 design intent).

Validation

Branch built on top of current main (6caf4128) and tested end-to-end on a single GB200 node (Ubuntu 24.04, glibc 2.39, CUDA 13, vLLM dev674+gb896ec108 + MooncakeStoreConnector).

Workload: Qwen3-8B DP=4, 100 conversations × 3 turns × 32 concurrent, 16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024 to force GPU prefix-cache eviction into Mooncake; owner configured with 4 GiB CPU pool + 1 TB SSD to force CPU→disk spillover.

Signal Pre-this-PR (just #2004) Post-this-PR
writes succeed yes — NotifyOffloadSuccess events fire yes
.bucket files on disk yes — 310 / 77 GB yes
tier-log classification every entry unknown_keys=N disk_keys=N (correct tier)
per-key load result codes every key -900 RPC_FAIL success_keys=N, no -900

Sample tier-log line, pre-this-PR (writes work, reads broken):

batch_keys=165 memory_keys=0 disk_keys=0 unknown_keys=165
success_keys=0 failed_keys=165 bytes_by_tier={'memory': 0, 'disk': 0, 'unknown': 0}

Sample tier-log line, post-this-PR (matches the format the vLLM consumer expects):

batch_keys=101 memory_keys=0 disk_keys=101 unknown_keys=0
success_keys=101 failed_keys=0 bytes_by_tier={'memory': 0, 'disk': 238288896, 'unknown': 0}

Related

Files changed

File Change
mooncake-integration/store/store_py.cpp Add .def("is_local_disk_replica", ...)
mooncake-store/src/real_client_main.cpp Add DEFINE_bool(start_offload_rpc_server, true, ...), pass to setup_internal()

Module

  • Mooncake Store (mooncake-store)

Type of Change

  • Bug fix

Replica::Descriptor is a 3-way std::variant {MemoryDescriptor,
DiskDescriptor, LocalDiskDescriptor} with a corresponding C++ predicate
per type. The Python wrapper was missing is_local_disk_replica.

Mooncake's offload pipeline (NotifyOffloadSuccess) constructs
LocalDiskDescriptor exclusively. With only is_memory_replica /
is_disk_replica exposed to Python, every LOCAL_DISK descriptor returned
by the master to a Python caller would test False on both predicates
and be misclassified (e.g. as "unknown" in tier diagnostics) regardless
of whether the actual load succeeded.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che-ai#1857 follow-up)

PR kvcache-ai#1857 added the start_offload_rpc_server parameter to setup_internal()
and updated only the Python-binding caller (setup_real). The standalone
mooncake_client binary in real_client_main.cpp was not updated, so it
kept the default value (false) and never started the dedicated TCP RPC
server for batch_get_offload_object / release_offload_buffer.

Net effect: when an owner is launched via the binary path (e.g. via
start_mooncake_owner.sh), local_rpc_addr falls back to <host>:<FLAGS_port>
which has no TCP listener (only the IPC abstract socket
@mooncake_client_<port>.sock lives at that label). Workers attempting
disk-tier reads see RPC_FAIL (-900) on every attempt, and the disk tier
silently delivers zero hits across the run despite NotifyOffloadSuccess
events showing up on the master and bucket files growing on disk.

This commit:
  - Adds DEFINE_bool(start_offload_rpc_server, true, ...) so the standalone
    binary defaults match the Python-binding behavior.
  - Passes the flag through to setup_internal so users can disable it via
    --start_offload_rpc_server=false for write-only owner setups (or
    keep the matching test seam exposed).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zhewenl zhewenl force-pushed the zhewen/disk-read-local-fixes branch from 04235fb to 3e3181c Compare May 11, 2026 21:42

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a Python binding for is_local_disk_replica and introduces a start_offload_rpc_server flag to control the TCP RPC server for offload operations. Feedback suggests removing a redundant comment in real_client_main.cpp that duplicates the flag's help text to improve maintainability.

I am having trouble creating individual review comments. Click here to see my feedback.

mooncake-store/src/real_client_main.cpp (113-122)

medium

This comment is very detailed and helpful, but it largely duplicates the help text for the start_offload_rpc_server flag defined on line 22. To improve maintainability and avoid having to update documentation in two places, consider removing this comment. The flag's help text is the more appropriate location for this level of detail.

@zhewenl zhewenl marked this pull request as ready for review May 11, 2026 21:44
@zhewenl

zhewenl commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

cc @LujhCoconut @zhangzuo21

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-integration/store/store_py.cpp 0.00% 1 Missing ⚠️
mooncake-store/src/real_client_main.cpp 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@LujhCoconut LujhCoconut left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zhewenl

zhewenl commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review @stmatengss @ykwd could you help with the merging this PR?🙏

@stmatengss stmatengss merged commit da9dfea into kvcache-ai:main May 12, 2026
19 checks passed
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 12, 2026
… for amd64)

The previous source-build path worked on both arches but added ~10 min to
every Docker build (cmake + dependencies.sh + Go toolchain + ninja). Most
of that is now redundant on arm64 — kvcache-ai/Mooncake#2083 landed
on main (commit da9dfea3) so the upstream tree has everything needed,
and a pre-built aarch64 wheel is available on the vllm-wheels S3 bucket.

This change:
- For TARGETPLATFORM=linux/arm64: install from
    https://vllm-wheels.s3.amazonaws.com/mooncake/mooncake_transfer_engine-0.3.10.post2-cp312-cp312-manylinux_2_39_aarch64.whl
  (~225 MB, ~30 s download vs ~10 min source build). The manylinux_2_39
  tag requires the FINAL_BASE_IMAGE have glibc >= 2.39 — Ubuntu 24.04+.
  On the default 22.04 base, override with --build-arg UBUNTU_VERSION=24.04.
- For TARGETPLATFORM=linux/amd64: preserve the existing source build at
  the new pinned MOONCAKE_REF=da9dfea3 (post-vllm-project#2083 merge). The upstream
  PyPI x86_64 wheel is unsuitable (built with WITH_NVIDIA_PEERMEM=ON,
  lacks the dmabuf-only registration path vLLM KV caches need).

Validated end-to-end with the S3 wheel on a GB200 node (Qwen3-8B DP=4,
100 conv x 3 turns x 32 concurrent, 16K input): 94/100 conversations,
234 tier-log lines all with disk_keys>0 success_keys=N failed_keys=0,
57.75 GB read back from SSD, 0 RPC_FAIL / INVALID_PARAMS. Matches the
expected post-vllm-project#2083 disk-tier readback signal.
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 12, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 12, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
… for amd64)

The previous source-build path worked on both arches but added ~10 min to
every Docker build (cmake + dependencies.sh + Go toolchain + ninja). Most
of that is now redundant on arm64 — kvcache-ai/Mooncake#2083 landed
on main (commit da9dfea3) so the upstream tree has everything needed,
and a pre-built aarch64 wheel is available on the vllm-wheels S3 bucket.

This change:
- For TARGETPLATFORM=linux/arm64: install from
    https://vllm-wheels.s3.amazonaws.com/mooncake/mooncake_transfer_engine-0.3.10.post2-cp312-cp312-manylinux_2_39_aarch64.whl
  (~225 MB, ~30 s download vs ~10 min source build). The manylinux_2_39
  tag requires the FINAL_BASE_IMAGE have glibc >= 2.39 — Ubuntu 24.04+.
  On the default 22.04 base, override with --build-arg UBUNTU_VERSION=24.04.
- For TARGETPLATFORM=linux/amd64: preserve the existing source build at
  the new pinned MOONCAKE_REF=da9dfea3 (post-vllm-project#2083 merge). The upstream
  PyPI x86_64 wheel is unsuitable (built with WITH_NVIDIA_PEERMEM=ON,
  lacks the dmabuf-only registration path vLLM KV caches need).

Validated end-to-end with the S3 wheel on a GB200 node (Qwen3-8B DP=4,
100 conv x 3 turns x 32 concurrent, 16K input): 94/100 conversations,
234 tier-log lines all with disk_keys>0 success_keys=N failed_keys=0,
57.75 GB read back from SSD, 0 RPC_FAIL / INVALID_PARAMS. Matches the
expected post-vllm-project#2083 disk-tier readback signal.
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request May 13, 2026
…og + perf fixes

Squashed forward-port of feat/mooncake-store-int-owner-client + PR-32 (tier-log)
+ PR-31 (segment readiness probe) + PR-36 (enable_offload gate + count-trigger
removal) + PR-37 (preferred_segment env) + bind_gpu_block_pool MultiConnector
fix, on top of the rebased feat/mooncake-store-connector.

What lands
==========

Owner-client topology
- scripts/mooncake/: master + owner launcher + RDMA auto-detection helpers
- mndp.yaml: Kimi-K2.5-NVFP4 / Qwen3-8B DP=4 vigil recipe
- vllm/distributed/.../mooncake/rdma_utils.py: RNIC + GID detection
- requester-only setup refactor: vLLM ranks pass global_segment_size=0; the
  separate mooncake_client owner contributes the CPU pool + SSD tier

Disk offload (re-introduced after intentional revert on PR-40900)
- batch budget tracking with backpressure
- batch-splitting for disk-tier loads
- LookupKeyServer wiring (restored after cherry-pick drop)
- store_py.cpp + standalone-binary fixes are companion in kvcache-ai#2083

PR-32 observability
- VLLM_MOONCAKE_STORE_TIER_LOG=1: per-batch "Mooncake load tier summary" lines
  showing memory_keys/disk_keys/unknown_keys/success_keys/failed_keys and
  bytes_by_tier breakdown

PR-36 perf
- _get_disk_offload_buffer_budget_bytes(enable_offload) returns None when off
- enable_offload field on MooncakeStoreConfig (read from JSON or
  MOONCAKE_ENABLE_OFFLOAD env)
- dropped redundant count-based split trigger that was firing on every batch
  with ≥2 keys, doubling owner GET-RPCs

PR-37 preferred_segment env
- MOONCAKE_PREFERRED_SEGMENT env var falls through to rdma_utils.py
- run_vllm_with_mooncake_owner.sh exports it for the spawned vLLM

MultiConnector bind_gpu_block_pool proxy
- allow simple_cpu_backend-style child connectors to bind to the GPU block
  pool when wrapped by MultiConnector (PD-disaggregated setups)

Validation
==========
End-to-end mndp run (Qwen3-8B DP=4, 100 conv × 3 turns × 32 concurrent,
16K input, --gpu-memory-utilization 0.4 --num-gpu-blocks-override 1024,
4 GiB owner CPU pool to force SSD spillover):

- 99/100 conversations completed
- 219 tier-log lines, 100% with disk_keys>0, 0 failed_keys
- 59.5 GB read back from SSD
- After PR-36 layered on: Mean TTFT -61%, Mean E2EL -58%, tier-log batches
  halved (219 → 115) confirming count-trigger removal

Layout
======
The owner-client commits assumed flat module layout (mooncake_store_*.py)
while feat/mooncake-store-connector kept store/ subdir. Reconciled to flat
layout to match upstream PR-40900 author's pre-revert state.

Related
=======
- vllm-project#40900 — parent PR (basic connector); this stacks on it
- ivanium#31 — segment-port readiness probe (folded in)
- ivanium#32 — VLLM_MOONCAKE_STORE_TIER_LOG (folded in)
- ivanium#36 — enable_offload gate + count-trigger removal (folded in)
- ivanium#37 — MOONCAKE_PREFERRED_SEGMENT env (folded in)
- kvcache-ai/Mooncake#2083 — Mooncake-side standalone-binary disk-read fix
@zhewenl zhewenl mentioned this pull request May 13, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants