[KV Connector] Support disk offloading in MooncakeStoreConnector by zhewenl · Pull Request #42689 · vllm-project/vllm

zhewenl · 2026-05-14T23:18:12Z

Summary

For latest usage, please refer to docs/features/mooncake_store_connector_usage.md.

Adds disk-tier KV offload to MooncakeStoreConnector that landed in #40900. Stacks cleanly on top of #40900 with no impact on the existing CPU-only path — operators upgrade by adding "enable_offload": true to mooncake_config.json.

End-to-end validated on a 4×GB200 node with Qwen3-8B + 4 DP: 49.6 GB read back from SSD across 74 tier-summary lines, 0 failed_keys, 21,027 disk_keys + 236 memory_keys.

NOTE: Please use mooncake after da9dfea38703c9380093e4b95cc1dc3670848a51

Architecture

        ┌──────────────────────────────────────────────────┐
        │             Mooncake Master (control)            │
        └──────────────────────────────────────────────────┘
              ▲ TCP RPC                  ▲
              │                          │
   ┌──────────┴────────┐    ┌────────────┴───────────────┐
   │ vLLM rank 0..N    │◀──▶│ mooncake_client (owner)    │
   │ (requesters)      │RDMA│ CPU pool + optional SSD    │
   │ GPU KV buffers    │    │ (DirectIO via uring)       │
   └───────────────────┘    └────────────────────────────┘

Two supported topologies, selected by an explicit mode field:

Mode	Description
`embedded` (default)	PR-40900 baseline. Each vLLM rank contributes its own CPU segment. No separate owner process. No SSD tier.
`standalone-store`	vLLM ranks contribute zero CPU; a separately-launched `mooncake_client` owns the CPU pool + optional SSD.

mode and global_segment_size are validated for consistency at startup — wrong combinations raise ValueError with an actionable message instead of failing silently.

Files changed

File	Description
`vllm/.../mooncake/store/worker.py`	disk-offload split path, dual-mode config + validation, tier-summary log, `preferred_segment` wiring, mode-detection log
`vllm/.../mooncake/rdma_utils.py`	(new) per-rank RNIC + hostname + preferred-segment helpers
`vllm/envs.py`	registers `VLLM_MOONCAKE_STORE_TIER_LOG`
`tests/v1/kv_connector/unit/test_mooncake_store_worker.py`	~30 new tests covering split path, validation matrix, topology recipes, tier log, RNIC selection
`tests/v1/kv_connector/unit/test_mooncake_store_connector.py`	one test renamed to match upstream's tip-of-main
`docs/features/mooncake_store_connector_usage.md`	disk-offload section + env-var table

How to use disk offload

Step 1. Start the Mooncake master with disk-offload enabled

mooncake_master \
  -rpc_port=50051 \
  -enable_http_metadata_server=true \
  -http_metadata_server_host=0.0.0.0 \
  -http_metadata_server_port=8080 \
  -enable_offload=true \
  -default_kv_lease_ttl=30000 \
  -eviction_high_watermark_ratio=0.95 \
  -eviction_ratio=0.1 \
  -logtostderr

Wait until the RPC port is reachable (e.g., nc -z 127.0.0.1 50051).

Step 2. Start the per-node owner with a CPU pool + SSD tier

# Choose a directory on local NVMe with O_DIRECT support
export MOONCAKE_OFFLOAD_FILE_STORAGE_PATH=/mnt/nvme/mooncake_offload
export MOONCAKE_OFFLOAD_STORAGE_BACKEND_DESCRIPTOR=bucket_storage_backend
export MOONCAKE_USE_URING=true

mooncake_client \
  --master_server_address=127.0.0.1:50051 \
  --metadata_server=http://127.0.0.1:8080/metadata \
  --host=127.0.0.1:18001 \
  --port=50052 \
  --protocol=rdma \
  --device_names="mlx5_0,mlx5_1,mlx5_2,mlx5_3" \
  --global_segment_size=200GB \
  --enable_offload=true \
  --threads=16

Verify the owner registered with the master (the master's admin metrics line should show Clients: 1, Mem Storage: 0 B / 200 GB).

Step 3. Write `mooncake_config.json` for owner-client mode

{
  "mode": "owner-client",
  "metadata_server": "http://127.0.0.1:8080/metadata",
  "master_server_address": "127.0.0.1:50051",
  "global_segment_size": 0,
  "local_buffer_size": "4GB",
  "protocol": "rdma",
  "device_name": "mlx5_0,mlx5_1,mlx5_2,mlx5_3",
  "enable_offload": true
}

In owner-client mode:

mode=owner-client declares the intent.
global_segment_size=0 means this vLLM rank contributes zero CPU to the pool; the owner provides all of it.
enable_offload=true tells the connector to use the disk-aware recv-side batching path.
device_name is a comma-separated CSV; entry [i] is used by the rank with physical GPU i.

Step 4. Launch vLLM

MOONCAKE_CONFIG_PATH=/path/to/mooncake_config.json \
MOONCAKE_PREFERRED_SEGMENT=127.0.0.1:18001 \
VLLM_MOONCAKE_STORE_TIER_LOG=1 \
vllm serve <model> \
  -dp 4 \
  --kv-transfer-config '{
    "kv_connector": "MooncakeStoreConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
      "load_async": true,
      "enable_cross_layers_blocks": true,
      "enable_offload": true
    }
  }'

MOONCAKE_PREFERRED_SEGMENT pins all PUTs to the SSD-bearing owner. Required when more than one segment is reachable from the master.
VLLM_MOONCAKE_STORE_TIER_LOG=1 emits per-batch tier-summary lines so you can verify the disk tier is serving reads.

Step 5. Verify disk offload is firing

On startup, each rank should log one line:

INFO ... [worker.py] Mooncake mode=owner-client (global_segment_size=0,
     local_buffer_size=4294967296, preferred_segment=127.0.0.1:18001,
     enable_offload=True)

Once the owner's CPU pool fills, the recv thread starts emitting tier-summary lines:

INFO ... Mooncake load tier summary: req_id=chatcmpl-...-conv-17-turn1-...
     batch_keys=412 memory_keys=0 disk_keys=412 unknown_keys=0
     success_keys=412 failed_keys=0
     bytes_by_tier={'memory': 0, 'disk': 972029952, 'unknown': 0}

disk_keys > 0 confirms the SSD path is being read from. failed_keys should stay 0 — non-zero failed_keys with insufficient_space in the master log means the DirectIO budget is too small; raise it:

export MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES=$((4 * 1024**3))  # 4 GiB

Test plan

.venv/bin/python -m pytest \
  tests/v1/kv_connector/unit/test_mooncake_store_connector.py \
  tests/v1/kv_connector/unit/test_mooncake_store_worker.py -v

Code Review

This pull request introduces disk offloading and dual-mode operation ("real-client" and "owner-client") to the MooncakeStoreConnector, allowing KV cache to be offloaded to CPU memory or disk. Key technical changes include the implementation of staging buffer budget management for load batches, improved RDMA NIC selection utilities, and enhanced logging for cache tiers. Review feedback correctly identified a critical correctness issue where oversized keys could lead to silent data corruption instead of a loud failure, as well as a regression in IPC path generation that compromised multi-user isolation.

gemini-code-assist · 2026-05-14T23:20:00Z

+                    self.set_finished_request(req_id)
+                    self.request_queue.task_done()
+                    return


Marking a request as finished when it was skipped due to an oversized key is a critical correctness issue. The consumer thread will proceed assuming the KV cache has been successfully loaded, but it will actually read stale or uninitialized data from the GPU cache. This leads to silent corruption of model outputs. Since this indicates a fatal configuration error (staging budget too small for a single block), the process should fail loudly instead of proceeding with invalid data.

self.request_queue.task_done() raise RuntimeError( f"Mooncake load for request {req_id} failed: key {oversized_key} " f"requires {oversized_key_bytes} staging bytes, exceeding budget " f"{self.disk_offload_buffer_budget_bytes}. Increase " "MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES.")

gemini-code-assist · 2026-05-14T23:20:00Z

+    hostname = socket.gethostname()
    extra_config = vllm_config.kv_transfer_config.kv_connector_extra_config
    if "lookup_rpc_port" in extra_config:
        rpc_port = extra_config["lookup_rpc_port"]
-    uid = os.getuid()
-    logger.debug("Base URL: %s, RPC Port: %s, UID: %s", base_url, rpc_port, uid)
-    return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_dp_rank{dp_rank}"
+    logger.debug("Base URL: %s, RPC Port: %s", base_url, rpc_port)
+    return (
+        f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_host_{hostname}_dp_rank{dp_rank}"


Replacing uid with hostname in the IPC path is a regression that breaks isolation in multi-user environments. If multiple users run vLLM on the same node, they will collide on the same socket path (as hostname and dp_rank will be identical). Additionally, socket.gethostname() can be slow or fail in restricted network environments. Isolation should be maintained using os.getuid(). If hostname is desired for observability, it should be added alongside the UID.

Suggested change

hostname = socket.gethostname()

extra_config = vllm_config.kv_transfer_config.kv_connector_extra_config

if "lookup_rpc_port" in extra_config:

rpc_port = extra_config["lookup_rpc_port"]

uid = os.getuid()

logger.debug("Base URL: %s, RPC Port: %s, UID: %s", base_url, rpc_port, uid)

return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_dp_rank{dp_rank}"

logger.debug("Base URL: %s, RPC Port: %s", base_url, rpc_port)

return (

f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_host_{hostname}_dp_rank{dp_rank}"

uid = os.getuid()

hostname = socket.gethostname()

extra_config = vllm_config.kv_transfer_config.kv_connector_extra_config

if "lookup_rpc_port" in extra_config:

rpc_port = extra_config["lookup_rpc_port"]

logger.debug("Base URL: %s, RPC Port: %s, UID: %s, Host: %s",

base_url, rpc_port, uid, hostname)

return (

f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_host_{hostname}_dp_rank{dp_rank}"

)

…nfig, observability Adds disk-tier KV offload support to the MooncakeStoreConnector that landed in vllm-project#40900. Stacks cleanly on top of vllm-project#40900 with zero changes to the existing CPU-only path — operators upgrade simply by adding ``"enable_offload": true`` to their mooncake_config.json and launching ``mooncake_client`` with ``--disk_gb N``. ## What this adds **Disk-tier offload (recv-side batching)** * Optional ``enable_offload`` flag in MooncakeStoreConfig. When true, the recv thread allocates a DirectIO staging budget (``MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES``, default 1.28 GiB) and caps each ``batch_get_into_multi_buffers`` call at ``0.9 ×`` that budget, splitting larger batches into sub-batches automatically. * Per-key staging estimate accounts for the C++ DirectIO 4 KiB alignment + 8 KiB padding so the split never overruns the owner's pinned-memory staging buffer. * Loads that would individually exceed the raw budget are skipped with a clear warning instead of returning an opaque ``insufficient_space`` error from Mooncake. **Dual-mode topology** * New ``mode`` field in MooncakeStoreConfig with ``Literal["real-client", "owner-client"]``: * ``real-client`` (default — PR-40900 baseline): every vLLM rank contributes its own CPU segment. No separate owner process. * ``owner-client``: vLLM ranks contribute zero CPU; a separately launched ``mooncake_client`` owns the CPU pool + optional SSD tier. This is the topology that makes disk offload work in practice (one node-local SSD tier serves all ranks). * ``__post_init__`` validation: ``mode`` and ``global_segment_size`` must agree (real-client requires > 0; owner-client requires 0), ``local_buffer_size`` must be > 0, ``mode`` must be one of the two literals. Hard fail with a clear message — no silent footguns. **Optional ``preferred_segment`` for replicate-config** * When set on ``kv_connector_extra_config["preferred_segment"]`` (or the ``MOONCAKE_PREFERRED_SEGMENT`` env), PUTs route to a specific owner segment via ``ReplicateConfig.preferred_segment`` rather than the default round-robin allocation. Required when more than one segment is reachable from the master and you want puts to converge on the SSD-bearing one. **Per-rank RNIC pinning + scratch helpers** * New ``vllm/distributed/kv_transfer/kv_connector/v1/mooncake/rdma_utils.py`` with three small helpers: (1) parse an operator-provided ``device_name`` CSV (positional, by physical GPU index) and select this rank's RNIC; (2) honour ``MOONCAKE_LOCAL_HOSTNAME`` to override the announced address on multi-NIC hosts; (3) resolve ``preferred_segment`` from ``extra_config`` or env. * When no ``device_name`` is configured on RDMA, the connector logs a SGLang-style warning explaining how to set it; Mooncake's C++ auto-selection is the only fallback. (No Python-side discovery — matches vllm-ascend's approach.) **Observability** * New ``VLLM_MOONCAKE_STORE_TIER_LOG=1`` env var. When set, the recv thread emits one line per ``batch_get_into_multi_buffers``: ``` Mooncake load tier summary: req_id=X batch_keys=N memory_keys=A disk_keys=B unknown_keys=C success_keys=D failed_keys=E bytes_by_tier={'memory': X, 'disk': Y, 'unknown': Z} ``` Lets operators verify the disk tier is actually serving reads and measure the hit-rate split. **Startup mode-detection log + soft warnings** * One info line per rank announcing the resolved mode + relevant fields. Plus warnings for unusual-but-legal hybrid combinations: * real-client + ``enable_offload`` + no ``preferred_segment`` → disk tier sees only a fraction of writes. * real-client + ``preferred_segment`` set → rank-segments idle. * owner-client + ``enable_offload=false`` → disk-batch-splitting disabled; large prefills may hit owner DirectIO budget. ## Files * ``vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py`` — disk-offload split, dual-mode config + validation, tier-summary log, ``preferred_segment`` wiring, mode-detection log. * ``vllm/distributed/kv_transfer/kv_connector/v1/mooncake/rdma_utils.py`` — new file; per-rank RNIC + hostname + preferred-segment helpers. * ``vllm/envs.py`` — registers ``VLLM_MOONCAKE_STORE_TIER_LOG``. * ``tests/v1/kv_connector/unit/test_mooncake_store_worker.py`` — ~30 new tests covering split path, validation matrix, topology recipes (real-client + owner-client+disk), tier-summary log, RNIC selection. * ``tests/v1/kv_connector/unit/test_mooncake_store_connector.py`` — one test renamed (``test_worker_role_initializes_store_worker_on_rank0``) to match upstream after rename in tip-of-main. * ``docs/features/mooncake_store_connector_usage.md`` — disk-offload section + env-var table updates. ## Test plan * ``.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_connector.py tests/v1/kv_connector/unit/test_mooncake_store_worker.py`` → 53 pass, 0 fail. * End-to-end validation on a 4×GB200 node with Qwen3-8B + 4 DP, 4 GiB owner CPU pool + 1 TB SSD tier: * owner-client + disk: 74 tier-summary lines, 21,027 disk_keys + 236 memory_keys, **0 failed_keys**, 49.6 GB read back from SSD. Split path (``510+290`` and similar) repeatedly exercised. * real-client + CPU only: master sees ``Mem Storage: 14.50 GB / 16.00 GB`` (4 ranks × 4 GiB each), 0 SSD, PUTs and evictions flowing. ## Usage ### CPU-only (PR-40900 baseline, unchanged) ```bash # mooncake_config.json { "mode": "real-client", "metadata_server": "http://master:8080/metadata", "master_server_address": "master:50051", "global_segment_size": "4GB", "local_buffer_size": "4GB", "protocol": "rdma", "device_name": "mlx5_0,mlx5_1,mlx5_2,mlx5_3" } ``` ```bash MOONCAKE_CONFIG_PATH=mooncake_config.json \ vllm serve <model> \ --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}' ``` ### Owner-client + disk offload Step 1. Start ``mooncake_master`` (HTTP metadata variant): ```bash mooncake_master -rpc_port=50051 -enable_http_metadata_server=true \ -http_metadata_server_port=8080 -enable_offload=true -logtostderr ``` Step 2. Start ``mooncake_client`` as the owner (CPU pool + SSD tier): ```bash mooncake_client \ --master_server_address=127.0.0.1:50051 \ --metadata_server=http://127.0.0.1:8080/metadata \ --host=127.0.0.1:18001 --port=50052 \ --protocol=rdma --device_names="$RDMA_DEVICES" \ --global_segment_size=200GB --enable_offload=true ``` Step 3. Set ``mooncake_config_owner_client.json``: ```bash { "mode": "owner-client", "metadata_server": "http://127.0.0.1:8080/metadata", "master_server_address": "127.0.0.1:50051", "global_segment_size": 0, "local_buffer_size": "4GB", "protocol": "rdma", "device_name": "mlx5_0,mlx5_1,mlx5_2,mlx5_3", "enable_offload": true } ``` Step 4. Launch vLLM: ```bash MOONCAKE_CONFIG_PATH=mooncake_config_owner_client.json \ MOONCAKE_PREFERRED_SEGMENT=127.0.0.1:18001 \ VLLM_MOONCAKE_STORE_TIER_LOG=1 \ vllm serve <model> \ --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_role":"kv_both", "kv_connector_extra_config":{"load_async":true,"enable_offload":true}}' ``` The startup logs should show: ``` INFO ... Mooncake mode=owner-client (global_segment_size=0, local_buffer_size=4294967296, preferred_segment=127.0.0.1:18001, enable_offload=True) ``` Once the owner CPU pool fills, tier-summary lines start emitting ``disk_keys > 0``, confirming SSD reads. ## Related * vllm-project#40900 — initial MooncakeStoreConnector PR (CPU only); this PR stacks on top. * RFC: vllm-project#38474. * Mooncake project: https://github.com/kvcache-ai/Mooncake Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8391623367

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T23:38:54Z

-            protocol=config.get("protocol", "rdma"),
-            device_name=config.get("device_name", ""),
-            master_server_address=config.get("master_server_address", ""),
+            enable_offload=bool(config.get("enable_offload", False)),


Parse enable_offload as a strict boolean

MooncakeStoreConfig.from_file currently coerces enable_offload with bool(...), which treats any non-empty string (including "false"/"0") as True. In JSON generated by templating systems that serialize booleans as strings, this will silently enable disk-offload mode and change runtime behavior (extra buffering/splitting path) instead of honoring the operator intent. Parse this field explicitly (e.g., accept only real booleans or validated string values) to avoid misconfiguration.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-14T23:38:54Z

-    return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_dp_rank{dp_rank}"
+    logger.debug("Base URL: %s, RPC Port: %s", base_url, rpc_port)
+    return (
+        f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_host_{hostname}_dp_rank{dp_rank}"


Preserve per-user uniqueness in lookup IPC path

The lookup socket path no longer includes uid, so two vLLM instances from different users on the same host now resolve to the same IPC filename when lookup_rpc_port/dp_rank match. That creates bind/connect collisions in shared machines (one process can fail to create the REP socket or connect to the wrong peer). Keep a per-user discriminator (like uid) in the path to avoid cross-user namespace conflicts.

Useful? React with 👍 / 👎.

ivanium

left some initial comments. Will continue later.

ivanium · 2026-05-15T00:21:01Z

+_DIRECT_IO_PADDING_BYTES = 2 * _DIRECT_IO_ALIGNMENT
+
+
+MooncakeMode = Literal["real-client", "owner-client"]


Let's think about better names.

From their docs, I think we can call them "embedded", "embedded-dummy", and "standalone-store"

ivanium · 2026-05-15T00:21:47Z

+    ``mode`` selects the topology: ``real-client`` (PR-40900 baseline — each
+    rank contributes ``global_segment_size``) or ``owner-client`` (rank
+    contributes 0; an external ``mooncake_client`` owns the pool).


we can revise the comments here a bit too

ivanium · 2026-05-15T04:18:03Z

+            if self.replicate_config is None:
+                res = self.store.batch_put_from_multi_buffers(keys, addrs, sizes)
+            else:
+                res = self.store.batch_put_from_multi_buffers(
+                    keys,
+                    addrs,
+                    sizes,
+                    self.replicate_config,
+                )


after a second look, maybe we can unify this to

res = self.store.batch_put_from_multi_buffers( keys, addrs, sizes, self.replicate_config, )

ivanium · 2026-05-15T05:14:42Z

+export MOONCAKE_ENABLE_OFFLOAD=1
+export MOONCAKE_OFFLOAD_FILE_STORAGE_PATH=/path/to/offload/dir


claude notified me here we actually read enable_offload from mooncake's config json rather than env var

mergify · 2026-05-15T23:41:03Z

Hi @zhewenl, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…ommit fix Follow-up to review feedback on PR vllm-project#42689: 1. Pre-commit fixes (the immediate CI failure). Two `raise ValueError(...)` / `elif (...)` sites flagged by `pre-commit run ruff-format --all-files` are now collapsed to single-line form. 2. Rename `DEFAULT_MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE` → `DEFAULT_MOONCAKE_DISK_STAGING_BUFFER_BYTES`. The old name collided conceptually with the unrelated `local_buffer_size` parameter to Mooncake's `setup()` (RDMA scratch buffer, 16 MiB default). The new name says what the constant actually mirrors: `FileStorageConfig::local_buffer_size` at `Mooncake/mooncake-store/include/storage_backend.h:206` (1280 * kMB, the DirectIO staging buffer used inside the disk-tier owner process). 3. Promote `DISK_OFFLOAD_USABLE_BUDGET_RATIO = 0.9` to an env var: `VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO` (typed `float`, default 0.9). Follows the `VLLM_MOONCAKE_STORE_TIER_LOG` pattern. The 0.9 had no experimental motivation in code; exposing it as a knob lets users tune the per-batch headroom against the owner's staging buffer. 4. Centralize three previously `os.getenv`-only Mooncake env vars in `vllm/envs.py`: - `MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES` — shared with the Mooncake C++ owner process; same name, single source of truth. - `MOONCAKE_PREFERRED_SEGMENT` — pin replicas to a specific owner segment in standalone-store mode. - `MOONCAKE_REQUESTER_LOCAL_HOSTNAME` — renamed from the previous `MOONCAKE_LOCAL_HOSTNAME`. The old name was too generic and collided with a different `local_hostname` knob in Mooncake's own wheel-level `mooncake_config.py`. The new name says what it's for: the vLLM-rank-as-requester's identity. 5. Add concise comments to each module-level constant in worker.py pinning it to its Mooncake C++ source (`file_storage.cpp:512-525` for DirectIO alignment + padding, `storage_backend.h:206` for the staging-buffer default). The drift hazard is now visible at the site of the constant. 6. Add a docstring + signature comments to `_split_disk_offload_load_batches` explaining the three-parallel-list input shape, why the inner type is `list[int]` (scatter-gather across K/V or multi-layer segments), and what the `(batches, oversize_key)` return tuple encodes. 7. Doc updates: - `docs/features/mooncake_store_connector_usage.md` env-var table now lists the three centralized vars plus the new ratio knob. - `docs/design/mooncake_offload_staging_buffer_explained.html` code snippets updated to the new constant name. Test plan: - `pre-commit run --files <touched-files>` — all hooks pass locally. - `.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_*` — 53/53 pass. - E2E validated on GB200 with `recipes/mooncake/verify/mndp_noscripts_p2p.yaml` (Qwen3-8B, standalone-store + disk offload): worker log shows `Mooncake mode=standalone-store ... enable_offload=True`, tier-summary reports disk_keys > 0 with zero failed_keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai>

Bundle of cleanup changes addressing reviewer comments on the original disk-offload commit: Naming - Rename mode strings: real-client → embedded, owner-client → standalone-store. The new names describe what the topology does (embeds the segment in-process vs. uses a standalone owner) rather than how the C++ side labels its clients. - Rename DEFAULT_MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE → DEFAULT_MOONCAKE_DISK_STAGING_BUFFER_BYTES. The old name collided with Mooncake's unrelated setup() local_buffer_size; the new name says what the constant actually mirrors (FileStorageConfig::local_buffer_size in storage_backend.h:206). Code simplification - Unify the batch_put call path: drop the `if self.replicate_config is None` branch in the SendingThread. MooncakeStoreWorker.__init__ now always builds a ReplicateConfig (only sets preferred_segment when configured), so the inner PUT site has a single unconditional call to batch_put_from_multi_buffers. - Inline trivial helpers (_get_kv_connector_extra_config, _get_disk_offload_buffer_budget_bytes) at their single call sites — they were adding indirection without adding clarity. Env vars - Add VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO (float, default 0.9) to vllm/envs.py — the previously-hardcoded usable-vs-raw budget ratio is now a runtime knob. - Centralize MOONCAKE_PREFERRED_SEGMENT and the renamed MOONCAKE_REQUESTER_LOCAL_HOSTNAME (was the too-generic MOONCAKE_LOCAL_HOSTNAME) in vllm/envs.py. - Drop the vLLM-side MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES env knob. The vLLM-side budget is now always DEFAULT_MOONCAKE_DISK_STAGING_BUFFER_BYTES; Mooncake's C++ owner still reads its own env var on its side. Doc fixes - enable_offload is read from the JSON config, not from MOONCAKE_ENABLE_OFFLOAD (the old doc was wrong — that env var is only referenced in a Mooncake C++ test). Usage doc updated; env-var table no longer claims the env var exists. - Add concise comments to each module-level constant pinning to its Mooncake C++ source line (file_storage.cpp:512-525 for DirectIO alignment + padding, storage_backend.h:206 for the staging-buffer default). - Add a docstring + signature comments to _split_disk_offload_load_batches explaining the three-parallel-list input shape, the scatter-gather inner type, and the (batches, oversize_key) return convention. Test plan - `pre-commit run --files <touched-files>` — all hooks pass locally. - `.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_*` — 50/50 pass. - E2E validated on GB200 with `recipes/mooncake/verify/mndp_noscripts_p2p.yaml` (Qwen3-8B DP=4, standalone-store + disk offload): 22,019 disk_keys / 0 failed_keys over 87 tier-summary lines; 100 conv × 3 turns bench completes with 298/300 successful requests (the 2 turn-3 failures are the documented PUT-backpressure-skip path, not a regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai>

ivanium

LTGM. Thanks for the effort!

…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflicts in vllm/envs.py introduced by the pydantic-settings refactor. The legacy `if TYPE_CHECKING:` declaration block and `environment_variables: dict[str, Callable]` runtime dict were dropped wholesale — both are already superseded by the pydantic BaseSettings model tree on this branch. Main-side commits touching vllm/envs.py since the merge base (256dbca..origin/main) and how each was ported: - ae4f59f (vllm-project#39337) — VLLM_USE_V2_MODEL_RUNNER widened from `bool` (default False) to `bool | None` (default None). Already present on the branch as `use_v2_model_runner` on CompilationSettings with a `_parse_use_v2_model_runner` field_validator. Tri-state: unset means "use config default". - 8a56da3 (vllm-project#42304) — adds VLLM_USE_BREAKABLE_CUDAGRAPH. Ported as `use_breakable_cudagraph: bool = False` on CompilationSettings. - 36e74c9 (vllm-project#42689) — adds four KV-connector env vars. Ported on ConnectorSettings as: - mooncake_store_tier_log: bool = False - mooncake_disk_staging_usable_ratio: float = 0.9 - preferred_segment: str | None (alias=MOONCAKE_PREFERRED_SEGMENT) - requester_local_hostname: str | None (alias=MOONCAKE_REQUESTER_LOCAL_HOSTNAME) The last two use `alias=` because they lack the VLLM_ prefix. Verification: - grep -n "<<<<<<< |>>>>>>> |=======" vllm/envs.py returns zero hits. - pre-commit run --files vllm/envs.py passes (ruff, mypy, SPDX, the schema validator that enforces every field has a default and a docstring, etc.). - Manual override test confirmed pydantic parses both VLLM_-prefixed and unprefixed env vars correctly via the registry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

commit 843715739b7b555c61dd6190cafb5ab7a44c41f1 Author: Yongye Zhu <zyy1102000@gmail.com> Date: Fri May 22 13:06:31 2026 -0400 [Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149) commit b21f3d56d4a2ab5504b56504e87e0475c6d84eb2 Author: Dao007forever <dao007forever@gmail.com> Date: Fri May 22 09:14:11 2026 -0700 [KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> commit c7624bea5ebba1c688eb4c216bd4ede7a94f2a82 Author: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com> Date: Fri May 22 12:10:03 2026 -0400 [Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650) Signed-off-by: zhanda <zhandazhu@gmail.com> Co-authored-by: Shang Wang <shangw@nvidia.com> commit 91f5b92438a568c89e8b9d6c2c55de5a552291f6 Author: Bugen Zhao <i@bugenzhao.com> Date: Fri May 22 23:22:11 2026 +0800 [Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405) Signed-off-by: Bugen Zhao <i@bugenzhao.com> commit f0feb15e7fc521544d23c2d23de0e327a509876b Author: Isotr0py <mozf@mail2.sysu.edu.cn> Date: Fri May 22 22:31:00 2026 +0800 [Multimodal] Simplify ViT CUDA graph interfaces (#41234) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> commit fb21d8b4f9027f4642637c7bb0acc08c29dce387 Author: sychen52 <41452870+sychen52@users.noreply.github.com> Date: Fri May 22 07:21:51 2026 -0700 Add NVFP4 MOE support for Deepseek V4. (#42209) Signed-off-by: Shiyang Chen <shiychen@nvidia.com> commit a377631d21cc97db678727455d33c4257435f417 Author: haosdent <haosdent@gmail.com> Date: Fri May 22 22:06:24 2026 +0800 [CI] Fix AMD docker build tests (#43329) Signed-off-by: haosdent <haosdent@gmail.com> commit d3a563501bcc6134a348f8458b1a797c94336f1f Author: Ilya Markov <markovilya197@gmail.com> Date: Fri May 22 15:43:27 2026 +0200 [EPLB] Change default EPLB communicator (#43110) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> commit 15f7cd33dc8bd4d2270b70ba49d511827d2413ff Author: Jee Jee Li <pandaleefree@gmail.com> Date: Fri May 22 21:41:56 2026 +0800 [LoRA] Reduce memory of 2D weights when EP is set (#42737) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> commit 79ff0ffa98dc8dd14a8651bce36ce6265ff4d35d Author: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com> Date: Fri May 22 05:26:41 2026 -0700 [BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118) Signed-off-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Keyi Li <likey6688@gmail.com> commit 4658bf882b881287fc85797a23037aa91740b7a7 Author: Tobias Wasner <wasnertobias@users.noreply.github.com> Date: Fri May 22 12:54:29 2026 +0200 [Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001) Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> commit b3c7ffcab82c2439726f8cb213800f6f38c023d3 Author: Taneem Ibrahim <taneem.ibrahim@gmail.com> Date: Fri May 22 05:43:33 2026 -0500 [Misc] Replace assert with proper exceptions for security and validation in pooling (#43286) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> commit d3d1cf6972607c53327b5ce1748e56a95fc41c37 Author: Ma Jian <jian1.ma@intel.com> Date: Fri May 22 18:22:45 2026 +0800 [XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> commit 7e1b45a09252a5b513cd83116aa7a2f310220c34 Author: wangxiyuan <wangxiyuan1007@gmail.com> Date: Fri May 22 17:13:12 2026 +0800 [Attention] Mamba attention module refactor (#41126) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> commit 65b7a812a2dabd212d78c7b5b8a320b4efb9750d Author: Li, Jiang <jiang1.li@intel.com> Date: Fri May 22 16:48:17 2026 +0800 [CPU] Experimentally enable Triton and MRV2 (#43225) Signed-off-by: jiang1.li <jiang1.li@intel.com> commit 2380bfc2104267914eea36015e2a347b9318c6c0 Author: wang.yuqi <yuqi.wang@daocloud.io> Date: Fri May 22 16:43:14 2026 +0800 [Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit a7616977176e12ddb14c0daab00cd2a2161ba37c Author: mrjunwan-lang <mrjunwan@google.com> Date: Fri May 22 01:36:17 2026 -0700 Fix the docker build failure in tpu-inference (#43360) Signed-off-by: mrjunwan-lang <mrjunwan@google.com> commit 694d9a81bbb07977e7a72a597acb44f6a848f774 Author: Nick Hill <nickhill123@gmail.com> Date: Fri May 22 00:25:10 2026 -0700 [BugFix] Fix setuptools-rust dep in requirements files (#43377) Signed-off-by: Nick Hill <nickhill123@gmail.com> commit 6bb8753db1076f498c240fffdd88b1ab983b7f40 Author: Weida Hong <wdhongtw@google.com> Date: Fri May 22 15:21:35 2026 +0800 Correcting the mock classes for MM GC tests (#43321) Signed-off-by: Weida Hong <wdhongtw@google.com> commit 025d4f5cd2617bb767663f9e7d62354039887757 Author: haosdent <haosdent@gmail.com> Date: Fri May 22 15:13:59 2026 +0800 [CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296) Signed-off-by: haosdent <haosdent@gmail.com> commit 5ea76fa89aa2e307f0d9a2e7fc19d13aed65a82f Author: haosdent <haosdent@gmail.com> Date: Fri May 22 14:24:18 2026 +0800 [CI] Fix test_lora_with_spec_decode on V2 model runner (#43314) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> commit fa1ff88b3145d1897558408a9001c030c39383b9 Author: tc-mb <157115220+tc-mb@users.noreply.github.com> Date: Fri May 22 13:44:06 2026 +0800 [Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213) Signed-off-by: tc-mb <tianchi_cai@icloud.com> commit e746a2eebf09b1f99beb6b3c60a5ba9d2f8c4875 Author: Furkan F <id+git@yufufi.com> Date: Fri May 22 07:28:23 2026 +0200 [Model] Use `AutoWeightsLoader` for Voyage (#42972) Signed-off-by: Furkan Fidan <dev@yufufi.com> commit 1fe3303983e1829fae25edfb0b93e8cbcfad96e6 Author: haosdent <haosdent@gmail.com> Date: Fri May 22 12:15:22 2026 +0800 [CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064) Signed-off-by: haosdent <haosdent@gmail.com> commit 8c8b1825eb26c1ffae776baaab16f2eebf92b7d3 Author: Xiaochang Wu <xiaochang.wu@intel.com> Date: Fri May 22 12:02:51 2026 +0800 [XPU] Enable multiple key kernels for sparse attention (#37888) Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> commit 18a27cc9a3641cc1dd3eae5113b75c7ccc029b5f Author: qizixi <22851944+zixi-qi@users.noreply.github.com> Date: Thu May 21 20:36:22 2026 -0700 [Bugfix] Make CuMemAllocator free callback stream-aware (#43020) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> commit 0ddd7dd6564f5e403a15bd7c973c7d358ec82454 Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Thu May 21 23:33:16 2026 -0400 [Frontend] DP Supervisor (#40841) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> commit 60af5c16ee64ea3c1c573d67d0773a713c87a22e Author: ruizhang <rza21.bc@gmail.com> Date: Thu May 21 20:32:31 2026 -0700 [Frontend] Add truncation side to OpenAI endpoints (#43260) Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> commit 35d0141a0b68a188777e277e372f211098419f58 Author: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Date: Thu May 21 23:17:54 2026 -0400 [ROCm][CI] add warmup to mem_util test before measurement (#43236) Signed-off-by: Divakar Verma <divakar.verma@amd.com> commit 86ccef7d4400a54441057773d8ffb1f61a20af94 Author: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Date: Fri May 22 05:06:40 2026 +0200 [ROCm] Add XGMI backend for MoRI Connector (#41753) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> commit 2998a047aad7d48bf0399f19b36f1a4d749c59c2 Author: Chengze Fan <fancz2002@gmail.com> Date: Thu May 21 19:43:01 2026 -0700 [Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855) Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> commit ba369b7eb5a3c6593b55f2005655d6586997fa07 Author: Isotr0py <mozf@mail2.sysu.edu.cn> Date: Fri May 22 10:26:05 2026 +0800 [CI] Fix dockerfile dependency graph failure for pre-commit (#43378) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> commit 39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Author: Bugen Zhao <i@bugenzhao.com> Date: Fri May 22 08:21:48 2026 +0800 [Rust Frontend] Move code from `vllm-frontend-rs` (#43283) Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Eric Curtin <eric.curtin@docker.com> Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Will.hou <willamhou@ceresman.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Eric Curtin <eric.curtin@docker.com> Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Will.hou <willamhou@ceresman.com> Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history. commit 39d5fa96a7c687f9ed7e14a5a52064965356cede Author: Lanze Liu <86434077+liulanze@users.noreply.github.com> Date: Thu May 21 15:42:42 2026 -0700 [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873) Signed-off-by: Lanze Liu <lanzetech@gmail.com> commit 565b745ec5d28dafd14585f1b695b159ba336a04 Author: Nick Hill <nickhill123@gmail.com> Date: Thu May 21 15:42:20 2026 -0700 [BugFix] Use correct logprobs for `logprob_token_ids` (#43125) Signed-off-by: Nick Hill <nickhill123@gmail.com> commit e26e1f09280b6c54e1bc1d1fbc0118f7e309cb10 Author: fangyuchu <fangyuchu@qq.com> Date: Fri May 22 06:42:07 2026 +0800 [Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968) Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit 0f66623b0d739dc94afddb67863c37d6f5816579 Author: Nick Hill <nickhill123@gmail.com> Date: Thu May 21 15:36:58 2026 -0700 [Frontend] Rework fastokens integration (#43168) Signed-off-by: Nick Hill <nickhill123@gmail.com> commit 0b59fc45dd475f96f6f46f2c3e699d7bc13b3b04 Author: ylangtsou <149562838+ylangtsou@users.noreply.github.com> Date: Fri May 22 06:00:52 2026 +0800 Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038) Signed-off-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> commit 17b69828a013acb7af0cd1d16d24ecc8d7582094 Author: Zheng Luo <zheluo@nvidia.com> Date: Thu May 21 13:05:01 2026 -0700 [Core] Add native ModelExpress load format (#43105) Signed-off-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> commit b29cbf06525254693f29d98686e038eaf225be8c Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Thu May 21 16:00:29 2026 -0400 [Perf] `zeros` -> `empty` to remove additional fill (#42988) Signed-off-by: yewentao256 <zhyanwentao@126.com> commit 9b54e50e2c1c61ea3b7def032fbafc56dd3179c1 Author: Michael Goin <mgoin64@gmail.com> Date: Thu May 21 15:51:12 2026 -0400 [Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> commit 1c78f76c29a642379ad0ec953a77af9bc44376b6 Author: anish <145943060+anishesg@users.noreply.github.com> Date: Thu May 21 11:07:46 2026 -0400 [Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079) Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> commit 9b9d5dbaab852a1c615fe83a7f92881d353503db Author: haosdent <haosdent@gmail.com> Date: Thu May 21 22:28:34 2026 +0800 [CI] Fix CPU tests failing on `tl.exp2` import (#43311) Signed-off-by: haosdent <haosdent@gmail.com> commit b730c4635288d75da4788bc28d8d26b5e5c3726c Author: Francesco Fusco <ffu@zurich.ibm.com> Date: Thu May 21 13:50:54 2026 +0200 [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172) Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit c68c55d43e504745dbfc2d46b552e80acb74d4b9 Author: velonica0 <47554626+velonica0@users.noreply.github.com> Date: Thu May 21 19:50:49 2026 +0800 [CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943) Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> commit 5ecd8e9c708821916323d25d5f7beddb7f41d22b Author: xiangdong <40376367+zxd1997066@users.noreply.github.com> Date: Thu May 21 18:41:38 2026 +0800 [XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266) Signed-off-by: zengxian <xiangdong.zeng@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> commit caf69823d61119ac3f4b066f20a910b62078e41c Author: haosdent <haosdent@gmail.com> Date: Thu May 21 18:38:07 2026 +0800 [CI] Pin protoc binary in rust-build stages (#43292) Signed-off-by: haosdent <haosdent@gmail.com> commit 68e07d59161a8d268b773c181fab17994a7c5d0a Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Thu May 21 04:58:09 2026 -0400 [Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> commit ebbfb34e3e058bd539db9e5015d0c18b7ce5a5e0 Author: Kevin H. Luu <khluu000@gmail.com> Date: Thu May 21 01:57:47 2026 -0700 [Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> commit edafea35550fab0b185b885711ec048dfd2e1a4d Author: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Date: Thu May 21 16:17:12 2026 +0800 Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223) Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> commit b719b1635b4899e2372905def0badf96d4dd242a Author: zexplorerhj <zhjoneson@163.com> Date: Thu May 21 16:16:27 2026 +0800 Update KDA chunk prefill decay to use exp2 semantics (#43195) Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> commit 0a54df28471be07b3d668ea21c5e411569d3baea Author: Kunshang Ji <kunshang.ji@intel.com> Date: Thu May 21 07:14:13 2026 +0000 [XPU] add setuptools-rust for xpu dependency (#43287) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> commit a950e9447e38727fc956afdc242bc6e3796ccb77 Author: haosdent <haosdent@gmail.com> Date: Thu May 21 14:30:14 2026 +0800 [CI] De-flake test_models for bigscience/bloom-560m (#43197) Signed-off-by: haosdent <haosdent@gmail.com> commit 050611a3dd19271a3c729788ff69b3470ccfb238 Author: Yiyang "Ian" Liu <yiyangliu@microsoft.com> Date: Wed May 20 22:58:59 2026 -0700 [Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> commit 905b97adfaf7b08f3cc95b328579e5336ed6d3b6 Author: yzong-rh <yzong@redhat.com> Date: Thu May 21 01:13:15 2026 -0400 [Benchmark] Add num-warmup to vllm bench throughput (#43245) Signed-off-by: Yifan Zong <yzong@redhat.com> commit a6682d1d259cca69a9ae737ea5608fbbe7520031 Author: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Date: Wed May 20 21:35:08 2026 -0700 [Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905) Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> commit f2ace1d57d28df8d4c5e973dd62d87f47d628cb3 Author: Nick Hill <nickhill123@gmail.com> Date: Wed May 20 21:24:48 2026 -0700 [Frontend][RFC] Rust front-end integration (#40848) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> commit d97ba29fdcf2538359fac5c644c0f07e59bc1988 Author: 손세정 <maze0717@g.skku.edu> Date: Thu May 21 13:24:08 2026 +0900 [ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973) Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: <> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> commit 6441cf4a44856f4eb4dce7d19a51fd69e1b423cf Author: Flora Feng <4florafeng@gmail.com> Date: Thu May 21 00:24:06 2026 -0400 [Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140) Signed-off-by: sfeng33 <4florafeng@gmail.com> commit 346cf163a11b55e069aa3143ae2878967393ddc2 Author: Ben Browning <bbrownin@redhat.com> Date: Thu May 21 00:23:47 2026 -0400 [Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit 7e5070934ee5f28103c5b95cb776904a12fc36f5 Author: haosdent <haosdent@gmail.com> Date: Thu May 21 12:22:10 2026 +0800 [CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082) Signed-off-by: haosdent <haosdent@gmail.com> commit 2b75a73b8e23f5df6de92d01a191e059424487e3 Author: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Date: Thu May 21 01:22:06 2026 -0300 [Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> commit e45df8c3f77572d03f638feded5b5efbccdbcc05 Author: sonusflow <git@sonusflow.pl> Date: Thu May 21 06:22:01 2026 +0200 [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329) Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> commit ee05e8137ec48b8e7375228a1142b4c5f2e3360c Author: Jee Jee Li <pandaleefree@gmail.com> Date: Thu May 21 12:20:57 2026 +0800 [Minor] Bigger overlap for FI AR (#43103) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> commit 5d041cc1fe5181daabf39943efc7b678380d57bd Author: Louie Tsai <louie.tsai@intel.com> Date: Wed May 20 20:57:48 2026 -0700 update GPU json file based on h200 recipes (#43262) Signed-off-by: louie-tsai <louie.tsai@intel.com> commit 9640970de20b15ade9eb3859825637f64e81ed8c Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Wed May 20 21:00:30 2026 -0400 [Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> commit 63ea11709bd9e9b14669e3973dff92d2dcea3cb1 Author: Ace Eldeib <alexeldeib@gmail.com> Date: Thu May 21 02:36:16 2026 +0200 [CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255) Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> commit bde560ed6e1dc889debf68410ccbcb00b749513b Author: akii96 <aakif.nawaz@amd.com> Date: Thu May 21 01:46:51 2026 +0300 [ROCm] Add QuickReduce min-size override and codec threshold (#41675) Signed-off-by: <> commit 6dc0a71843878ef45e29d4732147290b797b70fd Author: Jiangyun Zhu <riverclouds.zhu@qq.com> Date: Thu May 21 05:19:50 2026 +0800 [Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> commit 5774aad9c5b67c5bb67bb7d306a9652a035ed0aa Author: Michael Goin <mgoin64@gmail.com> Date: Wed May 20 17:13:12 2026 -0400 [Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135) Signed-off-by: mgoin <mgoin64@gmail.com> commit 452baa860b1169787cc8540a1772c4d96f682c40 Author: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Date: Wed May 20 16:10:44 2026 -0500 Add dllehr-amd to CODEOWNERS and committers list (#42772) Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> commit 2a43b407c5093b1255a172139da6a5151f410b7a Author: Flora Feng <4florafeng@gmail.com> Date: Wed May 20 14:59:12 2026 -0400 [Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237) Signed-off-by: sfeng33 <4florafeng@gmail.com> commit 53ff50fcd3d2012a406e5053026ea6a46c88b2b6 Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Wed May 20 14:57:42 2026 -0400 [Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> commit 363fc84407f8c966c1cee6786e45e9e6ab289684 Author: meena-at-work <80416898+meena-at-work@users.noreply.github.com> Date: Wed May 20 10:21:11 2026 -0700 Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082) Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> commit f2d5e3d3aeac4cb1f6d285e4a567a502ae507777 Author: haosdent <haosdent@gmail.com> Date: Thu May 21 01:00:24 2026 +0800 [CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: NickLucche <nlucches@redhat.com> commit 2d6b3489b9a325988ad52507236409747d2098a7 Author: Aaron Hao <ahao@anyscale.com> Date: Wed May 20 09:07:59 2026 -0700 [R3] Add routed experts to openai entrypoint (#38939) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> commit 9c78c99995b70726f9ea929ff2e535d6303383d6 Author: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Date: Wed May 20 19:50:24 2026 +0400 [MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> commit a10d69116cb25c8137eeb3f320add71d4e04fda9 Author: Flora Feng <4florafeng@gmail.com> Date: Wed May 20 10:21:00 2026 -0400 [Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019) Signed-off-by: sfeng33 <4florafeng@gmail.com> commit 644b2a28e7eb3b11191f157416cfedebd2da995b Author: Joel Smith <j.smith9103@outlook.com> Date: Wed May 20 15:10:01 2026 +0100 [Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215) Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> commit ded871201a424dd0d28a00aaf74c5786457a18ee Author: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com> Date: Wed May 20 10:08:58 2026 -0400 [Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452) Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Co-authored-by: Cursor <cursoragent@cursor.com> commit df84fb07a6e57969941841c6363d1efbac1ba1e8 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Wed May 20 10:01:45 2026 -0400 Remove additional dead code as a follow-up to #42889 (#43144) Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> commit 0a508743d42a26786c1432bb7f2e93f8111b6383 Author: Benjamin Chislett <bchislett@nvidia.com> Date: Wed May 20 09:15:52 2026 -0400 [Spec Decode] Support non-MTP speculation for NemotronH (#43130) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> commit 19cf334207ed81d3ed75a473acd1a95c785d9ed3 Author: Kebe <mail@kebe7jun.com> Date: Wed May 20 21:58:30 2026 +0900 [Feature] Support manually enabling the cumem allocator (#33648) Signed-off-by: Kebe <mail@kebe7jun.com> commit 87e31455b056c6ce59bf5dcb3c622155431851db Author: Ray Wang <roguerui6@gmail.com> Date: Wed May 20 02:32:03 2026 -0700 [Doc] Sync CLI guide with actual help modes and launch subcommand (#40326) Signed-off-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> commit cb600d1cdbb079ab9432348f128e71c4e2e0a373 Author: hallerite <git@hallerite.com> Date: Wed May 20 10:58:46 2026 +0200 [Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330) Signed-off-by: hallerite <git@hallerite.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> commit 6f21558da1ec7362d2b4f3d012bce2b612a74459 Author: xiangdong <40376367+zxd1997066@users.noreply.github.com> Date: Wed May 20 16:54:58 2026 +0800 [XPU][CI] Add 2 server model test files in Intel GPU CI (#42499) Signed-off-by: zengxian <xiangdong.zeng@intel.com> commit 1cb224430bea0d037b57e24cf91001f47b69ddf3 Author: Artem Perevedentsev <aperevedents@nvidia.com> Date: Wed May 20 11:46:55 2026 +0300 [GDN] Enable FI Blackwell GDN prefill kernel (#40717) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> commit 9b343dd4f54a9870f3ba1e41f5a5b3f4a1e25340 Author: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Wed May 20 17:10:00 2026 +0900 Enable mermaid diagrams in the docs (#43192) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> commit 07aeaf9d4df870a76d5a0dc19d6a7e74b4be5d3b Author: Chris Leonard <chleonar@redhat.com> Date: Wed May 20 03:18:12 2026 -0400 [6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> commit 40651c020772b80f9ca80272aebe749fe01cd38a Author: Nicolò Lucchesi <nlucches@redhat.com> Date: Wed May 20 09:02:36 2026 +0200 [Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097) Signed-off-by: NickLucche <nlucches@redhat.com> commit 7e4bc2cecb3a8aede2d10c86a3a1a4bd98e26100 Author: Nicolò Lucchesi <nlucches@redhat.com> Date: Wed May 20 08:58:25 2026 +0200 [Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099) Signed-off-by: NickLucche <nlucches@redhat.com> commit 85959567c3e71a9965616ebebe1853ca48d8d20f Author: Kevin H. Luu <khluu000@gmail.com> Date: Tue May 19 23:01:41 2026 -0700 [ci] Revert model executor test back to L4 (#43188) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> commit 4f940896a32c9e2a0eba7f50d521bf5f6b4de458 Author: Ronen Schaffer <ronen.schaffer@ibm.com> Date: Wed May 20 06:32:08 2026 +0300 [KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> commit cd0ff26e7acf2c691a33d4c44276db6980bab24b Author: Michael Goin <mgoin64@gmail.com> Date: Tue May 19 23:21:01 2026 -0400 [CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111) Signed-off-by: mgoin <mgoin64@gmail.com> commit 2ae910ed88121d7c3acdcb9bab14cd968257b6e6 Author: Izik Golan <47969623+izikgo@users.noreply.github.com> Date: Wed May 20 06:16:07 2026 +0300 [Perf] Avoid forward scan for async output placeholders (#42938) commit fadf5d332c6e9bb6e552c1ca529511bce0f79802 Author: pmaybank <113125070+pmaybank@users.noreply.github.com> Date: Tue May 19 23:16:02 2026 -0400 add enqueue all option to throughput benchmark (#42975) Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit c628a93a64fb4929c3c11d8e2c7244c4826b4f76 Author: Benjamin Chislett <bchislett@nvidia.com> Date: Tue May 19 23:15:57 2026 -0400 [Perf][Bugfix] Update dflash aux layer indexing (#40727) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> commit 5774aaed0cbeaa74ca7a75d372c1e8bd4aa11cdb Author: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com> Date: Tue May 19 22:32:06 2026 -0400 [Cohere] Enable Cohere MoE (#43143) Signed-off-by: Terrencezzj <terrence@cohere.ai> commit 39bba710bed5b6018718af3e0fd7984f6082118e Author: Nick Hill <nickhill123@gmail.com> Date: Tue May 19 19:19:05 2026 -0700 [MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160) Signed-off-by: Nick Hill <nickhill123@gmail.com> commit 73dd2f33b7a5a8a237fe7296039cec246e4c68bd Author: Aaron Hao <ahao@anyscale.com> Date: Tue May 19 18:01:29 2026 -0700 [bug] fix WeightTransferConfig.backend to allow for all strings (#43121) Signed-off-by: ahao-anyscale <ahao@anyscale.com> commit be16785998087f80ffac08b980603241e5da16ab Author: Fadi Arafeh <115173828+fadara01@users.noreply.github.com> Date: Wed May 20 00:31:15 2026 +0100 [CPU][DOC] Fix installation commands for Arm CPUs (#43115) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> commit 117afeea4665367a3066c1df58d4082d07fcc946 Author: Max de Bayser <mbayser@br.ibm.com> Date: Tue May 19 17:27:54 2026 -0400 Fix error in Dynamic NTK scaling (#41277) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> commit 12421962955ac28b6f80a0307f554fad939174dd Author: Doğaç Eldenk <dogacel@gmail.com> Date: Tue May 19 15:39:00 2026 -0500 [Model] Support post-norm architecture for EAGLE-3 supeculators (#42764) Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> commit a65093c1a39a8ddd8455365128ecbe259350e22c Author: Kevin H. Luu <khluu000@gmail.com> Date: Tue May 19 11:51:34 2026 -0700 [ci] Move language models tests (hybrid) back to L4 (#43129) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> commit 9aaf83ef502fc37bc647f6e474314d48ba36cd1c Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Date: Tue May 19 14:44:32 2026 -0400 [CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit f54721bcc3e072d71b0e09c0b0bd6d692eb06161 Author: tomeras91 <57313761+tomeras91@users.noreply.github.com> Date: Tue May 19 21:43:04 2026 +0300 [Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> commit aed2eb355a9d9136c8e17690b932983b55fb343f Author: Dao007forever <dao007forever@gmail.com> Date: Tue May 19 11:14:43 2026 -0700 [Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> commit d247a931cc25e7253feccbd6260d48216ff5c081 Author: Dom Brown <3886319+DomBrown@users.noreply.github.com> Date: Tue May 19 17:02:05 2026 +0100 [feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> commit 8200fbe1ac73f00a46b1cdd6c4c93bdaf2c33022 Author: Jinzhen Lin <jinzhen.ljz@antgroup.com> Date: Tue May 19 23:36:47 2026 +0800 [Misc] add humming to dependencies (#42540) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> commit 42b4f1fdf7269de8aa83755a805555fe78add28b Author: Flora Feng <4florafeng@gmail.com> Date: Tue May 19 11:21:12 2026 -0400 [Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025) Signed-off-by: sfeng33 <4florafeng@gmail.com> commit 1c6158083a6fc3aff408660d2defd7602f78f556 Author: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Date: Tue May 19 23:17:42 2026 +0800 [Model] Openvla support (#42654) Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> commit d740e2c02919cfba5a86a40d1c12439d03f5ac07 Author: Xinyu Chen <xinyu1.chen@intel.com> Date: Tue May 19 23:09:07 2026 +0800 [XPU] update xpu graph usage (#43043) Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> commit b82e908b4c65a1f162e2d35a8106f09d95d8aa02 Author: Nick Hill <nickhill123@gmail.com> Date: Tue May 19 07:35:54 2026 -0700 [Perf][4/n] Eliminate various GPU<->CPU syncs (#42347) Signed-off-by: Nick Hill <nickhill123@gmail.com> commit a78b842d0e85d287176031334f4721cd96b6e47d Author: Sage <80211083+sagearc@users.noreply.github.com> Date: Tue May 19 13:21:49 2026 +0300 [Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> commit 129019f3342f1b7346ed8f4c1ac9fdefd8fe6ef8 Author: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com> Date: Tue May 19 05:44:33 2026 -0400 [CI] Add MTP + PD disagg test for Qwen3.5 (#42677) Signed-off-by: ZhanqiuHu <zhu@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> commit ef54a4d604ef3725bd52aa2893f71d671bf5329a Author: Shanshan Shen <467638484@qq.com> Date: Tue May 19 16:43:16 2026 +0800 [Misc][MM] Remove redundant code in CLIPAttention (#43046) Signed-off-by: shen-shanshan <467638484@qq.com> commit 07beaed8422d2df34a20e8ebd22b7924d563a566 Author: Woosuk Kwon <woosuk.kwon@berkeley.edu> Date: Tue May 19 01:12:46 2026 -0700 [Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> commit 056bc2e16646599a96ac94e761c953e680e6fba9 Author: Yifan Qiao <yifanqiao@inferact.ai> Date: Tue May 19 01:07:46 2026 -0700 [KVConnector][DSV4] HMA support for Mooncake store connector (#42828) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> commit f34623bf3cac5b33451a761e802c9531e83d1c68 Author: Aaron Hao <ahao@anyscale.com> Date: Tue May 19 01:06:21 2026 -0700 [bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117) Signed-off-by: hao-aaron <ahao@anyscale.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit b14be81c1f63b70668d26d65a377b6383fbca936 Author: Woosuk Kwon <woosuk.kwon@berkeley.edu> Date: Tue May 19 00:52:54 2026 -0700 [Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> commit 301d986473a0ffc1df563422e01eac4a1efd59e0 Author: wang.yuqi <yuqi.wang@daocloud.io> Date: Tue May 19 15:37:40 2026 +0800 [Frontend] Consolidate beam search by BeamSearchMixin. (#42946) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> commit 257af77bc2b612d5ebd0aecea777139036543af3 Author: wang.yuqi <yuqi.wang@daocloud.io> Date: Tue May 19 14:43:18 2026 +0800 [Docs] Reorganize online serving docs. (#41907) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> commit 4a4fdabe28f3e2c8f9d05bcc80c4bf6d656b1ead Author: Taneem Ibrahim <taneem.ibrahim@gmail.com> Date: Tue May 19 01:16:42 2026 -0500 [Misc] Aligning tokwise pooler heads for consistency (#43041) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> commit f1e3f0e6d685082bdb313c20914099ac5ede5f14 Author: Chaojun Zhang <chaojun.zhang@intel.com> Date: Tue May 19 14:14:59 2026 +0800 [XPU] Use custom op collective behavior (#41354) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> commit 9fd8487d2f56468aeec8154123641eb7c2eeacdf Author: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com> Date: Tue May 19 13:50:38 2026 +0800 [Docs] Add SVG images for pooling models. (#42626) Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> commit 27f4ba94811ef14bd45bcdc0c0b8e288a7cc6bc6 Author: Junyan Xu <junyanxu5513@gmail.com> Date: Mon May 18 22:29:04 2026 -0700 fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671) Signed-off-by: junyanxu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit 6e889b582b6a0b11f22b3764be174266faa9ff5e Author: Kevin H. Luu <khluu000@gmail.com> Date: Mon May 18 21:58:36 2026 -0700 [ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> commit fab07e4d0f7f266643c6ac0dc944f9f433ef2140 Author: Qiuyang Yue <yueqiuyang1389@gmail.com> Date: Mon May 18 21:22:33 2026 -0700 [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> commit 3ca8db2ef88ec5a6686e62ee3ac899afae85c7af Author: gnovack <gnovack@amazon.com> Date: Mon May 18 21:17:56 2026 -0700 add cutedsl dsv4 indexer fp8 kernel (#42899) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> commit 87b08c5f6460cf487e47872c5fbc2595c97e74ef Author: Woosuk Kwon <woosuk.kwon@berkeley.edu> Date: Mon May 18 21:00:58 2026 -0700 [Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> commit fba010dd74e2f94e4f7223b164ec9097d1b8a6af Author: Nicolò Lucchesi <nlucches@redhat.com> Date: Tue May 19 05:25:41 2026 +0200 [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> commit da03e549b34685c4e63a091e973d907aee48a68c Author: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Date: Tue May 19 11:25:37 2026 +0800 [UX] Add a persistent cache for FlashInfer autotuning (#42537) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> commit 36dcaf25d8e091ea0f47b9ce7dcfca05de56f16d Author: Kunshang Ji <kunshang.ji@intel.com> Date: Tue May 19 03:17:09 2026 +0000 [XPU] add gptq(int4) support (#37844) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> commit 8f16c4a5c0feb01f106e5981f22ae8808a94a28b Author: Ofir Zafrir <ofir.zafrir@intel.com> Date: Tue May 19 06:16:07 2026 +0300 [BugFix][CPU][Spec Decode] Fix Eagle implementation on CPU backend (#42468) Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com> commit afd7b1dce94fed484351fafd5bf5ea6601ac621e Author: Revital Sur <eres@il.ibm.com> Date: Tue May 19 06:12:04 2026 +0300 [Bugfix] Use platform-agnostic device in example_connector load (#42926) Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 287471b99442b44c5a16c4d70b0f3e178dd52732 Author: Woosuk Kwon <woosuk.kwon@berkeley.edu> Date: Mon May 18 19:50:02 2026 -0700 [Model Refactoring] Migrate DeepSeek V4 to vllm/models/ [1/N] (#43004) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> commit 239b5ff30cf46f9196149c888a20be2096fdff03 Author: Michael Goin <mgoin64@gmail.com> Date: Mon May 18 20:22:27 2026 -0400 [Frontend] Add --spec-method/--spec-model/--spec-tokens CLI aliases (#42476) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> commit f85c76d701fc049a722c17b3affd9401380be1bf Author: Artem Perevedentsev <aperevedents@nvidia.com> Date: Tue May 19 02:58:15 2026 +0300 [CI/Build] Bump nvidia-cutlass-dsl to 4.5.1 (#42991) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> commit a171e6b52dff47dc567657e7d51f641bdcb22774 Author: shanjiaz <zsjwpianpian@gmail.com> Date: Mon May 18 19:39:09 2026 -0400 Add parallel drafting to v2 model runner unsupported features (#43010) Signed-off-by: shanjiaz <zsjwpianpian@gmail.com> commit 37ece593c105b5bb818aa94885617b863d390d7f Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Mon May 18 19:38:12 2026 -0400 [Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7% e2e performance improvement (#42774) Signed-off-by: yewentao256 <zhyanwentao@126.com> commit 57fef4e0bf0bfaddf117dfdc9367e1fb957b423f Author: Flora Feng <4florafeng@gmail.com> Date: Mon May 18 17:55:39 2026 -0400 [Refactor] Extract shared coerce_to_schema_type utility from Minimax M2 tool parser (#43006) Signed-off-by: sfeng33 <4florafeng@gmail.com> commit 0191354827560fe38f68b4e7207f8824d6152ca3 Author: haosdent <haosdent@gmail.com> Date: Tue May 19 05:29:10 2026 +0800 [Perf][MLA] Enable FULL cudagraph capture for TRITON_MLA decode (#42885) Signed-off-by: haosdent <haosdent@gmail.com> commit cd49a05d5aa3cc296912297b3c2b577efe4183c8 Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Mon May 18 16:41:22 2026 -0400 [Refactor] Remove dead code (#42889) Signed-off-by: yewentao256 <zhyanwentao@126.com> commit 84747489ded65265ee7d43815bfa3373b0d42279 Author: Ronen Schaffer <ronen.schaffer@ibm.com> Date: Mon May 18 22:41:58 2026 +0300 Tier offload followup (#42529) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> commit 8fc1c284b94668b60c30737e178cb7e6cd651e89 Author: Tuukka Sarvi <tuukka.sarvi@amd.com> Date: Mon May 18 21:56:22 2026 +0300 [ROCm] Guard AITER GDN decode fast path by layout (#42880) Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com> commit ce88f01c9ac4fcde9dd43a983074d4e893cde65d Author: Amit Portnoy <1131991+amitport@users.noreply.github.com> Date: Mon May 18 21:22:56 2026 +0300 [Docs] update attribution to reflect EDEN foundation (#41666) Signed-off-by: amitport <1131991+amitport@users.noreply.github.com> commit 00e20e76f775b88f47469ae9fcb0f1ecd7580bb9 Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Mon May 18 14:14:21 2026 -0400 [Refactor] Remove dead cuda kernels (#42767) Signed-off-by: yewentao256 <zhyanwentao@126.com> commit 9758a6e5c5a556275c030db456d5d434ee999d58 Author: czhu-cohere <conway.zhu@cohere.com> Date: Mon May 18 11:12:06 2026 -0700 [BugFix] support PP for Cohere vision model (#42819) Signed-off-by: <conway.zhu@cohere.com> Signed-off-by: root <conway.zhu@cohere.com> commit a2c8fc66573664395f491a94da1882fdf92e034b Author: Bowen Bao <bowenbao@amd.com> Date: Mon May 18 10:46:13 2026 -0700 [ROCm][Quantization][3/N] Refactor quark_moe w4a4 w/ oracle (#41436) Signed-off-by: Bowen Bao <bowenbao@amd.com> commit 6859ca76159fdd403b687c0c296e5a12850ba24e Author: Jinzhen Lin <jinzhen.ljz@antgroup.com> Date: Tue May 19 01:32:26 2026 +0800 [Bugfix] fix swiglu limit issue for humming backend + deepseek v4 (#42541) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> commit 67f58ce23f469e118688a50687ef0fbb14a1c028 Author: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Date: Tue May 19 01:02:01 2026 +0800 [Bugfix] Fix DSV4 MTP after ROCm mHC integration (#42930) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> commit 8c296de63b47664fc5979831e1ae2d2a14a05b1a Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Date: Mon May 18 12:12:27 2026 -0400 [Perf] Re-enable flashinfer autotune by default and cleanup (#42857) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> commit b12745e4f31ffacf401cc20a97c592d6a49f3269 Author: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue May 19 00:56:09 2026 +0900 Fix `--convert` passed without `--runner` on causal models (#42935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> commit e26736973a1981dbb4054dc1ac430e78d8006ef2 Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Mon May 18 11:27:21 2026 -0400 [Model Runner V2] Fix prompt logprobs calculation `Sizes of tensors must match` error (#42778) Signed-off-by: yewentao256 <zhyanwentao@126.com> commit 47829b1159335a010521ea3e5361d51744a36b0a Author: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Date: Mon May 18 18:26:00 2026 +0300 [Bugfix] mamba: run single-token extends as decodes (#42430) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> commit 4a39b4f55374d48ebaa2ca02312e24639db8e0b8 Author: Blanc Swan <85233612+blancsw@users.noreply.github.com> Date: Mon May 18 17:20:04 2026 +0200 [Model] Add Apertus Tool Parser (#41154) Signed-off-by: Blanc <swan.blanc@infomaniak.com> commit 78e7a7b9b0b9c285bf6978c3fc09eeecea3ff230 Author: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Date: Mon May 18 08:02:43 2026 -0700 Refactor AWQ Marlin MoE onto modular WNA16 oracle (#42483) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: Robert Shaw <robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit f5d3dc7115cf77472ba5e274f6becbbeddbf4bd5 Author: Michael Goin <mgoin64@gmail.com> Date: Mon May 18 10:26:07 2026 -0400 [Model Runner v2] Support update_config (#42783) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> commit 1ac10f159a09897baada01b14b6a0dd6442aefd6 Author: vllm-agent <claw@inferact.ai> Date: Mon May 18 06:02:51 2026 -0700 Revert "[torch.compile] Add patch for fullgraph compilation" (#42686) (#42913) Co-authored-by: Luka Govedič <luka.govedic@gmail.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> commit e5417657e55ec2f42809816e4aa5c9753f390cdd Author: liranschour <liranschour@users.noreply.github.com> Date: Mon May 18 15:59:42 2026 +0300 [KV Connector][Offloading] Flush all pending jobs on last step (#42611) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit 2e40faf08b2cae4ff6e27a255fe10833365de0e8 Author: xiangdong <40376367+zxd1997066@users.noreply.github.com> Date: Mon May 18 20:34:48 2026 +0800 [XPU][CI] Temporarily skip test_moe_lora_align_block_size_mixed_base_and_lora[1] in Intel GPU CI (#42954) Signed-off-by: zengxian <xiangdong.zeng@intel.com> commit 69c91d010a596bb74b553fe157497a1fd6edb47c Author: Nicolò Lucchesi <nlucches@redhat.com> Date: Mon May 18 14:34:16 2026 +0200 [MRv2] Default to MRv1 when a connector is present (#42955) Signed-off-by: NickLucche <nlucches@redhat.com> commit 737bfa3a43ce386bd1894792f3302d9f3f9d73fa Author: roikoren755 <26850796+roikoren755@users.noreply.github.com> Date: Mon May 18 14:54:00 2026 +0300 [Bugfix][Hybrid][NemotronH] Fix mamba_cache_mode=all + speculative decoding crash (#41233) Signed-off-by: Roi Koren <roik@nvidia.com> commit e414e1f1c020108593526b706efaf89e427c05a2 Author: Kfir Toledo <kfir.toledo@ibm.com> Date: Mon May 18 14:36:02 2026 +0300 [Bugfix][KV Offload] count appended GPU blocks in store group_sizes (#42945) Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> commit df852ed503ac1a79e568271cd6f136a7b2698f5e Author: inisis <desmond.yao@buaa.edu.cn> Date: Mon May 18 18:33:29 2026 +0800 fix: remove unused norm for dpskv4 (#41710) Signed-off-by: inisis <desmond.yao@buaa.edu.cn> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> commit 88a860d7545aad69661daad7a1c2b04f59c76144 Author: Yuwen Zhou <yuwen.zhou@intel.com> Date: Mon May 18 18:04:45 2026 +0800 [CPU] Add MXFP4 W4A16 MoE support (#41922) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Yuwen Zhou <yuwen.zhou@intel.com> commit cac81b6eda418fb5ca86b81197914dd02666353e Author: Tianmu Li <tianmu.li@intel.com> Date: Mon May 18 03:04:41 2026 -0700 [CPU Backend] Improve cpu thread utilization (#42666) Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit b4601ad43ff7ff2b9e2f52379144481e45bcf6c5 Author: Li, Jiang <jiang1.li@intel.com> Date: Mon May 18 18:04:36 2026 +0800 [CPU] Add fused GDN support for AMX CPU platform (#42707) Signed-off-by: jiang1.li <jiang1.li@intel.com> commit 2267f70070bdee8057b4afae69cba9b847add587 Author: Jee Jee Li <pandaleefree@gmail.com> Date: Mon May 18 18:04:31 2026 +0800 [Kernel] Pack topk id/weights triton kernel (#42527) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> commit 965d076148326f4511b6b832cbe7d974db74dbe9 Author: Tony Lin <tony.lin@intel.com> Date: Mon May 18 17:38:54 2026 +0800 [CPU] Specify required KV cache layout for CPU attention backend (#42740) Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> commit c38bed4248e97e5ed981569777d035d31ace5368 Author: wenjun liu <wenjun.liu@intel.com> Date: Mon May 18 16:36:45 2026 +0800 delete xpu ci (#42582) Signed-off-by: wenjun.liu <wenjun.liu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> commit 998714b21b413c78db8eb7af7f384dc90c0b10dc Author: Xin Yang <105740670+xyang16@users.noreply.github.com> Date: Mon May 18 01:32:46 2026 -0700 [Perf] Add do_not_specialize in fused FP8 RoPE kernel (#42849) Signed-off-by: Xin Yang <xyangx@amazon.com> commit 9537542537728af9fac418ecf1604ad8e8d9ff93 Author: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Mon May 18 17:31:06 2026 +0900 Revert checkpoint specific workaround in Transformers modelling backend (#42923) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> commit 5ab6d1b3fd407404cd78488bf6f4cbcde6d912b7 Author: Rishapveer Singh <singhrishapveer@gmail.com> Date: Mon May 18 10:14:36 2026 +0200 [Model] [Perf] Use flatten for Qwen3.5's GDN output projection (#42311) Signed-off-by: Rishapveer Singh <singhrishapveer@gmail.com> commit 7d5b033782681acee274f4f379c9fadc557fd7e8 Author: Jee Jee Li <pandaleefree@gmail.com> Date: Mon May 18 15:22:26 2026 +0800 [LoRA] Support 2D and 3D MoE LoRA adapter at the same time (#42242) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> commit e3aeee5ff8bf7e89fea231d2a965701248eb43c0 Author: Nguyễn Thế Duy <nduy250299@gmail.com> Date: Mon May 18 14:17:53 2026 +0700 [Bugfix] moe lora align kernel grid (#40131) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> commit c1f7854342d1e80f7f2406524d242b8ee5476d6d Author: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Mon May 18 15:33:32 2026 +0900 Improve logging when docs build is skipped (#42929) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> commit 23c15acd770cf16ed36c6d3fed8e7d78db7d5282 Author: gaozihao-shy <gaozihao3@huawei.com> Date: Mon May 18 13:07:16 2026 +0800 [BugFix] Kimi-K2.5: skip vision tower dtype conversion when using quantization (#42869) Signed-off-by: gaozihao-shy <gaozihao-shy@users.noreply.github.com> Signed-off-by: gaozihao <gaozihao3@huawei.com> commit b50646e5effd7cb5884cd96fdff4c53c18521198 Author: Andreas Karatzas <akaratza@amd.com> Date: Sun May 17 22:57:59 2026 -0500 [ROCm][CI] Stabilize ROCm pooling and multimodal CI (#42909) Signed-off-by: Andreas Karatzas <akaratza@amd.com> commit 990f49bdcb8ff51c0ceb1d784c3ca16e6c276927 Author: Soyaazz <523420504@qq.com> Date: Mon May 18 11:19:13 2026 +0800 [MM][CG] Enable encoder Cudagraph for Step3VL (#42224) Signed-off-by: JisoLya <523420504@qq.com> Signed-off-by: Soyaazz <523420504@qq.com> commit 107210442da1bc6985bfa615b55e1e5c2dd98958 Author: Alec <35311602+alec-flowers@users.noreply.github.com> Date: Sun May 17 19:11:46 2026 -0700 [CI] Add NIXL EP import canary (#42567) Signed-off-by: Alec Flowers <aflowers@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> commit 03ddc1c9bc5e448e0da6236268a611d7d001dbae Author: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> Date: Mon May 18 09:57:04 2026 +0800 [Perf] Wire silu_and_mul_per_block_quant into TritonFP8MoE (MiniMax-M2) (#42497) Signed-off-by: qianlihuang <yiliu.dong@qq.com> Signed-off-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> Co-authored-by: qianlihuang <yiliu.dong@qq.com> commit 966903eb93a053a908fbf8b931fcebfb28c4741a Author: Luka Govedič <ProExpertProg@users.noreply.github.com> Date: Sun May 17 15:49:16 2026 -0400 [torch.compile] Add patch for fullgraph compilation (#42686) Signed-off-by: Luka Govedič <luka.govedic@gmail.com> commit 599e75f432e5fd7c77e65dc95587f3441201bdbc Author: TJian <tunjian.tan@embeddedllm.com> Date: Mon May 18 00:18:50 2026 +0800 [ROCm] [Bugfix] Fix DeepSeek V4 Functionality and Accuracy (#42810) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> commit 1c8e9c0399f6a6a98f406dce5947a2ad318e195a Author: Taneem Ibrahim <taneem.ibrahim@gmail.com> Date: Sun May 17 09:40:21 2026 -0500 Refactor: Pass num_labels explicitly to PoolerClassify instead of reading from global config (#42851) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> commit 0fa888465e5a30b797bdf2cdcd0f57fc77541cef Author: zofia <110436990+zufangzhu@users.noreply.github.com> Date: Sun May 17 16:55:10 2026 +0800 [XPU] fix weight scale shape (#42725) Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> commit ff712f6447093d07747c88680b9d006b119f5890 Author: liuzhenwei <zhenweiliu@habana.ai> Date: Sun May 17 12:15:50 2026 +0800 [MRV2][XPU] add Model Runner V2 log (#42710) Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> commit 504a26ce2be2415118b73966480b4fc04d9b7bf8 Author: Qi Zhou <qizzzh@google.com> Date: Sat May 16 17:54:58 2026 -0700 Support bf16 for mamba ssm cache (#41680) Signed-off-by: Qi Zhou <qizzzh@google.com> commit a94189295b8b9c1d952be438b49ed5793db59159 Author: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Date: Sun May 17 08:54:27 2026 +0800 Fix Weight loading for Qwen3.5-MTP and Qwen3-VL using runai_streamer (#42716) Signed-off-by: weizhoublue <weizhou.lan@daocloud.io> commit 0867497368f390212a3f9684e2e05f698f8d1149 Author: Artem Perevedentsev <aperevedents@nvidia.com> Date: Sun May 17 00:55:12 2026 +0300 [CI/Build] Bump flashinfer to v0.6.11.post2 (#41711) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> commit 36e74c9ea4feb5ade38ffa1ea96f24dd73316e02 Author: Zhewen Li <zhewenli@meta.com> Date: Sat May 16 13:34:15 2026 -0700 [KV Connector] Support disk offloading in MooncakeStoreConnector (#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> commit 787bc0d0313840c16e403dfa2d135781d41d3614 Author: Taneem Ibrahim <taneem.ibrahim@gmail.com> Date: Sat May 16 14:58:16 2026 -0400 Add unit tests for pooler activation functions (#42824) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> commit d1586e1a1242754d2f6ac51f4f16680f7d4b129b Author: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Date: Sun May 17 01:02:54 2026 +0800 Fix: Propagate pinned model revisions into Ultravox secondary weight loading (#42830) commit 8a56da3845270837424ef4b7ee83ca97a7883025 Author: Jiangyun Zhu <riverclouds.zhu@qq.com> Date: Sat May 16 22:04:12 2026 +0800 [Experimental] Breakable CUDA graph (#42304) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> commit 4db300e95fd29f5b1a4a7c34f4fbe91b7e9abb24 Author: Andreas Karatzas <akaratza@amd.com> Date: Sat May 16 04:35:05 2026 -0500 [ROCm][CI] Removed problematic command override mechanism (#42807) Signed-off-by: Andreas Karatzas <akaratza@amd.com> commit 657b42b5922d21fef00529144ef5bb5633ad04b1 Author: Zhewen Li <zhewenli@meta.com> Date: Sat May 16 00:26:25 2026 -0700 [Docker][KVConnector] Build mooncake-transfer-engine from source (#42114) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: khluu <khluu000@gmail.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: khluu <khluu000@gmail.com> commit 32b7177909d1c9928bcedd81de7de5a1fa21d2b3 Author: Jee Jee Li <pandaleefree@gmail.com> Date: Sat May 16 11:22:35 2026 +0800 [LoRA][Bugfix] Dedup LoRA wrapping for modules referenced from multiple attribute paths (MoE gate) (#42757) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> commit 39c67d714ef091df1533181bdc3df82dc9ac3e07 Author: DustHunter <dusthunter@126.com> Date: Sat May 16 09:29:27 2026 +0800 fix: add API key authorization to /v2 endpoints (#42594) Signed-off-by: DustHunter <dusthunter@126.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> commit 87a2adcb43513ead1434aff03a535d86f56f768b Author: Viktor Pus <viktorpus@tenstorrent.com> Date: Sat May 16 02:44:48 2026 +0200 [Misc] Add common random prefix option to structured-output serving benchmark (#41632) Signed-off-by: Viktor Pus <viktorpus@tenstorrent.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 852f567444cf8c206219edb7b2c42aec55fc41cf Author: Michael Goin <mgoin64@gmail.com> Date: Fri May 15 20:15:52 2026 -0400 [Bugfix] Respect explicit --kv-cache-dtype over checkpoint kv_cache_scheme (#42782) Signed-off-by: mgoin <mgoin64@gmail.com> commit b2a27b82d970efa0203c06be6dc0d94526edaab0 Author: Michael Goin <mgoin64@gmail.com> Date: Fri May 15 20:07:39 2026 -0400 [Kernel][UX] Add `--linear-backend` arg for linear kernel selection (#39538) Signed-off-by: mgoin <mgoin64@gmail.com> commit d0921bafeff9bbe7a7b4efef6371700e69224702 Author: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com> Date: Fri May 15 16:20:33 2026 -0700 [Bugfix] Unwrap VLM wrappers for EPLB on Model Runner V2 (#42706) commit 1ccdf87507407cb02460ec2e7a3e1a4cac9b0a4a Author: rasdani <73563550+rasdani@users.noreply.github.com> Date: Fri May 15 15:20:53 2026 -0700 [Bugfix] Fix layerwise reload alias-buffer corruption (#42481) Signed-off-by: rasdani <73563550+rasdani@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Roger Wang <hey@rogerw.io> commit bd9dbe60601c986b50260f299fe279d057d7d89f Author: Rita Brugarolas <Rita.BrugarolasBrufau@amd.com> Date: Fri May 15 13:50:03 2026 -0700 [ROCm][Bugfix] Fix fused_mla_dual_rms_norm for AITER API rename _fused_qk_rmsnorm (#42606) Signed-off-by: Rita Brugarolas Brufau <rita.brugarolasbrufau@amd.com> commit de2d76f35239c58202e49469dc5524b6f6fc4ffb Author: Michael Goin <mgoin64@gmail.com> Date: Fri May 15 16:46:16 2026 -0400 [Build] Switch CUDA 12.9 wheel builds to PyTorch manylinux_2_28 base (#41668) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> commit 9a7a273dfe6a89bbe00639fe99b0d61095fbc40a Author: Sergei Skvortsov <yvorott@gmail.com> Date: Fri May 15 21:01:21 2026 +0100 Add HumanEval and GSM8K benchmarks to datasets (#42648) Signed-off-by: southfreebird <yvorott@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> commit b2c58ee9427f15563210e184c57a6e530f37e464 Author: Lanze Liu <86434077+liulanze@users.noreply.github.com> Date: Fri May 15 12:34:59 2026 -0700 [FlashAttn] Fix supports_kv_cache_dtype() accepting unhandled fp8 kv-cache dtype variants (#42685) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> commit 4d67d3bde25f94b6199ce16c7ef239ae4412bb8f Author: frida-andersson <fanders…

reneleonhardt · 2026-05-29T15:51:57Z

Wow, great work, thank you! ❤️

…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

…Backend Mirrors the disk-offload, ReplicateConfig and tier-logging extensions from upstream vllm PR #42689 (MooncakeStoreConnector) into the ascend MooncakeBackend so DSv4 / DSv3.2 deployments can opt into the SSD tier through `mooncake_config.json` ("enable_offload": true). Design decisions (vs the upstream single-file connector): * Backend ABC stays untouched. ReplicateConfig and the disk staging budget are owned by MooncakeBackend internally; KVCacheStore*Thread keeps calling Backend.put / Backend.get with the 3-arg signature so memcache / yuanrong backends are unaffected. * ReplicateConfig is imported with a try/except fallback. Older mooncake builds that don't ship it fall through to the 3-arg batch_put_from_multi_buffers call -- byte-identical to the pre-#42689 path. * RNIC selection helpers from upstream rdma_utils.py are skipped: ascend protocol drives RDMA selection via global_te (transfer engine) rather than per-GPU CSV mapping. * MooncakeStoreConfig appends mode / enable_offload at the end of the dataclass to keep positional-arg back-compat with any existing callers; __post_init__ does *soft* validation (only catches obvious mode/global_segment_size mismatches). * sub-batch split lives inside MooncakeBackend.get. With enable_offload disabled the budget is None and we still issue a single batch_get_into_multi_buffers, preserving baseline behavior. * tier logging gated by VLLM_MOONCAKE_STORE_TIER_LOG (default False); when off zero replica_desc / classification work happens. Default-OFF guarantees (key default-path equivalence): 1. enable_offload defaults False -> disk_offload_buffer_budget_bytes stays None -> single-batch get path 2. VLLM_MOONCAKE_STORE_TIER_LOG defaults False -> no replica probing 3. preferred_segment defaults None -> ReplicateConfig left at default 4. ReplicateConfig unavailable -> 3-arg put fallback Env vars added (mirror upstream names): VLLM_MOONCAKE_STORE_TIER_LOG VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO (default 0.9) MOONCAKE_PREFERRED_SEGMENT MOONCAKE_REQUESTER_LOCAL_HOSTNAME Reference: vllm PR #42689: vllm-project/vllm#42689 predecessor: vllm PR #40900 base: v0.20.2rc @ 145e994 Signed-off-by: liuchenbing <chenliumail@163.com>

…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LCAIZJ · 2026-06-08T11:42:09Z

Hi, I have two questions about this PR:

--disk_gb parameter: The Summary mentions "launching mooncake_client with --disk_gb N", but I couldn't find a --disk_gb parameter in the Mooncake codebase. The actual SSD capacity seems to be controlled by the environment variable MOONCAKE_OFFLOAD_TOTAL_SIZE_LIMIT_BYTES (defaulting to 2 TB). Could you clarify whether --disk_gb is a real CLI flag or just a shorthand in the description?
Mode naming inconsistency: The PR description uses "mode": "owner-client", but the docs section uses "mode": "standalone-store". Are these referring to the same mode? If so, which is the correct/canonical name? This could be confusing for users trying to follow the setup instructions.

@zhewenl @ivanium

zhewenl · 2026-06-08T12:56:44Z

@LCAIZJ Thanks for pointing out the issues! I've updated the PR description and for further reference, please refer to docs/features/mooncake_store_connector_usage.md or https://docs.vllm.ai/en/latest/features/mooncake_store_connector_usage

mergify Bot added documentation Improvements or additions to documentation v1 kv-connector labels May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

zhewenl force-pushed the mooncake-disk-offload branch 2 times, most recently from f4fd81a to 132a151 Compare May 14, 2026 23:23

zhewenl changed the title ~~[KV Transfer] MooncakeStoreConnector: disk-tier offload, dual-mode config, observability~~ [KV Connector] Support disk offloading in MooncakeStoreConnector May 14, 2026

zhewenl force-pushed the mooncake-disk-offload branch from 132a151 to 8391623 Compare May 14, 2026 23:34

zhewenl marked this pull request as ready for review May 14, 2026 23:35

zhewenl requested review from ApostaC, NickLucche, orozery and xuechendi as code owners May 14, 2026 23:35

claude Bot reviewed May 14, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

ivanium reviewed May 15, 2026

View reviewed changes

zhewenl added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2026

zhewenl force-pushed the mooncake-disk-offload branch from 3813729 to 04ea397 Compare May 16, 2026 07:03

ivanium approved these changes May 16, 2026

View reviewed changes

ywang96 merged commit 36e74c9 into vllm-project:main May 16, 2026
70 checks passed

HF-001 mentioned this pull request May 19, 2026

[Feature] Support disk offloading in MooncakeConnector vllm-project/vllm-ascend#9273

Open

Dao007forever mentioned this pull request May 27, 2026

[Mooncake] Use all HCAs on multi-NIC hosts instead of GPU-indexed RNIC selection #43799

Merged

3 tasks

LCAIZJ mentioned this pull request Jun 9, 2026

[RFC]: [Roadmap] Mooncake Store Connector Feature Enrichment Roadmap #45036

Open

8 tasks

		_DIRECT_IO_PADDING_BYTES = 2 * _DIRECT_IO_ALIGNMENT


		MooncakeMode = Literal["real-client", "owner-client"]

		export MOONCAKE_ENABLE_OFFLOAD=1
		export MOONCAKE_OFFLOAD_FILE_STORAGE_PATH=/path/to/offload/dir

Uh oh!

Conversation

zhewenl commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Files changed

How to use disk offload

Step 1. Start the Mooncake master with disk-offload enabled

Step 2. Start the per-node owner with a CPU pool + SSD tier

Step 3. Write mooncake_config.json for owner-client mode

Step 4. Launch vLLM

Step 5. Verify disk offload is firing

Test plan

Related

Uh oh!

mergify Bot commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

ivanium left a comment

Choose a reason for hiding this comment

Uh oh!

ivanium May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivanium May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivanium May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivanium May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 15, 2026

Uh oh!

ivanium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reneleonhardt commented May 29, 2026

Uh oh!

LCAIZJ commented Jun 8, 2026

Uh oh!

zhewenl commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhewenl commented May 14, 2026 •

edited

Loading

Step 3. Write `mooncake_config.json` for owner-client mode