[KV Connector] Support disk offloading in MooncakeStoreConnector#42689
Conversation
|
Documentation preview: https://vllm--42689.org.readthedocs.build/en/42689/ |
There was a problem hiding this comment.
Code Review
This pull request introduces disk offloading and dual-mode operation ("real-client" and "owner-client") to the MooncakeStoreConnector, allowing KV cache to be offloaded to CPU memory or disk. Key technical changes include the implementation of staging buffer budget management for load batches, improved RDMA NIC selection utilities, and enhanced logging for cache tiers. Review feedback correctly identified a critical correctness issue where oversized keys could lead to silent data corruption instead of a loud failure, as well as a regression in IPC path generation that compromised multi-user isolation.
| self.set_finished_request(req_id) | ||
| self.request_queue.task_done() | ||
| return |
There was a problem hiding this comment.
Marking a request as finished when it was skipped due to an oversized key is a critical correctness issue. The consumer thread will proceed assuming the KV cache has been successfully loaded, but it will actually read stale or uninitialized data from the GPU cache. This leads to silent corruption of model outputs. Since this indicates a fatal configuration error (staging budget too small for a single block), the process should fail loudly instead of proceeding with invalid data.
self.request_queue.task_done()
raise RuntimeError(
f"Mooncake load for request {req_id} failed: key {oversized_key} "
f"requires {oversized_key_bytes} staging bytes, exceeding budget "
f"{self.disk_offload_buffer_budget_bytes}. Increase "
"MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES.")| hostname = socket.gethostname() | ||
| extra_config = vllm_config.kv_transfer_config.kv_connector_extra_config | ||
| if "lookup_rpc_port" in extra_config: | ||
| rpc_port = extra_config["lookup_rpc_port"] | ||
| uid = os.getuid() | ||
| logger.debug("Base URL: %s, RPC Port: %s, UID: %s", base_url, rpc_port, uid) | ||
| return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_dp_rank{dp_rank}" | ||
| logger.debug("Base URL: %s, RPC Port: %s", base_url, rpc_port) | ||
| return ( | ||
| f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_host_{hostname}_dp_rank{dp_rank}" |
There was a problem hiding this comment.
Replacing uid with hostname in the IPC path is a regression that breaks isolation in multi-user environments. If multiple users run vLLM on the same node, they will collide on the same socket path (as hostname and dp_rank will be identical). Additionally, socket.gethostname() can be slow or fail in restricted network environments. Isolation should be maintained using os.getuid(). If hostname is desired for observability, it should be added alongside the UID.
| hostname = socket.gethostname() | |
| extra_config = vllm_config.kv_transfer_config.kv_connector_extra_config | |
| if "lookup_rpc_port" in extra_config: | |
| rpc_port = extra_config["lookup_rpc_port"] | |
| uid = os.getuid() | |
| logger.debug("Base URL: %s, RPC Port: %s, UID: %s", base_url, rpc_port, uid) | |
| return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_dp_rank{dp_rank}" | |
| logger.debug("Base URL: %s, RPC Port: %s", base_url, rpc_port) | |
| return ( | |
| f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_host_{hostname}_dp_rank{dp_rank}" | |
| uid = os.getuid() | |
| hostname = socket.gethostname() | |
| extra_config = vllm_config.kv_transfer_config.kv_connector_extra_config | |
| if "lookup_rpc_port" in extra_config: | |
| rpc_port = extra_config["lookup_rpc_port"] | |
| logger.debug("Base URL: %s, RPC Port: %s, UID: %s, Host: %s", | |
| base_url, rpc_port, uid, hostname) | |
| return ( | |
| f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_host_{hostname}_dp_rank{dp_rank}" | |
| ) |
f4fd81a to
132a151
Compare
…nfig, observability Adds disk-tier KV offload support to the MooncakeStoreConnector that landed in vllm-project#40900. Stacks cleanly on top of vllm-project#40900 with zero changes to the existing CPU-only path — operators upgrade simply by adding ``"enable_offload": true`` to their mooncake_config.json and launching ``mooncake_client`` with ``--disk_gb N``. ## What this adds **Disk-tier offload (recv-side batching)** * Optional ``enable_offload`` flag in MooncakeStoreConfig. When true, the recv thread allocates a DirectIO staging budget (``MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES``, default 1.28 GiB) and caps each ``batch_get_into_multi_buffers`` call at ``0.9 ×`` that budget, splitting larger batches into sub-batches automatically. * Per-key staging estimate accounts for the C++ DirectIO 4 KiB alignment + 8 KiB padding so the split never overruns the owner's pinned-memory staging buffer. * Loads that would individually exceed the raw budget are skipped with a clear warning instead of returning an opaque ``insufficient_space`` error from Mooncake. **Dual-mode topology** * New ``mode`` field in MooncakeStoreConfig with ``Literal["real-client", "owner-client"]``: * ``real-client`` (default — PR-40900 baseline): every vLLM rank contributes its own CPU segment. No separate owner process. * ``owner-client``: vLLM ranks contribute zero CPU; a separately launched ``mooncake_client`` owns the CPU pool + optional SSD tier. This is the topology that makes disk offload work in practice (one node-local SSD tier serves all ranks). * ``__post_init__`` validation: ``mode`` and ``global_segment_size`` must agree (real-client requires > 0; owner-client requires 0), ``local_buffer_size`` must be > 0, ``mode`` must be one of the two literals. Hard fail with a clear message — no silent footguns. **Optional ``preferred_segment`` for replicate-config** * When set on ``kv_connector_extra_config["preferred_segment"]`` (or the ``MOONCAKE_PREFERRED_SEGMENT`` env), PUTs route to a specific owner segment via ``ReplicateConfig.preferred_segment`` rather than the default round-robin allocation. Required when more than one segment is reachable from the master and you want puts to converge on the SSD-bearing one. **Per-rank RNIC pinning + scratch helpers** * New ``vllm/distributed/kv_transfer/kv_connector/v1/mooncake/rdma_utils.py`` with three small helpers: (1) parse an operator-provided ``device_name`` CSV (positional, by physical GPU index) and select this rank's RNIC; (2) honour ``MOONCAKE_LOCAL_HOSTNAME`` to override the announced address on multi-NIC hosts; (3) resolve ``preferred_segment`` from ``extra_config`` or env. * When no ``device_name`` is configured on RDMA, the connector logs a SGLang-style warning explaining how to set it; Mooncake's C++ auto-selection is the only fallback. (No Python-side discovery — matches vllm-ascend's approach.) **Observability** * New ``VLLM_MOONCAKE_STORE_TIER_LOG=1`` env var. When set, the recv thread emits one line per ``batch_get_into_multi_buffers``: ``` Mooncake load tier summary: req_id=X batch_keys=N memory_keys=A disk_keys=B unknown_keys=C success_keys=D failed_keys=E bytes_by_tier={'memory': X, 'disk': Y, 'unknown': Z} ``` Lets operators verify the disk tier is actually serving reads and measure the hit-rate split. **Startup mode-detection log + soft warnings** * One info line per rank announcing the resolved mode + relevant fields. Plus warnings for unusual-but-legal hybrid combinations: * real-client + ``enable_offload`` + no ``preferred_segment`` → disk tier sees only a fraction of writes. * real-client + ``preferred_segment`` set → rank-segments idle. * owner-client + ``enable_offload=false`` → disk-batch-splitting disabled; large prefills may hit owner DirectIO budget. ## Files * ``vllm/distributed/kv_transfer/kv_connector/v1/mooncake/store/worker.py`` — disk-offload split, dual-mode config + validation, tier-summary log, ``preferred_segment`` wiring, mode-detection log. * ``vllm/distributed/kv_transfer/kv_connector/v1/mooncake/rdma_utils.py`` — new file; per-rank RNIC + hostname + preferred-segment helpers. * ``vllm/envs.py`` — registers ``VLLM_MOONCAKE_STORE_TIER_LOG``. * ``tests/v1/kv_connector/unit/test_mooncake_store_worker.py`` — ~30 new tests covering split path, validation matrix, topology recipes (real-client + owner-client+disk), tier-summary log, RNIC selection. * ``tests/v1/kv_connector/unit/test_mooncake_store_connector.py`` — one test renamed (``test_worker_role_initializes_store_worker_on_rank0``) to match upstream after rename in tip-of-main. * ``docs/features/mooncake_store_connector_usage.md`` — disk-offload section + env-var table updates. ## Test plan * ``.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_connector.py tests/v1/kv_connector/unit/test_mooncake_store_worker.py`` → 53 pass, 0 fail. * End-to-end validation on a 4×GB200 node with Qwen3-8B + 4 DP, 4 GiB owner CPU pool + 1 TB SSD tier: * owner-client + disk: 74 tier-summary lines, 21,027 disk_keys + 236 memory_keys, **0 failed_keys**, 49.6 GB read back from SSD. Split path (``510+290`` and similar) repeatedly exercised. * real-client + CPU only: master sees ``Mem Storage: 14.50 GB / 16.00 GB`` (4 ranks × 4 GiB each), 0 SSD, PUTs and evictions flowing. ## Usage ### CPU-only (PR-40900 baseline, unchanged) ```bash # mooncake_config.json { "mode": "real-client", "metadata_server": "http://master:8080/metadata", "master_server_address": "master:50051", "global_segment_size": "4GB", "local_buffer_size": "4GB", "protocol": "rdma", "device_name": "mlx5_0,mlx5_1,mlx5_2,mlx5_3" } ``` ```bash MOONCAKE_CONFIG_PATH=mooncake_config.json \ vllm serve <model> \ --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}' ``` ### Owner-client + disk offload Step 1. Start ``mooncake_master`` (HTTP metadata variant): ```bash mooncake_master -rpc_port=50051 -enable_http_metadata_server=true \ -http_metadata_server_port=8080 -enable_offload=true -logtostderr ``` Step 2. Start ``mooncake_client`` as the owner (CPU pool + SSD tier): ```bash mooncake_client \ --master_server_address=127.0.0.1:50051 \ --metadata_server=http://127.0.0.1:8080/metadata \ --host=127.0.0.1:18001 --port=50052 \ --protocol=rdma --device_names="$RDMA_DEVICES" \ --global_segment_size=200GB --enable_offload=true ``` Step 3. Set ``mooncake_config_owner_client.json``: ```bash { "mode": "owner-client", "metadata_server": "http://127.0.0.1:8080/metadata", "master_server_address": "127.0.0.1:50051", "global_segment_size": 0, "local_buffer_size": "4GB", "protocol": "rdma", "device_name": "mlx5_0,mlx5_1,mlx5_2,mlx5_3", "enable_offload": true } ``` Step 4. Launch vLLM: ```bash MOONCAKE_CONFIG_PATH=mooncake_config_owner_client.json \ MOONCAKE_PREFERRED_SEGMENT=127.0.0.1:18001 \ VLLM_MOONCAKE_STORE_TIER_LOG=1 \ vllm serve <model> \ --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector", "kv_role":"kv_both", "kv_connector_extra_config":{"load_async":true,"enable_offload":true}}' ``` The startup logs should show: ``` INFO ... Mooncake mode=owner-client (global_segment_size=0, local_buffer_size=4294967296, preferred_segment=127.0.0.1:18001, enable_offload=True) ``` Once the owner CPU pool fills, tier-summary lines start emitting ``disk_keys > 0``, confirming SSD reads. ## Related * vllm-project#40900 — initial MooncakeStoreConnector PR (CPU only); this PR stacks on top. * RFC: vllm-project#38474. * Mooncake project: https://github.com/kvcache-ai/Mooncake Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
132a151 to
8391623
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8391623367
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| protocol=config.get("protocol", "rdma"), | ||
| device_name=config.get("device_name", ""), | ||
| master_server_address=config.get("master_server_address", ""), | ||
| enable_offload=bool(config.get("enable_offload", False)), |
There was a problem hiding this comment.
Parse enable_offload as a strict boolean
MooncakeStoreConfig.from_file currently coerces enable_offload with bool(...), which treats any non-empty string (including "false"/"0") as True. In JSON generated by templating systems that serialize booleans as strings, this will silently enable disk-offload mode and change runtime behavior (extra buffering/splitting path) instead of honoring the operator intent. Parse this field explicitly (e.g., accept only real booleans or validated string values) to avoid misconfiguration.
Useful? React with 👍 / 👎.
| return f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_uid{uid}_dp_rank{dp_rank}" | ||
| logger.debug("Base URL: %s, RPC Port: %s", base_url, rpc_port) | ||
| return ( | ||
| f"ipc://{base_url}/lookup_rpc_port_{rpc_port}_host_{hostname}_dp_rank{dp_rank}" |
There was a problem hiding this comment.
Preserve per-user uniqueness in lookup IPC path
The lookup socket path no longer includes uid, so two vLLM instances from different users on the same host now resolve to the same IPC filename when lookup_rpc_port/dp_rank match. That creates bind/connect collisions in shared machines (one process can fail to create the REP socket or connect to the wrong peer). Keep a per-user discriminator (like uid) in the path to avoid cross-user namespace conflicts.
Useful? React with 👍 / 👎.
ivanium
left a comment
There was a problem hiding this comment.
left some initial comments. Will continue later.
| _DIRECT_IO_PADDING_BYTES = 2 * _DIRECT_IO_ALIGNMENT | ||
|
|
||
|
|
||
| MooncakeMode = Literal["real-client", "owner-client"] |
There was a problem hiding this comment.
Let's think about better names.
From their docs, I think we can call them "embedded", "embedded-dummy", and "standalone-store"
| ``mode`` selects the topology: ``real-client`` (PR-40900 baseline — each | ||
| rank contributes ``global_segment_size``) or ``owner-client`` (rank | ||
| contributes 0; an external ``mooncake_client`` owns the pool). |
There was a problem hiding this comment.
we can revise the comments here a bit too
| if self.replicate_config is None: | ||
| res = self.store.batch_put_from_multi_buffers(keys, addrs, sizes) | ||
| else: | ||
| res = self.store.batch_put_from_multi_buffers( | ||
| keys, | ||
| addrs, | ||
| sizes, | ||
| self.replicate_config, | ||
| ) |
There was a problem hiding this comment.
after a second look, maybe we can unify this to
res = self.store.batch_put_from_multi_buffers(
keys,
addrs,
sizes,
self.replicate_config,
)| export MOONCAKE_ENABLE_OFFLOAD=1 | ||
| export MOONCAKE_OFFLOAD_FILE_STORAGE_PATH=/path/to/offload/dir |
There was a problem hiding this comment.
claude notified me here we actually read enable_offload from mooncake's config json rather than env var
|
Hi @zhewenl, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…ommit fix Follow-up to review feedback on PR vllm-project#42689: 1. Pre-commit fixes (the immediate CI failure). Two `raise ValueError(...)` / `elif (...)` sites flagged by `pre-commit run ruff-format --all-files` are now collapsed to single-line form. 2. Rename `DEFAULT_MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE` → `DEFAULT_MOONCAKE_DISK_STAGING_BUFFER_BYTES`. The old name collided conceptually with the unrelated `local_buffer_size` parameter to Mooncake's `setup()` (RDMA scratch buffer, 16 MiB default). The new name says what the constant actually mirrors: `FileStorageConfig::local_buffer_size` at `Mooncake/mooncake-store/include/storage_backend.h:206` (1280 * kMB, the DirectIO staging buffer used inside the disk-tier owner process). 3. Promote `DISK_OFFLOAD_USABLE_BUDGET_RATIO = 0.9` to an env var: `VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO` (typed `float`, default 0.9). Follows the `VLLM_MOONCAKE_STORE_TIER_LOG` pattern. The 0.9 had no experimental motivation in code; exposing it as a knob lets users tune the per-batch headroom against the owner's staging buffer. 4. Centralize three previously `os.getenv`-only Mooncake env vars in `vllm/envs.py`: - `MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES` — shared with the Mooncake C++ owner process; same name, single source of truth. - `MOONCAKE_PREFERRED_SEGMENT` — pin replicas to a specific owner segment in standalone-store mode. - `MOONCAKE_REQUESTER_LOCAL_HOSTNAME` — renamed from the previous `MOONCAKE_LOCAL_HOSTNAME`. The old name was too generic and collided with a different `local_hostname` knob in Mooncake's own wheel-level `mooncake_config.py`. The new name says what it's for: the vLLM-rank-as-requester's identity. 5. Add concise comments to each module-level constant in worker.py pinning it to its Mooncake C++ source (`file_storage.cpp:512-525` for DirectIO alignment + padding, `storage_backend.h:206` for the staging-buffer default). The drift hazard is now visible at the site of the constant. 6. Add a docstring + signature comments to `_split_disk_offload_load_batches` explaining the three-parallel-list input shape, why the inner type is `list[int]` (scatter-gather across K/V or multi-layer segments), and what the `(batches, oversize_key)` return tuple encodes. 7. Doc updates: - `docs/features/mooncake_store_connector_usage.md` env-var table now lists the three centralized vars plus the new ratio knob. - `docs/design/mooncake_offload_staging_buffer_explained.html` code snippets updated to the new constant name. Test plan: - `pre-commit run --files <touched-files>` — all hooks pass locally. - `.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_*` — 53/53 pass. - E2E validated on GB200 with `recipes/mooncake/verify/mndp_noscripts_p2p.yaml` (Qwen3-8B, standalone-store + disk offload): worker log shows `Mooncake mode=standalone-store ... enable_offload=True`, tier-summary reports disk_keys > 0 with zero failed_keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Bundle of cleanup changes addressing reviewer comments on the original disk-offload commit: Naming - Rename mode strings: real-client → embedded, owner-client → standalone-store. The new names describe what the topology does (embeds the segment in-process vs. uses a standalone owner) rather than how the C++ side labels its clients. - Rename DEFAULT_MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE → DEFAULT_MOONCAKE_DISK_STAGING_BUFFER_BYTES. The old name collided with Mooncake's unrelated setup() local_buffer_size; the new name says what the constant actually mirrors (FileStorageConfig::local_buffer_size in storage_backend.h:206). Code simplification - Unify the batch_put call path: drop the `if self.replicate_config is None` branch in the SendingThread. MooncakeStoreWorker.__init__ now always builds a ReplicateConfig (only sets preferred_segment when configured), so the inner PUT site has a single unconditional call to batch_put_from_multi_buffers. - Inline trivial helpers (_get_kv_connector_extra_config, _get_disk_offload_buffer_budget_bytes) at their single call sites — they were adding indirection without adding clarity. Env vars - Add VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO (float, default 0.9) to vllm/envs.py — the previously-hardcoded usable-vs-raw budget ratio is now a runtime knob. - Centralize MOONCAKE_PREFERRED_SEGMENT and the renamed MOONCAKE_REQUESTER_LOCAL_HOSTNAME (was the too-generic MOONCAKE_LOCAL_HOSTNAME) in vllm/envs.py. - Drop the vLLM-side MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES env knob. The vLLM-side budget is now always DEFAULT_MOONCAKE_DISK_STAGING_BUFFER_BYTES; Mooncake's C++ owner still reads its own env var on its side. Doc fixes - enable_offload is read from the JSON config, not from MOONCAKE_ENABLE_OFFLOAD (the old doc was wrong — that env var is only referenced in a Mooncake C++ test). Usage doc updated; env-var table no longer claims the env var exists. - Add concise comments to each module-level constant pinning to its Mooncake C++ source line (file_storage.cpp:512-525 for DirectIO alignment + padding, storage_backend.h:206 for the staging-buffer default). - Add a docstring + signature comments to _split_disk_offload_load_batches explaining the three-parallel-list input shape, the scatter-gather inner type, and the (batches, oversize_key) return convention. Test plan - `pre-commit run --files <touched-files>` — all hooks pass locally. - `.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_*` — 50/50 pass. - E2E validated on GB200 with `recipes/mooncake/verify/mndp_noscripts_p2p.yaml` (Qwen3-8B DP=4, standalone-store + disk offload): 22,019 disk_keys / 0 failed_keys over 87 tier-summary lines; 100 conv × 3 turns bench completes with 298/300 successful requests (the 2 turn-3 failures are the documented PUT-backpressure-skip path, not a regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
3813729 to
04ea397
Compare
ivanium
left a comment
There was a problem hiding this comment.
LTGM. Thanks for the effort!
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflicts in vllm/envs.py introduced by the pydantic-settings refactor. The legacy `if TYPE_CHECKING:` declaration block and `environment_variables: dict[str, Callable]` runtime dict were dropped wholesale — both are already superseded by the pydantic BaseSettings model tree on this branch. Main-side commits touching vllm/envs.py since the merge base (256dbca..origin/main) and how each was ported: - ae4f59f (vllm-project#39337) — VLLM_USE_V2_MODEL_RUNNER widened from `bool` (default False) to `bool | None` (default None). Already present on the branch as `use_v2_model_runner` on CompilationSettings with a `_parse_use_v2_model_runner` field_validator. Tri-state: unset means "use config default". - 8a56da3 (vllm-project#42304) — adds VLLM_USE_BREAKABLE_CUDAGRAPH. Ported as `use_breakable_cudagraph: bool = False` on CompilationSettings. - 36e74c9 (vllm-project#42689) — adds four KV-connector env vars. Ported on ConnectorSettings as: - mooncake_store_tier_log: bool = False - mooncake_disk_staging_usable_ratio: float = 0.9 - preferred_segment: str | None (alias=MOONCAKE_PREFERRED_SEGMENT) - requester_local_hostname: str | None (alias=MOONCAKE_REQUESTER_LOCAL_HOSTNAME) The last two use `alias=` because they lack the VLLM_ prefix. Verification: - grep -n "<<<<<<< |>>>>>>> |=======" vllm/envs.py returns zero hits. - pre-commit run --files vllm/envs.py passes (ruff, mypy, SPDX, the schema validator that enforces every field has a default and a docstring, etc.). - Manual override test confirmed pydantic parses both VLLM_-prefixed and unprefixed env vars correctly via the registry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 843715739b7b555c61dd6190cafb5ab7a44c41f1
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Fri May 22 13:06:31 2026 -0400
[Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149)
commit b21f3d56d4a2ab5504b56504e87e0475c6d84eb2
Author: Dao007forever <dao007forever@gmail.com>
Date: Fri May 22 09:14:11 2026 -0700
[KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit c7624bea5ebba1c688eb4c216bd4ede7a94f2a82
Author: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com>
Date: Fri May 22 12:10:03 2026 -0400
[Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650)
Signed-off-by: zhanda <zhandazhu@gmail.com>
Co-authored-by: Shang Wang <shangw@nvidia.com>
commit 91f5b92438a568c89e8b9d6c2c55de5a552291f6
Author: Bugen Zhao <i@bugenzhao.com>
Date: Fri May 22 23:22:11 2026 +0800
[Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
commit f0feb15e7fc521544d23c2d23de0e327a509876b
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Fri May 22 22:31:00 2026 +0800
[Multimodal] Simplify ViT CUDA graph interfaces (#41234)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit fb21d8b4f9027f4642637c7bb0acc08c29dce387
Author: sychen52 <41452870+sychen52@users.noreply.github.com>
Date: Fri May 22 07:21:51 2026 -0700
Add NVFP4 MOE support for Deepseek V4. (#42209)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
commit a377631d21cc97db678727455d33c4257435f417
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 22:06:24 2026 +0800
[CI] Fix AMD docker build tests (#43329)
Signed-off-by: haosdent <haosdent@gmail.com>
commit d3a563501bcc6134a348f8458b1a797c94336f1f
Author: Ilya Markov <markovilya197@gmail.com>
Date: Fri May 22 15:43:27 2026 +0200
[EPLB] Change default EPLB communicator (#43110)
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: Markov Ilya <markovilya19@gmail.com>
commit 15f7cd33dc8bd4d2270b70ba49d511827d2413ff
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Fri May 22 21:41:56 2026 +0800
[LoRA] Reduce memory of 2D weights when EP is set (#42737)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
commit 79ff0ffa98dc8dd14a8651bce36ce6265ff4d35d
Author: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com>
Date: Fri May 22 05:26:41 2026 -0700
[BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118)
Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
commit 4658bf882b881287fc85797a23037aa91740b7a7
Author: Tobias Wasner <wasnertobias@users.noreply.github.com>
Date: Fri May 22 12:54:29 2026 +0200
[Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001)
Signed-off-by: Tobias Wasner <wasnertobias@gmail.com>
commit b3c7ffcab82c2439726f8cb213800f6f38c023d3
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Fri May 22 05:43:33 2026 -0500
[Misc] Replace assert with proper exceptions for security and validation in pooling (#43286)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit d3d1cf6972607c53327b5ce1748e56a95fc41c37
Author: Ma Jian <jian1.ma@intel.com>
Date: Fri May 22 18:22:45 2026 +0800
[XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951)
Signed-off-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit 7e1b45a09252a5b513cd83116aa7a2f310220c34
Author: wangxiyuan <wangxiyuan1007@gmail.com>
Date: Fri May 22 17:13:12 2026 +0800
[Attention] Mamba attention module refactor (#41126)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
commit 65b7a812a2dabd212d78c7b5b8a320b4efb9750d
Author: Li, Jiang <jiang1.li@intel.com>
Date: Fri May 22 16:48:17 2026 +0800
[CPU] Experimentally enable Triton and MRV2 (#43225)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
commit 2380bfc2104267914eea36015e2a347b9318c6c0
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Fri May 22 16:43:14 2026 +0800
[Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit a7616977176e12ddb14c0daab00cd2a2161ba37c
Author: mrjunwan-lang <mrjunwan@google.com>
Date: Fri May 22 01:36:17 2026 -0700
Fix the docker build failure in tpu-inference (#43360)
Signed-off-by: mrjunwan-lang <mrjunwan@google.com>
commit 694d9a81bbb07977e7a72a597acb44f6a848f774
Author: Nick Hill <nickhill123@gmail.com>
Date: Fri May 22 00:25:10 2026 -0700
[BugFix] Fix setuptools-rust dep in requirements files (#43377)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit 6bb8753db1076f498c240fffdd88b1ab983b7f40
Author: Weida Hong <wdhongtw@google.com>
Date: Fri May 22 15:21:35 2026 +0800
Correcting the mock classes for MM GC tests (#43321)
Signed-off-by: Weida Hong <wdhongtw@google.com>
commit 025d4f5cd2617bb767663f9e7d62354039887757
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 15:13:59 2026 +0800
[CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 5ea76fa89aa2e307f0d9a2e7fc19d13aed65a82f
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 14:24:18 2026 +0800
[CI] Fix test_lora_with_spec_decode on V2 model runner (#43314)
Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
commit fa1ff88b3145d1897558408a9001c030c39383b9
Author: tc-mb <157115220+tc-mb@users.noreply.github.com>
Date: Fri May 22 13:44:06 2026 +0800
[Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213)
Signed-off-by: tc-mb <tianchi_cai@icloud.com>
commit e746a2eebf09b1f99beb6b3c60a5ba9d2f8c4875
Author: Furkan F <id+git@yufufi.com>
Date: Fri May 22 07:28:23 2026 +0200
[Model] Use `AutoWeightsLoader` for Voyage (#42972)
Signed-off-by: Furkan Fidan <dev@yufufi.com>
commit 1fe3303983e1829fae25edfb0b93e8cbcfad96e6
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 12:15:22 2026 +0800
[CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 8c8b1825eb26c1ffae776baaab16f2eebf92b7d3
Author: Xiaochang Wu <xiaochang.wu@intel.com>
Date: Fri May 22 12:02:51 2026 +0800
[XPU] Enable multiple key kernels for sparse attention (#37888)
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit 18a27cc9a3641cc1dd3eae5113b75c7ccc029b5f
Author: qizixi <22851944+zixi-qi@users.noreply.github.com>
Date: Thu May 21 20:36:22 2026 -0700
[Bugfix] Make CuMemAllocator free callback stream-aware (#43020)
Signed-off-by: zixi-qi <zixi@inferact.ai>
Co-authored-by: Claude <noreply@anthropic.com>
commit 0ddd7dd6564f5e403a15bd7c973c7d358ec82454
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Thu May 21 23:33:16 2026 -0400
[Frontend] DP Supervisor (#40841)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit 60af5c16ee64ea3c1c573d67d0773a713c87a22e
Author: ruizhang <rza21.bc@gmail.com>
Date: Thu May 21 20:32:31 2026 -0700
[Frontend] Add truncation side to OpenAI endpoints (#43260)
Signed-off-by: Rui Zhang <rza21.bc@gmail.com>
Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net>
Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net>
commit 35d0141a0b68a188777e277e372f211098419f58
Author: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Date: Thu May 21 23:17:54 2026 -0400
[ROCm][CI] add warmup to mem_util test before measurement (#43236)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
commit 86ccef7d4400a54441057773d8ffb1f61a20af94
Author: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Date: Fri May 22 05:06:40 2026 +0200
[ROCm] Add XGMI backend for MoRI Connector (#41753)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
commit 2998a047aad7d48bf0399f19b36f1a4d749c59c2
Author: Chengze Fan <fancz2002@gmail.com>
Date: Thu May 21 19:43:01 2026 -0700
[Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855)
Signed-off-by: Chengze Fan <chengze@meta.com>
Signed-off-by: Chengze Fan <fancz2002@gmail.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
commit ba369b7eb5a3c6593b55f2005655d6586997fa07
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Fri May 22 10:26:05 2026 +0800
[CI] Fix dockerfile dependency graph failure for pre-commit (#43378)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit 39910f2b25aacc09f5e7f166cdf0030b19f8b9e8
Author: Bugen Zhao <i@bugenzhao.com>
Date: Fri May 22 08:21:48 2026 +0800
[Rust Frontend] Move code from `vllm-frontend-rs` (#43283)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com>
Signed-off-by: Will.hou <1205157517@qq.com>
Signed-off-by: Will.hou <willamhou@ceresman.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Eric Curtin <eric.curtin@docker.com>
Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com>
Co-authored-by: Will.hou <1205157517@qq.com>
Co-authored-by: Will.hou <willamhou@ceresman.com>
Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history.
commit 39d5fa96a7c687f9ed7e14a5a52064965356cede
Author: Lanze Liu <86434077+liulanze@users.noreply.github.com>
Date: Thu May 21 15:42:42 2026 -0700
[Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873)
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
commit 565b745ec5d28dafd14585f1b695b159ba336a04
Author: Nick Hill <nickhill123@gmail.com>
Date: Thu May 21 15:42:20 2026 -0700
[BugFix] Use correct logprobs for `logprob_token_ids` (#43125)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit e26e1f09280b6c54e1bc1d1fbc0118f7e309cb10
Author: fangyuchu <fangyuchu@qq.com>
Date: Fri May 22 06:42:07 2026 +0800
[Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968)
Signed-off-by: fangyuchu <fangyuchu@qq.com>
Signed-off-by: zWaNg3 <389750525@qq.com>
Co-authored-by: zWaNg3 <389750525@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 0f66623b0d739dc94afddb67863c37d6f5816579
Author: Nick Hill <nickhill123@gmail.com>
Date: Thu May 21 15:36:58 2026 -0700
[Frontend] Rework fastokens integration (#43168)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit 0b59fc45dd475f96f6f46f2c3e699d7bc13b3b04
Author: ylangtsou <149562838+ylangtsou@users.noreply.github.com>
Date: Fri May 22 06:00:52 2026 +0800
Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038)
Signed-off-by: Ylang Tsou <ylangt@google.com>
Co-authored-by: Ylang Tsou <ylangt@google.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
commit 17b69828a013acb7af0cd1d16d24ecc8d7582094
Author: Zheng Luo <zheluo@nvidia.com>
Date: Thu May 21 13:05:01 2026 -0700
[Core] Add native ModelExpress load format (#43105)
Signed-off-by: Zheng Luo <zheluo@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
commit b29cbf06525254693f29d98686e038eaf225be8c
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Thu May 21 16:00:29 2026 -0400
[Perf] `zeros` -> `empty` to remove additional fill (#42988)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 9b54e50e2c1c61ea3b7def032fbafc56dd3179c1
Author: Michael Goin <mgoin64@gmail.com>
Date: Thu May 21 15:51:12 2026 -0400
[Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
commit 1c78f76c29a642379ad0ec953a77af9bc44376b6
Author: anish <145943060+anishesg@users.noreply.github.com>
Date: Thu May 21 11:07:46 2026 -0400
[Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079)
Signed-off-by: anish <anishesg@users.noreply.github.com>
Signed-off-by: Your Name <ak8686@princeton.edu>
Signed-off-by: anish <145943060+anishesg@users.noreply.github.com>
Co-authored-by: anish <anishesg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
commit 9b9d5dbaab852a1c615fe83a7f92881d353503db
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 22:28:34 2026 +0800
[CI] Fix CPU tests failing on `tl.exp2` import (#43311)
Signed-off-by: haosdent <haosdent@gmail.com>
commit b730c4635288d75da4788bc28d8d26b5e5c3726c
Author: Francesco Fusco <ffu@zurich.ibm.com>
Date: Thu May 21 13:50:54 2026 +0200
[Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172)
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit c68c55d43e504745dbfc2d46b552e80acb74d4b9
Author: velonica0 <47554626+velonica0@users.noreply.github.com>
Date: Thu May 21 19:50:49 2026 +0800
[CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943)
Signed-off-by: velonica0 <like@mail.nankai.edu.cn>
Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
commit 5ecd8e9c708821916323d25d5f7beddb7f41d22b
Author: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Date: Thu May 21 18:41:38 2026 +0800
[XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit caf69823d61119ac3f4b066f20a910b62078e41c
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 18:38:07 2026 +0800
[CI] Pin protoc binary in rust-build stages (#43292)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 68e07d59161a8d268b773c181fab17994a7c5d0a
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Thu May 21 04:58:09 2026 -0400
[Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
commit ebbfb34e3e058bd539db9e5015d0c18b7ce5a5e0
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Thu May 21 01:57:47 2026 -0700
[Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
commit edafea35550fab0b185b885711ec048dfd2e1a4d
Author: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
Date: Thu May 21 16:17:12 2026 +0800
Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223)
Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
commit b719b1635b4899e2372905def0badf96d4dd242a
Author: zexplorerhj <zhjoneson@163.com>
Date: Thu May 21 16:16:27 2026 +0800
Update KDA chunk prefill decay to use exp2 semantics (#43195)
Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
commit 0a54df28471be07b3d668ea21c5e411569d3baea
Author: Kunshang Ji <kunshang.ji@intel.com>
Date: Thu May 21 07:14:13 2026 +0000
[XPU] add setuptools-rust for xpu dependency (#43287)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
commit a950e9447e38727fc956afdc242bc6e3796ccb77
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 14:30:14 2026 +0800
[CI] De-flake test_models for bigscience/bloom-560m (#43197)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 050611a3dd19271a3c729788ff69b3470ccfb238
Author: Yiyang "Ian" Liu <yiyangliu@microsoft.com>
Date: Wed May 20 22:58:59 2026 -0700
[Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601)
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
commit 905b97adfaf7b08f3cc95b328579e5336ed6d3b6
Author: yzong-rh <yzong@redhat.com>
Date: Thu May 21 01:13:15 2026 -0400
[Benchmark] Add num-warmup to vllm bench throughput (#43245)
Signed-off-by: Yifan Zong <yzong@redhat.com>
commit a6682d1d259cca69a9ae737ea5608fbbe7520031
Author: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
Date: Wed May 20 21:35:08 2026 -0700
[Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905)
Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
commit f2ace1d57d28df8d4c5e973dd62d87f47d628cb3
Author: Nick Hill <nickhill123@gmail.com>
Date: Wed May 20 21:24:48 2026 -0700
[Frontend][RFC] Rust front-end integration (#40848)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
commit d97ba29fdcf2538359fac5c644c0f07e59bc1988
Author: 손세정 <maze0717@g.skku.edu>
Date: Thu May 21 13:24:08 2026 +0900
[ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973)
Signed-off-by: AAISSJ <maze0717@g.skku.edu>
Signed-off-by: <>
Signed-off-by: sejung-son <sejung.son@nhn.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local>
Co-authored-by: sejung-son <sejung.son@nhn.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
commit 6441cf4a44856f4eb4dce7d19a51fd69e1b423cf
Author: Flora Feng <4florafeng@gmail.com>
Date: Thu May 21 00:24:06 2026 -0400
[Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 346cf163a11b55e069aa3143ae2878967393ddc2
Author: Ben Browning <bbrownin@redhat.com>
Date: Thu May 21 00:23:47 2026 -0400
[Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 7e5070934ee5f28103c5b95cb776904a12fc36f5
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 12:22:10 2026 +0800
[CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 2b75a73b8e23f5df6de92d01a191e059424487e3
Author: Luciano Martins <22145370+lucianommartins@users.noreply.github.com>
Date: Thu May 21 01:22:06 2026 -0300
[Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
commit e45df8c3f77572d03f638feded5b5efbccdbcc05
Author: sonusflow <git@sonusflow.pl>
Date: Thu May 21 06:22:01 2026 +0200
[Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329)
Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit ee05e8137ec48b8e7375228a1142b4c5f2e3360c
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Thu May 21 12:20:57 2026 +0800
[Minor] Bigger overlap for FI AR (#43103)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
commit 5d041cc1fe5181daabf39943efc7b678380d57bd
Author: Louie Tsai <louie.tsai@intel.com>
Date: Wed May 20 20:57:48 2026 -0700
update GPU json file based on h200 recipes (#43262)
Signed-off-by: louie-tsai <louie.tsai@intel.com>
commit 9640970de20b15ade9eb3859825637f64e81ed8c
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Wed May 20 21:00:30 2026 -0400
[Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit 63ea11709bd9e9b14669e3973dff92d2dcea3cb1
Author: Ace Eldeib <alexeldeib@gmail.com>
Date: Thu May 21 02:36:16 2026 +0200
[CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255)
Signed-off-by: Ace Eldeib <aeldeib@coreweave.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
commit bde560ed6e1dc889debf68410ccbcb00b749513b
Author: akii96 <aakif.nawaz@amd.com>
Date: Thu May 21 01:46:51 2026 +0300
[ROCm] Add QuickReduce min-size override and codec threshold (#41675)
Signed-off-by: <>
commit 6dc0a71843878ef45e29d4732147290b797b70fd
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Thu May 21 05:19:50 2026 +0800
[Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
commit 5774aad9c5b67c5bb67bb7d306a9652a035ed0aa
Author: Michael Goin <mgoin64@gmail.com>
Date: Wed May 20 17:13:12 2026 -0400
[Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit 452baa860b1169787cc8540a1772c4d96f682c40
Author: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
Date: Wed May 20 16:10:44 2026 -0500
Add dllehr-amd to CODEOWNERS and committers list (#42772)
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com>
commit 2a43b407c5093b1255a172139da6a5151f410b7a
Author: Flora Feng <4florafeng@gmail.com>
Date: Wed May 20 14:59:12 2026 -0400
[Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 53ff50fcd3d2012a406e5053026ea6a46c88b2b6
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Wed May 20 14:57:42 2026 -0400
[Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
commit 363fc84407f8c966c1cee6786e45e9e6ab289684
Author: meena-at-work <80416898+meena-at-work@users.noreply.github.com>
Date: Wed May 20 10:21:11 2026 -0700
Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082)
Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
commit f2d5e3d3aeac4cb1f6d285e4a567a502ae507777
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 01:00:24 2026 +0800
[CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186)
Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
commit 2d6b3489b9a325988ad52507236409747d2098a7
Author: Aaron Hao <ahao@anyscale.com>
Date: Wed May 20 09:07:59 2026 -0700
[R3] Add routed experts to openai entrypoint (#38939)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit 9c78c99995b70726f9ea929ff2e535d6303383d6
Author: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Date: Wed May 20 19:50:24 2026 +0400
[MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
commit a10d69116cb25c8137eeb3f320add71d4e04fda9
Author: Flora Feng <4florafeng@gmail.com>
Date: Wed May 20 10:21:00 2026 -0400
[Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 644b2a28e7eb3b11191f157416cfedebd2da995b
Author: Joel Smith <j.smith9103@outlook.com>
Date: Wed May 20 15:10:01 2026 +0100
[Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215)
Signed-off-by: j9smith <j.smith9103@outlook.com>
Signed-off-by: Joel Smith <j.smith9103@outlook.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
commit ded871201a424dd0d28a00aaf74c5786457a18ee
Author: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com>
Date: Wed May 20 10:08:58 2026 -0400
[Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452)
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
commit df84fb07a6e57969941841c6363d1efbac1ba1e8
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Wed May 20 10:01:45 2026 -0400
Remove additional dead code as a follow-up to #42889 (#43144)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
commit 0a508743d42a26786c1432bb7f2e93f8111b6383
Author: Benjamin Chislett <bchislett@nvidia.com>
Date: Wed May 20 09:15:52 2026 -0400
[Spec Decode] Support non-MTP speculation for NemotronH (#43130)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
commit 19cf334207ed81d3ed75a473acd1a95c785d9ed3
Author: Kebe <mail@kebe7jun.com>
Date: Wed May 20 21:58:30 2026 +0900
[Feature] Support manually enabling the cumem allocator (#33648)
Signed-off-by: Kebe <mail@kebe7jun.com>
commit 87e31455b056c6ce59bf5dcb3c622155431851db
Author: Ray Wang <roguerui6@gmail.com>
Date: Wed May 20 02:32:03 2026 -0700
[Doc] Sync CLI guide with actual help modes and launch subcommand (#40326)
Signed-off-by: Rui Wang <raygorous@gmail.com>
Co-authored-by: Rui Wang <raygorous@gmail.com>
commit cb600d1cdbb079ab9432348f128e71c4e2e0a373
Author: hallerite <git@hallerite.com>
Date: Wed May 20 10:58:46 2026 +0200
[Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330)
Signed-off-by: hallerite <git@hallerite.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 6f21558da1ec7362d2b4f3d012bce2b612a74459
Author: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Date: Wed May 20 16:54:58 2026 +0800
[XPU][CI] Add 2 server model test files in Intel GPU CI (#42499)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
commit 1cb224430bea0d037b57e24cf91001f47b69ddf3
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Wed May 20 11:46:55 2026 +0300
[GDN] Enable FI Blackwell GDN prefill kernel (#40717)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
commit 9b343dd4f54a9870f3ba1e41f5a5b3f4a1e25340
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Wed May 20 17:10:00 2026 +0900
Enable mermaid diagrams in the docs (#43192)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 07aeaf9d4df870a76d5a0dc19d6a7e74b4be5d3b
Author: Chris Leonard <chleonar@redhat.com>
Date: Wed May 20 03:18:12 2026 -0400
[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
commit 40651c020772b80f9ca80272aebe749fe01cd38a
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Wed May 20 09:02:36 2026 +0200
[Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097)
Signed-off-by: NickLucche <nlucches@redhat.com>
commit 7e4bc2cecb3a8aede2d10c86a3a1a4bd98e26100
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Wed May 20 08:58:25 2026 +0200
[Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099)
Signed-off-by: NickLucche <nlucches@redhat.com>
commit 85959567c3e71a9965616ebebe1853ca48d8d20f
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Tue May 19 23:01:41 2026 -0700
[ci] Revert model executor test back to L4 (#43188)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
commit 4f940896a32c9e2a0eba7f50d521bf5f6b4de458
Author: Ronen Schaffer <ronen.schaffer@ibm.com>
Date: Wed May 20 06:32:08 2026 +0300
[KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
commit cd0ff26e7acf2c691a33d4c44276db6980bab24b
Author: Michael Goin <mgoin64@gmail.com>
Date: Tue May 19 23:21:01 2026 -0400
[CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit 2ae910ed88121d7c3acdcb9bab14cd968257b6e6
Author: Izik Golan <47969623+izikgo@users.noreply.github.com>
Date: Wed May 20 06:16:07 2026 +0300
[Perf] Avoid forward scan for async output placeholders (#42938)
commit fadf5d332c6e9bb6e552c1ca529511bce0f79802
Author: pmaybank <113125070+pmaybank@users.noreply.github.com>
Date: Tue May 19 23:16:02 2026 -0400
add enqueue all option to throughput benchmark (#42975)
Signed-off-by: Philip Maybank <pmaybank@amd.com>
Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit c628a93a64fb4929c3c11d8e2c7244c4826b4f76
Author: Benjamin Chislett <bchislett@nvidia.com>
Date: Tue May 19 23:15:57 2026 -0400
[Perf][Bugfix] Update dflash aux layer indexing (#40727)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
commit 5774aaed0cbeaa74ca7a75d372c1e8bd4aa11cdb
Author: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com>
Date: Tue May 19 22:32:06 2026 -0400
[Cohere] Enable Cohere MoE (#43143)
Signed-off-by: Terrencezzj <terrence@cohere.ai>
commit 39bba710bed5b6018718af3e0fd7984f6082118e
Author: Nick Hill <nickhill123@gmail.com>
Date: Tue May 19 19:19:05 2026 -0700
[MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit 73dd2f33b7a5a8a237fe7296039cec246e4c68bd
Author: Aaron Hao <ahao@anyscale.com>
Date: Tue May 19 18:01:29 2026 -0700
[bug] fix WeightTransferConfig.backend to allow for all strings (#43121)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
commit be16785998087f80ffac08b980603241e5da16ab
Author: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Date: Wed May 20 00:31:15 2026 +0100
[CPU][DOC] Fix installation commands for Arm CPUs (#43115)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
commit 117afeea4665367a3066c1df58d4082d07fcc946
Author: Max de Bayser <mbayser@br.ibm.com>
Date: Tue May 19 17:27:54 2026 -0400
Fix error in Dynamic NTK scaling (#41277)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 12421962955ac28b6f80a0307f554fad939174dd
Author: Doğaç Eldenk <dogacel@gmail.com>
Date: Tue May 19 15:39:00 2026 -0500
[Model] Support post-norm architecture for EAGLE-3 supeculators (#42764)
Signed-off-by: Doğaç Eldenk <dogacel@gmail.com>
commit a65093c1a39a8ddd8455365128ecbe259350e22c
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Tue May 19 11:51:34 2026 -0700
[ci] Move language models tests (hybrid) back to L4 (#43129)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
commit 9aaf83ef502fc37bc647f6e474314d48ba36cd1c
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Tue May 19 14:44:32 2026 -0400
[CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit f54721bcc3e072d71b0e09c0b0bd6d692eb06161
Author: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Date: Tue May 19 21:43:04 2026 +0300
[Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
commit aed2eb355a9d9136c8e17690b932983b55fb343f
Author: Dao007forever <dao007forever@gmail.com>
Date: Tue May 19 11:14:43 2026 -0700
[Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit d247a931cc25e7253feccbd6260d48216ff5c081
Author: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Date: Tue May 19 17:02:05 2026 +0100
[feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
commit 8200fbe1ac73f00a46b1cdd6c4c93bdaf2c33022
Author: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Date: Tue May 19 23:36:47 2026 +0800
[Misc] add humming to dependencies (#42540)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
commit 42b4f1fdf7269de8aa83755a805555fe78add28b
Author: Flora Feng <4florafeng@gmail.com>
Date: Tue May 19 11:21:12 2026 -0400
[Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 1c6158083a6fc3aff408660d2defd7602f78f556
Author: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
Date: Tue May 19 23:17:42 2026 +0800
[Model] Openvla support (#42654)
Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
commit d740e2c02919cfba5a86a40d1c12439d03f5ac07
Author: Xinyu Chen <xinyu1.chen@intel.com>
Date: Tue May 19 23:09:07 2026 +0800
[XPU] update xpu graph usage (#43043)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
commit b82e908b4c65a1f162e2d35a8106f09d95d8aa02
Author: Nick Hill <nickhill123@gmail.com>
Date: Tue May 19 07:35:54 2026 -0700
[Perf][4/n] Eliminate various GPU<->CPU syncs (#42347)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit a78b842d0e85d287176031334f4721cd96b6e47d
Author: Sage <80211083+sagearc@users.noreply.github.com>
Date: Tue May 19 13:21:49 2026 +0300
[Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887)
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
commit 129019f3342f1b7346ed8f4c1ac9fdefd8fe6ef8
Author: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com>
Date: Tue May 19 05:44:33 2026 -0400
[CI] Add MTP + PD disagg test for Qwen3.5 (#42677)
Signed-off-by: ZhanqiuHu <zhu@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
commit ef54a4d604ef3725bd52aa2893f71d671bf5329a
Author: Shanshan Shen <467638484@qq.com>
Date: Tue May 19 16:43:16 2026 +0800
[Misc][MM] Remove redundant code in CLIPAttention (#43046)
Signed-off-by: shen-shanshan <467638484@qq.com>
commit 07beaed8422d2df34a20e8ebd22b7924d563a566
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Tue May 19 01:12:46 2026 -0700
[Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 056bc2e16646599a96ac94e761c953e680e6fba9
Author: Yifan Qiao <yifanqiao@inferact.ai>
Date: Tue May 19 01:07:46 2026 -0700
[KVConnector][DSV4] HMA support for Mooncake store connector (#42828)
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
commit f34623bf3cac5b33451a761e802c9531e83d1c68
Author: Aaron Hao <ahao@anyscale.com>
Date: Tue May 19 01:06:21 2026 -0700
[bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117)
Signed-off-by: hao-aaron <ahao@anyscale.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit b14be81c1f63b70668d26d65a377b6383fbca936
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Tue May 19 00:52:54 2026 -0700
[Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 301d986473a0ffc1df563422e01eac4a1efd59e0
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Tue May 19 15:37:40 2026 +0800
[Frontend] Consolidate beam search by BeamSearchMixin. (#42946)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 257af77bc2b612d5ebd0aecea777139036543af3
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Tue May 19 14:43:18 2026 +0800
[Docs] Reorganize online serving docs. (#41907)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 4a4fdabe28f3e2c8f9d05bcc80c4bf6d656b1ead
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Tue May 19 01:16:42 2026 -0500
[Misc] Aligning tokwise pooler heads for consistency (#43041)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
commit f1e3f0e6d685082bdb313c20914099ac5ede5f14
Author: Chaojun Zhang <chaojun.zhang@intel.com>
Date: Tue May 19 14:14:59 2026 +0800
[XPU] Use custom op collective behavior (#41354)
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit 9fd8487d2f56468aeec8154123641eb7c2eeacdf
Author: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com>
Date: Tue May 19 13:50:38 2026 +0800
[Docs] Add SVG images for pooling models. (#42626)
Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 27f4ba94811ef14bd45bcdc0c0b8e288a7cc6bc6
Author: Junyan Xu <junyanxu5513@gmail.com>
Date: Mon May 18 22:29:04 2026 -0700
fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671)
Signed-off-by: junyanxu <junyanxu5513@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 6e889b582b6a0b11f22b3764be174266faa9ff5e
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Mon May 18 21:58:36 2026 -0700
[ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
commit fab07e4d0f7f266643c6ac0dc944f9f433ef2140
Author: Qiuyang Yue <yueqiuyang1389@gmail.com>
Date: Mon May 18 21:22:33 2026 -0700
[Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289)
Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist <noreply@google.com>
commit 3ca8db2ef88ec5a6686e62ee3ac899afae85c7af
Author: gnovack <gnovack@amazon.com>
Date: Mon May 18 21:17:56 2026 -0700
add cutedsl dsv4 indexer fp8 kernel (#42899)
Signed-off-by: george <george@inferact.ai>
Co-authored-by: george <george@inferact.ai>
commit 87b08c5f6460cf487e47872c5fbc2595c97e74ef
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Mon May 18 21:00:58 2026 -0700
[Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit fba010dd74e2f94e4f7223b164ec9097d1b8a6af
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Tue May 19 05:25:41 2026 +0200
[Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit da03e549b34685c4e63a091e973d907aee48a68c
Author: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Date: Tue May 19 11:25:37 2026 +0800
[UX] Add a persistent cache for FlashInfer autotuning (#42537)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
commit 36dcaf25d8e091ea0f47b9ce7dcfca05de56f16d
Author: Kunshang Ji <kunshang.ji@intel.com>
Date: Tue May 19 03:17:09 2026 +0000
[XPU] add gptq(int4) support (#37844)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
commit 8f16c4a5c0feb01f106e5981f22ae8808a94a28b
Author: Ofir Zafrir <ofir.zafrir@intel.com>
Date: Tue May 19 06:16:07 2026 +0300
[BugFix][CPU][Spec Decode] Fix Eagle implementation on CPU backend (#42468)
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com>
commit afd7b1dce94fed484351fafd5bf5ea6601ac621e
Author: Revital Sur <eres@il.ibm.com>
Date: Tue May 19 06:12:04 2026 +0300
[Bugfix] Use platform-agnostic device in example_connector load (#42926)
Signed-off-by: Revital Sur <eres@il.ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 287471b99442b44c5a16c4d70b0f3e178dd52732
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Mon May 18 19:50:02 2026 -0700
[Model Refactoring] Migrate DeepSeek V4 to vllm/models/ [1/N] (#43004)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 239b5ff30cf46f9196149c888a20be2096fdff03
Author: Michael Goin <mgoin64@gmail.com>
Date: Mon May 18 20:22:27 2026 -0400
[Frontend] Add --spec-method/--spec-model/--spec-tokens CLI aliases (#42476)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit f85c76d701fc049a722c17b3affd9401380be1bf
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Tue May 19 02:58:15 2026 +0300
[CI/Build] Bump nvidia-cutlass-dsl to 4.5.1 (#42991)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
commit a171e6b52dff47dc567657e7d51f641bdcb22774
Author: shanjiaz <zsjwpianpian@gmail.com>
Date: Mon May 18 19:39:09 2026 -0400
Add parallel drafting to v2 model runner unsupported features (#43010)
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
commit 37ece593c105b5bb818aa94885617b863d390d7f
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 19:38:12 2026 -0400
[Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7% e2e performance improvement (#42774)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 57fef4e0bf0bfaddf117dfdc9367e1fb957b423f
Author: Flora Feng <4florafeng@gmail.com>
Date: Mon May 18 17:55:39 2026 -0400
[Refactor] Extract shared coerce_to_schema_type utility from Minimax M2 tool parser (#43006)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 0191354827560fe38f68b4e7207f8824d6152ca3
Author: haosdent <haosdent@gmail.com>
Date: Tue May 19 05:29:10 2026 +0800
[Perf][MLA] Enable FULL cudagraph capture for TRITON_MLA decode (#42885)
Signed-off-by: haosdent <haosdent@gmail.com>
commit cd49a05d5aa3cc296912297b3c2b577efe4183c8
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 16:41:22 2026 -0400
[Refactor] Remove dead code (#42889)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 84747489ded65265ee7d43815bfa3373b0d42279
Author: Ronen Schaffer <ronen.schaffer@ibm.com>
Date: Mon May 18 22:41:58 2026 +0300
Tier offload followup (#42529)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
commit 8fc1c284b94668b60c30737e178cb7e6cd651e89
Author: Tuukka Sarvi <tuukka.sarvi@amd.com>
Date: Mon May 18 21:56:22 2026 +0300
[ROCm] Guard AITER GDN decode fast path by layout (#42880)
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
commit ce88f01c9ac4fcde9dd43a983074d4e893cde65d
Author: Amit Portnoy <1131991+amitport@users.noreply.github.com>
Date: Mon May 18 21:22:56 2026 +0300
[Docs] update attribution to reflect EDEN foundation (#41666)
Signed-off-by: amitport <1131991+amitport@users.noreply.github.com>
commit 00e20e76f775b88f47469ae9fcb0f1ecd7580bb9
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 14:14:21 2026 -0400
[Refactor] Remove dead cuda kernels (#42767)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 9758a6e5c5a556275c030db456d5d434ee999d58
Author: czhu-cohere <conway.zhu@cohere.com>
Date: Mon May 18 11:12:06 2026 -0700
[BugFix] support PP for Cohere vision model (#42819)
Signed-off-by: <conway.zhu@cohere.com>
Signed-off-by: root <conway.zhu@cohere.com>
commit a2c8fc66573664395f491a94da1882fdf92e034b
Author: Bowen Bao <bowenbao@amd.com>
Date: Mon May 18 10:46:13 2026 -0700
[ROCm][Quantization][3/N] Refactor quark_moe w4a4 w/ oracle (#41436)
Signed-off-by: Bowen Bao <bowenbao@amd.com>
commit 6859ca76159fdd403b687c0c296e5a12850ba24e
Author: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Date: Tue May 19 01:32:26 2026 +0800
[Bugfix] fix swiglu limit issue for humming backend + deepseek v4 (#42541)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
commit 67f58ce23f469e118688a50687ef0fbb14a1c028
Author: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Date: Tue May 19 01:02:01 2026 +0800
[Bugfix] Fix DSV4 MTP after ROCm mHC integration (#42930)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
commit 8c296de63b47664fc5979831e1ae2d2a14a05b1a
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Mon May 18 12:12:27 2026 -0400
[Perf] Re-enable flashinfer autotune by default and cleanup (#42857)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
commit b12745e4f31ffacf401cc20a97c592d6a49f3269
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Tue May 19 00:56:09 2026 +0900
Fix `--convert` passed without `--runner` on causal models (#42935)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit e26736973a1981dbb4054dc1ac430e78d8006ef2
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 11:27:21 2026 -0400
[Model Runner V2] Fix prompt logprobs calculation `Sizes of tensors must match` error (#42778)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 47829b1159335a010521ea3e5361d51744a36b0a
Author: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Date: Mon May 18 18:26:00 2026 +0300
[Bugfix] mamba: run single-token extends as decodes (#42430)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
commit 4a39b4f55374d48ebaa2ca02312e24639db8e0b8
Author: Blanc Swan <85233612+blancsw@users.noreply.github.com>
Date: Mon May 18 17:20:04 2026 +0200
[Model] Add Apertus Tool Parser (#41154)
Signed-off-by: Blanc <swan.blanc@infomaniak.com>
commit 78e7a7b9b0b9c285bf6978c3fc09eeecea3ff230
Author: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com>
Date: Mon May 18 08:02:43 2026 -0700
Refactor AWQ Marlin MoE onto modular WNA16 oracle (#42483)
Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com>
Signed-off-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com>
Co-authored-by: Robert Shaw <robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit f5d3dc7115cf77472ba5e274f6becbbeddbf4bd5
Author: Michael Goin <mgoin64@gmail.com>
Date: Mon May 18 10:26:07 2026 -0400
[Model Runner v2] Support update_config (#42783)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 1ac10f159a09897baada01b14b6a0dd6442aefd6
Author: vllm-agent <claw@inferact.ai>
Date: Mon May 18 06:02:51 2026 -0700
Revert "[torch.compile] Add patch for fullgraph compilation" (#42686) (#42913)
Co-authored-by: Luka Govedič <luka.govedic@gmail.com>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
commit e5417657e55ec2f42809816e4aa5c9753f390cdd
Author: liranschour <liranschour@users.noreply.github.com>
Date: Mon May 18 15:59:42 2026 +0300
[KV Connector][Offloading] Flush all pending jobs on last step (#42611)
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 2e40faf08b2cae4ff6e27a255fe10833365de0e8
Author: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Date: Mon May 18 20:34:48 2026 +0800
[XPU][CI] Temporarily skip test_moe_lora_align_block_size_mixed_base_and_lora[1] in Intel GPU CI (#42954)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
commit 69c91d010a596bb74b553fe157497a1fd6edb47c
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Mon May 18 14:34:16 2026 +0200
[MRv2] Default to MRv1 when a connector is present (#42955)
Signed-off-by: NickLucche <nlucches@redhat.com>
commit 737bfa3a43ce386bd1894792f3302d9f3f9d73fa
Author: roikoren755 <26850796+roikoren755@users.noreply.github.com>
Date: Mon May 18 14:54:00 2026 +0300
[Bugfix][Hybrid][NemotronH] Fix mamba_cache_mode=all + speculative decoding crash (#41233)
Signed-off-by: Roi Koren <roik@nvidia.com>
commit e414e1f1c020108593526b706efaf89e427c05a2
Author: Kfir Toledo <kfir.toledo@ibm.com>
Date: Mon May 18 14:36:02 2026 +0300
[Bugfix][KV Offload] count appended GPU blocks in store group_sizes (#42945)
Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
commit df852ed503ac1a79e568271cd6f136a7b2698f5e
Author: inisis <desmond.yao@buaa.edu.cn>
Date: Mon May 18 18:33:29 2026 +0800
fix: remove unused norm for dpskv4 (#41710)
Signed-off-by: inisis <desmond.yao@buaa.edu.cn>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
commit 88a860d7545aad69661daad7a1c2b04f59c76144
Author: Yuwen Zhou <yuwen.zhou@intel.com>
Date: Mon May 18 18:04:45 2026 +0800
[CPU] Add MXFP4 W4A16 MoE support (#41922)
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Yuwen Zhou <yuwen.zhou@intel.com>
commit cac81b6eda418fb5ca86b81197914dd02666353e
Author: Tianmu Li <tianmu.li@intel.com>
Date: Mon May 18 03:04:41 2026 -0700
[CPU Backend] Improve cpu thread utilization (#42666)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit b4601ad43ff7ff2b9e2f52379144481e45bcf6c5
Author: Li, Jiang <jiang1.li@intel.com>
Date: Mon May 18 18:04:36 2026 +0800
[CPU] Add fused GDN support for AMX CPU platform (#42707)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
commit 2267f70070bdee8057b4afae69cba9b847add587
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Mon May 18 18:04:31 2026 +0800
[Kernel] Pack topk id/weights triton kernel (#42527)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
commit 965d076148326f4511b6b832cbe7d974db74dbe9
Author: Tony Lin <tony.lin@intel.com>
Date: Mon May 18 17:38:54 2026 +0800
[CPU] Specify required KV cache layout for CPU attention backend (#42740)
Signed-off-by: Tony Lin <tony.lin@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
commit c38bed4248e97e5ed981569777d035d31ace5368
Author: wenjun liu <wenjun.liu@intel.com>
Date: Mon May 18 16:36:45 2026 +0800
delete xpu ci (#42582)
Signed-off-by: wenjun.liu <wenjun.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 998714b21b413c78db8eb7af7f384dc90c0b10dc
Author: Xin Yang <105740670+xyang16@users.noreply.github.com>
Date: Mon May 18 01:32:46 2026 -0700
[Perf] Add do_not_specialize in fused FP8 RoPE kernel (#42849)
Signed-off-by: Xin Yang <xyangx@amazon.com>
commit 9537542537728af9fac418ecf1604ad8e8d9ff93
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Mon May 18 17:31:06 2026 +0900
Revert checkpoint specific workaround in Transformers modelling backend (#42923)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 5ab6d1b3fd407404cd78488bf6f4cbcde6d912b7
Author: Rishapveer Singh <singhrishapveer@gmail.com>
Date: Mon May 18 10:14:36 2026 +0200
[Model] [Perf] Use flatten for Qwen3.5's GDN output projection (#42311)
Signed-off-by: Rishapveer Singh <singhrishapveer@gmail.com>
commit 7d5b033782681acee274f4f379c9fadc557fd7e8
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Mon May 18 15:22:26 2026 +0800
[LoRA] Support 2D and 3D MoE LoRA adapter at the same time (#42242)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit e3aeee5ff8bf7e89fea231d2a965701248eb43c0
Author: Nguyễn Thế Duy <nduy250299@gmail.com>
Date: Mon May 18 14:17:53 2026 +0700
[Bugfix] moe lora align kernel grid (#40131)
Signed-off-by: TheDuyIT <nduy250299@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: dtnguyen <dtnguyen@nvidia.com>
Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
commit c1f7854342d1e80f7f2406524d242b8ee5476d6d
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Mon May 18 15:33:32 2026 +0900
Improve logging when docs build is skipped (#42929)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 23c15acd770cf16ed36c6d3fed8e7d78db7d5282
Author: gaozihao-shy <gaozihao3@huawei.com>
Date: Mon May 18 13:07:16 2026 +0800
[BugFix] Kimi-K2.5: skip vision tower dtype conversion when using quantization (#42869)
Signed-off-by: gaozihao-shy <gaozihao-shy@users.noreply.github.com>
Signed-off-by: gaozihao <gaozihao3@huawei.com>
commit b50646e5effd7cb5884cd96fdff4c53c18521198
Author: Andreas Karatzas <akaratza@amd.com>
Date: Sun May 17 22:57:59 2026 -0500
[ROCm][CI] Stabilize ROCm pooling and multimodal CI (#42909)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit 990f49bdcb8ff51c0ceb1d784c3ca16e6c276927
Author: Soyaazz <523420504@qq.com>
Date: Mon May 18 11:19:13 2026 +0800
[MM][CG] Enable encoder Cudagraph for Step3VL (#42224)
Signed-off-by: JisoLya <523420504@qq.com>
Signed-off-by: Soyaazz <523420504@qq.com>
commit 107210442da1bc6985bfa615b55e1e5c2dd98958
Author: Alec <35311602+alec-flowers@users.noreply.github.com>
Date: Sun May 17 19:11:46 2026 -0700
[CI] Add NIXL EP import canary (#42567)
Signed-off-by: Alec Flowers <aflowers@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
commit 03ddc1c9bc5e448e0da6236268a611d7d001dbae
Author: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com>
Date: Mon May 18 09:57:04 2026 +0800
[Perf] Wire silu_and_mul_per_block_quant into TritonFP8MoE (MiniMax-M2) (#42497)
Signed-off-by: qianlihuang <yiliu.dong@qq.com>
Signed-off-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com>
Co-authored-by: qianlihuang <yiliu.dong@qq.com>
commit 966903eb93a053a908fbf8b931fcebfb28c4741a
Author: Luka Govedič <ProExpertProg@users.noreply.github.com>
Date: Sun May 17 15:49:16 2026 -0400
[torch.compile] Add patch for fullgraph compilation (#42686)
Signed-off-by: Luka Govedič <luka.govedic@gmail.com>
commit 599e75f432e5fd7c77e65dc95587f3441201bdbc
Author: TJian <tunjian.tan@embeddedllm.com>
Date: Mon May 18 00:18:50 2026 +0800
[ROCm] [Bugfix] Fix DeepSeek V4 Functionality and Accuracy (#42810)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
commit 1c8e9c0399f6a6a98f406dce5947a2ad318e195a
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Sun May 17 09:40:21 2026 -0500
Refactor: Pass num_labels explicitly to PoolerClassify instead of reading from global config (#42851)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
commit 0fa888465e5a30b797bdf2cdcd0f57fc77541cef
Author: zofia <110436990+zufangzhu@users.noreply.github.com>
Date: Sun May 17 16:55:10 2026 +0800
[XPU] fix weight scale shape (#42725)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit ff712f6447093d07747c88680b9d006b119f5890
Author: liuzhenwei <zhenweiliu@habana.ai>
Date: Sun May 17 12:15:50 2026 +0800
[MRV2][XPU] add Model Runner V2 log (#42710)
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
commit 504a26ce2be2415118b73966480b4fc04d9b7bf8
Author: Qi Zhou <qizzzh@google.com>
Date: Sat May 16 17:54:58 2026 -0700
Support bf16 for mamba ssm cache (#41680)
Signed-off-by: Qi Zhou <qizzzh@google.com>
commit a94189295b8b9c1d952be438b49ed5793db59159
Author: weizhoublue <45163302+weizhoublue@users.noreply.github.com>
Date: Sun May 17 08:54:27 2026 +0800
Fix Weight loading for Qwen3.5-MTP and Qwen3-VL using runai_streamer (#42716)
Signed-off-by: weizhoublue <weizhou.lan@daocloud.io>
commit 0867497368f390212a3f9684e2e05f698f8d1149
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Sun May 17 00:55:12 2026 +0300
[CI/Build] Bump flashinfer to v0.6.11.post2 (#41711)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
commit 36e74c9ea4feb5ade38ffa1ea96f24dd73316e02
Author: Zhewen Li <zhewenli@meta.com>
Date: Sat May 16 13:34:15 2026 -0700
[KV Connector] Support disk offloading in MooncakeStoreConnector (#42689)
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 787bc0d0313840c16e403dfa2d135781d41d3614
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Sat May 16 14:58:16 2026 -0400
Add unit tests for pooler activation functions (#42824)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
commit d1586e1a1242754d2f6ac51f4f16680f7d4b129b
Author: weizhoublue <45163302+weizhoublue@users.noreply.github.com>
Date: Sun May 17 01:02:54 2026 +0800
Fix: Propagate pinned model revisions into Ultravox secondary weight loading (#42830)
commit 8a56da3845270837424ef4b7ee83ca97a7883025
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Sat May 16 22:04:12 2026 +0800
[Experimental] Breakable CUDA graph (#42304)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
commit 4db300e95fd29f5b1a4a7c34f4fbe91b7e9abb24
Author: Andreas Karatzas <akaratza@amd.com>
Date: Sat May 16 04:35:05 2026 -0500
[ROCm][CI] Removed problematic command override mechanism (#42807)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit 657b42b5922d21fef00529144ef5bb5633ad04b1
Author: Zhewen Li <zhewenli@meta.com>
Date: Sat May 16 00:26:25 2026 -0700
[Docker][KVConnector] Build mooncake-transfer-engine from source (#42114)
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Signed-off-by: khluu <khluu000@gmail.com>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: khluu <khluu000@gmail.com>
commit 32b7177909d1c9928bcedd81de7de5a1fa21d2b3
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Sat May 16 11:22:35 2026 +0800
[LoRA][Bugfix] Dedup LoRA wrapping for modules referenced from multiple attribute paths (MoE gate) (#42757)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
commit 39c67d714ef091df1533181bdc3df82dc9ac3e07
Author: DustHunter <dusthunter@126.com>
Date: Sat May 16 09:29:27 2026 +0800
fix: add API key authorization to /v2 endpoints (#42594)
Signed-off-by: DustHunter <dusthunter@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 87a2adcb43513ead1434aff03a535d86f56f768b
Author: Viktor Pus <viktorpus@tenstorrent.com>
Date: Sat May 16 02:44:48 2026 +0200
[Misc] Add common random prefix option to structured-output serving benchmark (#41632)
Signed-off-by: Viktor Pus <viktorpus@tenstorrent.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 852f567444cf8c206219edb7b2c42aec55fc41cf
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 15 20:15:52 2026 -0400
[Bugfix] Respect explicit --kv-cache-dtype over checkpoint kv_cache_scheme (#42782)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit b2a27b82d970efa0203c06be6dc0d94526edaab0
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 15 20:07:39 2026 -0400
[Kernel][UX] Add `--linear-backend` arg for linear kernel selection (#39538)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit d0921bafeff9bbe7a7b4efef6371700e69224702
Author: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com>
Date: Fri May 15 16:20:33 2026 -0700
[Bugfix] Unwrap VLM wrappers for EPLB on Model Runner V2 (#42706)
commit 1ccdf87507407cb02460ec2e7a3e1a4cac9b0a4a
Author: rasdani <73563550+rasdani@users.noreply.github.com>
Date: Fri May 15 15:20:53 2026 -0700
[Bugfix] Fix layerwise reload alias-buffer corruption (#42481)
Signed-off-by: rasdani <73563550+rasdani@users.noreply.github.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit bd9dbe60601c986b50260f299fe279d057d7d89f
Author: Rita Brugarolas <Rita.BrugarolasBrufau@amd.com>
Date: Fri May 15 13:50:03 2026 -0700
[ROCm][Bugfix] Fix fused_mla_dual_rms_norm for AITER API rename _fused_qk_rmsnorm (#42606)
Signed-off-by: Rita Brugarolas Brufau <rita.brugarolasbrufau@amd.com>
commit de2d76f35239c58202e49469dc5524b6f6fc4ffb
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 15 16:46:16 2026 -0400
[Build] Switch CUDA 12.9 wheel builds to PyTorch manylinux_2_28 base (#41668)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit 9a7a273dfe6a89bbe00639fe99b0d61095fbc40a
Author: Sergei Skvortsov <yvorott@gmail.com>
Date: Fri May 15 21:01:21 2026 +0100
Add HumanEval and GSM8K benchmarks to datasets (#42648)
Signed-off-by: southfreebird <yvorott@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
commit b2c58ee9427f15563210e184c57a6e530f37e464
Author: Lanze Liu <86434077+liulanze@users.noreply.github.com>
Date: Fri May 15 12:34:59 2026 -0700
[FlashAttn] Fix supports_kv_cache_dtype() accepting unhandled fp8 kv-cache dtype variants (#42685)
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
commit 4d67d3bde25f94b6199ce16c7ef239ae4412bb8f
Author: frida-andersson <fanders…
|
Wow, great work, thank you! ❤️ |
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
…Backend
Mirrors the disk-offload, ReplicateConfig and tier-logging extensions
from upstream vllm PR #42689 (MooncakeStoreConnector) into the ascend
MooncakeBackend so DSv4 / DSv3.2 deployments can opt into the SSD tier
through `mooncake_config.json` ("enable_offload": true).
Design decisions (vs the upstream single-file connector):
* Backend ABC stays untouched. ReplicateConfig and the disk staging
budget are owned by MooncakeBackend internally; KVCacheStore*Thread
keeps calling Backend.put / Backend.get with the 3-arg signature so
memcache / yuanrong backends are unaffected.
* ReplicateConfig is imported with a try/except fallback. Older
mooncake builds that don't ship it fall through to the 3-arg
batch_put_from_multi_buffers call -- byte-identical to the
pre-#42689 path.
* RNIC selection helpers from upstream rdma_utils.py are skipped:
ascend protocol drives RDMA selection via global_te
(transfer engine) rather than per-GPU CSV mapping.
* MooncakeStoreConfig appends mode / enable_offload at the end of the
dataclass to keep positional-arg back-compat with any existing
callers; __post_init__ does *soft* validation (only catches obvious
mode/global_segment_size mismatches).
* sub-batch split lives inside MooncakeBackend.get. With
enable_offload disabled the budget is None and we still issue a
single batch_get_into_multi_buffers, preserving baseline behavior.
* tier logging gated by VLLM_MOONCAKE_STORE_TIER_LOG (default False);
when off zero replica_desc / classification work happens.
Default-OFF guarantees (key default-path equivalence):
1. enable_offload defaults False -> disk_offload_buffer_budget_bytes
stays None -> single-batch get path
2. VLLM_MOONCAKE_STORE_TIER_LOG defaults False -> no replica probing
3. preferred_segment defaults None -> ReplicateConfig left at default
4. ReplicateConfig unavailable -> 3-arg put fallback
Env vars added (mirror upstream names):
VLLM_MOONCAKE_STORE_TIER_LOG
VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO (default 0.9)
MOONCAKE_PREFERRED_SEGMENT
MOONCAKE_REQUESTER_LOCAL_HOSTNAME
Reference:
vllm PR #42689: vllm-project/vllm#42689
predecessor: vllm PR #40900
base: v0.20.2rc @ 145e994
Signed-off-by: liuchenbing <chenliumail@163.com>
…Backend
Mirrors the disk-offload, ReplicateConfig and tier-logging extensions
from upstream vllm PR #42689 (MooncakeStoreConnector) into the ascend
MooncakeBackend so DSv4 / DSv3.2 deployments can opt into the SSD tier
through `mooncake_config.json` ("enable_offload": true).
Design decisions (vs the upstream single-file connector):
* Backend ABC stays untouched. ReplicateConfig and the disk staging
budget are owned by MooncakeBackend internally; KVCacheStore*Thread
keeps calling Backend.put / Backend.get with the 3-arg signature so
memcache / yuanrong backends are unaffected.
* ReplicateConfig is imported with a try/except fallback. Older
mooncake builds that don't ship it fall through to the 3-arg
batch_put_from_multi_buffers call -- byte-identical to the
pre-#42689 path.
* RNIC selection helpers from upstream rdma_utils.py are skipped:
ascend protocol drives RDMA selection via global_te
(transfer engine) rather than per-GPU CSV mapping.
* MooncakeStoreConfig appends mode / enable_offload at the end of the
dataclass to keep positional-arg back-compat with any existing
callers; __post_init__ does *soft* validation (only catches obvious
mode/global_segment_size mismatches).
* sub-batch split lives inside MooncakeBackend.get. With
enable_offload disabled the budget is None and we still issue a
single batch_get_into_multi_buffers, preserving baseline behavior.
* tier logging gated by VLLM_MOONCAKE_STORE_TIER_LOG (default False);
when off zero replica_desc / classification work happens.
Default-OFF guarantees (key default-path equivalence):
1. enable_offload defaults False -> disk_offload_buffer_budget_bytes
stays None -> single-batch get path
2. VLLM_MOONCAKE_STORE_TIER_LOG defaults False -> no replica probing
3. preferred_segment defaults None -> ReplicateConfig left at default
4. ReplicateConfig unavailable -> 3-arg put fallback
Env vars added (mirror upstream names):
VLLM_MOONCAKE_STORE_TIER_LOG
VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO (default 0.9)
MOONCAKE_PREFERRED_SEGMENT
MOONCAKE_REQUESTER_LOCAL_HOSTNAME
Reference:
vllm PR #42689: vllm-project/vllm#42689
predecessor: vllm PR #40900
base: v0.20.2rc @ 145e994
Signed-off-by: liuchenbing <chenliumail@163.com>
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m-project#42689) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi, I have two questions about this PR:
|
|
@LCAIZJ Thanks for pointing out the issues! I've updated the PR description and for further reference, please refer to docs/features/mooncake_store_connector_usage.md or https://docs.vllm.ai/en/latest/features/mooncake_store_connector_usage |
Summary
For latest usage, please refer to
docs/features/mooncake_store_connector_usage.md.Adds disk-tier KV offload to
MooncakeStoreConnectorthat landed in #40900. Stacks cleanly on top of #40900 with no impact on the existing CPU-only path — operators upgrade by adding"enable_offload": truetomooncake_config.json.End-to-end validated on a 4×GB200 node with Qwen3-8B + 4 DP: 49.6 GB read back from SSD across 74 tier-summary lines, 0 failed_keys, 21,027 disk_keys + 236 memory_keys.
NOTE: Please use mooncake after da9dfea38703c9380093e4b95cc1dc3670848a51
Architecture
Two supported topologies, selected by an explicit
modefield:embedded(default)standalone-storemooncake_clientowns the CPU pool + optional SSD.modeandglobal_segment_sizeare validated for consistency at startup — wrong combinations raiseValueErrorwith an actionable message instead of failing silently.Files changed
vllm/.../mooncake/store/worker.pypreferred_segmentwiring, mode-detection logvllm/.../mooncake/rdma_utils.pyvllm/envs.pyVLLM_MOONCAKE_STORE_TIER_LOGtests/v1/kv_connector/unit/test_mooncake_store_worker.pytests/v1/kv_connector/unit/test_mooncake_store_connector.pydocs/features/mooncake_store_connector_usage.mdHow to use disk offload
Step 1. Start the Mooncake master with disk-offload enabled
Wait until the RPC port is reachable (e.g.,
nc -z 127.0.0.1 50051).Step 2. Start the per-node owner with a CPU pool + SSD tier
Verify the owner registered with the master (the master's admin metrics line should show
Clients: 1,Mem Storage: 0 B / 200 GB).Step 3. Write
mooncake_config.jsonfor owner-client mode{ "mode": "owner-client", "metadata_server": "http://127.0.0.1:8080/metadata", "master_server_address": "127.0.0.1:50051", "global_segment_size": 0, "local_buffer_size": "4GB", "protocol": "rdma", "device_name": "mlx5_0,mlx5_1,mlx5_2,mlx5_3", "enable_offload": true }In owner-client mode:
mode=owner-clientdeclares the intent.global_segment_size=0means this vLLM rank contributes zero CPU to the pool; the owner provides all of it.enable_offload=truetells the connector to use the disk-aware recv-side batching path.device_nameis a comma-separated CSV; entry[i]is used by the rank with physical GPUi.Step 4. Launch vLLM
MOONCAKE_PREFERRED_SEGMENTpins all PUTs to the SSD-bearing owner. Required when more than one segment is reachable from the master.VLLM_MOONCAKE_STORE_TIER_LOG=1emits per-batch tier-summary lines so you can verify the disk tier is serving reads.Step 5. Verify disk offload is firing
On startup, each rank should log one line:
Once the owner's CPU pool fills, the recv thread starts emitting tier-summary lines:
disk_keys > 0confirms the SSD path is being read from.failed_keysshould stay 0 — non-zerofailed_keyswithinsufficient_spacein the master log means the DirectIO budget is too small; raise it:Test plan
Related
MooncakeStoreConnector(CPU only)