[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector by KuntaiDu · Pull Request #25712 · vllm-project/vllm

KuntaiDu · 2025-09-25T21:27:19Z

Purpose

Refactor of #25363 . This PR enables the combination of hybrid allocator + KV cache connector in a backward-compatible way.

Test Script



import os

# Set token chunk size to 256
os.environ["LMCACHE_CHUNK_SIZE"] = "256"
# Enable CPU memory backend
os.environ["LMCACHE_LOCAL_CPU"] = "True"
# Set CPU memory limit to 5GB
os.environ["LMCACHE_MAX_LOCAL_CPU_SIZE"] = "20.0"
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
os.environ["LMCACHE_USE_LAYERWISE"] = "True"


from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig

# Configure KV cache transfer to use LMCache
ktc = KVTransferConfig(
    kv_connector="LMCacheConnectorV1",
    kv_role="kv_both",
)

# Initialize LLM with LMCache configuration
# Adjust gpu_memory_utilization based on your GPU memory
llm = LLM(model="google/gemma-3-4b-it",
          kv_transfer_config=ktc,
          max_model_len=75000,
          gpu_memory_utilization=0.18,
          enforce_eager=True)

# Define sampling parameters
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)

# Run inference
outputs = llm.generate("hi" * 70000 + "\nhow are you?", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# This requires loading KV cache and will success
outputs = llm.generate("hi" * 10000 + "\nTell me a story.", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# flush out prefix cache in GPU
outputs = llm.generate("1" + "hi" * 70000 + "\nhow are you?", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# This requires loading KV cache
# but this request cannot be executed as vLLM cannot allocate for long prefix 
# stored by LMCache
outputs = llm.generate("hi" * 70000 + "\nTell me a story.", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

Test Result

Success.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

gemini-code-assist

Code Review

This pull request successfully enables the use of the hybrid allocator with the KV cache connector by removing the explicit restriction and adding the necessary logic to handle multiple KV cache groups. The changes are well-structured, introducing a SupportsHMA interface to check for compatibility. My review focuses on improving code quality and performance. I've identified an opportunity to refactor duplicated code for better maintainability and two instances where an expensive deepcopy operation can be replaced with a more efficient shallow copy, which should improve initialization performance.

KuntaiDu · 2025-09-25T21:53:10Z

@NickLucche @njhill This is the refactored version of #25363 , PTAL

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

mergify · 2025-10-01T12:54:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @KuntaiDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Follow on from vllm-project#25712 `VllmConfig` is explicitly designed as a dataclass containing user-provided configuration and model metadata. It is a global configuration object that lives throughout the entire engine lifetime and is meant to be immutable after `__post_init__()`. `KVCacheConfig` is worker-specific, runtime-computed state. It has limited lifetime, and its purpose is limited to initializing the KV Cache in the model runner. Even if we add KV cache hints to model config.json in future, this would be parsed into `ModelConfig`, used as input to the `get_kv_cache_configs()` computation, and the resulting `KVCacheConfig` would still be runtime state. We are currently creating per-worker copies of VllmConfig in order to attach the runtime `KVCacheConfig` state. But instead we should just explicitly pass `KVCacheConfig` to the connector. Make sure to handle backwards compatibility for external connector implementations (loaded via module path) that have the old style constructor signature. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

…atures without `KVCacheConfig` (#39832) The v0.12.0 release contained initial support for HMA in KV Connectors. As part of these changes, a KVCacheConfig argument was added to KV connector constructors. Backwards compatibility support for out-of-tree connectors was included in this change, with a very prominent warning. See #25712 and #27887. Since the warning has been around for over 5 months, we can safely remove the support of it. Signed-off-by: yewentao256 <zhyanwentao@126.com>

… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

…atures without `KVCacheConfig` (vllm-project#39832) The v0.12.0 release contained initial support for HMA in KV Connectors. As part of these changes, a KVCacheConfig argument was added to KV connector constructors. Backwards compatibility support for out-of-tree connectors was included in this change, with a very prominent warning. See vllm-project#25712 and vllm-project#27887. Since the warning has been around for over 5 months, we can safely remove the support of it. Signed-off-by: yewentao256 <zhyanwentao@126.com>

… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

…atures without `KVCacheConfig` (vllm-project#39832) The v0.12.0 release contained initial support for HMA in KV Connectors. As part of these changes, a KVCacheConfig argument was added to KV connector constructors. Backwards compatibility support for out-of-tree connectors was included in this change, with a very prominent warning. See vllm-project#25712 and vllm-project#27887. Since the warning has been around for over 5 months, we can safely remove the support of it. Signed-off-by: yewentao256 <zhyanwentao@126.com>

…atures without `KVCacheConfig` (vllm-project#39832) The v0.12.0 release contained initial support for HMA in KV Connectors. As part of these changes, a KVCacheConfig argument was added to KV connector constructors. Backwards compatibility support for out-of-tree connectors was included in this change, with a very prominent warning. See vllm-project#25712 and vllm-project#27887. Since the warning has been around for over 5 months, we can safely remove the support of it. Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…atures without `KVCacheConfig` (vllm-project#39832) The v0.12.0 release contained initial support for HMA in KV Connectors. As part of these changes, a KVCacheConfig argument was added to KV connector constructors. Backwards compatibility support for out-of-tree connectors was included in this change, with a very prominent warning. See vllm-project#25712 and vllm-project#27887. Since the warning has been around for over 5 months, we can safely remove the support of it. Signed-off-by: yewentao256 <zhyanwentao@126.com>

KuntaiDu added 2 commits September 25, 2025 14:16

Refactor: make sure the API calls are backward compatible

1ded8ae

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

align function signature

42040ba

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

KuntaiDu requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 25, 2025 21:27

KuntaiDu mentioned this pull request Sep 25, 2025

[Core][Hybrid allocator + connector 1/n] Enable KV cache connector + hybrid allocator #25363

Closed

5 tasks

mergify Bot added v1 kv-connector labels Sep 25, 2025

gemini-code-assist Bot reviewed Sep 25, 2025

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

Comment thread vllm/v1/core/sched/scheduler.py

Comment thread vllm/v1/worker/gpu_worker.py Outdated

KuntaiDu added 2 commits September 26, 2025 12:14

fix mypy errors

fbaa51a

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

adjust the signature of block_ids

fae4c82

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

hmellor reviewed Oct 1, 2025

View reviewed changes

Comment thread vllm/config/__init__.py Outdated

mergify Bot added the needs-rebase label Oct 1, 2025

heheda12345 reviewed Oct 2, 2025

View reviewed changes

Comment thread vllm/v1/worker/gpu_worker.py Outdated

KuntaiDu added 2 commits October 24, 2025 00:43

fix CI errors

2fac4fb

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'main' into kuntai-enable-hma-connector

4c724a6

simon-mo disabled auto-merge October 25, 2025 06:34

simon-mo merged commit b853540 into vllm-project:main Oct 25, 2025
50 of 52 checks passed

KuntaiDu mentioned this pull request Oct 27, 2025

[Stability fix] turn off HMA allocator when connector is set #27592

Merged

5 tasks

markmc mentioned this pull request Oct 31, 2025

[KV Connector] Make KVCacheConfig an explicit constructor argument #27887

Merged

KuntaiDu mentioned this pull request Nov 4, 2025

[Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator #28011

Merged

5 tasks

ivanium mentioned this pull request Dec 6, 2025

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector #30166

Merged

5 tasks

HarshavardhanK mentioned this pull request Apr 20, 2026

[Integration] connector_v1: subclass SupportsHMA so PD-disagg works for hybrid models kvcache-ai/Mooncake#1931

Merged

21 tasks

markmc mentioned this pull request Apr 28, 2026

[KV Connector] Remove compat support for pre-v0.12.0 constructor signatures without KVCacheConfig #39832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector#25712

[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector#25712
simon-mo merged 35 commits into
vllm-project:mainfrom
KuntaiDu:kuntai-enable-hma-connector

KuntaiDu commented Sep 25, 2025 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu commented Sep 25, 2025

Uh oh!

Uh oh!

mergify Bot commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

KuntaiDu commented Sep 25, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Script

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu commented Sep 25, 2025

Uh oh!

Uh oh!

mergify Bot commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

KuntaiDu commented Sep 25, 2025 •

edited by github-actions Bot

Loading