[feat] Support hybrid allocator by KuntaiDu · Pull Request #1436 · LMCache/LMCache

KuntaiDu · 2025-08-26T06:04:18Z

This PR supports hybrid allocator in LMCache. Need to work together with vllm-project/vllm#23624.

PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to LMCache! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Core] for changes in the core LMCache logic (e.g., LMCacheEngine, Backend etc.)
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient unit tests to ensure the change is stay correct and robust. The unit and integration tests will always run and our comprehensive test will be triggered after the "full" label is tagged onto a PR.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of KuntaiDu, ApostaC or YaoJiayi.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

gemini-code-assist

Summary of Changes

Hello @KuntaiDu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for a hybrid allocator within LMCache, enabling more flexible and potentially optimized management of KV (Key-Value) cache memory. This change is designed to work in conjunction with related updates in the vLLM project, specifically targeting improvements in how KV cache groups are handled across different layers of a model.

Highlights

Hybrid Allocator Configuration: New fields have been added to LMCacheEngineMetadata to configure layer-to-KV-cache-group mappings, facilitating the setup of a hybrid memory allocation scheme.
KV Cache Group Management: The internal RequestTracker and ReqMeta structures have been updated to manage allocated_block_ids and slot_mappings as dictionaries, allowing for distinct KV cache groups per layer or group of layers.
VLLM Integration Updates: The integration with vLLM has been refined to align with vLLM 0.9.0+ block ID handling, ensuring compatibility and proper propagation of KV cache group information during request processing.
GPU Connector Enhancements: The GPU memory connector now incorporates KV cache group awareness, enabling batched operations (batched_to_gpu, batched_from_gpu) to correctly handle and map memory operations to their respective KV cache groups.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for a hybrid allocator in LMCache, which involves significant changes across the configuration, vLLM adapter, and GPU connector. The changes are well-structured to handle multiple KV cache groups. I've found one critical issue that could lead to a runtime error, which I've detailed in a specific comment. Overall, the changes are in the right direction to support the new feature.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

YaoJiayi

Thanks! Left a few comments:)

YaoJiayi · 2025-08-27T05:02:24Z

        self,
        new_token_ids: list[int],
-        new_block_ids: Union[Optional[tuple[list[int], ...]], list[int]],
+        new_block_ids: Optional[tuple[list[int], ...]],


@maobaolong @chunxiaozheng Do we still need backward compatibility here?

YaoJiayi · 2025-08-27T05:04:57Z

            token_ids=token_ids,
-            slot_mapping=slot_mapping,
+            slot_mappings=slot_mappings,
+            slot_mapping=slot_mappings[0],


What's the difference between slot_mapping and slot_mappings?

slot_mapping == slot_mappings[0]. Basically slot mapping only contain the first KV cache group. This is for test compatibility.

YaoJiayi · 2025-08-27T05:11:42Z

                    kvcaches=kvcaches,
-                    slot_mapping=slot_mapping[:lmcache_cached_tokens],
+                    # FIXME(Kuntai): need to support multiple kv cache groups
+                    slot_mapping=slot_mappings[0][:lmcache_cached_tokens],


What is this todo here?

The todo is to fully deprecate slot_mapping in all connectors and substitute them with slot_mappings.

I am planning to rename it to group_id_to_slot_mapping as slot_mappings looks too similar to slot_mapping (and I have several bugs that was simply because I misread slot_mappings to slot_mapping). Does that sound good to you @YaoJiayi ?

Just talked to @ApostaC . @YaoJiayi the plan is:

write a gpu connector that only handles the hybrid KV cache allocator, and leave all other gpu connectors unchanged. We will refactor other gpu connectors in future PRs

Stick to slot_mappings, and make it a tensor with shape [# of kv cache groups, # of tokens].

YaoJiayi · 2025-08-27T05:13:09Z

+
        self.dtype = kwargs["dtype"]
        self.device = kwargs["device"]
+        self.layer_id_to_kv_cache_group_id = kwargs.get(


Can we have two global mappings for layer_name_to_kv_cache_group_id and layer_id_to_kv_cache_group_id so the all gpu connectors can benefit from them in the future?

I thought in the future, there may be multiple vLLM instances on the same machine connecting to one LMCache process (which is a result @ApostaC 's proposal about separating connector process from vLLM process) and in that case we need to have different layer_name_to_kv_cache_group_id for different connector. Is that the case @ApostaC ?

@YaoJiayi I confirmed with @ApostaC that the case that I mentioned will happen. In this case I guess it's better for us to keep layer_name_to_kv_cache_group_id local to the gpu connector class. I will still separate the logic of converting kv_cache_config to layer_name_to_kv_cache_group_id to a util function.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

…name to kv cache group id Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

…gpu connectors Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

… storage manager Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

…v_cache_group_id in new VLLMPagedMemLayerwiseGPUConnectorForHybridAlloc. This is doable because it is separated into a new connector and previous version of vLLM will no longer touch VLLMPagedMemLayerwiseGPUConnectorForHybridAlloc Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

KuntaiDu · 2025-08-28T22:33:31Z

New updates:

Separate the logic of converting kv_cache_config to layer_name_to_kv_cache_group_id into a vLLM util function
Separate the logic of hybrid allocator into a separate GPU connector
Make slot_mappings a tensor

Not covered in this PR (we can do it in future PRs)

Fully deprecate slot_mapping
Add test (test can only be added after vLLM-side PR is merged)

This PR is ready for review.

…ata --- now it is tied to gpu connector Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

ApostaC · 2025-08-29T19:07:17Z

+    slot_mappings: torch.Tensor
+    # Slot mapping for backward compatibility


Probably we want a unified interface to get the slot mapping

KuntaiDu · 2025-08-29T19:18:37Z

TODO:

Maintain old version compatiblity
Perf test on default LLMs

…t to avoid spamming Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

…back to LMCache, as it needs more careful design to be pushed into vllm Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

matthewygf · 2025-11-19T09:46:23Z

Bump, would love to use this feature and let me know if I can help in anyway !

KuntaiDu · 2025-12-14T06:27:26Z

This feature is being deprioritized for now :( , this is a pure performance PR and is less urgent than feature PRs (e.g. process disaggregation, P2P backend.

mingfang · 2025-12-19T17:12:51Z

This feature is being deprioritized for now :( , this is a pure performance PR and is less urgent than feature PRs (e.g. process disaggregation, P2P backend.

Currently vLLM crashes when LMCache is enabled for models that requires hybrid kv cache(e.g. Qwen3-Next-80B-A3B-Instruct)

github-actions · 2026-02-18T04:02:03Z

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

mingfang · 2026-02-25T19:51:22Z

vllm is still crashing with this error

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

are there any plan to support hybrid models?
@KuntaiDu this is NOT a pure performance PR. vllm crashes on startup.

bambarambambum · 2026-03-04T12:54:53Z

Hello,
vLLM still crashes on startup.
All modern hybrid models that I have tried to run (qwen3.5, Qwen3-Coder-Next, GLM5) crash with this error.
(EngineCore_DP0 pid=251) WARNING 03-04 12:47:30 [kv_cache_utils.py:1170] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved. (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] EngineCore failed to start. (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] Traceback (most recent call last): (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 763, in __init__ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] super().__init__( (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=251) Process EngineCore_DP0: (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 259, in _initialize_kv_caches (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] kv_cache_configs = get_kv_cache_configs( (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1553, in get_kv_cache_configs (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] global_kv_cache_groups = get_kv_cache_groups(vllm_config, merged_kv_cache_specs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1231, in get_kv_cache_groups (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] unify_hybrid_kv_cache_specs(kv_cache_spec) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1211, in unify_hybrid_kv_cache_specs (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] raise ValueError( (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type. (EngineCore_DP0 pid=251) Traceback (most recent call last): (EngineCore_DP0 pid=251) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=251) self.run() (EngineCore_DP0 pid=251) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=251) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core (EngineCore_DP0 pid=251) raise e (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=251) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) return func(*args, **kwargs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 763, in __init__ (EngineCore_DP0 pid=251) super().__init__( (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__ (EngineCore_DP0 pid=251) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) return func(*args, **kwargs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 259, in _initialize_kv_caches (EngineCore_DP0 pid=251) kv_cache_configs = get_kv_cache_configs( (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1553, in get_kv_cache_configs (EngineCore_DP0 pid=251) global_kv_cache_groups = get_kv_cache_groups(vllm_config, merged_kv_cache_specs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1231, in get_kv_cache_groups (EngineCore_DP0 pid=251) unify_hybrid_kv_cache_specs(kv_cache_spec) (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1211, in unify_hybrid_kv_cache_specs (EngineCore_DP0 pid=251) raise ValueError( (EngineCore_DP0 pid=251) ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type. [rank0]:[W304 12:47:31.886063773 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=1) Traceback (most recent call last): (APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module> (APIServer pid=1) sys.exit(main()) (APIServer pid=1) ^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main (APIServer pid=1) args.dispatch_function(args) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd (APIServer pid=1) uvloop.run(run_server(args)) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run (APIServer pid=1) return __asyncio.run( (APIServer pid=1) ^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=1) return runner.run(main) (APIServer pid=1) ^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=1) return self._loop.run_until_complete(task) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=1) return await main (APIServer pid=1) ^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=1) async with build_async_engine_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=1) async with build_async_engine_client_from_engine_args( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args (APIServer pid=1) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config (APIServer pid=1) return cls( (APIServer pid=1) ^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 152, in __init__ (APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=1) return func(*args, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 125, in make_async_mp_client (APIServer pid=1) return AsyncMPClient(*client_args) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=1) return func(*args, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 839, in __init__ (APIServer pid=1) super().__init__( (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 493, in __init__ (APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as ( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__ (APIServer pid=1) next(self.gen) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines (APIServer pid=1) wait_for_engine_startup( (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup (APIServer pid=1) raise RuntimeError( (APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

KuntaiDu · 2026-03-05T00:50:29Z

We are switching to LMCache multiprocess mode, which allows full separation between vllm container and LMCache container. After finalizing the support of multiprocess mode we will then start to circle back to hybrid model.

dotmobo · 2026-03-05T09:01:05Z

hi, i tried qwen 3.5 35B A3B FP8 with llmcache and the latest vllm and i have the following error

"Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type"

malaiwah · 2026-03-22T13:33:13Z

Need to work together with vllm-project/vllm#23624.

I noticed vLLM has continued working on the hybrid allocator in vllm-project/vllm#30166 (and support is now merged). The original PR mentionned is now Closed.

github-actions · 2026-05-22T02:23:46Z

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

KuntaiDu added 3 commits August 26, 2025 01:45

initial impl for hybrid allocator, slot mapping --> slot mappings

b7a54ca

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

fix the error in layer id --> kv cache group id.

dea2b18

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

remove gpu memory footprint

2789af5

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

gemini-code-assist Bot reviewed Aug 26, 2025

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated

KuntaiDu added 11 commits August 26, 2025 22:51

remove old code, and allow layer_id_to_kv_cache_group_id to not exist.

4613987

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

fix gemini suggestion

63e677d

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

resolve merge conflict

ab6cf27

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

make ruff happy

6d160e1

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

support slot_mapping arg, for test compatiblity

50ff739

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

maintain backward compatiblity to pass tests

32a4732

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

further edit for backward compatibility

d18ac7d

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'LMCache:dev' into kuntai-support-hybrid-allocator

4fefbbf

bug fix

2e15dfd

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

bug fix

337ee13

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

bug fix

6045474

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

YaoJiayi reviewed Aug 27, 2025

View reviewed changes

KuntaiDu added 7 commits August 28, 2025 21:28

separate layer-wise connector to a separate file

0a0b2f1

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

merge and resolve mypy error

3bafe17

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

use helper function from vllm to compute mapping from layer_id/layer_…

e468ea3

…name to kv cache group id Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

annotate type

36979eb

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

put VLLMPagedMemLayerwiseGPUConnectorForHybridAlloc into the list of …

89accf6

…gpu connectors Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[bugfix] importing VLLMPagedMemLayerwiseGPUConnectorForHybridAlloc in…

256f100

… storage manager Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

KuntaiDu requested a review from ApostaC August 28, 2025 22:33

KuntaiDu added 4 commits August 28, 2025 22:38

[Cleanup] Remove layer_id_to_kv_cache_group_id from CacheEngine Metad…

4c7d1da

…ata --- now it is tied to gpu connector Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

avoid introduce slot_mappings to non-related gpu connnectors

f82d353

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[Cleanup] reduce unnecessary code diff

d4fff44

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

make slot_mappings a tensor

5f253df

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[bugfix]: fix indexing issues for tensor

9d04077

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

ApostaC reviewed Aug 29, 2025

View reviewed changes

KuntaiDu added 4 commits August 30, 2025 00:17

[bugfix] warning_once is not a standard attribute of logger, remove i…

3ebbb17

…t to avoid spamming Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[Cleanup] move the mapping of layer_id --> kv_cache_group_id mapping …

251d4d0

…back to LMCache, as it needs more careful design to be pushed into vllm Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[Cleanup] move the mapping of layer_id --> kv_cache_group_id mapping …

e1590b4

…back to LMCache, as it needs more careful design to be pushed into vllm Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[Bugfix] move kv_cache_config out from vllm_config

3712a45

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

YaoJiayi mentioned this pull request Sep 8, 2025

LMCache Q3 Roadmap #1253

Closed

43 tasks

KuntaiDu added 6 commits September 17, 2025 14:32

resolve merge conflict

2218355

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[bugfix] adjust the condition to trigger hybridallocatorconnector

62a5f0e

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'LMCache:dev' into kuntai-support-hybrid-allocator

ee72739

update

150d331

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'dev' into kuntai-support-hybrid-allocator

28ddd5b

Merge branch 'dev' into kuntai-support-hybrid-allocator

e2cfa0d

markmc mentioned this pull request Oct 21, 2025

[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector vllm-project/vllm#25712

Merged

5 tasks

Merge branch 'LMCache:dev' into kuntai-support-hybrid-allocator

4e5e833

ivanium mentioned this pull request Dec 6, 2025

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector vllm-project/vllm#30166

Merged

5 tasks

github-actions Bot added the stale label Feb 18, 2026

github-actions Bot removed the stale label Feb 26, 2026

github-actions Bot added the stale label May 22, 2026

		slot_mappings: torch.Tensor
		# Slot mapping for backward compatibility

Conversation

KuntaiDu commented Aug 26, 2025

PR Title and Classification

Code Quality

What to Expect for the Reviews

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

YaoJiayi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KuntaiDu commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KuntaiDu commented Aug 29, 2025

Uh oh!

matthewygf commented Nov 19, 2025

Uh oh!

KuntaiDu commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfang commented Dec 19, 2025

Uh oh!

github-actions Bot commented Feb 18, 2026

Uh oh!

mingfang commented Feb 25, 2026

Uh oh!

bambarambambum commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KuntaiDu commented Mar 5, 2026

Uh oh!

dotmobo commented Mar 5, 2026

Uh oh!

malaiwah commented Mar 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

KuntaiDu commented Aug 28, 2025 •

edited

Loading

KuntaiDu commented Dec 14, 2025 •

edited

Loading

bambarambambum commented Mar 4, 2026 •

edited

Loading