Skip to content

[feat] Support hybrid allocator#1436

Open
KuntaiDu wants to merge 37 commits into
LMCache:devfrom
KuntaiDu:kuntai-support-hybrid-allocator
Open

[feat] Support hybrid allocator#1436
KuntaiDu wants to merge 37 commits into
LMCache:devfrom
KuntaiDu:kuntai-support-hybrid-allocator

Conversation

@KuntaiDu

Copy link
Copy Markdown
Contributor

This PR supports hybrid allocator in LMCache. Need to work together with vllm-project/vllm#23624.

PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


PR Checklist (Click to Expand)

Thank you for your contribution to LMCache! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Core] for changes in the core LMCache logic (e.g., LMCacheEngine, Backend etc.)
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient unit tests to ensure the change is stay correct and robust. The unit and integration tests will always run and our comprehensive test will be triggered after the "full" label is tagged onto a PR.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of KuntaiDu, ApostaC or YaoJiayi.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @KuntaiDu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for a hybrid allocator within LMCache, enabling more flexible and potentially optimized management of KV (Key-Value) cache memory. This change is designed to work in conjunction with related updates in the vLLM project, specifically targeting improvements in how KV cache groups are handled across different layers of a model.

Highlights

  • Hybrid Allocator Configuration: New fields have been added to LMCacheEngineMetadata to configure layer-to-KV-cache-group mappings, facilitating the setup of a hybrid memory allocation scheme.
  • KV Cache Group Management: The internal RequestTracker and ReqMeta structures have been updated to manage allocated_block_ids and slot_mappings as dictionaries, allowing for distinct KV cache groups per layer or group of layers.
  • VLLM Integration Updates: The integration with vLLM has been refined to align with vLLM 0.9.0+ block ID handling, ensuring compatibility and proper propagation of KV cache group information during request processing.
  • GPU Connector Enhancements: The GPU memory connector now incorporates KV cache group awareness, enabling batched operations (batched_to_gpu, batched_from_gpu) to correctly handle and map memory operations to their respective KV cache groups.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a hybrid allocator in LMCache, which involves significant changes across the configuration, vLLM adapter, and GPU connector. The changes are well-structured to handle multiple KV cache groups. I've found one critical issue that could lead to a runtime error, which I've detailed in a specific comment. Overall, the changes are in the right direction to support the new feature.

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

@YaoJiayi YaoJiayi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left a few comments:)

self,
new_token_ids: list[int],
new_block_ids: Union[Optional[tuple[list[int], ...]], list[int]],
new_block_ids: Optional[tuple[list[int], ...]],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maobaolong @chunxiaozheng Do we still need backward compatibility here?

token_ids=token_ids,
slot_mapping=slot_mapping,
slot_mappings=slot_mappings,
slot_mapping=slot_mappings[0],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between slot_mapping and slot_mappings?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slot_mapping == slot_mappings[0]. Basically slot mapping only contain the first KV cache group. This is for test compatibility.

kvcaches=kvcaches,
slot_mapping=slot_mapping[:lmcache_cached_tokens],
# FIXME(Kuntai): need to support multiple kv cache groups
slot_mapping=slot_mappings[0][:lmcache_cached_tokens],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this todo here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The todo is to fully deprecate slot_mapping in all connectors and substitute them with slot_mappings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to rename it to group_id_to_slot_mapping as slot_mappings looks too similar to slot_mapping (and I have several bugs that was simply because I misread slot_mappings to slot_mapping). Does that sound good to you @YaoJiayi ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just talked to @ApostaC . @YaoJiayi the plan is:

  • write a gpu connector that only handles the hybrid KV cache allocator, and leave all other gpu connectors unchanged. We will refactor other gpu connectors in future PRs
  • Stick to slot_mappings, and make it a tensor with shape [# of kv cache groups, # of tokens].

Comment thread lmcache/v1/gpu_connector.py Outdated

self.dtype = kwargs["dtype"]
self.device = kwargs["device"]
self.layer_id_to_kv_cache_group_id = kwargs.get(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have two global mappings for layer_name_to_kv_cache_group_id and layer_id_to_kv_cache_group_id so the all gpu connectors can benefit from them in the future?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought in the future, there may be multiple vLLM instances on the same machine connecting to one LMCache process (which is a result @ApostaC 's proposal about separating connector process from vLLM process) and in that case we need to have different layer_name_to_kv_cache_group_id for different connector. Is that the case @ApostaC ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YaoJiayi I confirmed with @ApostaC that the case that I mentioned will happen. In this case I guess it's better for us to keep layer_name_to_kv_cache_group_id local to the gpu connector class. I will still separate the logic of converting kv_cache_config to layer_name_to_kv_cache_group_id to a util function.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…name to kv cache group id

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…gpu connectors

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
… storage manager

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…v_cache_group_id in new VLLMPagedMemLayerwiseGPUConnectorForHybridAlloc. This is doable because it is separated into a new connector and previous version of vLLM will no longer touch VLLMPagedMemLayerwiseGPUConnectorForHybridAlloc

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@KuntaiDu

KuntaiDu commented Aug 28, 2025

Copy link
Copy Markdown
Contributor Author

New updates:

  • Separate the logic of converting kv_cache_config to layer_name_to_kv_cache_group_id into a vLLM util function
  • Separate the logic of hybrid allocator into a separate GPU connector
  • Make slot_mappings a tensor

Not covered in this PR (we can do it in future PRs)

  • Fully deprecate slot_mapping
  • Add test (test can only be added after vLLM-side PR is merged)

This PR is ready for review.

@KuntaiDu KuntaiDu requested a review from ApostaC August 28, 2025 22:33
…ata --- now it is tied to gpu connector

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Comment on lines +220 to +221
slot_mappings: torch.Tensor
# Slot mapping for backward compatibility

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we want a unified interface to get the slot mapping

@KuntaiDu

Copy link
Copy Markdown
Contributor Author

TODO:

  • Maintain old version compatiblity
  • Perf test on default LLMs

…t to avoid spamming

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…back to LMCache, as it needs more careful design to be pushed into vllm

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…back to LMCache, as it needs more careful design to be pushed into vllm

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@YaoJiayi YaoJiayi mentioned this pull request Sep 8, 2025
43 tasks
@matthewygf

Copy link
Copy Markdown

Bump, would love to use this feature and let me know if I can help in anyway !

@KuntaiDu

KuntaiDu commented Dec 14, 2025

Copy link
Copy Markdown
Contributor Author

This feature is being deprioritized for now :( , this is a pure performance PR and is less urgent than feature PRs (e.g. process disaggregation, P2P backend.

@mingfang

Copy link
Copy Markdown

This feature is being deprioritized for now :( , this is a pure performance PR and is less urgent than feature PRs (e.g. process disaggregation, P2P backend.

Currently vLLM crashes when LMCache is enabled for models that requires hybrid kv cache(e.g. Qwen3-Next-80B-A3B-Instruct)

@github-actions

Copy link
Copy Markdown
Contributor

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

@github-actions github-actions Bot added the stale label Feb 18, 2026
@mingfang

Copy link
Copy Markdown

vllm is still crashing with this error

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

are there any plan to support hybrid models?
@KuntaiDu this is NOT a pure performance PR. vllm crashes on startup.

@github-actions github-actions Bot removed the stale label Feb 26, 2026
@bambarambambum

bambarambambum commented Mar 4, 2026

Copy link
Copy Markdown

Hello,
vLLM still crashes on startup.
All modern hybrid models that I have tried to run (qwen3.5, Qwen3-Coder-Next, GLM5) crash with this error.
(EngineCore_DP0 pid=251) WARNING 03-04 12:47:30 [kv_cache_utils.py:1170] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved. (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] EngineCore failed to start. (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] Traceback (most recent call last): (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 763, in __init__ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] super().__init__( (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=251) Process EngineCore_DP0: (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 259, in _initialize_kv_caches (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] kv_cache_configs = get_kv_cache_configs( (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1553, in get_kv_cache_configs (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] global_kv_cache_groups = get_kv_cache_groups(vllm_config, merged_kv_cache_specs) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1231, in get_kv_cache_groups (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] unify_hybrid_kv_cache_specs(kv_cache_spec) (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1211, in unify_hybrid_kv_cache_specs (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] raise ValueError( (EngineCore_DP0 pid=251) ERROR 03-04 12:47:30 [core.py:1029] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type. (EngineCore_DP0 pid=251) Traceback (most recent call last): (EngineCore_DP0 pid=251) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=251) self.run() (EngineCore_DP0 pid=251) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=251) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core (EngineCore_DP0 pid=251) raise e (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=251) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) return func(*args, **kwargs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 763, in __init__ (EngineCore_DP0 pid=251) super().__init__( (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__ (EngineCore_DP0 pid=251) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=251) return func(*args, **kwargs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 259, in _initialize_kv_caches (EngineCore_DP0 pid=251) kv_cache_configs = get_kv_cache_configs( (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1553, in get_kv_cache_configs (EngineCore_DP0 pid=251) global_kv_cache_groups = get_kv_cache_groups(vllm_config, merged_kv_cache_specs) (EngineCore_DP0 pid=251) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1231, in get_kv_cache_groups (EngineCore_DP0 pid=251) unify_hybrid_kv_cache_specs(kv_cache_spec) (EngineCore_DP0 pid=251) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1211, in unify_hybrid_kv_cache_specs (EngineCore_DP0 pid=251) raise ValueError( (EngineCore_DP0 pid=251) ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type. [rank0]:[W304 12:47:31.886063773 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=1) Traceback (most recent call last): (APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module> (APIServer pid=1) sys.exit(main()) (APIServer pid=1) ^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main (APIServer pid=1) args.dispatch_function(args) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd (APIServer pid=1) uvloop.run(run_server(args)) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run (APIServer pid=1) return __asyncio.run( (APIServer pid=1) ^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=1) return runner.run(main) (APIServer pid=1) ^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=1) return self._loop.run_until_complete(task) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=1) return await main (APIServer pid=1) ^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=1) async with build_async_engine_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=1) async with build_async_engine_client_from_engine_args( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args (APIServer pid=1) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config (APIServer pid=1) return cls( (APIServer pid=1) ^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 152, in __init__ (APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=1) return func(*args, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 125, in make_async_mp_client (APIServer pid=1) return AsyncMPClient(*client_args) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=1) return func(*args, **kwargs) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 839, in __init__ (APIServer pid=1) super().__init__( (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 493, in __init__ (APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as ( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__ (APIServer pid=1) next(self.gen) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines (APIServer pid=1) wait_for_engine_startup( (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup (APIServer pid=1) raise RuntimeError( (APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

@KuntaiDu

KuntaiDu commented Mar 5, 2026

Copy link
Copy Markdown
Contributor Author

We are switching to LMCache multiprocess mode, which allows full separation between vllm container and LMCache container. After finalizing the support of multiprocess mode we will then start to circle back to hybrid model.

@dotmobo

dotmobo commented Mar 5, 2026

Copy link
Copy Markdown

hi, i tried qwen 3.5 35B A3B FP8 with llmcache and the latest vllm and i have the following error

"Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type"

@malaiwah

Copy link
Copy Markdown

Need to work together with vllm-project/vllm#23624.

I noticed vLLM has continued working on the hybrid allocator in vllm-project/vllm#30166 (and support is now merged). The original PR mentionned is now Closed.

@github-actions

Copy link
Copy Markdown
Contributor

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

@github-actions github-actions Bot added the stale label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants