Skip to content

[MP][optimize] optimize save when mla enabled#2935

Merged
ApostaC merged 5 commits intoLMCache:devfrom
chunxiaozheng:mp-optimize-save
Apr 10, 2026
Merged

[MP][optimize] optimize save when mla enabled#2935
ApostaC merged 5 commits intoLMCache:devfrom
chunxiaozheng:mp-optimize-save

Conversation

@chunxiaozheng
Copy link
Copy Markdown
Collaborator

@chunxiaozheng chunxiaozheng commented Apr 2, 2026

When MLA is enabled, store or retrieve requests only need to be sent once in multi workers, which can greatly reduce the number of requests in the server.
The current PR only modifies store requests and will modify retrieve requests in the next request.


Note

Medium Risk
Changes how world_size, tp_size, and worker_id are derived for LMCache multiprocess requests, which can affect cache-keying and cross-worker coordination in distributed runs. Backward-compat defaults exist, but misconfigured parallel_strategy could lead to incorrect sharing or runtime errors.

Overview
Adds a new ParallelStrategy dataclass and wires it into LMCacheMPSchedulerAdapter and LMCacheMPWorkerAdapter so cache-key fields (world_size, worker_id) and tp_size can be computed from KV-parallel topology (e.g., MLA excluding TP) rather than the raw process rank.

Updates adapter constructors with defaults and new properties (world_size, tp_size, worker_id, plus use_mla/is_first_rank_of_pp_group on the worker) and adjusts FREE_LOOKUP_LOCKS tests to construct adapters via parallel_strategy instead of directly setting world_size/tp_size.

Reviewed by Cursor Bugbot for commit ac60e13. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds use_mla and is_first_rank_of_pp_group parameters to the LMCacheMPWorkerAdapter constructor. Feedback indicates that the use of is_first_rank_of_pp_group may be logically incorrect for KV cache storage and should likely be is_first_rank_of_tp_group to avoid data loss. Additionally, the new fields are currently unused in the class logic, and the constructor is missing a required docstring according to the repository style guide.

Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated
Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated
Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated
Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comment: I think right now we have world_size and kv_rank in the initialization args, though their meaning is not super clear. How about we define a new dataclass to describe the parallel strategy and local rank information (and include mla flag there)?

@chunxiaozheng
Copy link
Copy Markdown
Collaborator Author

Quick comment: I think right now we have world_size and kv_rank in the initialization args, though their meaning is not super clear. How about we define a new dataclass to describe the parallel strategy and local rank information (and include mla flag there)?

@ApostaC Sorry, I just saw the message and I think it's okay, thanks for your review.

@chunxiaozheng
Copy link
Copy Markdown
Collaborator Author

chunxiaozheng commented Apr 7, 2026

@ApostaC I have updated, could you help take a look?

(the change of LMCacheMPConnector is vllm-project/vllm#38810)

Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR! One thing on my mind is how to introduce the PRs on both the LMCache & vLLM side without breaking the CI/CD.

Potential idea:

  • Step 1: introduce ParallelStrategy as an optional argument with default value None, and keep the old world_size and kv_rank parameter with default value None as well. (Probably we need some logic to handle both the old version and new version). -- This will not break LMCache CI .
  • Step 2: update the vLLM side code to use ParallelStrategy and remove the old world_size and kv_rank arguments. Since in step 1, we made world_size and kv_rank optional with a default value None, it will not break LMCache CI either.
  • Step 3: clean up the LMCache-side code, to completely remove the old world_size and kv_rank related arguments.

I hate this complicated process, so let me know if you have better ideas @chunxiaozheng

Comment on lines +106 to +107
The world size, world_size may not be equal to the actual_world_size,
in the case of mla, it will 'exclude' the effect of TP.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean by "exclude" the effect of TP? Does that mean for MLA TP 8 case, we will have:

  • world_size = 1, actual_world_size = 8
  • worker_id = 1, actualy_worker_id =

In this case, can we name it to kv_world_size and kv_worker_id, which means how do we divide the KV cache. WDYT?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC Good idea, I will update here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC The meaning of "exclude" the effect of TP can refer to the following annotations in LMCacheMPConnector

def extract_world_size_and_kv_rank(
    world_size: int,
    rank: int,
    vllm_config: VllmConfig,
) -> tuple[int, int]:
    """
    Convert the rank for the MLA.
    """
    use_mla = mla_enabled(vllm_config.model_config)
    if not use_mla:
        return world_size, rank
    else:
        # Tensor parallel does not change the KV caches for MLA models.
        # So we need to "exclude" the effect of TP on rank and world size
        tp_size = vllm_config.parallel_config.tensor_parallel_size
        # vLLM constructs TP groups first, and then construct other
        # parallel groups on top of TP groups.
        # for example, TP=4, PP=2,
        # PP group: [0, 1, 2, 3], [4, 5, 6, 7]
        # TP group: [0, 4], [1, 5], [2, 6], [3, 7]
        # So we can "exclude" the effect of TP by rank // tp_size.
        return world_size // tp_size, rank // tp_size

@chunxiaozheng
Copy link
Copy Markdown
Collaborator Author

I like this PR! One thing on my mind is how to introduce the PRs on both the LMCache & vLLM side without breaking the CI/CD.

Potential idea:

  • Step 1: introduce ParallelStrategy as an optional argument with default value None, and keep the old world_size and kv_rank parameter with default value None as well. (Probably we need some logic to handle both the old version and new version). -- This will not break LMCache CI .
  • Step 2: update the vLLM side code to use ParallelStrategy and remove the old world_size and kv_rank arguments. Since in step 1, we made world_size and kv_rank optional with a default value None, it will not break LMCache CI either.
  • Step 3: clean up the LMCache-side code, to completely remove the old world_size and kv_rank related arguments.

I hate this complicated process, so let me know if you have better ideas @chunxiaozheng

@ApostaC Thanks for your suggestion, Iet me think about this.

@chunxiaozheng
Copy link
Copy Markdown
Collaborator Author

I like this PR! One thing on my mind is how to introduce the PRs on both the LMCache & vLLM side without breaking the CI/CD.
Potential idea:

  • Step 1: introduce ParallelStrategy as an optional argument with default value None, and keep the old world_size and kv_rank parameter with default value None as well. (Probably we need some logic to handle both the old version and new version). -- This will not break LMCache CI .
  • Step 2: update the vLLM side code to use ParallelStrategy and remove the old world_size and kv_rank arguments. Since in step 1, we made world_size and kv_rank optional with a default value None, it will not break LMCache CI either.
  • Step 3: clean up the LMCache-side code, to completely remove the old world_size and kv_rank related arguments.

I hate this complicated process, so let me know if you have better ideas @chunxiaozheng

@ApostaC Thanks for your suggestion, Iet me think about this.

@ApostaC If the LMCacheMPConnector is also maintained in the LMCache repository, would it simplify a lot? we only need to update the code in LMCache, @maobaolong has already submitted some PRs about this.

  1. Copy a snapshot of lmcache_mp_connector.py for vllm 0.18.0 #2887
  2. [KVConnector]: prioritize external connector over internal registry vllm-project/vllm#38301

@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Apr 8, 2026

@chunxiaozheng I think the problem is the CI/CD. Right now, we are using vllm-side connector implementation to do the CI/CD. In that case, the CI/CD may still become red.

@chunxiaozheng
Copy link
Copy Markdown
Collaborator Author

@chunxiaozheng I think the problem is the CI/CD. Right now, we are using vllm-side connector implementation to do the CI/CD. In that case, the CI/CD may still become red.

@ApostaC Okay, I think the suggestion you gave above is a good choice.

Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py
@chunxiaozheng
Copy link
Copy Markdown
Collaborator Author

chunxiaozheng commented Apr 9, 2026

@ApostaC I have updated this PR and it will be fully compatible with the previous lmcache_mp_connector, could you help take a look again.

Signed-off-by: idellzheng <idellzheng@tencent.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ac60e13. Configure here.

Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py
@chunxiaozheng chunxiaozheng added the full Run comprehensive tests on this PR label Apr 9, 2026
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC enabled auto-merge (squash) April 10, 2026 00:42
Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ApostaC ApostaC merged commit 06981d6 into LMCache:dev Apr 10, 2026
35 of 36 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
Signed-off-by: idellzheng <idellzheng@tencent.com>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
Signed-off-by: idellzheng <idellzheng@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR mp_mode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants