[MP][optimize] optimize save when mla enabled#2935
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds use_mla and is_first_rank_of_pp_group parameters to the LMCacheMPWorkerAdapter constructor. Feedback indicates that the use of is_first_rank_of_pp_group may be logically incorrect for KV cache storage and should likely be is_first_rank_of_tp_group to avoid data loss. Additionally, the new fields are currently unused in the class logic, and the constructor is missing a required docstring according to the repository style guide.
ApostaC
left a comment
There was a problem hiding this comment.
Quick comment: I think right now we have world_size and kv_rank in the initialization args, though their meaning is not super clear. How about we define a new dataclass to describe the parallel strategy and local rank information (and include mla flag there)?
@ApostaC Sorry, I just saw the message and I think it's okay, thanks for your review. |
6f02316 to
f5a3eb1
Compare
|
@ApostaC I have updated, could you help take a look? (the change of |
ApostaC
left a comment
There was a problem hiding this comment.
I like this PR! One thing on my mind is how to introduce the PRs on both the LMCache & vLLM side without breaking the CI/CD.
Potential idea:
- Step 1: introduce
ParallelStrategyas an optional argument with default value None, and keep the old world_size and kv_rank parameter with default value None as well. (Probably we need some logic to handle both the old version and new version). -- This will not break LMCache CI . - Step 2: update the vLLM side code to use
ParallelStrategyand remove the old world_size and kv_rank arguments. Since in step 1, we made world_size and kv_rank optional with a default value None, it will not break LMCache CI either. - Step 3: clean up the LMCache-side code, to completely remove the old world_size and kv_rank related arguments.
I hate this complicated process, so let me know if you have better ideas @chunxiaozheng
| The world size, world_size may not be equal to the actual_world_size, | ||
| in the case of mla, it will 'exclude' the effect of TP. |
There was a problem hiding this comment.
What does it mean by "exclude" the effect of TP? Does that mean for MLA TP 8 case, we will have:
world_size= 1,actual_world_size= 8worker_id= 1,actualy_worker_id=
In this case, can we name it to kv_world_size and kv_worker_id, which means how do we divide the KV cache. WDYT?
There was a problem hiding this comment.
@ApostaC Good idea, I will update here.
There was a problem hiding this comment.
@ApostaC The meaning of "exclude" the effect of TP can refer to the following annotations in LMCacheMPConnector
def extract_world_size_and_kv_rank(
world_size: int,
rank: int,
vllm_config: VllmConfig,
) -> tuple[int, int]:
"""
Convert the rank for the MLA.
"""
use_mla = mla_enabled(vllm_config.model_config)
if not use_mla:
return world_size, rank
else:
# Tensor parallel does not change the KV caches for MLA models.
# So we need to "exclude" the effect of TP on rank and world size
tp_size = vllm_config.parallel_config.tensor_parallel_size
# vLLM constructs TP groups first, and then construct other
# parallel groups on top of TP groups.
# for example, TP=4, PP=2,
# PP group: [0, 1, 2, 3], [4, 5, 6, 7]
# TP group: [0, 4], [1, 5], [2, 6], [3, 7]
# So we can "exclude" the effect of TP by rank // tp_size.
return world_size // tp_size, rank // tp_size
@ApostaC Thanks for your suggestion, Iet me think about this. |
@ApostaC If the |
|
@chunxiaozheng I think the problem is the CI/CD. Right now, we are using vllm-side connector implementation to do the CI/CD. In that case, the CI/CD may still become red. |
@ApostaC Okay, I think the suggestion you gave above is a good choice. |
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
3d4ba18 to
2aff145
Compare
|
@ApostaC I have updated this PR and it will be fully compatible with the previous |
Signed-off-by: idellzheng <idellzheng@tencent.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ac60e13. Configure here.
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>

When
MLAis enabled,storeorretrieverequests only need to be sent once in multi workers, which can greatly reduce the number of requests in the server.The current PR only modifies
storerequests and will modifyretrieverequests in the next request.Note
Medium Risk
Changes how
world_size,tp_size, andworker_idare derived for LMCache multiprocess requests, which can affect cache-keying and cross-worker coordination in distributed runs. Backward-compat defaults exist, but misconfiguredparallel_strategycould lead to incorrect sharing or runtime errors.Overview
Adds a new
ParallelStrategydataclass and wires it intoLMCacheMPSchedulerAdapterandLMCacheMPWorkerAdapterso cache-key fields (world_size,worker_id) andtp_sizecan be computed from KV-parallel topology (e.g., MLA excluding TP) rather than the raw process rank.Updates adapter constructors with defaults and new properties (
world_size,tp_size,worker_id, plususe_mla/is_first_rank_of_pp_groupon the worker) and adjustsFREE_LOOKUP_LOCKStests to construct adapters viaparallel_strategyinstead of directly settingworld_size/tp_size.Reviewed by Cursor Bugbot for commit ac60e13. Bugbot is set up for automated code reviews on this repo. Configure here.