[LMCache][MP] optimize save when mla enabled#38810
[LMCache][MP] optimize save when mla enabled#38810ApostaC merged 6 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Multi-Head Latent Attention (MLA) in the KV cache transfer mechanism by ensuring only the first rank of a tensor parallel group saves the cache to avoid redundant operations. The review feedback highlights that the documentation incorrectly swapped tensor parallel (TP) and pipeline parallel (PP) group definitions, and that the variable naming should be updated from is_first_rank_of_pp_group to is_first_rank_of_tp_group to accurately reflect the implementation logic. Additionally, the reviewer recommended using inspect.signature and getattr to maintain backward compatibility with older versions of the lmcache package and to prevent potential runtime errors.
|
hi~@ApostaC, could you help take a look? |
|
Hi @chunxiaozheng, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
21b91dc to
344d0aa
Compare
|
hi~@KuntaiDu, could you help take a look again. |
Signed-off-by: idellzheng <idellzheng@tencent.com> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: idellzheng <idellzheng@tencent.com> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: idellzheng <idellzheng@tencent.com> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>
When MLA is enabled, store or retrieve requests only need to be sent once in multi workers, which can greatly reduce the number of requests in the server.
The current PR only modifies store requests and will modify retrieve requests in the next request.