Fix to support mla multiple tp failed to read issue#2697
Fix to support mla multiple tp failed to read issue#2697maobaolong merged 3 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial enhancement to the caching system, enabling robust support for scenarios where multiple consumers (e.g., workers in a tensor parallel setup) need to concurrently access the same cached object. By adding a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for multiple readers on a single cache key by adding a count parameter to lock acquisition and release functions, a crucial feature for supporting tensor parallelism scenarios where multiple workers need to access the same data. However, the current implementation of lock management is susceptible to race conditions and Denial of Service (DoS) attacks. Specifically, the dynamic calculation of locks to release based on GPU contexts can lead to premature cache entry deletion or resource leaks, and the use of unvalidated counts in loops while holding global locks presents a DoS risk. It is recommended to track acquired lock counts per request and implement input validation for lock counts. Additionally, there is a suggestion to refactor some duplicated logic in server.py to improve code maintainability.
ApostaC
left a comment
There was a problem hiding this comment.
The big problem is the num_reader calculation. A minor issue is whether to use count or extra_count.
Otherwise LGTM!
| def reserve_read( | ||
| self, | ||
| keys: list[ObjectKey], | ||
| count: int = 1, |
There was a problem hiding this comment.
Can we make this become extra_count = 0, so that there won't be any misuse of this interface as reserve_read(keys, count = 0)?
There was a problem hiding this comment.
Thanks for this suggestion, it is better.
| def finish_read( | ||
| self, | ||
| keys: list[ObjectKey], | ||
| count: int = 1, |
There was a problem hiding this comment.
Do we need to have a configurable count for finish_read?
There was a problem hiding this comment.
Oh, I see the need! Ignore my comment above.
But I would like to have "extra_count" similar to above
| num_readers += 1 | ||
| if layout_desc is None: | ||
| layout_desc = get_layout_desc( | ||
| self.gpu_contexts[gpu_id], | ||
| self.chunk_size, | ||
| ) |
There was a problem hiding this comment.
We cannot count num_readers this way here. The example is if the LMCache is connecting to 4 vLLM instances, where each vLLM instance runs TP=2, the current implementation will count num_reader to 8.
The best way to do it is to update the protocol and pass the TP size when calling lookup. @maobaolong It may need to create a PR on the vLLM side, modifying the lmcache_mp_connector.py (need to initialize the LMCacheMPSchedulerAdapter with the TP information). I can help review the PR there, should be a very simple one.
|
@ApostaC Is this failure related to this PR? It is from https://buildkite.com/lmcache/k3-correctness-test/builds/66 |
|
@maobaolong The K3 test is the new environment we just setup. The failure should not block the merge of the PR. |
| start=start, | ||
| end=end, | ||
| request_id=request_id, | ||
| tp_size=self.tp_size, |
There was a problem hiding this comment.
I would suggest we have the tp_size in the lookup interface instead of putting it here. The reason is that the IPCCacheEngineKey already has world_size, and having a tp_size here may confuse people.
My proposal:
- Edit
lmcache/v1/multiprocess/protocols/engine.pyand add the tp_size to the argument list ofLOOKUPoperation. - In the adapter (this file), pass the tp_size when calling
send_lmcache_requestfor LOOKUP and FREE_LOOKUP_LOCKS. E.g.,
send_lmcache_request(
self.mq_client,
RequestType.LOOKUP,
[key, self.tp_size],
)- In server.py (
lookupandfree_lookup_locks), check the tp_size argument and MLA flag, and compute theextra_countaccordingly.
There was a problem hiding this comment.
Great idea, addressed.
But I did nothing for free_lookup_locks operation, since we use lookup_lock_counts to get the extra_readers other than the argument.
| MAX_READ_LOCK_COUNT = 128 | ||
|
|
||
|
|
||
| def _validate_extra_count(extra_count: int) -> int: |
| for _ in range(total): | ||
| entry.read_lock.lock() |
There was a problem hiding this comment.
Can we have a TODO/a issue here? It will be more performant if we support a count argument in TTLLock.lock() and TTLLock.unlock(). Since TTLLock is implemented in C++ with std::atomic, having a python for loop here may impact the performance.
There was a problem hiding this comment.
TODO comment added.
| # tp > world_size → extra_readers = tp - 1 (all | ||
| # TP workers retrieve the same ObjectKey). | ||
| tp = key.tp_size if key.tp_size > 1 else world_size | ||
| extra_readers = tp - 1 if tp > world_size else 0 |
There was a problem hiding this comment.
The TP problem will only occur for MLA, this is different TP workers will share the same object only in MLA.
For non-MLA models, different TP workers will have their own "object", and therefore, only need to lock once.
There was a problem hiding this comment.
One thing we may need to double-check: there are some codes on the vLLM side, which will pass "world_size = 1" for MLA models even with TP, and all the LMCache logic actually relies on this to do the correct MLA + TP handling (lookup, store, and retrieve) (for example, see the implementation of ipc_keys_to_object_keys, which will explode the keys based on the world_size, and it's expected to be 1 for MLA case)
We probably want to have a better semantics for world_size in the future
|
@ApostaC Thanks for the previous review, addressed you suggestions, PTAL. |
ApostaC
left a comment
There was a problem hiding this comment.
Otherwise LGTM as we discussed offline. Giving an approve here
| def free_lookup_locks( | ||
| self, | ||
| key: IPCCacheEngineKey, | ||
| ) -> None: |
There was a problem hiding this comment.
Let's have tp_size passed here so we don't need lookup_lock_counts
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
ApostaC
left a comment
There was a problem hiding this comment.
Sorry for the back and forth, but can we do the following things:
- Unify the naming of
extra_readers(used infinished_read_prefetched) vsextra_count(used in other functions). Let's haveextra_countin all the places.
Another thing is that there are unit tests using the lookup and free_lookup_locks interface. We need to update them otherwise the UTs will fail. These include:
test_mq_handler_helpers.pytest_mq.pytest_cache_server.pytest_blend_server.py
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
| tp_size: int, | ||
| world_size: int, | ||
| ) -> int: | ||
| """Compute extra count for MLA multi-reader locking. |
There was a problem hiding this comment.
Thanks for the comments!
| ) | ||
| break | ||
|
|
||
| layout_desc = self._find_layout_desc(model_name, world_size) |
There was a problem hiding this comment.
This is only a refactor.
chunxiaozheng
left a comment
There was a problem hiding this comment.
After discuss offline, LGTM!
|
@ApostaC @chunxiaozheng Thanks for your review! |
* Fix to support mla multiple tp failed to read issue Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix style Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix ut and renaming Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com> Signed-off-by: shaoxiawjc <wjc2800@163.com>
* Fix to support mla multiple tp failed to read issue Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix style Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix ut and renaming Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
* Fix to support mla multiple tp failed to read issue Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix style Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix ut and renaming Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* Fix to support mla multiple tp failed to read issue Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix style Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix ut and renaming Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>
What this PR does / why we need it:
Fix the issue when run vllm and lmcache with mp mode for deepseek mla tp=2
Special notes for your reviewers:
If applicable: