[Bugfix] fix some memory leak in cache_engine and eic connector#2544
[Bugfix] fix some memory leak in cache_engine and eic connector#2544sammshen merged 11 commits intoLMCache:devfrom
Conversation
Signed-off-by: liubj77 <liubj77@gmail.com>
Summary of ChangesHello @liubj77, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical memory leaks identified in the cache engine and EIC connector, particularly when Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses two memory leaks. The first occurs in the cache_engine when a batched retrieval has partial failures, leading to unused memory objects not being released. The second is in the eic_connector, where a memory object was not released upon a data fetching error. The fixes are correct and crucial for system stability. I have one suggestion for a minor refactoring in cache_engine.py to improve efficiency.
maobaolong
left a comment
There was a problem hiding this comment.
@liubj77 Left a comment.
| f"eic mget data {key_str} failed, status_code {status_code}" | ||
| " err_code {err_code}" | ||
| ) | ||
| memory_obj.ref_count_down() |
There was a problem hiding this comment.
How about the exception path? In another word, memory_obj should ref_count_down although exception occured.
There was a problem hiding this comment.
How about the exception path? In another word, memory_obj should ref_count_down although exception occured.
According to the code design, EIC returns values based on the code and will not encounter any exceptions.
There was a problem hiding this comment.
That's ok. There is another trick, we can surround the code after self.memory_allocator.allocate and catch all exception, then, we can ensure whatever happened, we can call ref_count_down. The memory object will never leak.
There was a problem hiding this comment.
done, possible exceptions have been handled in try-catch
There was a problem hiding this comment.
@maobaolong Please review this again and see if it's acceptable to merge.
Signed-off-by: liubj77 <liubj77@gmail.com>
Signed-off-by: liubj77 <liubj77@gmail.com>
|
@maobaolong @sammshen @deng451e Could you please review this and see if it's acceptable to merge. |
|
LGTM |
sammshen
left a comment
There was a problem hiding this comment.
need to take another look sorry
| ret_mask[start:end] = True | ||
| used_keys.add(key) | ||
|
|
||
| for (key, _, _), memory_obj in zip(blocks, memory_objs, strict=False): |
There was a problem hiding this comment.
dumb question:
why do we need to do this check twice?
@deng451e has a great question which is: how do you know there is no overlap bewteen these two sets (complement of used_keys and end >= last_failed_block_start)
There was a problem hiding this comment.
DongDongJu
left a comment
There was a problem hiding this comment.
LGTM, but I am not that familiar with eic connector.
So, just left one my comment from code reading.
Signed-off-by: liubj77 <liubj77@gmail.com>
bc85af0 to
57437f7
Compare
@DongDongJu @sammshen @maobaolong @deng451e Can this PR be merged? And the EIC modify didn't change the original logic, it just releases memory on failure. I've tested it for about two weeks, and with this patch, lmcache memory leaks have decreased significantly. Of course, there are some other memory leaks that haven't been discovered yet.
|

Fixes #2017
@DongDongJu @sammshen @ApostaC Please take a look.
This might be related to this bug, because although I didn't use RedisConnector, I looked at the code and it seems that RedisConnector would also have the same memory leak.
After running for a long period of time, this log will appear.
What this PR does / why we need it:
When
async_loadingis not used, memory leaks can occur due to differences in the implementation of someRemoteConnectorclasses, such as whether they supportbatched_get, and differences in the implementation ofbatched_getitself.For example:
RedisConnectordoes not support batched_getsupport_batched_get=False, it will call the below logic inbatched_get_blocking, which queries multiple keys simultaneously. This can lead to a problem: the previous key may fail, but the next key will succeed.Similarly, although
EicConnectorsupports batched_getsupport_batched_get=True, but itsbatched_getimplementation also queries each key independently in parallel, leading to the same problem as above.However, function
_process_tokens_internalinLMCacheEnginedoes not handle this scenario, it only performsbreakandtruncate, as shown in the code below, this will lead to memory leak ofmemory_obj.Special notes for your reviewers:
As shown in the PR's new code, I added logging for unreleased memory. After running it in the test environment for a long time, I confirmed that some
memory_objinstances are not being released.This PR also fixes the memory_obj leak in EICConnector when
get_datafails.If applicable:
Note
Medium Risk
Touches retrieval control-flow and reference counting, so mistakes could cause premature frees or missing cache hits; changes are bounded and covered by new unit tests for failure/truncation cases.
Overview
Fixes a KV-cache memory leak during retrieval when
batched_getreturns non-contiguous results (e.g., a later key succeeds after an earlier key fails).LMCacheEngine._process_tokens_internalnow (1) tracks the earliest failed block across locations, (2) immediatelyref_count_down()s any successfully-fetched but unused blocks after a per-location failure, and (3) truncates returned chunks usingend <= last_failed_block_startwhile freeing any already-added chunks that extend past the gap.EICConnector.get_datanow guarantees allocatedmemory_objs are released viaref_count_down()on both error status codes and unexpected exceptions duringmget.Adds focused unit tests covering boundary truncation, multi-location failure ordering, freeing of unused successful blocks, and avoiding double-free scenarios.
Reviewed by Cursor Bugbot for commit 05c4043. Bugbot is set up for automated code reviews on this repo. Configure here.