[3/4][MP] L2 Store controller for MP mode#2646
Conversation
…d tests Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the LMCache distributed storage system by introducing a dedicated L2 Store Controller. This new component enables the asynchronous and policy-driven transfer of data from the L1 cache to L2 adapters, improving data persistence and consistency across the caching hierarchy. The changes lay the groundwork for more sophisticated multi-process caching strategies by providing a robust mechanism for managing data movement between cache levels. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Code Review
This pull request introduces the L2 Store Controller, a major component for managing asynchronous data transfer from L1 to L2 cache. The implementation features an event-driven architecture with a dedicated background thread, graceful shutdown handling, and a flexible policy-based approach for storage decisions. However, a critical reliability issue has been identified in the background thread's error handling that could lead to a memory leak and Denial of Service if an unhandled exception occurs. It is recommended to add robust error handling to the background loop to ensure the thread remains active or fails gracefully without leaking resources. Additionally, consider simplifying the InFlightStoreTask data structure by removing a redundant field to improve maintainability.
| def _store_loop(self) -> None: | ||
| """ | ||
| Main event-driven loop running in a background thread. | ||
|
|
||
| Uses select.poll() to wait on: | ||
| - The StoreListener's eventfd (new keys from L1 writes). | ||
| - Each L2 adapter's store eventfd (completed store tasks). | ||
|
|
||
| Exits when the stop flag is set. | ||
| """ | ||
| poller = select.poll() | ||
|
|
||
| listener_efd = self._listener.get_event_fd() | ||
| poller.register(listener_efd, select.POLLIN) | ||
|
|
||
| for efd in self._efd_to_adapter_index: | ||
| poller.register(efd, select.POLLIN) | ||
|
|
||
| while not self._stop_flag.is_set(): | ||
| ready = poller.poll(STORE_LOOP_POLL_TIMEOUT_MS) | ||
|
|
||
| for fd, events in ready: | ||
| if not (events & select.POLLIN): | ||
| continue | ||
|
|
||
| # Consume the eventfd value | ||
| try: | ||
| os.eventfd_read(fd) | ||
| except (OSError, BlockingIOError): | ||
| pass | ||
|
|
||
| if fd == listener_efd: | ||
| keys = self._listener.pop_pending_keys() | ||
| if keys: | ||
| self._process_new_keys(keys) | ||
| elif fd in self._efd_to_adapter_index: | ||
| adapter_index = self._efd_to_adapter_index[fd] | ||
| self._process_completed_tasks(adapter_index) |
There was a problem hiding this comment.
The background thread running the _store_loop lacks a top-level exception handler. If any operation within the loop (such as _process_new_keys, _process_completed_tasks, or calls to L2 adapter methods) raises an unhandled exception, the thread will terminate silently. This not only stops the asynchronous storage process but also leads to a memory leak: the StoreListener will continue to accumulate keys in its _pending_keys list every time a write finishes in L1Manager, but since the consumer thread is dead, these keys will never be popped. Over time, this can lead to memory exhaustion and a Denial of Service (DoS).
There was a problem hiding this comment.
I second this. But maybe not urgent. We can put a TODO on this.
There was a problem hiding this comment.
This makes sense, I will add it in the follow-up PRs
| keys: list[ObjectKey] | ||
| """All keys that were submitted in this store task.""" | ||
|
|
||
| read_locked_keys: list[ObjectKey] | ||
| """The subset of keys for which reserve_read succeeded | ||
| (i.e., keys holding an L1 read lock that must be released).""" |
There was a problem hiding this comment.
The InFlightStoreTask dataclass has keys and read_locked_keys fields which appear to hold identical data. Since a store task is only submitted for keys where reserve_read succeeded, the set of 'keys submitted to the task' is the same as the 'keys that were successfully read-locked'.
This redundancy can be removed to simplify the code. I suggest removing the read_locked_keys field and using the keys field for all related operations.
This would involve:
- Updating
InFlightStoreTaskas suggested below. - In
_process_new_keys, changing the instantiation toInFlightStoreTask(adapter_index=adapter_index, keys=successful_keys). - In
_process_completed_tasksand_cleanup_in_flight_tasks, usingtask.keyswhen callingl1_mgr.finish_read().
| keys: list[ObjectKey] | |
| """All keys that were submitted in this store task.""" | |
| read_locked_keys: list[ObjectKey] | |
| """The subset of keys for which reserve_read succeeded | |
| (i.e., keys holding an L1 read lock that must be released).""" | |
| keys: list[ObjectKey] | |
| """Keys for which reserve_read succeeded and were submitted in this store task. | |
| These keys hold an L1 read lock that must be released.""" |
| """ | ||
| poller = select.poll() | ||
|
|
||
| listener_efd = self._listener.get_event_fd() |
There was a problem hiding this comment.
Maybe make this as a list since later we may need to register other listeners for other purposes. But by promoting to a list of listeners, I am also worry about get_event_fd is not an standard interface of EventListener so maybe just renaming it to _l1_listener is more convenient for now?
There was a problem hiding this comment.
It is fine here. The listener_efd is not a general requirement of the listeners, but a special implementation needed to trigger the store operation (i.e., specialized to the store controller).
royyhuang
left a comment
There was a problem hiding this comment.
Overall LGTM just a few small comments.
| """ | ||
|
|
||
| @abstractmethod | ||
| def select_store_targets( |
There was a problem hiding this comment.
Maybe
select_store_after_l1_writes
| """ | ||
|
|
||
| @abstractmethod | ||
| def select_l1_deletions( |
There was a problem hiding this comment.
select_l1_deletions_after_store
Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Ofer Kiselov Nahman <ofer.kiselovnahman@weka.io>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [Add] helper functions in l2 adapter module for storage controller and tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * add l2 store controller Signed-off-by: ApostaC <yihua98@uchicago.edu> * hook with storage manager and add unit tests Signed-off-by: ApostaC <yihua98@uchicago.edu> * [add] debug log for store operation Signed-off-by: ApostaC <yihua98@uchicago.edu>
What this PR does / why we need it:
Part of #2562
It supports storing to the L2 adapter
Special notes for your reviewers:
If applicable: