Skip to content

[3/4][MP] L2 Store controller for MP mode#2646

Merged
ApostaC merged 7 commits intoLMCache:devfrom
ApostaC:local-dev/l2-adapter-3
Mar 1, 2026
Merged

[3/4][MP] L2 Store controller for MP mode#2646
ApostaC merged 7 commits intoLMCache:devfrom
ApostaC:local-dev/l2-adapter-3

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Feb 27, 2026

What this PR does / why we need it:
Part of #2562

It supports storing to the L2 adapter

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

…d tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the LMCache distributed storage system by introducing a dedicated L2 Store Controller. This new component enables the asynchronous and policy-driven transfer of data from the L1 cache to L2 adapters, improving data persistence and consistency across the caching hierarchy. The changes lay the groundwork for more sophisticated multi-process caching strategies by providing a robust mechanism for managing data movement between cache levels.

Highlights

  • L2 Store Controller Implementation: Introduced a new StoreController responsible for asynchronously copying data from L1 to L2 adapters after write operations are completed in L1. This controller operates in a background thread, listening for L1 write-completion events and managing L2 store tasks.
  • Store Policy Mechanism: Defined a StorePolicy abstract interface and a DefaultStorePolicy implementation. This policy dictates which L2 adapter(s) keys should be stored to and whether keys should be deleted from L1 after a successful L2 store.
  • L1Manager Integration: Integrated the new StoreController into the StorageManager, ensuring that L1 write-completion events trigger the L2 storage process. The StorageManager now initializes and manages the lifecycle of the StoreController and its associated L2 adapters.
  • Debug and Testability Enhancements: Added debug/test-only functions to MockL2Adapter to facilitate testing and monitoring of stored objects and locked keys. Comprehensive unit tests were also added for the StoreController and StorePolicy to validate their functionality and integration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/v1/distributed/l2_adapters/init.py
    • Added create_l2_adapter function to instantiate L2 adapter instances based on their configuration.
  • lmcache/v1/distributed/l2_adapters/config.py
    • Introduced get_type_name_for_config function for reverse-lookup of registered L2 adapter type names from config instances.
  • lmcache/v1/distributed/l2_adapters/mock_l2_adapter.py
    • Added debug_get_stored_object_count, debug_get_locked_key_count, and debug_has_key methods for testing and debugging purposes.
  • lmcache/v1/distributed/storage_controllers/init.py
    • Exported the newly introduced StoreController.
  • lmcache/v1/distributed/storage_controllers/store_controller.py
    • Added the StoreController class, which manages asynchronous L1-to-L2 data storage, including a background event-driven loop.
    • Implemented StoreListener to receive L1 write-completion callbacks and enqueue keys for the StoreController.
    • Defined InFlightStoreTask dataclass to track submitted L2 store tasks.
  • lmcache/v1/distributed/storage_controllers/store_policy.py
    • Introduced the StorePolicy abstract base class for defining L1-to-L2 storage decisions.
    • Provided DefaultStorePolicy which stores all keys to all adapters and never deletes from L1.
    • Added AdapterDescriptor dataclass to provide lightweight information about L2 adapters to the store policy.
  • lmcache/v1/distributed/storage_manager.py
    • Imported necessary L2 adapter and store controller components.
    • Initialized L2 adapters and the StoreController within the StorageManager constructor.
    • Started the StoreController background thread.
    • Updated the close method to properly stop the StoreController and close all L2 adapters.
  • tests/v1/distributed/test_store_controller.py
    • Added comprehensive unit tests for the StoreController, covering its lifecycle, single and multiple adapter interactions, read lock release, and custom policy behavior.
  • tests/v1/distributed/test_store_policy.py
    • Added unit tests for the StorePolicy interface and the DefaultStorePolicy implementation.
Activity
  • Implemented the core logic for the L2 Store controller, which is part of issue Support L2 storage in MP mode #2562.
  • Added unit tests to ensure the correctness and reliability of the new L2 store controller and policy mechanisms.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC mentioned this pull request Feb 27, 2026
8 tasks
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the L2 Store Controller, a major component for managing asynchronous data transfer from L1 to L2 cache. The implementation features an event-driven architecture with a dedicated background thread, graceful shutdown handling, and a flexible policy-based approach for storage decisions. However, a critical reliability issue has been identified in the background thread's error handling that could lead to a memory leak and Denial of Service if an unhandled exception occurs. It is recommended to add robust error handling to the background loop to ensure the thread remains active or fails gracefully without leaking resources. Additionally, consider simplifying the InFlightStoreTask data structure by removing a redundant field to improve maintainability.

Comment on lines +209 to +246
def _store_loop(self) -> None:
"""
Main event-driven loop running in a background thread.

Uses select.poll() to wait on:
- The StoreListener's eventfd (new keys from L1 writes).
- Each L2 adapter's store eventfd (completed store tasks).

Exits when the stop flag is set.
"""
poller = select.poll()

listener_efd = self._listener.get_event_fd()
poller.register(listener_efd, select.POLLIN)

for efd in self._efd_to_adapter_index:
poller.register(efd, select.POLLIN)

while not self._stop_flag.is_set():
ready = poller.poll(STORE_LOOP_POLL_TIMEOUT_MS)

for fd, events in ready:
if not (events & select.POLLIN):
continue

# Consume the eventfd value
try:
os.eventfd_read(fd)
except (OSError, BlockingIOError):
pass

if fd == listener_efd:
keys = self._listener.pop_pending_keys()
if keys:
self._process_new_keys(keys)
elif fd in self._efd_to_adapter_index:
adapter_index = self._efd_to_adapter_index[fd]
self._process_completed_tasks(adapter_index)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The background thread running the _store_loop lacks a top-level exception handler. If any operation within the loop (such as _process_new_keys, _process_completed_tasks, or calls to L2 adapter methods) raises an unhandled exception, the thread will terminate silently. This not only stops the asynchronous storage process but also leads to a memory leak: the StoreListener will continue to accumulate keys in its _pending_keys list every time a write finishes in L1Manager, but since the consumer thread is dead, these keys will never be popped. Over time, this can lead to memory exhaustion and a Denial of Service (DoS).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second this. But maybe not urgent. We can put a TODO on this.

Copy link
Copy Markdown
Contributor Author

@ApostaC ApostaC Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, I will add it in the follow-up PRs

Comment on lines +122 to +127
keys: list[ObjectKey]
"""All keys that were submitted in this store task."""

read_locked_keys: list[ObjectKey]
"""The subset of keys for which reserve_read succeeded
(i.e., keys holding an L1 read lock that must be released)."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The InFlightStoreTask dataclass has keys and read_locked_keys fields which appear to hold identical data. Since a store task is only submitted for keys where reserve_read succeeded, the set of 'keys submitted to the task' is the same as the 'keys that were successfully read-locked'.

This redundancy can be removed to simplify the code. I suggest removing the read_locked_keys field and using the keys field for all related operations.

This would involve:

  1. Updating InFlightStoreTask as suggested below.
  2. In _process_new_keys, changing the instantiation to InFlightStoreTask(adapter_index=adapter_index, keys=successful_keys).
  3. In _process_completed_tasks and _cleanup_in_flight_tasks, using task.keys when calling l1_mgr.finish_read().
Suggested change
keys: list[ObjectKey]
"""All keys that were submitted in this store task."""
read_locked_keys: list[ObjectKey]
"""The subset of keys for which reserve_read succeeded
(i.e., keys holding an L1 read lock that must be released)."""
keys: list[ObjectKey]
"""Keys for which reserve_read succeeded and were submitted in this store task.
These keys hold an L1 read lock that must be released."""

"""
poller = select.poll()

listener_efd = self._listener.get_event_fd()
Copy link
Copy Markdown
Contributor

@royyhuang royyhuang Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this as a list since later we may need to register other listeners for other purposes. But by promoting to a list of listeners, I am also worry about get_event_fd is not an standard interface of EventListener so maybe just renaming it to _l1_listener is more convenient for now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine here. The listener_efd is not a general requirement of the listeners, but a special implementation needed to trigger the store operation (i.e., specialized to the store controller).

Copy link
Copy Markdown
Contributor

@royyhuang royyhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM just a few small comments.

"""

@abstractmethod
def select_store_targets(
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe
select_store_after_l1_writes

"""

@abstractmethod
def select_l1_deletions(
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_l1_deletions_after_store

@ApostaC ApostaC added the full Run comprehensive tests on this PR label Feb 28, 2026
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC merged commit 1985030 into LMCache:dev Mar 1, 2026
24 checks passed
hlin99 pushed a commit to hlin99/LMCache that referenced this pull request Mar 2, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
oferki pushed a commit to oferki/LMCache that referenced this pull request Mar 3, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Ofer Kiselov Nahman <ofer.kiselovnahman@weka.io>
oferki pushed a commit to oferki/LMCache that referenced this pull request Mar 3, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
mauryaavinash95 pushed a commit to mauryaavinash95/LMCache that referenced this pull request Mar 7, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
shaoxiawjc pushed a commit to shaoxiawjc/LMCache that referenced this pull request Mar 11, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: shaoxiawjc <wjc2800@163.com>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* [Add] helper functions in l2 adapter module for storage controller and tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* add l2 store controller

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* hook with storage manager and add unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* [add] debug log for store operation

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants