Skip to content

[Core] Add persistence interfaces and nixl persistence#2938

Merged
sammshen merged 10 commits intodevfrom
localdev/l2-persistence
Apr 17, 2026
Merged

[Core] Add persistence interfaces and nixl persistence#2938
sammshen merged 10 commits intodevfrom
localdev/l2-persistence

Conversation

@YaoJiayi
Copy link
Copy Markdown
Collaborator

@YaoJiayi YaoJiayi commented Apr 2, 2026

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Adds a new storage adapter and changes L2 config parsing to always include persist settings; persistence and secondary lookup touch on-disk state and capacity accounting, which could affect eviction and restart behavior if incorrect.

Overview
Adds a new nixl_store_dynamic NIXL L2 adapter that stores each key into a deterministic per-key file (opened/registered per operation) and supports persist/recover by retaining files on shutdown and performing secondary on-disk lookup on cache misses.

Introduces PersistConfig (parsed from JSON persist_enabled, default true) into L2AdapterConfigBase, and updates docs to describe persist semantics and the new adapter’s capacity model (max_capacity_gb) and eviction behavior. Also hardens the existing static nixl_store store path to skip already-present keys to avoid leaking pool slots, and adds comprehensive unit tests for the dynamic adapter (store/lookup/load, delete, capacity, and persistence).

Reviewed by Cursor Bugbot for commit 2a2bbec. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

  • ✅ Fixed: Filename helper doesn't sanitize / in model names
    • I updated dynamic filename encoding to replace / with -SEP- and decode it back in the reverse helper so model names no longer create unintended subdirectories.
  • ✅ Fixed: PersistDesc class defined but never used anywhere
    • I removed the unused PersistDesc dataclass from internal_api.py to eliminate the redundant and unreferenced abstraction.
  • ✅ Fixed: os.makedirs crashes when persist_path lacks directory component
    • I guarded directory creation in persist() so os.makedirs is called only when persist_path has a non-empty directory component.

Create PR

Or push these changes by commenting:

@cursor push f8106b8f6f
Preview (f8106b8f6f)
diff --git a/lmcache/v1/distributed/internal_api.py b/lmcache/v1/distributed/internal_api.py
--- a/lmcache/v1/distributed/internal_api.py
+++ b/lmcache/v1/distributed/internal_api.py
@@ -160,11 +160,3 @@
     """The key of the object to be evicted"""
 
 
-@dataclass(frozen=True)
-class PersistDesc:
-    """
-    Descriptor for persist/recover operations on L2 adapters.
-    """
-
-    disk_persist_path: str
-    """ Path on disk where adapter metadata should be persisted/recovered """

diff --git a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
--- a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
+++ b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
@@ -47,7 +47,9 @@
 
 logger = init_logger(__name__)
 
+_PATH_SLASH_REPLACEMENT = "-SEP-"
 
+
 # ---------------------------------------------------------------
 # ObjectKey <-> file path helpers
 # ---------------------------------------------------------------
@@ -56,7 +58,8 @@
 def _object_key_to_filename(key: ObjectKey) -> str:
     """Derive a deterministic file name from an ObjectKey."""
     chunk_hex = key.chunk_hash.hex()
-    return f"{key.model_name}_{key.kv_rank:08x}_{chunk_hex}.bin"
+    safe_model_name = key.model_name.replace("/", _PATH_SLASH_REPLACEMENT)
+    return f"{safe_model_name}_{key.kv_rank:08x}_{chunk_hex}.bin"
 
 
 def _filename_to_object_key(filename: str) -> ObjectKey:
@@ -64,7 +67,7 @@
     stem = filename.removesuffix(".bin")
     # Split from the right: last part is chunk_hash, second-last is kv_rank
     parts = stem.rsplit("_", 2)
-    model_name = parts[0]
+    model_name = parts[0].replace(_PATH_SLASH_REPLACEMENT, "/")
     kv_rank = int(parts[1], 16)
     chunk_hash = bytes.fromhex(parts[2])
     return ObjectKey(chunk_hash=chunk_hash, model_name=model_name, kv_rank=kv_rank)
@@ -453,7 +456,9 @@
                     }
                 entries.append(entry)
 
-        os.makedirs(os.path.dirname(config.persist_path), exist_ok=True)
+        persist_dir = os.path.dirname(config.persist_path)
+        if persist_dir:
+            os.makedirs(persist_dir, exist_ok=True)
         with open(config.persist_path, "w") as f:
             json.dump(entries, f)

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py Outdated
Comment thread lmcache/v1/distributed/internal_api.py Outdated
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic Nixl L2 adapter supporting on-demand file registration and metadata persistence, with corresponding updates to the base interface and configuration parsing. Feedback highlights a performance bottleneck caused by blocking I/O within a locked section and identifies several style guide violations regarding missing type hints for new functions. It also suggests using strict zip iteration to ensure consistency between keys and memory objects.

Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py Outdated
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py Outdated
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_l2_adapter.py Outdated
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py Outdated
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
@deng451e deng451e added the full Run comprehensive tests on this PR label Apr 3, 2026
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a very superficial pass. Quick question: if we use nixl dynamic adapter, do we still need persist and recover? If it's not needed, we don't need to add them since currently most of the other backends do not need them either.

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Store and secondary lookup race causes permanent _total_bytes overcount
    • I added a post-await key recheck before mutating _memory_objects and _total_bytes in store, preventing duplicate byte accounting when secondary lookup races in, and added a regression test for this interleaving.

Create PR

Or push these changes by commenting:

@cursor push c49599c49c
Preview (c49599c49c)
diff --git a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
--- a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
+++ b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
@@ -571,9 +571,12 @@
                     pin_count=1,
                 )
                 with self._lock:
-                    self._memory_objects[key] = store_obj
-                    self._total_bytes += store_obj.size
-                    store_obj.decrease_pin_count()
+                    # Re-check in case lookup/recovery populated this key
+                    # while we awaited dynamic_store_file.
+                    if key not in self._memory_objects:
+                        self._memory_objects[key] = store_obj
+                        self._total_bytes += store_obj.size
+                        store_obj.decrease_pin_count()
                 stored_keys.append(key)
 
         except Exception:

diff --git a/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py b/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py
--- a/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py
+++ b/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py
@@ -7,6 +7,7 @@
 """
 
 # Standard
+import asyncio
 import os
 import select
 import shutil
@@ -561,6 +562,47 @@
 
         assert usage_after < usage_before
 
+    def test_store_secondary_lookup_race_does_not_overcount_usage(
+        self, adapter_with_persist
+    ):
+        """Store+recover race should not leave stale bytes after delete."""
+        adpt, buf, _, _, _ = adapter_with_persist
+        key = create_object_key(1001)
+        obj = create_memory_obj(buf, page_index=0)
+        lookup_tasks: list[int] = []
+
+        async def _dynamic_store_with_lookup(mem_indices, file_path, page_size):
+            file_size = len(mem_indices) * page_size
+            with open(file_path, "wb") as f:
+                f.truncate(file_size)
+            lookup_tasks.append(adpt.submit_lookup_and_lock_task([key]))
+            await asyncio.sleep(0)
+
+        try:
+            adpt.nixl_agent.dynamic_store_file = _dynamic_store_with_lookup
+            adpt.submit_store_task([key], [obj])
+            wait_for_event_fd(adpt.get_store_event_fd())
+            wait_for_event_fd(adpt.get_lookup_and_lock_event_fd())
+            adpt.pop_completed_store_tasks()
+
+            bitmap = adpt.query_lookup_and_lock_result(lookup_tasks[0])
+            assert bitmap is not None
+            assert bitmap.test(0)
+
+            adpt.submit_unlock([key])
+            sync_task = adpt.submit_lookup_and_lock_task([])
+            wait_for_event_fd(adpt.get_lookup_and_lock_event_fd())
+            adpt.query_lookup_and_lock_result(sync_task)
+
+            usage_before_delete, _ = adpt.get_usage()
+            adpt.delete([key])
+            usage_after_delete, _ = adpt.get_usage()
+
+            assert usage_before_delete > 0.0
+            assert usage_after_delete == 0.0
+        finally:
+            adpt.close()
+
     def test_store_rejected_when_capacity_exceeded(self):
         """Store should stop when max capacity is reached."""
         tmp_dir = tempfile.mkdtemp(prefix="nixl_dyn_cap_test_")

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py Outdated
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, here's the current semantics of persist_enabled and recover_enabled

True False
persist_enabled No special thing will happen Will delete all objects from L2 when shutting down
recover_enabled Try read the object from L2 via filename if it's not seen before Skip loading the object if it's not seen (i.e., being stored locally) before

I think we actually want the behavior of persist_enabled = True and recover_enabled = True. In this case, we can just drop the persist/recover-related configs and simplify the code/design doc a bit more.

Are there any cases we want to delete the L2 objects after shutdown, or don't look at L2 physical files when lookup?

@YaoJiayi YaoJiayi requested a review from ApostaC April 15, 2026 08:09
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review covering coding quality and correctness issues (error + warning severity).
(From Claude Code)

Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the current way of config. But here are two problems identified by claude code

Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py Outdated
Comment thread lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 70144aa. Configure here.

Comment thread lmcache/v1/distributed/l2_adapters/config.py
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMQ

@sammshen sammshen enabled auto-merge (squash) April 16, 2026 22:39
@sammshen sammshen merged commit 69787b8 into dev Apr 17, 2026
32 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants