[Core] Add persistence interfaces and nixl persistence#2938
Conversation
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Autofix Details
Bugbot Autofix prepared fixes for all 3 issues found in the latest run.
- ✅ Fixed: Filename helper doesn't sanitize
/in model names- I updated dynamic filename encoding to replace
/with-SEP-and decode it back in the reverse helper so model names no longer create unintended subdirectories.
- I updated dynamic filename encoding to replace
- ✅ Fixed:
PersistDescclass defined but never used anywhere- I removed the unused
PersistDescdataclass frominternal_api.pyto eliminate the redundant and unreferenced abstraction.
- I removed the unused
- ✅ Fixed:
os.makedirscrashes when persist_path lacks directory component- I guarded directory creation in
persist()soos.makedirsis called only whenpersist_pathhas a non-empty directory component.
- I guarded directory creation in
Or push these changes by commenting:
@cursor push f8106b8f6f
Preview (f8106b8f6f)
diff --git a/lmcache/v1/distributed/internal_api.py b/lmcache/v1/distributed/internal_api.py
--- a/lmcache/v1/distributed/internal_api.py
+++ b/lmcache/v1/distributed/internal_api.py
@@ -160,11 +160,3 @@
"""The key of the object to be evicted"""
-@dataclass(frozen=True)
-class PersistDesc:
- """
- Descriptor for persist/recover operations on L2 adapters.
- """
-
- disk_persist_path: str
- """ Path on disk where adapter metadata should be persisted/recovered """
diff --git a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
--- a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
+++ b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
@@ -47,7 +47,9 @@
logger = init_logger(__name__)
+_PATH_SLASH_REPLACEMENT = "-SEP-"
+
# ---------------------------------------------------------------
# ObjectKey <-> file path helpers
# ---------------------------------------------------------------
@@ -56,7 +58,8 @@
def _object_key_to_filename(key: ObjectKey) -> str:
"""Derive a deterministic file name from an ObjectKey."""
chunk_hex = key.chunk_hash.hex()
- return f"{key.model_name}_{key.kv_rank:08x}_{chunk_hex}.bin"
+ safe_model_name = key.model_name.replace("/", _PATH_SLASH_REPLACEMENT)
+ return f"{safe_model_name}_{key.kv_rank:08x}_{chunk_hex}.bin"
def _filename_to_object_key(filename: str) -> ObjectKey:
@@ -64,7 +67,7 @@
stem = filename.removesuffix(".bin")
# Split from the right: last part is chunk_hash, second-last is kv_rank
parts = stem.rsplit("_", 2)
- model_name = parts[0]
+ model_name = parts[0].replace(_PATH_SLASH_REPLACEMENT, "/")
kv_rank = int(parts[1], 16)
chunk_hash = bytes.fromhex(parts[2])
return ObjectKey(chunk_hash=chunk_hash, model_name=model_name, kv_rank=kv_rank)
@@ -453,7 +456,9 @@
}
entries.append(entry)
- os.makedirs(os.path.dirname(config.persist_path), exist_ok=True)
+ persist_dir = os.path.dirname(config.persist_path)
+ if persist_dir:
+ os.makedirs(persist_dir, exist_ok=True)
with open(config.persist_path, "w") as f:
json.dump(entries, f)This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
There was a problem hiding this comment.
Code Review
This pull request introduces a dynamic Nixl L2 adapter supporting on-demand file registration and metadata persistence, with corresponding updates to the base interface and configuration parsing. Feedback highlights a performance bottleneck caused by blocking I/O within a locked section and identifies several style guide violations regarding missing type hints for new functions. It also suggests using strict zip iteration to ensure consistency between keys and memory objects.
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
ApostaC
left a comment
There was a problem hiding this comment.
Just did a very superficial pass. Quick question: if we use nixl dynamic adapter, do we still need persist and recover? If it's not needed, we don't need to add them since currently most of the other backends do not need them either.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Store and secondary lookup race causes permanent
_total_bytesovercount- I added a post-await key recheck before mutating
_memory_objectsand_total_bytesin store, preventing duplicate byte accounting when secondary lookup races in, and added a regression test for this interleaving.
- I added a post-await key recheck before mutating
Or push these changes by commenting:
@cursor push c49599c49c
Preview (c49599c49c)
diff --git a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
--- a/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
+++ b/lmcache/v1/distributed/l2_adapters/nixl_store_dynamic_l2_adapter.py
@@ -571,9 +571,12 @@
pin_count=1,
)
with self._lock:
- self._memory_objects[key] = store_obj
- self._total_bytes += store_obj.size
- store_obj.decrease_pin_count()
+ # Re-check in case lookup/recovery populated this key
+ # while we awaited dynamic_store_file.
+ if key not in self._memory_objects:
+ self._memory_objects[key] = store_obj
+ self._total_bytes += store_obj.size
+ store_obj.decrease_pin_count()
stored_keys.append(key)
except Exception:
diff --git a/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py b/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py
--- a/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py
+++ b/tests/v1/distributed/test_nixl_store_dynamic_l2_adapter.py
@@ -7,6 +7,7 @@
"""
# Standard
+import asyncio
import os
import select
import shutil
@@ -561,6 +562,47 @@
assert usage_after < usage_before
+ def test_store_secondary_lookup_race_does_not_overcount_usage(
+ self, adapter_with_persist
+ ):
+ """Store+recover race should not leave stale bytes after delete."""
+ adpt, buf, _, _, _ = adapter_with_persist
+ key = create_object_key(1001)
+ obj = create_memory_obj(buf, page_index=0)
+ lookup_tasks: list[int] = []
+
+ async def _dynamic_store_with_lookup(mem_indices, file_path, page_size):
+ file_size = len(mem_indices) * page_size
+ with open(file_path, "wb") as f:
+ f.truncate(file_size)
+ lookup_tasks.append(adpt.submit_lookup_and_lock_task([key]))
+ await asyncio.sleep(0)
+
+ try:
+ adpt.nixl_agent.dynamic_store_file = _dynamic_store_with_lookup
+ adpt.submit_store_task([key], [obj])
+ wait_for_event_fd(adpt.get_store_event_fd())
+ wait_for_event_fd(adpt.get_lookup_and_lock_event_fd())
+ adpt.pop_completed_store_tasks()
+
+ bitmap = adpt.query_lookup_and_lock_result(lookup_tasks[0])
+ assert bitmap is not None
+ assert bitmap.test(0)
+
+ adpt.submit_unlock([key])
+ sync_task = adpt.submit_lookup_and_lock_task([])
+ wait_for_event_fd(adpt.get_lookup_and_lock_event_fd())
+ adpt.query_lookup_and_lock_result(sync_task)
+
+ usage_before_delete, _ = adpt.get_usage()
+ adpt.delete([key])
+ usage_after_delete, _ = adpt.get_usage()
+
+ assert usage_before_delete > 0.0
+ assert usage_after_delete == 0.0
+ finally:
+ adpt.close()
+
def test_store_rejected_when_capacity_exceeded(self):
"""Store should stop when max capacity is reached."""
tmp_dir = tempfile.mkdtemp(prefix="nixl_dyn_cap_test_")This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
ApostaC
left a comment
There was a problem hiding this comment.
IIUC, here's the current semantics of persist_enabled and recover_enabled
| True | False | |
|---|---|---|
| persist_enabled | No special thing will happen | Will delete all objects from L2 when shutting down |
| recover_enabled | Try read the object from L2 via filename if it's not seen before | Skip loading the object if it's not seen (i.e., being stored locally) before |
I think we actually want the behavior of persist_enabled = True and recover_enabled = True. In this case, we can just drop the persist/recover-related configs and simplify the code/design doc a bit more.
Are there any cases we want to delete the L2 objects after shutdown, or don't look at L2 physical files when lookup?
ApostaC
left a comment
There was a problem hiding this comment.
I think we can keep the current way of config. But here are two problems identified by claude code
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 70144aa. Configure here.
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>


What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Medium Risk
Adds a new storage adapter and changes L2 config parsing to always include persist settings; persistence and secondary lookup touch on-disk state and capacity accounting, which could affect eviction and restart behavior if incorrect.
Overview
Adds a new
nixl_store_dynamicNIXL L2 adapter that stores each key into a deterministic per-key file (opened/registered per operation) and supports persist/recover by retaining files on shutdown and performing secondary on-disk lookup on cache misses.Introduces
PersistConfig(parsed from JSONpersist_enabled, defaulttrue) intoL2AdapterConfigBase, and updates docs to describe persist semantics and the new adapter’s capacity model (max_capacity_gb) and eviction behavior. Also hardens the existing staticnixl_storestore path to skip already-present keys to avoid leaking pool slots, and adds comprehensive unit tests for the dynamic adapter (store/lookup/load, delete, capacity, and persistence).Reviewed by Cursor Bugbot for commit 2a2bbec. Bugbot is set up for automated code reviews on this repo. Configure here.