[Feat][DAX] Optimize staged batched restore path and document modification#2904
[Feat][DAX] Optimize staged batched restore path and document modification#2904ApostaC merged 12 commits intoLMCache:devfrom
Conversation
Add a native batched memcpy helper and switch the DAX backend batched retrieve path to a persistent staged restore pipeline with region workers, a reusable staging slab, and broader correctness coverage. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
There was a problem hiding this comment.
Code Review
This pull request implements a staged batched restore path for the DAX storage backend to optimize retrieval throughput. Key changes include the addition of a batched_memcpy utility (with both C++ and non-CUDA Python implementations), new configuration parameters for tuning restore workers and staging slab sizes, and a redesigned retrieval flow in DaxBackend that utilizes parallel executors and a pinned staging slab. Feedback is provided regarding a missing docstring for the new public method batched_get_blocking and ensuring the correct memory format is used during batched allocation.
Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
| } | ||
| } | ||
|
|
||
| void batched_memcpy(const std::vector<uintptr_t>& src_ptrs, |
There was a problem hiding this comment.
is the main benefit just avoiding GIL?
There was a problem hiding this comment.
Yes one of reason is that GIL avoidance. That helps, but it’s only one part.
The bigger gain here is that the restore path is now batched end-to-end.
batched_memcpy mainly removes Python per-copy overhead on that path and lets the copy loop run in native code with the GIL released, but by itself it would not explain all of the TTFT improvement.
|
Nice optimization! Looking forward to MP mode support as well. |
The DAX backend still needs LocalCPUBackend as its allocator backend for restore buffers/staging outputs. So in this config, max_local_cpu_size: 16 is still providing the CPU-side allocation budget used by DAX restore, but retrieved chunks are not being admitted into the local CPU hot cache. So the sample path is effectively: DAX -> CPU restore buffers from LocalCPUBackend -> normal GPU connector rather than DAX -> keep result in local CPU hot cache. |
Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
| def _release_restore_resources( | ||
| self, | ||
| restore_slab_ptr: Optional[int] = None, | ||
| ) -> None: |
There was a problem hiding this comment.
nit: regarding those newly introduced long helper functions, it would be better if we can add some docstring about describing the args, expected return values and expected behavior. It will help other maintainers (and AI agents) better understand the codes.
Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7c5c53a. Configure here.
| if region.items | ||
| ] | ||
| for future in futures: | ||
| future.result() |
There was a problem hiding this comment.
In-flight restore workers write to freed output buffers
Medium Severity
_run_restore_waves iterates futures sequentially via future.result(). If an earlier future raises, the loop exits immediately without waiting for remaining in-flight futures from the same wave. The caller's except block then calls _cleanup_restore_outputs, which frees output buffers that still-running region workers may be actively writing to via _batched_memcpy. Since the native batched_memcpy releases the GIL, this is a genuine use-after-free race, not protected by CPython's GIL.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7c5c53a. Configure here.
…ation (LMCache#2904) * Add a native batched memcpy helper and switch the DAX backend batched retrieve path to a persistent staged restore pipeline with region workers, a reusable staging slab, and broader correctness coverage. * docs: add DAX batched restore usage notes * [DAX] Remove dead restore cleanup parameter * [DAX] Share restore dispatch path for blocking get * [DAX] Document restore helper behavior Signed-off-by: DongDongJu <commisori28@gmail.com>


This PR optimizes the DAX backend's staged restore path.
Right next plan is support Higher TP/DP/EP and support MP mode like raw block backend.
Before this change, DAX retrieval restored cached chunks through serialized per-chunk restaging, which became the main bottleneck on long-context cache-hit workloads. This PR keeps the existing staged retrieve model (
DAX -> CPU -> normal GPU connector) but makes the DAX restore path batched and reusable.Main changes:
LocalCPUBackendbatched_memcpyhelper with non-CUDA fallbackIn local
long_doc_qavalidation onQwen/Qwen3-14B, the optimized staged DAX path reduced mean query TTFT by about61.6%and mean query-round time by about33.3%versus the pre-optimization DAX backend.Special notes for your reviewers:
with batched restore.
dax.restore_workersdax.restore_max_regionsdax.retrieve_staging_slab_bytes-
long_doc_qacomparison vs pre-optimization DAX: TTFT-61.6%, query-round-33.3%baseline -> optimized): TTFT-58.4%, query-round-32.5%-62.0%, query-round-36.0%If applicable:
Note
Medium Risk
Reworks the DAX retrieval path to use new threaded batched restore logic and a shared pinned staging slab, which can affect concurrency, lifecycle/cleanup, and memory-safety if mis-tuned or if edge cases slip through.
Overview
DAX retrieval is refactored from per-key restores into a staged batched restore pipeline. Both blocking and async-prefetch reads now go through a shared dispatcher that reserves readable entries, batches CPU output allocation, coalesces adjacent DAX spans into region/wave copy plans, and restores via persistent thread pools while preserving existing semantics (blocking keeps positional
Noneholes; async returns only the consecutive hit prefix).Adds new DAX tuning knobs (
dax.restore_workers,dax.restore_max_regions,dax.retrieve_staging_slab_bytes) and manages a backend-owned pinned staging slab with explicit shutdown/cleanup onclose()and init failures.Introduces a native
batched_memcpy(csrc/mem_alloc.*+pybind) with a Python non-CUDA fallback, and expands DAX unit tests to cover batched restore behavior, locking/close interactions, and the memcpy helper.Reviewed by Cursor Bugbot for commit 7c5c53a. Bugbot is set up for automated code reviews on this repo. Configure here.