[Feat][DAX] Optimize staged batched restore path and document modification by DongDongJu · Pull Request #2904 · LMCache/LMCache

DongDongJu · 2026-03-30T06:48:58Z

This PR optimizes the DAX backend's staged restore path.
Right next plan is support Higher TP/DP/EP and support MP mode like raw block backend.

Before this change, DAX retrieval restored cached chunks through serialized per-chunk restaging, which became the main bottleneck on long-context cache-hit workloads. This PR keeps the existing staged retrieve model (DAX -> CPU -> normal GPU connector) but makes the DAX restore path batched and reusable.

Main changes:

add a shared batched restore pipeline for blocking and async-prefetch DAX retrieval
group restore output allocation from LocalCPUBackend
coalesce adjacent DAX spans and execute restore work in regions/waves
add a reusable pinned retrieve staging slab and persistent restore executors
add a native batched_memcpy helper with non-CUDA fallback
document the new DAX restore tuning knobs

In local long_doc_qa validation on Qwen/Qwen3-14B, the optimized staged DAX path reduced mean query TTFT by about
61.6% and mean query-round time by about 33.3% versus the pre-optimization DAX backend.

Special notes for your reviewers:

Scope is intentionally limited to the DAX staged retrieve path and the small native copy helper it depends on.
This PR does not change the basic store path; stores are still CPU-staged before being written to DAX.
Sync and async-prefetch retrieval semantics are preserved; the backend only replaces serialized per-chunk restore
with batched restore.
New DAX knobs documented in this PR:
- dax.restore_workers
- dax.restore_max_regions
- dax.retrieve_staging_slab_bytes
Local validation beyond unit tests:
-long_doc_qa comparison vs pre-optimization DAX: TTFT -61.6%, query-round -33.3%
- reversed-order control (baseline -> optimized): TTFT -58.4%, query-round -32.5%
- longer-running control with query round > 1 minute: TTFT -62.0%, query-round -36.0%

sample test configs

  chunk_size: 256
  local_cpu: false
  max_local_cpu_size: 16
  cache_policy: LRU
  min_retrieve_tokens: 0
  storage_plugins: ["dax"]
  store_location: dax
  retrieve_locations: ["dax"]
  extra_config:
    storage_plugin.dax.module_path: lmcache.v1.storage_backend.plugins.dax_backend
    storage_plugin.dax.class_name: DaxBackend
    dax.device_path: "/dev/daxxxx"
    dax.max_dax_size: 64
    dax.async_put: false

  vllm serve Qwen/Qwen3-14B \
    --host 0.0.0.0 \
    --port 8000 \
    --reasoning-parser qwen3 \
    --attention-backend FLASH_ATTN \
    --gpu-memory-utilization 0.5 \
    --max-model-len 32768 \
    --no-enable-prefix-caching \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

  python benchmarks/long_doc_qa/long_doc_qa.py \
    --host 127.0.0.1 \
    --port 8000 \
    --model Qwen/Qwen3-14B \
    --document-length 10000 \
    --num-documents 30 \
    --output-len 200 \
    --repeat-count 2 \
    --repeat-mode tile \
    --max-inflight-requests 5 \
    --sleep-time-after-warmup 0 \
    --trim-fraction 0.0 \
    --json-output

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Note

Medium Risk
Reworks the DAX retrieval path to use new threaded batched restore logic and a shared pinned staging slab, which can affect concurrency, lifecycle/cleanup, and memory-safety if mis-tuned or if edge cases slip through.

Overview
DAX retrieval is refactored from per-key restores into a staged batched restore pipeline. Both blocking and async-prefetch reads now go through a shared dispatcher that reserves readable entries, batches CPU output allocation, coalesces adjacent DAX spans into region/wave copy plans, and restores via persistent thread pools while preserving existing semantics (blocking keeps positional None holes; async returns only the consecutive hit prefix).

Adds new DAX tuning knobs (dax.restore_workers, dax.restore_max_regions, dax.retrieve_staging_slab_bytes) and manages a backend-owned pinned staging slab with explicit shutdown/cleanup on close() and init failures.

Introduces a native batched_memcpy (csrc/mem_alloc.* + pybind) with a Python non-CUDA fallback, and expands DAX unit tests to cover batched restore behavior, locking/close interactions, and the memcpy helper.

^{Reviewed by Cursor Bugbot for commit 7c5c53a. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add a native batched memcpy helper and switch the DAX backend batched retrieve path to a persistent staged restore pipeline with region workers, a reusable staging slab, and broader correctness coverage. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

gemini-code-assist

Code Review

This pull request implements a staged batched restore path for the DAX storage backend to optimize retrieval throughput. Key changes include the addition of a batched_memcpy utility (with both C++ and non-CUDA Python implementations), new configuration parameters for tuning restore workers and staging slab sizes, and a redesigned retrieval flow in DaxBackend that utilizes parallel executors and a pinned staging slab. Feedback is provided regarding a missing docstring for the new public method batched_get_blocking and ensuring the correct memory format is used during batched allocation.

Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

sammshen · 2026-03-31T00:50:56Z

  }
 }

+void batched_memcpy(const std::vector<uintptr_t>& src_ptrs,


is the main benefit just avoiding GIL?

Yes one of reason is that GIL avoidance. That helps, but it’s only one part.
The bigger gain here is that the restore path is now batched end-to-end.
batched_memcpy mainly removes Python per-copy overhead on that path and lets the copy loop run in native code with the GIL released, but by itself it would not explain all of the TTFT improvement.

sammshen

LGTM!

jayhpark530 · 2026-03-31T13:04:50Z

Nice optimization! Looking forward to MP mode support as well.
How does local_cpu: false work in the sample test config?

DongDongJu · 2026-04-01T20:11:28Z

Nice optimization! Looking forward to MP mode support as well. How does local_cpu: false work in the sample test config?

The DAX backend still needs LocalCPUBackend as its allocator backend for restore buffers/staging outputs.

So in this config, max_local_cpu_size: 16 is still providing the CPU-side allocation budget used by DAX restore, but retrieved chunks are not being admitted into the local CPU hot cache.

So the sample path is effectively:

DAX -> CPU restore buffers from LocalCPUBackend -> normal GPU connector

rather than

DAX -> keep result in local CPU hot cache.

Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

ApostaC

Otherwise LGTM!

ApostaC · 2026-04-09T20:57:42Z

+    def _release_restore_resources(
+        self,
+        restore_slab_ptr: Optional[int] = None,
+    ) -> None:


nit: regarding those newly introduced long helper functions, it would be better if we can add some docstring about describing the args, expected return values and expected behavior. It will help other maintainers (and AI agents) better understand the codes.

addressed here 92b7fa6

Signed-off-by: DongDongJu <commisori28@gmail.com>

ApostaC

LGTM!

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 7c5c53a. Configure here.}

cursor · 2026-04-15T12:56:19Z

+                if region.items
+            ]
+            for future in futures:
+                future.result()


In-flight restore workers write to freed output buffers

Medium Severity

_run_restore_waves iterates futures sequentially via future.result(). If an earlier future raises, the loop exits immediately without waiting for remaining in-flight futures from the same wave. The caller's except block then calls _cleanup_restore_outputs, which frees output buffers that still-running region workers may be actively writing to via _batched_memcpy. Since the native batched_memcpy releases the GIL, this is a genuine use-after-free race, not protected by CPython's GIL.

Additional Locations (1)

lmcache/v1/storage_backend/plugins/dax_backend.py#L1376-L1380

^{Reviewed by Cursor Bugbot for commit 7c5c53a. Configure here.}

…ation (LMCache#2904) * Add a native batched memcpy helper and switch the DAX backend batched retrieve path to a persistent staged restore pipeline with region workers, a reusable staging slab, and broader correctness coverage. * docs: add DAX batched restore usage notes * [DAX] Remove dead restore cleanup parameter * [DAX] Share restore dispatch path for blocking get * [DAX] Document restore helper behavior Signed-off-by: DongDongJu <commisori28@gmail.com>

DongDongJu added 2 commits March 27, 2026 12:40

docs: add DAX batched restore usage notes

edcbaad

Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py Outdated

DongDongJu added 2 commits March 30, 2026 15:56

chore: apply pre-commit fixes for DAX backend changes

be9a99a

Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

fix: document batched DAX get and preserve cached fmt

141f763

Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

sammshen reviewed Mar 31, 2026

View reviewed changes

sammshen approved these changes Mar 31, 2026

View reviewed changes

Merge branch 'dev' into feat/devdax_opt

f9f99a8

cursor Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py

Merge branch 'LMCache:dev' into feat/devdax_opt

ca08e1c

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py

[DAX] Remove dead restore cleanup parameter

f95213e

Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

DongDongJu requested review from ApostaC and maobaolong April 5, 2026 23:46

ApostaC reviewed Apr 9, 2026

View reviewed changes

DongDongJu added 2 commits April 11, 2026 07:35

[DAX] Share restore dispatch path for blocking get

ebdb38b

Signed-off-by: DongDongJu <commisori28@gmail.com>

[DAX] Trim restore dispatch follow-up

640a30d

Signed-off-by: DongDongJu <commisori28@gmail.com>

DongDongJu requested review from YaoJiayi, deng451e and hickeyma as code owners April 10, 2026 22:54

DongDongJu added 2 commits April 11, 2026 08:01

[DAX] Document restore helper behavior

92b7fa6

Signed-off-by: DongDongJu <commisori28@gmail.com>

Merge branch 'dev' into feat/devdax_opt

5c747b8

ApostaC approved these changes Apr 14, 2026

View reviewed changes

ApostaC enabled auto-merge (squash) April 14, 2026 20:11

github-actions Bot added the full Run comprehensive tests on this PR label Apr 14, 2026

Merge branch 'dev' into feat/devdax_opt

7c5c53a

cursor Bot reviewed Apr 15, 2026

View reviewed changes

ApostaC merged commit 2cedbea into LMCache:dev Apr 15, 2026
31 of 33 checks passed

Conversation

DongDongJu commented Mar 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

sammshen Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

DongDongJu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

jayhpark530 commented Mar 31, 2026

Uh oh!

DongDongJu commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ApostaC Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

DongDongJu Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 15, 2026

Choose a reason for hiding this comment

In-flight restore workers write to freed output buffers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DongDongJu commented Mar 30, 2026 •

edited by cursor Bot

Loading