Skip to content

[Feat][DAX] Optimize staged batched restore path and document modification#2904

Merged
ApostaC merged 12 commits intoLMCache:devfrom
DongDongJu:feat/devdax_opt
Apr 15, 2026
Merged

[Feat][DAX] Optimize staged batched restore path and document modification#2904
ApostaC merged 12 commits intoLMCache:devfrom
DongDongJu:feat/devdax_opt

Conversation

@DongDongJu
Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu commented Mar 30, 2026

This PR optimizes the DAX backend's staged restore path.
Right next plan is support Higher TP/DP/EP and support MP mode like raw block backend.

Before this change, DAX retrieval restored cached chunks through serialized per-chunk restaging, which became the main bottleneck on long-context cache-hit workloads. This PR keeps the existing staged retrieve model (DAX -> CPU -> normal GPU connector) but makes the DAX restore path batched and reusable.

Main changes:

  • add a shared batched restore pipeline for blocking and async-prefetch DAX retrieval
  • group restore output allocation from LocalCPUBackend
  • coalesce adjacent DAX spans and execute restore work in regions/waves
  • add a reusable pinned retrieve staging slab and persistent restore executors
  • add a native batched_memcpy helper with non-CUDA fallback
  • document the new DAX restore tuning knobs

In local long_doc_qa validation on Qwen/Qwen3-14B, the optimized staged DAX path reduced mean query TTFT by about
61.6% and mean query-round time by about 33.3% versus the pre-optimization DAX backend.

Special notes for your reviewers:

  • Scope is intentionally limited to the DAX staged retrieve path and the small native copy helper it depends on.
  • This PR does not change the basic store path; stores are still CPU-staged before being written to DAX.
  • Sync and async-prefetch retrieval semantics are preserved; the backend only replaces serialized per-chunk restore
    with batched restore.
  • New DAX knobs documented in this PR:
    • dax.restore_workers
    • dax.restore_max_regions
    • dax.retrieve_staging_slab_bytes
  • Local validation beyond unit tests:
    -long_doc_qa comparison vs pre-optimization DAX: TTFT -61.6%, query-round -33.3%
    • reversed-order control (baseline -> optimized): TTFT -58.4%, query-round -32.5%
    • longer-running control with query round > 1 minute: TTFT -62.0%, query-round -36.0%
sample test configs

  chunk_size: 256
  local_cpu: false
  max_local_cpu_size: 16
  cache_policy: LRU
  min_retrieve_tokens: 0
  storage_plugins: ["dax"]
  store_location: dax
  retrieve_locations: ["dax"]
  extra_config:
    storage_plugin.dax.module_path: lmcache.v1.storage_backend.plugins.dax_backend
    storage_plugin.dax.class_name: DaxBackend
    dax.device_path: "/dev/daxxxx"
    dax.max_dax_size: 64
    dax.async_put: false

  vllm serve Qwen/Qwen3-14B \
    --host 0.0.0.0 \
    --port 8000 \
    --reasoning-parser qwen3 \
    --attention-backend FLASH_ATTN \
    --gpu-memory-utilization 0.5 \
    --max-model-len 32768 \
    --no-enable-prefix-caching \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

  python benchmarks/long_doc_qa/long_doc_qa.py \
    --host 127.0.0.1 \
    --port 8000 \
    --model Qwen/Qwen3-14B \
    --document-length 10000 \
    --num-documents 30 \
    --output-len 200 \
    --repeat-count 2 \
    --repeat-mode tile \
    --max-inflight-requests 5 \
    --sleep-time-after-warmup 0 \
    --trim-fraction 0.0 \
    --json-output

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Reworks the DAX retrieval path to use new threaded batched restore logic and a shared pinned staging slab, which can affect concurrency, lifecycle/cleanup, and memory-safety if mis-tuned or if edge cases slip through.

Overview
DAX retrieval is refactored from per-key restores into a staged batched restore pipeline. Both blocking and async-prefetch reads now go through a shared dispatcher that reserves readable entries, batches CPU output allocation, coalesces adjacent DAX spans into region/wave copy plans, and restores via persistent thread pools while preserving existing semantics (blocking keeps positional None holes; async returns only the consecutive hit prefix).

Adds new DAX tuning knobs (dax.restore_workers, dax.restore_max_regions, dax.retrieve_staging_slab_bytes) and manages a backend-owned pinned staging slab with explicit shutdown/cleanup on close() and init failures.

Introduces a native batched_memcpy (csrc/mem_alloc.* + pybind) with a Python non-CUDA fallback, and expands DAX unit tests to cover batched restore behavior, locking/close interactions, and the memcpy helper.

Reviewed by Cursor Bugbot for commit 7c5c53a. Bugbot is set up for automated code reviews on this repo. Configure here.

Add a native batched memcpy helper and switch the DAX backend batched retrieve path to a persistent staged restore pipeline with region workers, a reusable staging slab, and broader correctness coverage.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a staged batched restore path for the DAX storage backend to optimize retrieval throughput. Key changes include the addition of a batched_memcpy utility (with both C++ and non-CUDA Python implementations), new configuration parameters for tuning restore workers and staging slab sizes, and a redesigned retrieval flow in DaxBackend that utilizes parallel executors and a pinned staging slab. Feedback is provided regarding a missing docstring for the new public method batched_get_blocking and ensuring the correct memory format is used during batched allocation.

Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py Outdated
Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Comment thread csrc/mem_alloc.cpp
}
}

void batched_memcpy(const std::vector<uintptr_t>& src_ptrs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the main benefit just avoiding GIL?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes one of reason is that GIL avoidance. That helps, but it’s only one part.
The bigger gain here is that the restore path is now batched end-to-end.
batched_memcpy mainly removes Python per-copy overhead on that path and lets the copy loop run in native code with the GIL released, but by itself it would not explain all of the TTFT improvement.

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jayhpark530
Copy link
Copy Markdown
Contributor

Nice optimization! Looking forward to MP mode support as well.
How does local_cpu: false work in the sample test config?

@DongDongJu
Copy link
Copy Markdown
Collaborator Author

Nice optimization! Looking forward to MP mode support as well. How does local_cpu: false work in the sample test config?

The DAX backend still needs LocalCPUBackend as its allocator backend for restore buffers/staging outputs.

So in this config, max_local_cpu_size: 16 is still providing the CPU-side allocation budget used by DAX restore, but retrieved chunks are not being admitted into the local CPU hot cache.

So the sample path is effectively:

DAX -> CPU restore buffers from LocalCPUBackend -> normal GPU connector

rather than

DAX -> keep result in local CPU hot cache.

Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py
Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py
Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
@DongDongJu DongDongJu requested review from ApostaC and maobaolong April 5, 2026 23:46
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py Outdated
Comment on lines +782 to +785
def _release_restore_resources(
self,
restore_slab_ptr: Optional[int] = None,
) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: regarding those newly introduced long helper functions, it would be better if we can add some docstring about describing the args, expected return values and expected behavior. It will help other maintainers (and AI agents) better understand the codes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed here 92b7fa6

Comment thread lmcache/v1/storage_backend/plugins/dax_backend.py Outdated
Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC enabled auto-merge (squash) April 14, 2026 20:11
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 14, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7c5c53a. Configure here.

if region.items
]
for future in futures:
future.result()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-flight restore workers write to freed output buffers

Medium Severity

_run_restore_waves iterates futures sequentially via future.result(). If an earlier future raises, the loop exits immediately without waiting for remaining in-flight futures from the same wave. The caller's except block then calls _cleanup_restore_outputs, which frees output buffers that still-running region workers may be actively writing to via _batched_memcpy. Since the native batched_memcpy releases the GIL, this is a genuine use-after-free race, not protected by CPython's GIL.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7c5c53a. Configure here.

@ApostaC ApostaC merged commit 2cedbea into LMCache:dev Apr 15, 2026
31 of 33 checks passed
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
…ation (LMCache#2904)

* Add a native batched memcpy helper and switch the DAX backend batched retrieve path to a persistent staged restore pipeline with region workers, a reusable staging slab, and broader correctness coverage.

* docs: add DAX batched restore usage notes

* [DAX] Remove dead restore cleanup parameter

* [DAX] Share restore dispatch path for blocking get

* [DAX] Document restore helper behavior

Signed-off-by: DongDongJu <commisori28@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants