[Feat][RawBlock] Add TP>1 support and compact batched retrieval path by DongDongJu · Pull Request #2948 · LMCache/LMCache

DongDongJu · 2026-04-03T20:12:23Z

What this PR does / why we need it:

Add RustRawBlockBackend support for TP > 1 by allowing explicit per-TP raw block device mapping via extra_config["rust_raw_block.per_tp_device_paths"].
Fail fast if multiple TP ranks are configured to use the same raw block device path.
Warn when restored on-device metadata appears to belong to a different TP worker.
Add a compact backend-native batched retrieval path so get_blocking, batched_get_blocking, and batched_get_non_blocking share the same raw-block prefix-read behavior.
including tp4 unit test. (do not require gpus)

Special notes for your reviewers:

This PR is mainly about higher-TP support and correctness for the raw block backend.
For TP > 1, each TP worker must be mapped to a distinct raw block partition.
The batched retrieval cleanup is intentionally compact: one shared prefix-read path backs the blocking and async raw-block get paths.
Validation included TP4 runs with vLLM + LMCache + uv + aiperf, plus LMCache long_doc_qa, on /dev/nvme4n1p1-4.

Validation / test results:
TP4 aiperf fair comparison:

LMCache + LocalCPU: 10.82 req/s, 723.06 ms
LMCache + LocalDisk buffered: 8.55 req/s, 930.88 ms
LMCache + rust_raw_block: 7.17 req/s, 1110.23 ms
LMCache + LocalDisk O_DIRECT: 1.95 req/s, 4095.91 ms

TP4 long_doc_qa comparison:

LMCache + LocalCPU: 0.145 s, 0.256 s
LMCache + LocalDisk buffered: 0.350 s, 0.328 s
LMCache + rust_raw_block: 0.438 s, 0.343 s
LMCache + LocalDisk O_DIRECT: 1.933 s, 0.885 s
Correctness check:
Raw block vs vanilla vLLM constrained-output comparison: 6/6 exact matches on
content, finish_reason, and usage

Reproduction scripts:

unit tests

cd LMCache

./[your venv]/bin/pytest -q \
  tests/v1/storage_backend/test_rust_raw_block_backend.py

2. TP4 raw-block config

# /tmp/lmcache_rawblock_tp4.yaml
chunk_size: 256
local_cpu: false
max_local_cpu_size: 10.0
lmcache_instance_id: "tp4_rawblock"
storage_plugins:
  - rust_raw_block
store_location: "rust_raw_block"
retrieve_locations:
  - "rust_raw_block"
extra_config:
  storage_plugin.rust_raw_block.module_path:
lmcache.v1.storage_backend.plugins.rust_raw_block_backend
  storage_plugin.rust_raw_block.class_name: RustRawBlockBackend
  rust_raw_block.per_tp_device_paths:
    "0": "/dev/nvme4n1p1"
    "1": "/dev/nvme4n1p2"
    "2": "/dev/nvme4n1p3"
    "3": "/dev/nvme4n1p4"
  rust_raw_block.block_align: 4096
  rust_raw_block.header_bytes: 4096
  rust_raw_block.meta_total_bytes: 4194304
  rust_raw_block.meta_enable_periodic: false
  rust_raw_block.use_odirect: true
  rust_raw_block.align_local_cpu_allocator: true

3. TP4 vLLM server

export PYTHONHASHSEED=0
MODEL=/path/to/Qwen2.5-14B-Instruct
for dev in /dev/nvme4n1p1 /dev/nvme4n1p2 /dev/nvme4n1p3 /dev/nvme4n1p4; do
  sudo dd if=/dev/zero of="$dev" bs=1M count=16 conv=fsync status=none
done

LMCACHE_CONFIG_FILE=/tmp/lmcache_rawblock_tp4.yaml \
./.venv-raw-block-tp4/bin/vllm serve "$MODEL" \
  --served-model-name qwen2.5-14b-local \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.70 \
  --max-model-len 8192 \
  --trust-remote-code \
  --enforce-eager \
  --no-enable-prefix-caching \
  --port 18115 \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

4. aiperf repeated-prefix dataset + run

python3 - <<'PY'
import json
prefix = " ".join(["hi"] * 3584)
with open("/tmp/repeated_prefix_single_turn.jsonl", "w") as f:
    for i in range(64):
        req = {
            "messages": [
                {"role": "user", "content": f"{prefix} unique_suffix_{i}"}
            ]
        }
        f.write(json.dumps(req) + "\n")
PY

PYTHONHASHSEED=0 uv run --directory ../aiperf aiperf \
  profile qwen2.5-14b-local \
  --tokenizer "$MODEL" \
  --endpoint-type chat \
  --custom-dataset-type single-turn \
  --input-file /tmp/repeated_prefix_single_turn.jsonl \
  --dataset-sampling-strategy sequential \
  --request-count 64 \
  --warmup-request-count 8 \
  --concurrency 8 \
  --output-tokens-mean 32 \
  --output-tokens-stddev 0 \
  --use-legacy-max-tokens \
  --ui-type none \
  --no-gpu-telemetry \
  --url http://127.0.0.1:18115 \
  --output-artifact-dir /tmp/aiperf_rawblock_tp4

5. long_doc_qa

PYTHONHASHSEED=0 python3 benchmarks/long_doc_qa/long_doc_qa.py \
  --port 18115 \
  --model qwen2.5-14b-local \
  --document-length 6000 \
  --num-documents 12 \
  --output-len 64 \
  --repeat-count 2 \
  --repeat-mode tile \
  --max-inflight-requests 4 \
  --json-output

If applicable:

- [ ] this PR contains user facing changes - docs added
- [x] this PR contains unit tests


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches a storage backend’s device selection and read path; misconfiguration or subtle ordering/error-handling bugs could lead to cross-rank data corruption or retrieval failures, though guarded by validation and extensive tests.
> 
> **Overview**
> Adds **TP>1 support** to `RustRawBlockBackend` by selecting a per-rank raw block `device_path` from `extra_config["rust_raw_block.per_tp_device_paths"]`, accepting both int/string YAML keys, and **failing fast** on duplicate device paths.
> 
> Refactors retrieval to a shared `_batched_get_prefix` implementation backing `get_blocking`, `batched_get_blocking`, and new async `batched_get_non_blocking`, with consistent “stop at first miss” semantics and improved cleanup (refcount release, `_inflight_io_count` reset, LRU touch) on allocation/read errors.
> 
> On restart metadata load, emits a **warning** when the restored index appears to belong to a different TP worker/device mapping, and adds comprehensive unit coverage for TP=4 init/I/O isolation and new batched-get error cases.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit cf276b0b89b7ec7d9e82a7e574b3a3233aec280a. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend by allowing explicit per-TP device path configuration. Changes: - Remove TP > 1 restriction that blocked multi-GPU deployments - Add support for explicit per-TP device paths via rust_raw_block.per_tp_device_paths configuration - Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py Configuration example: extra_config: rust_raw_block.per_tp_device_paths: "0": "/dev/nvme0n1p1" "1": "/dev/nvme0n1p2" "2": "/dev/nvme0n1p3" "3": "/dev/nvme0n1p4" Each TP worker now uses its own partition to avoid metadata conflicts and data corruption. Partitions must be pre-created on the device before use. Tests: - test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization with per-TP device paths - test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O test covering roundtrip, multiple operations, and TP isolation Signed-off-by: Daejun Park <daejun7.park@samsung.com> Source-Commit: d120963

Add error when a device path is mapped to multiple TP workers. Tests: - test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with the same device paths Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: eeba57c

Add warning when the raw block backend already has an entry of another rank. Tests: - test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests cross-rank device load Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: 2f59d3d

Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

DongDongJu · 2026-04-03T20:13:29Z

I think localdisk w/ o_direct has some issue related with path. I will fix it.

gemini-code-assist

Code Review

This pull request introduces support for Tensor Parallelism (TP > 1) in the RustRawBlockBackend by implementing per-TP device path configurations and ensuring partition isolation between workers. It also adds batched retrieval capabilities through batched_get_blocking and batched_get_non_blocking methods, supported by a refactored internal prefix-based retrieval logic. Feedback was provided regarding the _batched_get_prefix implementation, highlighting potential resource leaks with _inflight_io_count and memory leaks of allocated MemoryObj instances during exceptions, as well as the need for more robust error handling during memory allocation.

Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

sammshen

LGTM!

ApostaC

Otherwise LGTM!

Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f1cf012. Configure here.}

Address review feedback by documenting the TP device path helper so its string-versus-integer YAML key fallback is explicit. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

…MCache#2948) * feat: Add TP > 1 support for RustRawBlockBackend Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend by allowing explicit per-TP device path configuration. Changes: - Remove TP > 1 restriction that blocked multi-GPU deployments - Add support for explicit per-TP device paths via rust_raw_block.per_tp_device_paths configuration - Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py Configuration example: extra_config: rust_raw_block.per_tp_device_paths: "0": "/dev/nvme0n1p1" "1": "/dev/nvme0n1p2" "2": "/dev/nvme0n1p3" "3": "/dev/nvme0n1p4" Each TP worker now uses its own partition to avoid metadata conflicts and data corruption. Partitions must be pre-created on the device before use. Tests: - test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization with per-TP device paths - test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O test covering roundtrip, multiple operations, and TP isolation Signed-off-by: Daejun Park <daejun7.park@samsung.com> Source-Commit: d120963 * fix: validate duplicate per-TP raw block device paths Add error when a device path is mapped to multiple TP workers. Tests: - test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with the same device paths Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: eeba57c * fix: warn on cross-rank metadata load in raw block backend Add warning when the raw block backend already has an entry of another rank. Tests: - test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests cross-rank device load Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: 2f59d3d * perf: streamline raw block batched retrieval Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: harden raw block batched get cleanup Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * docs: describe raw block batched get semantics Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: preserve raw block read errors and TP key lookup Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: release partial raw block batch on read error Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * refactor: tighten raw block TP device path typing Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> --------- Signed-off-by: Daejun Park <daejun7.park@samsung.com> Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com> Co-authored-by: Daejun Park <daejun7.park@samsung.com> Co-authored-by: Dongjin Kim <dongjin_.kim@samsung.com> Co-authored-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

Daejun7Park and others added 4 commits April 3, 2026 18:09

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py Outdated

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py

Dongjoo Seo added 2 commits April 3, 2026 20:31

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py Outdated

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py

DongDongJu requested review from ApostaC, deng451e and sammshen April 5, 2026 23:46

sammshen approved these changes Apr 8, 2026

View reviewed changes

ApostaC approved these changes Apr 9, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py Outdated

DongDongJu requested review from chunxiaozheng, hickeyma and maobaolong as code owners April 10, 2026 15:48

DongDongJu enabled auto-merge (squash) April 10, 2026 15:49

github-actions Bot added the full Run comprehensive tests on this PR label Apr 10, 2026

Merge branch 'dev' into dongjoo/rawblock-tp4-compact

f1cf012

cursor Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py

Dongjoo Seo and others added 2 commits April 10, 2026 20:29

docs: clarify TP device path lookup helper

9fa964e

Address review feedback by documenting the TP device path helper so its string-versus-integer YAML key fallback is explicit. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

Merge branch 'dev' into dongjoo/rawblock-tp4-compact

cf276b0

DongDongJu merged commit e7dfb09 into LMCache:dev Apr 11, 2026
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][RawBlock] Add TP>1 support and compact batched retrieval path#2948

[Feat][RawBlock] Add TP>1 support and compact batched retrieval path#2948
DongDongJu merged 12 commits intoLMCache:devfrom
DongDongJu:dongjoo/rawblock-tp4-compact

DongDongJu commented Apr 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

DongDongJu commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sammshen left a comment

Uh oh!

ApostaC left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DongDongJu commented Apr 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DongDongJu commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DongDongJu commented Apr 3, 2026 •

edited by cursor Bot

Loading