Skip to content

[Feat][RawBlock] Add TP>1 support and compact batched retrieval path#2948

Merged
DongDongJu merged 12 commits intoLMCache:devfrom
DongDongJu:dongjoo/rawblock-tp4-compact
Apr 11, 2026
Merged

[Feat][RawBlock] Add TP>1 support and compact batched retrieval path#2948
DongDongJu merged 12 commits intoLMCache:devfrom
DongDongJu:dongjoo/rawblock-tp4-compact

Conversation

@DongDongJu
Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu commented Apr 3, 2026

What this PR does / why we need it:

  • Add RustRawBlockBackend support for TP > 1 by allowing explicit per-TP raw block device mapping via extra_config["rust_raw_block.per_tp_device_paths"].
  • Fail fast if multiple TP ranks are configured to use the same raw block device path.
  • Warn when restored on-device metadata appears to belong to a different TP worker.
  • Add a compact backend-native batched retrieval path so get_blocking, batched_get_blocking, and batched_get_non_blocking share the same raw-block prefix-read behavior.
  • including tp4 unit test. (do not require gpus)

Special notes for your reviewers:

  • This PR is mainly about higher-TP support and correctness for the raw block backend.
  • For TP > 1, each TP worker must be mapped to a distinct raw block partition.
  • The batched retrieval cleanup is intentionally compact: one shared prefix-read path backs the blocking and async raw-block get paths.
  • Validation included TP4 runs with vLLM + LMCache + uv + aiperf, plus LMCache long_doc_qa, on /dev/nvme4n1p1-4.

Validation / test results:
TP4 aiperf fair comparison:

  • LMCache + LocalCPU: 10.82 req/s, 723.06 ms
  • LMCache + LocalDisk buffered: 8.55 req/s, 930.88 ms
  • LMCache + rust_raw_block: 7.17 req/s, 1110.23 ms
  • LMCache + LocalDisk O_DIRECT: 1.95 req/s, 4095.91 ms

TP4 long_doc_qa comparison:

  • LMCache + LocalCPU: 0.145 s, 0.256 s
  • LMCache + LocalDisk buffered: 0.350 s, 0.328 s
  • LMCache + rust_raw_block: 0.438 s, 0.343 s
  • LMCache + LocalDisk O_DIRECT: 1.933 s, 0.885 s
    Correctness check:
  • Raw block vs vanilla vLLM constrained-output comparison: 6/6 exact matches on
    content, finish_reason, and usage

Reproduction scripts:

  1. unit tests
cd LMCache

./[your venv]/bin/pytest -q \
  tests/v1/storage_backend/test_rust_raw_block_backend.py

2. TP4 raw-block config

# /tmp/lmcache_rawblock_tp4.yaml
chunk_size: 256
local_cpu: false
max_local_cpu_size: 10.0
lmcache_instance_id: "tp4_rawblock"
storage_plugins:
  - rust_raw_block
store_location: "rust_raw_block"
retrieve_locations:
  - "rust_raw_block"
extra_config:
  storage_plugin.rust_raw_block.module_path:
lmcache.v1.storage_backend.plugins.rust_raw_block_backend
  storage_plugin.rust_raw_block.class_name: RustRawBlockBackend
  rust_raw_block.per_tp_device_paths:
    "0": "/dev/nvme4n1p1"
    "1": "/dev/nvme4n1p2"
    "2": "/dev/nvme4n1p3"
    "3": "/dev/nvme4n1p4"
  rust_raw_block.block_align: 4096
  rust_raw_block.header_bytes: 4096
  rust_raw_block.meta_total_bytes: 4194304
  rust_raw_block.meta_enable_periodic: false
  rust_raw_block.use_odirect: true
  rust_raw_block.align_local_cpu_allocator: true

3. TP4 vLLM server

export PYTHONHASHSEED=0
MODEL=/path/to/Qwen2.5-14B-Instruct
for dev in /dev/nvme4n1p1 /dev/nvme4n1p2 /dev/nvme4n1p3 /dev/nvme4n1p4; do
  sudo dd if=/dev/zero of="$dev" bs=1M count=16 conv=fsync status=none
done

LMCACHE_CONFIG_FILE=/tmp/lmcache_rawblock_tp4.yaml \
./.venv-raw-block-tp4/bin/vllm serve "$MODEL" \
  --served-model-name qwen2.5-14b-local \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.70 \
  --max-model-len 8192 \
  --trust-remote-code \
  --enforce-eager \
  --no-enable-prefix-caching \
  --port 18115 \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

4. aiperf repeated-prefix dataset + run

python3 - <<'PY'
import json
prefix = " ".join(["hi"] * 3584)
with open("/tmp/repeated_prefix_single_turn.jsonl", "w") as f:
    for i in range(64):
        req = {
            "messages": [
                {"role": "user", "content": f"{prefix} unique_suffix_{i}"}
            ]
        }
        f.write(json.dumps(req) + "\n")
PY

PYTHONHASHSEED=0 uv run --directory ../aiperf aiperf \
  profile qwen2.5-14b-local \
  --tokenizer "$MODEL" \
  --endpoint-type chat \
  --custom-dataset-type single-turn \
  --input-file /tmp/repeated_prefix_single_turn.jsonl \
  --dataset-sampling-strategy sequential \
  --request-count 64 \
  --warmup-request-count 8 \
  --concurrency 8 \
  --output-tokens-mean 32 \
  --output-tokens-stddev 0 \
  --use-legacy-max-tokens \
  --ui-type none \
  --no-gpu-telemetry \
  --url http://127.0.0.1:18115 \
  --output-artifact-dir /tmp/aiperf_rawblock_tp4

5. long_doc_qa

PYTHONHASHSEED=0 python3 benchmarks/long_doc_qa/long_doc_qa.py \
  --port 18115 \
  --model qwen2.5-14b-local \
  --document-length 6000 \
  --num-documents 12 \
  --output-len 64 \
  --repeat-count 2 \
  --repeat-mode tile \
  --max-inflight-requests 4 \
  --json-output

If applicable:

- [ ] this PR contains user facing changes - docs added
- [x] this PR contains unit tests


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches a storage backend’s device selection and read path; misconfiguration or subtle ordering/error-handling bugs could lead to cross-rank data corruption or retrieval failures, though guarded by validation and extensive tests.
> 
> **Overview**
> Adds **TP>1 support** to `RustRawBlockBackend` by selecting a per-rank raw block `device_path` from `extra_config["rust_raw_block.per_tp_device_paths"]`, accepting both int/string YAML keys, and **failing fast** on duplicate device paths.
> 
> Refactors retrieval to a shared `_batched_get_prefix` implementation backing `get_blocking`, `batched_get_blocking`, and new async `batched_get_non_blocking`, with consistent “stop at first miss” semantics and improved cleanup (refcount release, `_inflight_io_count` reset, LRU touch) on allocation/read errors.
> 
> On restart metadata load, emits a **warning** when the restored index appears to belong to a different TP worker/device mapping, and adds comprehensive unit coverage for TP=4 init/I/O isolation and new batched-get error cases.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit cf276b0b89b7ec7d9e82a7e574b3a3233aec280a. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Daejun7Park and others added 4 commits April 3, 2026 18:09
Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend
by allowing explicit per-TP device path configuration.

Changes:
- Remove TP > 1 restriction that blocked multi-GPU deployments
- Add support for explicit per-TP device paths via
  rust_raw_block.per_tp_device_paths configuration
- Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py

Configuration example:
  extra_config:
    rust_raw_block.per_tp_device_paths:
      "0": "/dev/nvme0n1p1"
      "1": "/dev/nvme0n1p2"
      "2": "/dev/nvme0n1p3"
      "3": "/dev/nvme0n1p4"

Each TP worker now uses its own partition to avoid metadata
conflicts and data corruption. Partitions must be pre-created
on the device before use.

Tests:
- test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization
  with per-TP device paths
- test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O
  test covering roundtrip, multiple operations, and TP isolation

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Source-Commit: d120963
Add error when a device path is mapped to multiple TP workers.

Tests:
- test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with
  the same device paths

Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Source-Commit: eeba57c
Add warning when the raw block backend already has an entry of another rank.

Tests:
- test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests
  cross-rank device load

Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Source-Commit: 2f59d3d
Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper.

Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
@DongDongJu
Copy link
Copy Markdown
Collaborator Author

I think localdisk w/ o_direct has some issue related with path. I will fix it.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Tensor Parallelism (TP > 1) in the RustRawBlockBackend by implementing per-TP device path configurations and ensuring partition isolation between workers. It also adds batched retrieval capabilities through batched_get_blocking and batched_get_non_blocking methods, supported by a refactored internal prefix-based retrieval logic. Feedback was provided regarding the _batched_get_prefix implementation, highlighting potential resource leaks with _inflight_io_count and memory leaks of allocated MemoryObj instances during exceptions, as well as the need for more robust error handling during memory allocation.

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py
Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py Outdated
Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py
Dongjoo Seo added 2 commits April 3, 2026 20:31
Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py
Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py Outdated
Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py
Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py Outdated
Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer.

Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
@DongDongJu DongDongJu enabled auto-merge (squash) April 10, 2026 15:49
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 10, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f1cf012. Configure here.

Comment thread lmcache/v1/storage_backend/plugins/rust_raw_block_backend.py
Dongjoo Seo and others added 2 commits April 10, 2026 20:29
Address review feedback by documenting the TP device path helper so its string-versus-integer YAML key fallback is explicit.

Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
@DongDongJu DongDongJu merged commit e7dfb09 into LMCache:dev Apr 11, 2026
35 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
…MCache#2948)

* feat: Add TP > 1 support for RustRawBlockBackend

Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend
by allowing explicit per-TP device path configuration.

Changes:
- Remove TP > 1 restriction that blocked multi-GPU deployments
- Add support for explicit per-TP device paths via
  rust_raw_block.per_tp_device_paths configuration
- Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py

Configuration example:
  extra_config:
    rust_raw_block.per_tp_device_paths:
      "0": "/dev/nvme0n1p1"
      "1": "/dev/nvme0n1p2"
      "2": "/dev/nvme0n1p3"
      "3": "/dev/nvme0n1p4"

Each TP worker now uses its own partition to avoid metadata
conflicts and data corruption. Partitions must be pre-created
on the device before use.

Tests:
- test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization
  with per-TP device paths
- test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O
  test covering roundtrip, multiple operations, and TP isolation

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Source-Commit: d120963

* fix: validate duplicate per-TP raw block device paths

Add error when a device path is mapped to multiple TP workers.

Tests:
- test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with
  the same device paths

Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Source-Commit: eeba57c

* fix: warn on cross-rank metadata load in raw block backend

Add warning when the raw block backend already has an entry of another rank.

Tests:
- test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests
  cross-rank device load

Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Source-Commit: 2f59d3d

* perf: streamline raw block batched retrieval

Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper.

Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* fix: harden raw block batched get cleanup

Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* docs: describe raw block batched get semantics

Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* fix: preserve raw block read errors and TP key lookup

Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* fix: release partial raw block batch on read error

Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* refactor: tighten raw block TP device path typing

Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer.

Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

---------

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
Co-authored-by: Daejun Park <daejun7.park@samsung.com>
Co-authored-by: Dongjin Kim <dongjin_.kim@samsung.com>
Co-authored-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
…MCache#2948)

* feat: Add TP > 1 support for RustRawBlockBackend

Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend
by allowing explicit per-TP device path configuration.

Changes:
- Remove TP > 1 restriction that blocked multi-GPU deployments
- Add support for explicit per-TP device paths via
  rust_raw_block.per_tp_device_paths configuration
- Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py

Configuration example:
  extra_config:
    rust_raw_block.per_tp_device_paths:
      "0": "/dev/nvme0n1p1"
      "1": "/dev/nvme0n1p2"
      "2": "/dev/nvme0n1p3"
      "3": "/dev/nvme0n1p4"

Each TP worker now uses its own partition to avoid metadata
conflicts and data corruption. Partitions must be pre-created
on the device before use.

Tests:
- test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization
  with per-TP device paths
- test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O
  test covering roundtrip, multiple operations, and TP isolation

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Source-Commit: d120963

* fix: validate duplicate per-TP raw block device paths

Add error when a device path is mapped to multiple TP workers.

Tests:
- test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with
  the same device paths

Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Source-Commit: eeba57c

* fix: warn on cross-rank metadata load in raw block backend

Add warning when the raw block backend already has an entry of another rank.

Tests:
- test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests
  cross-rank device load

Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Source-Commit: 2f59d3d

* perf: streamline raw block batched retrieval

Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper.

Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* fix: harden raw block batched get cleanup

Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* docs: describe raw block batched get semantics

Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* fix: preserve raw block read errors and TP key lookup

Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* fix: release partial raw block batch on read error

Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation.

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* refactor: tighten raw block TP device path typing

Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer.

Signed-off-by: DongDongJu <commisori28@gmail.com>

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

---------

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: DongDongJu <commisori28@gmail.com>
Co-authored-by: Daejun Park <daejun7.park@samsung.com>
Co-authored-by: Dongjin Kim <dongjin_.kim@samsung.com>
Co-authored-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants