[Feat][RawBlock] Add TP>1 support and compact batched retrieval path#2948
[Feat][RawBlock] Add TP>1 support and compact batched retrieval path#2948DongDongJu merged 12 commits intoLMCache:devfrom
Conversation
Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend
by allowing explicit per-TP device path configuration.
Changes:
- Remove TP > 1 restriction that blocked multi-GPU deployments
- Add support for explicit per-TP device paths via
rust_raw_block.per_tp_device_paths configuration
- Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py
Configuration example:
extra_config:
rust_raw_block.per_tp_device_paths:
"0": "/dev/nvme0n1p1"
"1": "/dev/nvme0n1p2"
"2": "/dev/nvme0n1p3"
"3": "/dev/nvme0n1p4"
Each TP worker now uses its own partition to avoid metadata
conflicts and data corruption. Partitions must be pre-created
on the device before use.
Tests:
- test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization
with per-TP device paths
- test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O
test covering roundtrip, multiple operations, and TP isolation
Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Source-Commit: d120963
Add error when a device path is mapped to multiple TP workers. Tests: - test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with the same device paths Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: eeba57c
Add warning when the raw block backend already has an entry of another rank. Tests: - test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests cross-rank device load Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: 2f59d3d
Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
|
I think localdisk w/ o_direct has some issue related with path. I will fix it. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Tensor Parallelism (TP > 1) in the RustRawBlockBackend by implementing per-TP device path configurations and ensuring partition isolation between workers. It also adds batched retrieval capabilities through batched_get_blocking and batched_get_non_blocking methods, supported by a refactored internal prefix-based retrieval logic. Feedback was provided regarding the _batched_get_prefix implementation, highlighting potential resource leaks with _inflight_io_count and memory leaks of allocated MemoryObj instances during exceptions, as well as the need for more robust error handling during memory allocation.
Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f1cf012. Configure here.
Address review feedback by documenting the TP device path helper so its string-versus-integer YAML key fallback is explicit. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
…MCache#2948) * feat: Add TP > 1 support for RustRawBlockBackend Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend by allowing explicit per-TP device path configuration. Changes: - Remove TP > 1 restriction that blocked multi-GPU deployments - Add support for explicit per-TP device paths via rust_raw_block.per_tp_device_paths configuration - Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py Configuration example: extra_config: rust_raw_block.per_tp_device_paths: "0": "/dev/nvme0n1p1" "1": "/dev/nvme0n1p2" "2": "/dev/nvme0n1p3" "3": "/dev/nvme0n1p4" Each TP worker now uses its own partition to avoid metadata conflicts and data corruption. Partitions must be pre-created on the device before use. Tests: - test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization with per-TP device paths - test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O test covering roundtrip, multiple operations, and TP isolation Signed-off-by: Daejun Park <daejun7.park@samsung.com> Source-Commit: d120963 * fix: validate duplicate per-TP raw block device paths Add error when a device path is mapped to multiple TP workers. Tests: - test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with the same device paths Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: eeba57c * fix: warn on cross-rank metadata load in raw block backend Add warning when the raw block backend already has an entry of another rank. Tests: - test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests cross-rank device load Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: 2f59d3d * perf: streamline raw block batched retrieval Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: harden raw block batched get cleanup Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * docs: describe raw block batched get semantics Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: preserve raw block read errors and TP key lookup Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: release partial raw block batch on read error Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * refactor: tighten raw block TP device path typing Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> --------- Signed-off-by: Daejun Park <daejun7.park@samsung.com> Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com> Co-authored-by: Daejun Park <daejun7.park@samsung.com> Co-authored-by: Dongjin Kim <dongjin_.kim@samsung.com> Co-authored-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
…MCache#2948) * feat: Add TP > 1 support for RustRawBlockBackend Add support for Tensor Parallelism (TP > 1) in RustRawBlockBackend by allowing explicit per-TP device path configuration. Changes: - Remove TP > 1 restriction that blocked multi-GPU deployments - Add support for explicit per-TP device paths via rust_raw_block.per_tp_device_paths configuration - Add comprehensive TP=4 test suite to test_rust_raw_block_backend.py Configuration example: extra_config: rust_raw_block.per_tp_device_paths: "0": "/dev/nvme0n1p1" "1": "/dev/nvme0n1p2" "2": "/dev/nvme0n1p3" "3": "/dev/nvme0n1p4" Each TP worker now uses its own partition to avoid metadata conflicts and data corruption. Partitions must be pre-created on the device before use. Tests: - test_rust_raw_block_backend_tp4_initialization: Tests TP=4 initialization with per-TP device paths - test_rust_raw_block_backend_tp4_comprehensive_io: Comprehensive TP=4 I/O test covering roundtrip, multiple operations, and TP isolation Signed-off-by: Daejun Park <daejun7.park@samsung.com> Source-Commit: d120963 * fix: validate duplicate per-TP raw block device paths Add error when a device path is mapped to multiple TP workers. Tests: - test_rust_raw_block_backend_tp_paths_must_be_unique: Tests TP=2 with the same device paths Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: eeba57c * fix: warn on cross-rank metadata load in raw block backend Add warning when the raw block backend already has an entry of another rank. Tests: - test_rust_raw_block_backend_warns_on_cross_rank_metadata_load: Tests cross-rank device load Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Source-Commit: 2f59d3d * perf: streamline raw block batched retrieval Use one shared raw-block prefix read path for blocking and async retrieval, and keep the added TP/raw-block tests compact by sharing the setup helper. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: harden raw block batched get cleanup Address review feedback for batched raw-block retrieval by ensuring inflight IO accounting is released when raw device setup fails, handling allocator exhaustion without asserts, and releasing allocated MemoryObj instances on read failures. Add regression tests for each case. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * docs: describe raw block batched get semantics Address review feedback by documenting the raw-block batched retrieval overrides, including their prefix-stop return behavior and the raw-device initialization failure they may propagate. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: preserve raw block read errors and TP key lookup Address review feedback by accepting integer YAML keys in rust_raw_block.per_tp_device_paths and restoring propagation of raw-device read failures while keeping the cleanup path intact. Add regression coverage for integer-key TP init and read-error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * fix: release partial raw block batch on read error Address review feedback by releasing any already-loaded MemoryObj instances when a later raw-block read in the same batch fails. Add regression coverage for a two-key batch that succeeds on the first read and fails on the second while preserving error propagation. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * refactor: tighten raw block TP device path typing Address review feedback by replacing broad Any typing around rust_raw_block.per_tp_device_paths with an explicit mapping alias and validating the config value before use. This keeps the TP rank lookup logic robust for both string and integer YAML keys while making the helper signatures clearer. Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> --------- Signed-off-by: Daejun Park <daejun7.park@samsung.com> Signed-off-by: Dongjin Kim <dongjin_.kim@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com> Co-authored-by: Daejun Park <daejun7.park@samsung.com> Co-authored-by: Dongjin Kim <dongjin_.kim@samsung.com> Co-authored-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

What this PR does / why we need it:
RustRawBlockBackendsupport forTP > 1by allowing explicit per-TP raw block device mapping viaextra_config["rust_raw_block.per_tp_device_paths"].get_blocking,batched_get_blocking, andbatched_get_non_blockingshare the same raw-block prefix-read behavior.Special notes for your reviewers:
TP > 1, each TP worker must be mapped to a distinct raw block partition.uv+aiperf, plus LMCachelong_doc_qa, on/dev/nvme4n1p1-4.Validation / test results:
TP4
aiperffair comparison:10.82 req/s,723.06 ms8.55 req/s,930.88 ms7.17 req/s,1110.23 msO_DIRECT:1.95 req/s,4095.91 msTP4
long_doc_qacomparison:0.145 s,0.256 s0.350 s,0.328 s0.438 s,0.343 sO_DIRECT:1.933 s,0.885 sCorrectness check:
6/6exact matches oncontent,finish_reason, andusageReproduction scripts: