Skip to content

feat(disk): support multi-path local disk backend for multi-device I/O#2801

Merged
DongDongJu merged 1 commit intoLMCache:devfrom
glimchb:local-multi
Apr 8, 2026
Merged

feat(disk): support multi-path local disk backend for multi-device I/O#2801
DongDongJu merged 1 commit intoLMCache:devfrom
glimchb:local-multi

Conversation

@glimchb
Copy link
Copy Markdown
Contributor

@glimchb glimchb commented Mar 17, 2026

Allow local_disk to accept comma-separated paths (e.g. "/mnt/nvme0/,/mnt/nvme1/") so KV cache files are hash-distributed across multiple devices, increasing aggregate disk bandwidth.

  • _parse_local_disk now handles comma-separated paths with per-entry file:// prefix stripping
  • LocalDiskBackend tracks per-path filesystem block sizes for correct O_DIRECT alignment on heterogeneous devices
  • Backward-compatible .path property returns the first path
  • Added tests for multi-path init, key distribution, determinism, block size lookup, and config parser
  • Updated local_storage.rst and configurations.rst docs

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Changes disk cache initialization and path selection logic, which can affect where KV files are written/read and O_DIRECT alignment behavior. Risk is moderate due to new config surface (local_disk_path_sharding) and device-id parsing, but behavior is backward-compatible for single-path setups and covered by added tests.

Overview
Enables multi-device local disk offloading by allowing local_disk to be a comma-separated list of cache directories and adding local_disk_path_sharding (default by_gpu) to deterministically choose a per-worker path based on GPU device index.

Updates _parse_local_disk to strip optional file:// prefixes per entry and normalize multi-path values, adjusts LocalDiskBackend to create all configured directories and select the active path for I/O, and adds unit tests plus docs describing multi-path configuration and sharding behavior.

Reviewed by Cursor Bugbot for commit 3101789. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the local disk backend by enabling support for multiple storage paths. This allows users to leverage several independent disk devices, such as multiple NVMe drives, to distribute KV cache files. The primary goal is to boost aggregate disk I/O bandwidth, which is crucial for performance-sensitive applications. The changes include robust parsing of multi-path configurations, intelligent handling of varying filesystem block sizes for optimal direct I/O, and a deterministic strategy for distributing cache entries across the available paths.

Highlights

  • Multi-Path Local Disk Support: The local_disk configuration now accepts a comma-separated list of paths, enabling KV cache files to be distributed across multiple local disk devices for increased aggregate I/O bandwidth.
  • Intelligent I/O Alignment: The LocalDiskBackend now tracks filesystem block sizes independently for each configured path, ensuring correct O_DIRECT alignment even when using heterogeneous storage devices.
  • Backward Compatibility: A new .path property has been added to LocalDiskBackend which returns the first configured path, maintaining compatibility with existing code that expects a single disk path.
  • Deterministic Key Distribution: Cache keys are deterministically hashed and assigned to one of the configured disk paths, ensuring even distribution of files and I/O load across devices.
  • Enhanced Configuration Parsing: The _parse_local_disk utility has been refactored to correctly handle comma-separated paths and strip file:// prefixes from individual path entries.
  • Comprehensive Testing and Documentation: New unit tests have been added to validate multi-path initialization, key distribution, determinism, block size lookup, and configuration parsing. Documentation for local_storage.rst and configurations.rst has been updated to reflect these new capabilities.
Changelog
  • docs/source/api_reference/configurations.rst
    • Updated the description for the local_disk configuration option to include support for comma-separated paths and multi-device striping.
  • docs/source/kv_cache/storage_backends/local_storage.rst
    • Added a new section titled 'Multi-Path (Multi-Device) Disk Offloading' to explain how to configure and utilize multiple disk paths.
    • Included examples for environment variable and YAML configurations for multi-path setups.
    • Added notes on independent filesystem block size tracking and a tip regarding kernel-level RAID 0.
  • lmcache/v1/config_base.py
    • Refactored the _parse_local_disk function to accept and process comma-separated paths, stripping file:// prefixes from each part.
    • Updated the docstring for _parse_local_disk with detailed examples of input and output formats.
  • lmcache/v1/storage_backend/local_disk_backend.py
    • Modified the __init__ method to store a list of paths instead of a single path.
    • Implemented logic to create directories for all specified paths and to store per-path filesystem block sizes in self.os_disk_bs.
    • Added a @property path to return the first path for backward compatibility.
    • Updated the _key_to_path method to deterministically distribute cache keys across the multiple configured disk paths using hashing.
    • Introduced a new private method _get_block_size to retrieve the appropriate filesystem block size for a given file path.
    • Modified write_file and read_file to dynamically use the correct block size obtained from _get_block_size for O_DIRECT operations.
  • tests/v1/storage_backend/test_local_disk_backend.py
    • Imported _parse_local_disk for testing.
    • Added an assertion in test_init to verify that backend.paths is correctly initialized.
    • Introduced a new test class TestMultiPathDiskBackend to cover multi-path functionality.
    • Added test_init_multi_path to verify correct initialization with multiple paths, backward compatibility of .path, and per-path block size tracking.
    • Implemented test_key_distribution to check that keys are spread across multiple paths.
    • Created test_deterministic_key_to_path to ensure consistent mapping of a key to the same path.
    • Added test_get_block_size to confirm that the correct block size is returned for paths on different filesystems.
    • Introduced a new test class TestParseLocalDisk with various test cases for the _parse_local_disk utility, including handling None, single paths, file:// URIs, comma-separated lists, mixed formats, whitespace, and empty strings.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-path support for the local disk backend, allowing KV cache files to be distributed across multiple storage devices for increased I/O bandwidth. The changes include updates to the configuration parsing, the local disk backend logic to handle multiple paths and heterogeneous block sizes, and comprehensive documentation and tests for the new feature.

My review identifies two high-severity issues. First, the configuration parser for disk paths is too strict and could lead to runtime errors with valid path formats. Second, the key distribution logic uses Python's built-in hash() function, which is not deterministic across process restarts, potentially leading to cache invalidation. I've provided suggestions to address both of these points for improved robustness and correctness.

Comment thread lmcache/v1/config_base.py Outdated
Comment on lines +81 to +84
if re.match(r"file://(.*)/", part):
parsed.append(part[7:])
else:
parsed.append(part)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex r"file://(.*)/" used to parse local_disk paths is too strict as it requires a trailing slash on file:// URIs. For example, file:///path/to/cache would not be parsed correctly, and the path would be passed as file:///path/to/cache to filesystem functions, which would fail. This is also inconsistent with the example in docs/source/api_reference/configurations.rst which does not use a trailing slash.

Using startswith("file://") would be more robust and handle both cases (with and without a trailing slash).

Suggested change
if re.match(r"file://(.*)/", part):
parsed.append(part[7:])
else:
parsed.append(part)
if part.startswith("file://"):
parsed.append(part[7:])
else:
parsed.append(part)

Absolute file path for the cached chunk.
"""
filename = key.to_string().replace("/", "-") + ".pt"
idx = hash(filename) % len(self.paths)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The built-in hash() function is not suitable for persistent caches as its output is not guaranteed to be stable across different Python versions or even different process invocations due to hash randomization (enabled by default). This means that after a restart, keys might be mapped to different paths, leading to cache misses and effectively losing cached data until it's evicted and rewritten.

For deterministic distribution, a stable hashing algorithm from the hashlib module should be used. You will need to add import hashlib at the top of the file.

Suggested change
idx = hash(filename) % len(self.paths)
idx = int(__import__('hashlib').md5(filename.encode()).hexdigest(), 16) % len(self.paths)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8a80177e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +222 to +223
if path.startswith(disk_dir):
return bs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use directory-boundary matching for block size lookup

_get_block_size uses path.startswith(disk_dir), which can select the wrong filesystem entry when one configured cache path is a prefix of another (for example, /mnt/nvme1 vs /mnt/nvme10). In that case reads/writes for files under the longer path can get the shorter path’s block size, and with use_odirect=True this can lead to misaligned direct I/O and runtime EINVAL failures. This was introduced with multi-path support and affects any deployment using overlapping path prefixes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30b364dee3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lmcache/v1/config_base.py Outdated

parsed: list[str] = []
for part in raw_parts:
if re.match(r"file://(.*)/", part):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Strip file:// URIs without requiring trailing slash

_parse_local_disk only removes the file:// prefix when a path matches file://(.*)/, so valid file URIs like file:///mnt/nvme0 (no trailing /) are left unchanged. In the new multi-path flow this causes LocalDiskBackend to treat entries as literal filesystem paths (file:///...) and create/write under the wrong directory tree instead of the intended mount points, which breaks configurations that omit trailing slashes.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Comment thread lmcache/v1/storage_backend/local_disk_backend.py
Comment thread docs/source/api_reference/configurations.rst Outdated
@sammshen
Copy link
Copy Markdown
Contributor

can you add multipath support first with simple round robin routing? and then we can add extra modules to match workers with disk paths / smarter load balancing later?

@glimchb glimchb force-pushed the local-multi branch 2 times, most recently from d386614 to 8b21b63 Compare April 2, 2026 14:33
@glimchb glimchb closed this Apr 2, 2026
@glimchb glimchb reopened this Apr 2, 2026
@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Apr 3, 2026

@cursor review

assert result == "/mnt/nvme0/,/mnt/nvme1/"

def test_empty_string(self):
assert _parse_local_disk("") is None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests directly test private function _parse_local_disk

Low Severity

TestParseLocalDisk imports and directly tests _parse_local_disk, a private (underscore-prefixed) module-level function from config_base. Project rules state tests must verify the public interface and docstring contract, not implementation details, and explicitly prohibit tests for private methods. This parsing behavior can be tested through the public LMCacheEngineConfig.from_defaults or from_env API instead.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Comment thread lmcache/v1/storage_backend/local_disk_backend.py Outdated
@DongDongJu DongDongJu added the full Run comprehensive tests on this PR label Apr 6, 2026
@DongDongJu
Copy link
Copy Markdown
Collaborator

can you add multipath support first with simple round robin routing? and then we can add extra modules to match workers with disk paths / smarter load balancing later?

It sounds make sense to me, lets just do the file level interleaving among the folders.

@glimchb
Copy link
Copy Markdown
Contributor Author

glimchb commented Apr 7, 2026

can you add multipath support first with simple round robin routing? and then we can add extra modules to match workers with disk paths / smarter load balancing later?

It sounds make sense to me, lets just do the file level interleaving among the folders.

sure. this is introducing "by_gpu" sharding. same like we did for NIXL #2418 and GDS #2922.
next PR we can introduce round-robin across all 3 backends: local, nixl, gds as well.
adding round robin to this PR is a bit too much for single PR imho

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ae805f9. Configure here.

Comment thread lmcache/v1/storage_backend/local_disk_backend.py
Allow `local_disk` to accept comma-separated paths (e.g.
"/mnt/nvme0/,/mnt/nvme1/") to use multiple NVMe devices.  Each GPU
worker selects one path at init time via the `local_disk_path_sharding`
strategy (currently only "by_gpu": device_id % num_paths), matching
the GDS backend approach LMCache#2817 and NIXL approach LMCache#2418.

- Path selected once in __init__; _key_to_path, write_file, read_file
  unchanged from upstream
- _parse_local_disk now uses startswith("file://") instead of regex,
  fixing file:// URIs without a trailing slash
- All directories created at startup
- Added local_disk_path_sharding config field (default: "by_gpu")
- Added tests and updated docs

Before this change the only way to increase performance was to use
any of the linux multi-pathing technologies to aggregate IOs.

Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
os.makedirs(self.path)
logger.info(f"Created local disk cache directory: {self.path}")

# Multi-path support: parse comma-separated paths and select one
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally you create a new module called PathSharder that takes the csv as input and outputs a single path. the policy should be in there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a good idea. thank you for suggestion. I will introduce PathSharder and use it in all 3 storages: local, gds and nixl. This will remove duplication and create nicer abstraction for the sharding strategy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uploaded draft #2982 still wip

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approving first but please address the serious code modularity comment

Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left just one comment for the community.


local_disk: "/mnt/nvme0/kvcache/,/mnt/nvme1/kvcache/"
local_disk_path_sharding: "by_gpu"
max_local_disk_size: 20.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_local_disk_size is kinda vague in here.
we should make better naming or param with multi folder structure.

@DongDongJu DongDongJu merged commit 155179c into LMCache:dev Apr 8, 2026
35 checks passed
@glimchb glimchb deleted the local-multi branch April 8, 2026 23:50
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
LMCache#2801)

feat(disk): support multi-path local disk backend with path sharding

Allow `local_disk` to accept comma-separated paths (e.g.
"/mnt/nvme0/,/mnt/nvme1/") to use multiple NVMe devices.  Each GPU
worker selects one path at init time via the `local_disk_path_sharding`
strategy (currently only "by_gpu": device_id % num_paths), matching
the GDS backend approach LMCache#2817 and NIXL approach LMCache#2418.

- Path selected once in __init__; _key_to_path, write_file, read_file
  unchanged from upstream
- _parse_local_disk now uses startswith("file://") instead of regex,
  fixing file:// URIs without a trailing slash
- All directories created at startup
- Added local_disk_path_sharding config field (default: "by_gpu")
- Added tests and updated docs

Before this change the only way to increase performance was to use
any of the linux multi-pathing technologies to aggregate IOs.

Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants