Add hipFile support for AIS (AMD Infinity Storage) storage by glimchb · Pull Request #2799 · LMCache/LMCache

glimchb · 2026-03-17T13:59:41Z

This commit adds comprehensive support for AMD hipFile, providing the ROCm equivalent to NVIDIA cuFile for GPU-direct storage operations.

HipFileMemoryAllocator: New memory allocator for AMD GPUs
GDS Backend Integration: Added use_hipfile configuration option
ROCm Device Detection: Properly detects HIP devices via torch.version.hip
Automatic Allocator Selection: Backend chooses between cuFile/hipFile
Seamless integration with existing GDS backend
Configuration via extra_config: {"use_hipfile": true}
Automatic fallback to cuFile for NVIDIA systems
Full test coverage with mock-based tests
Complete documentation with examples
lmcache/v1/memory_management.py: Added HipFileMemoryAllocator
lmcache/v1/storage_backend/gds_backend.py: Added hipFile support
lmcache/v1/config.py: Added use_hipfile config option
lmcache/v1/cache_engine.py: Updated allocator selection logic
tests/v1/utils.py: Added has_hipfile() function
tests/v1/*: Updated tests to support both backends
docs/*: Updated documentation for hipFile support
requirements/common.txt: Added hipfile-python dependency

extra_config:
use_hipfile: true

export LMCACHE_EXTRA_CONFIG='{"use_hipfile": true}'

The implementation maintains full backward compatibility while enabling GPU-direct storage on AMD ROCm platforms.

Resolves: Support for AMD GPU-direct storage in LMCache

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

gemini-code-assist · 2026-03-17T14:00:12Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request extends the LMCache GDS backend to support AMD's hipFile, enabling GPU-direct storage for ROCm platforms. It integrates a new memory allocator, introduces a configuration option for hipFile, and intelligently selects the appropriate GPU-direct storage mechanism (cuFile or hipFile) based on the environment. These changes broaden the compatibility of the GDS backend to include AMD GPUs while maintaining existing functionality and ensuring backward compatibility.

Highlights

AMD hipFile Support: Introduced comprehensive support for AMD hipFile, providing the ROCm equivalent to NVIDIA cuFile for GPU-direct storage operations.
New Memory Allocator: Added a new HipFileMemoryAllocator class for managing GPU memory with hipFile.
Configurable Backend Selection: Implemented a use_hipfile configuration option and automatic allocator selection logic to choose between cuFile and hipFile based on system and configuration.
Updated Documentation and Tests: Expanded documentation to include hipFile usage and updated the test suite to ensure compatibility and proper functioning across both NVIDIA and AMD GDS backends.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/source/api_reference/configurations.rst
- Updated the description for cufile_buffer_size to clarify its applicability to both cuFile and hipFile operations.
docs/source/kv_cache/storage_backends/gds.rst
- Expanded the GDS overview to explicitly mention support for both NVIDIA cuFile and AMD hipFile.
- Added a new section detailing how to configure and use AMD hipFile, including environment variable and configuration file examples.
docs/source/kv_cache/storage_backends/weka.rst
- Modified comments and section headers to reflect that cufile_buffer_size and related explanations now apply to both cuFile and hipFile.
lmcache/v1/cache_engine.py
- Imported the new HipFileMemoryAllocator class.
lmcache/v1/memory_management.py
- Implemented the HipFileMemoryAllocator class, which inherits from GPUMemoryAllocator and handles hipFileBufRegister and hipFileBufDeregister for AMD GPU memory.
lmcache/v1/storage_backend/gds_backend.py
- Imported HipFileMemoryAllocator.
- Refactored the __init__ method to remove premature allocator initialization.
- Introduced self.use_hipfile and logic to parse use_hipfile from extra_config.
- Updated filesystem type detection and enforcement logic to account for both use_cufile and use_hipfile.
- Modified the GDS backend to dynamically import hipfile when self.use_hipfile is enabled.
- Updated the initialize_allocator method to conditionally return either CuFileMemoryAllocator or HipFileMemoryAllocator based on the use_hipfile flag.
tests/v1/data/gds.yaml
- Explicitly added use_cufile: true to extra_config for clarity in testing.
tests/v1/data/hipfile.yaml
- Added a new YAML configuration file to test the HipFileMemoryAllocator.
tests/v1/data/hipfile_gds.yaml
- Added a new YAML configuration file for GDS backend testing with hipFile enabled.
tests/v1/storage_backend/test_gds_backend.py
- Imported has_hipfile utility function.
- Modified pytest.mark.skipif decorators to allow tests to run if either cuFile or hipFile is available, ensuring broader GDS test coverage.
tests/v1/test_gds.py
- Imported HipFileMemoryAllocator.
- Updated assertions to check for instances of both CuFileMemoryAllocator and HipFileMemoryAllocator for the GDS backend's memory allocator.
tests/v1/utils.py
- Added a new utility function has_hipfile() to detect the presence and loadability of the hipfile Python package and libhipfile.so dynamic library.

Activity

The pull request was opened by glimchb to add AMD hipFile support for GPU-direct storage.
Initial code changes were pushed to introduce the HipFileMemoryAllocator, integrate use_hipfile configuration, and update GDS backend logic for allocator selection.
Documentation was updated to reflect the new hipFile capabilities and configuration.
New test configurations and utility functions were added, and existing tests were modified to ensure compatibility and proper functioning with both cuFile and hipFile.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request successfully integrates AMD hipFile support for GPU-direct storage operations, providing a ROCm equivalent to NVIDIA cuFile. The changes include adding a new HipFileMemoryAllocator, integrating it into the GDS backend with a configuration option (use_hipfile), and updating relevant documentation and tests. The implementation maintains backward compatibility and correctly handles automatic allocator selection. The documentation has been updated to reflect the new configuration options and capabilities.

gemini-code-assist · 2026-03-17T14:02:19Z

+        # HACK: hipfile import is placed here to avoid import errors on
+        # hardware without GPUDirect Storage / hipFile support.
+        from hipfile.bindings import hipFileBufDeregister, hipFileBufRegister


The comment "HACK: hipfile import is placed here to avoid import errors on hardware without GPUDirect Storage / hipFile support" indicates a workaround. While functional, placing imports inside a function can sometimes lead to unexpected behavior or make the code harder to reason about. Consider if a more robust solution, such as a factory pattern or conditional import at the module level with a dummy class/function for unsupported platforms, could be implemented to avoid this "HACK".

gemini-code-assist · 2026-03-17T14:02:19Z

+            # HACK: hipfile import may be buggy on some hardware
+            # (e.g., without GPUDirect), so it's temporarily put here.
+            # Third Party


Similar to the HipFileMemoryAllocator import, the "HACK" comment for hipfile import suggests a temporary solution. It would be beneficial to explore a more permanent design pattern to handle conditional imports for different GDS backends (cuFile, hipFile) to improve code clarity and maintainability.

gemini-code-assist · 2026-03-17T14:02:19Z

+            import hipfile
+
+            self.cudart = None
+            self.cufile = hipfile  # Reuse the same attribute name for compatibility


Reusing the self.cufile attribute name for the hipfile module (i.e., self.cufile = hipfile) might lead to confusion for future maintainers who might expect self.cufile to always refer to the NVIDIA cuFile module. While the comment mentions "compatibility", it could be clearer to use a more generic attribute name, such as self.gds_driver_module, to explicitly indicate that it can hold either cufile or hipfile depending on the active backend. However, a full refactoring would require changes outside this diff.

sbates130272 · 2026-03-17T18:35:48Z

@glimchb thanks for this work. Can I ask you change the title of this PR from GPU-direct storage to AMD Infinity Storage (AIS) since that is the appropriate name for this technology and I would not want to confuse users.

DongDongJu

Hello @glimchb,

I have two small requests.

Can we add few documentation that what is pre-condition for enabling Hipfile? eg., rocm version, hardware requirements ...

And multiple code looks logically redundant with cufile usage.
Can we using flags like use_gds=True/False and gds_backend="cufile or hipfile" and implement based on that?

DongDongJu · 2026-03-23T20:24:32Z

+    extra_config:
+        use_hipfile: true
+
+Note: The ``cufile_buffer_size`` configuration is used for both cuFile and hipFile buffers.


Since this variable are using for the same purpose.
Can we have a thin abstraction class for GDS can cover cufile and hipfile by each implementation?

DongDongJu · 2026-03-23T20:27:02Z

+            # HACK: hipfile import may be buggy on some hardware
+            # (e.g., without GPUDirect), so it's temporarily put here.
+            # Third Party
+            import hipfile


Then can we make the exception or raise error in here with above description?

glimchb · 2026-03-23T21:25:19Z

Hello @glimchb,

I have two small requests.

Can we add few documentation that what is pre-condition for enabling Hipfile? eg., rocm version, hardware requirements

absolutely. adding this now

...

And multiple code looks logically redundant with cufile usage. Can we using flags like use_gds=True/False and gds_backend="cufile or hipfile" and implement based on that?

I also want to fix it, but code review will be much more complex.
I can do this change as separate pr AFTER this one for consolidation of configs or BEFORE this PR as preparation.
See #2858
which one you prefer @DongDongJu ?

also do we want to keep old option use_cufile around for deprecation compatibility reasons for some time ? or no need and can just replace it ?

Preparation work for LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai> Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

DongDongJu · 2026-03-23T23:03:09Z

Hello @glimchb,
I have two small requests.
Can we add few documentation that what is pre-condition for enabling Hipfile? eg., rocm version, hardware requirements

absolutely. adding this now

Thanks!

...

And multiple code looks logically redundant with cufile usage. Can we using flags like use_gds=True/False and gds_backend="cufile or hipfile" and implement based on that?

I also want to fix it, but code review will be much more complex. I can do this change as separate pr AFTER this one for consolidation of configs or BEFORE this PR as preparation. See #2858 which one you prefer @DongDongJu ?

also do we want to keep old option use_cufile around for deprecation compatibility reasons for some time ? or no need and can just replace it ?

IMO. If you will make the follow up PR for this then dealing this in that PR is right direction.

DongDongJu

Code-wise looks no problem now but I can not confirm that functionality since I dont have access to AMD GPU to test this

DongDongJu · 2026-03-23T23:10:56Z

So maybe @mcgrof can help to eval this?

glimchb · 2026-03-24T10:30:51Z

Code-wise looks no problem now but I can not confirm that functionality since I dont have access to AMD GPU to test this

@DongDongJu we tested this on our server. If code looks good and existing functionality preserved, then maybe consider approving it? It’s same structure as CuFile

DongDongJu · 2026-03-24T17:06:19Z

@sammshen @deng451e Could you take a look by any chance?

sammshen

LGTM!

Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

glimchb · 2026-03-26T00:29:39Z

i ran uvx pre-commit run --all-files now

Preparation work for LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai> Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

) feat(gds): Add hipFile support for AIS (AMD Infinity Storage) storage Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

…2858) Follow up work post #2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

gemini-code-assist Bot reviewed Mar 17, 2026

View reviewed changes

glimchb force-pushed the hipFile branch 2 times, most recently from afcf20c to 542bb9b Compare March 17, 2026 14:05

glimchb changed the title ~~Add AMD hipFile support for GPU-direct storage~~ Add AMD hipFile support for AIS (GPU-direct) storage Mar 17, 2026

glimchb changed the title ~~Add AMD hipFile support for AIS (GPU-direct) storage~~ Add AMD hipFile support for AIS (AMD Infinity Storage) storage Mar 17, 2026

glimchb changed the title ~~Add AMD hipFile support for AIS (AMD Infinity Storage) storage~~ Add hipFile support for AIS (AMD Infinity Storage) storage Mar 18, 2026

DongDongJu requested changes Mar 23, 2026

View reviewed changes

glimchb force-pushed the hipFile branch from 542bb9b to 9157ae8 Compare March 23, 2026 21:32

glimchb mentioned this pull request Mar 23, 2026

[refactor]: Replace use_cufile with use_gds/gds_backend config flags #2858

Merged

2 tasks

DongDongJu reviewed Mar 23, 2026

View reviewed changes

DongDongJu approved these changes Mar 24, 2026

View reviewed changes

sammshen approved these changes Mar 25, 2026

View reviewed changes

glimchb force-pushed the hipFile branch from 9157ae8 to 08366fa Compare March 25, 2026 11:56

DongDongJu enabled auto-merge (squash) March 25, 2026 15:19

auto-merge was automatically disabled March 25, 2026 23:18
Head branch was pushed to by a user without write access

DongDongJu enabled auto-merge (squash) March 25, 2026 23:29

github-actions Bot added the full Run comprehensive tests on this PR label Mar 25, 2026

feat(gds): Add hipFile support for AIS (AMD Infinity Storage) storage

cfc7d2d

Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>

auto-merge was automatically disabled March 26, 2026 00:28
Head branch was pushed to by a user without write access

glimchb force-pushed the hipFile branch from 642a36d to cfc7d2d Compare March 26, 2026 00:28

github-actions Bot removed the full Run comprehensive tests on this PR label Mar 26, 2026

deng451e enabled auto-merge (squash) March 26, 2026 00:43

github-actions Bot added the full Run comprehensive tests on this PR label Mar 26, 2026

deng451e merged commit 18e7a00 into LMCache:dev Mar 26, 2026
36 checks passed

glimchb deleted the hipFile branch March 26, 2026 04:12

gaoikawa mentioned this pull request Apr 8, 2026

[Feature]: Integrate hipFile into LMCache ROCm/hipFile#202

Closed

Conversation

glimchb commented Mar 17, 2026

Uh oh!

gemini-code-assist Bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

sbates130272 commented Mar 17, 2026

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

DongDongJu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

DongDongJu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

glimchb commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DongDongJu commented Mar 23, 2026

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

DongDongJu commented Mar 23, 2026

Uh oh!

glimchb commented Mar 24, 2026

Uh oh!

DongDongJu commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

glimchb commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

glimchb commented Mar 23, 2026 •

edited

Loading

DongDongJu commented Mar 24, 2026 •

edited

Loading