Add hipFile support for AIS (AMD Infinity Storage) storage#2799
Add hipFile support for AIS (AMD Infinity Storage) storage#2799deng451e merged 1 commit intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request extends the LMCache GDS backend to support AMD's hipFile, enabling GPU-direct storage for ROCm platforms. It integrates a new memory allocator, introduces a configuration option for hipFile, and intelligently selects the appropriate GPU-direct storage mechanism (cuFile or hipFile) based on the environment. These changes broaden the compatibility of the GDS backend to include AMD GPUs while maintaining existing functionality and ensuring backward compatibility. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request successfully integrates AMD hipFile support for GPU-direct storage operations, providing a ROCm equivalent to NVIDIA cuFile. The changes include adding a new HipFileMemoryAllocator, integrating it into the GDS backend with a configuration option (use_hipfile), and updating relevant documentation and tests. The implementation maintains backward compatibility and correctly handles automatic allocator selection. The documentation has been updated to reflect the new configuration options and capabilities.
| # HACK: hipfile import is placed here to avoid import errors on | ||
| # hardware without GPUDirect Storage / hipFile support. | ||
| from hipfile.bindings import hipFileBufDeregister, hipFileBufRegister |
There was a problem hiding this comment.
The comment "HACK: hipfile import is placed here to avoid import errors on hardware without GPUDirect Storage / hipFile support" indicates a workaround. While functional, placing imports inside a function can sometimes lead to unexpected behavior or make the code harder to reason about. Consider if a more robust solution, such as a factory pattern or conditional import at the module level with a dummy class/function for unsupported platforms, could be implemented to avoid this "HACK".
| # HACK: hipfile import may be buggy on some hardware | ||
| # (e.g., without GPUDirect), so it's temporarily put here. | ||
| # Third Party |
There was a problem hiding this comment.
Similar to the HipFileMemoryAllocator import, the "HACK" comment for hipfile import suggests a temporary solution. It would be beneficial to explore a more permanent design pattern to handle conditional imports for different GDS backends (cuFile, hipFile) to improve code clarity and maintainability.
| import hipfile | ||
|
|
||
| self.cudart = None | ||
| self.cufile = hipfile # Reuse the same attribute name for compatibility |
There was a problem hiding this comment.
Reusing the self.cufile attribute name for the hipfile module (i.e., self.cufile = hipfile) might lead to confusion for future maintainers who might expect self.cufile to always refer to the NVIDIA cuFile module. While the comment mentions "compatibility", it could be clearer to use a more generic attribute name, such as self.gds_driver_module, to explicitly indicate that it can hold either cufile or hipfile depending on the active backend. However, a full refactoring would require changes outside this diff.
afcf20c to
542bb9b
Compare
|
@glimchb thanks for this work. Can I ask you change the title of this PR from GPU-direct storage to AMD Infinity Storage (AIS) since that is the appropriate name for this technology and I would not want to confuse users. |
DongDongJu
left a comment
There was a problem hiding this comment.
Hello @glimchb,
I have two small requests.
Can we add few documentation that what is pre-condition for enabling Hipfile? eg., rocm version, hardware requirements ...
And multiple code looks logically redundant with cufile usage.
Can we using flags like use_gds=True/False and gds_backend="cufile or hipfile" and implement based on that?
| extra_config: | ||
| use_hipfile: true | ||
|
|
||
| Note: The ``cufile_buffer_size`` configuration is used for both cuFile and hipFile buffers. |
There was a problem hiding this comment.
Since this variable are using for the same purpose.
Can we have a thin abstraction class for GDS can cover cufile and hipfile by each implementation?
| # HACK: hipfile import may be buggy on some hardware | ||
| # (e.g., without GPUDirect), so it's temporarily put here. | ||
| # Third Party | ||
| import hipfile |
There was a problem hiding this comment.
Then can we make the exception or raise error in here with above description?
absolutely. adding this now ...
I also want to fix it, but code review will be much more complex. also do we want to keep old option |
Preparation work for LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai> Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Thanks!
IMO. If you will make the follow up PR for this then dealing this in that PR is right direction. |
DongDongJu
left a comment
There was a problem hiding this comment.
Code-wise looks no problem now but I can not confirm that functionality since I dont have access to AMD GPU to test this
|
So maybe @mcgrof can help to eval this? |
@DongDongJu we tested this on our server. If code looks good and existing functionality preserved, then maybe consider approving it? It’s same structure as CuFile |
Head branch was pushed to by a user without write access
Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Head branch was pushed to by a user without write access
|
i ran |
Preparation work for LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai> Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Preparation work for LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai> Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Follow up work post LMCache#2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
…2858) Follow up work post #2799 (AMD hipFile AIS support). Refactor the GDS backend configuration to be backend-agnostic: - Replace extra_config["use_cufile"] with top-level use_gds (bool) config - Add gds_backend config field ("cufile" default) to select GDS library - Rename cufile_buffer_size to gds_buffer_size - Rename internal attributes: self.cufile -> self.gds_module, self._cufile_driver -> self._gds_driver, self.cufile_base_pointer -> self.gds_base_pointer - Update tests and documentation accordingly New env vars: LMCACHE_USE_GDS, LMCACHE_GDS_BACKEND, LMCACHE_GDS_BUFFER_SIZE Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
This commit adds comprehensive support for AMD hipFile, providing the ROCm equivalent to NVIDIA cuFile for GPU-direct storage operations.
HipFileMemoryAllocator: New memory allocator for AMD GPUs
GDS Backend Integration: Added use_hipfile configuration option
ROCm Device Detection: Properly detects HIP devices via torch.version.hip
Automatic Allocator Selection: Backend chooses between cuFile/hipFile
Seamless integration with existing GDS backend
Configuration via extra_config: {"use_hipfile": true}
Automatic fallback to cuFile for NVIDIA systems
Full test coverage with mock-based tests
Complete documentation with examples
lmcache/v1/memory_management.py: Added HipFileMemoryAllocator
lmcache/v1/storage_backend/gds_backend.py: Added hipFile support
lmcache/v1/config.py: Added use_hipfile config option
lmcache/v1/cache_engine.py: Updated allocator selection logic
tests/v1/utils.py: Added has_hipfile() function
tests/v1/*: Updated tests to support both backends
docs/*: Updated documentation for hipFile support
requirements/common.txt: Added hipfile-python dependency
extra_config:
use_hipfile: true
export LMCACHE_EXTRA_CONFIG='{"use_hipfile": true}'
The implementation maintains full backward compatibility while enabling GPU-direct storage on AMD ROCm platforms.
Resolves: Support for AMD GPU-direct storage in LMCache
What this PR does / why we need it:
Special notes for your reviewers:
If applicable: