[ops][refactor] Add full list of Python fallbacks to run without compiled CUDA extensions#2591
[ops][refactor] Add full list of Python fallbacks to run without compiled CUDA extensions#2591ApostaC merged 59 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello @hlin99, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility and portability of the LMCache library by decoupling it from mandatory compiled CUDA extensions. It achieves this by providing robust pure-Python fallback implementations for all core operations and introducing a dynamic backend selection mechanism. This change allows LMCache to function seamlessly on systems without NVIDIA GPUs, broadening its compatibility and ease of use across diverse hardware environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring to decouple lmcache from compiled CUDA extensions. By adding pure Python fallbacks for all CUDA operations and implementing a dynamic backend selection mechanism, the library can now be installed and used on machines without NVIDIA GPUs. The changes are extensive and include:
- A new
lmcache/__init__.pyfor dynamic backend dispatching. - A comprehensive
lmcache/non_cuda_equivalents.pywith Python implementations for allc_ops. - An excellent new test suite in
tests/v1/test_non_cuda_equivalents.pythat validates the numerical equivalence between the CUDA and Python backends under various conditions. - Removal of CUDA-specific import guards throughout the codebase.
The approach is robust, particularly the testing strategy which runs scenarios in different environments and compares results. I have a couple of suggestions to improve safety and test correctness, but overall this is a high-quality contribution.
hi @hickeyma thanks for your review. installing from source I did not encounter the issue, and I think that might be the reason that people are not complaining crash in import c ops. However, just viewing from logic perspective, cuda.is_available != c lib must be there to be safely imported. There might be other reasons to fail the import. So fallback path with a try & exception block is a safe way to properly handle this |
6a7496c to
a18b286
Compare
|
resolve conflicts and rebase to latest. |
…iled CUDA extensions
Decouple LMCache from compiled CUDA extensions by introducing a
complete set of pure-Python (+ NumPy/SciPy) fallback implementations
for every function previously only available through c_ops.
Key changes:
-- Centralize backend selection in lmcache/init.py with a
predicate-based registry that probes available backends at import
time and dispatches to the best one (CUDA > fallback).
-- Implement non_cuda_equivalents.py covering all c_ops surfaces:
rotary embedding, CDF calculation, KV-cache reshape/transfer,
encode/decode, pinned/NUMA memory alloc/free, memcpy, etc.
-- Add tests/v1/test_non_cuda_equivalents.py that runs each op
under three backends (CUDA c_ops, non-CUDA with GPU visible,
non-CUDA without GPU) and cross-compares results to ensure
numerical equivalence.
-- Adapt test skip logic to use torch.cuda.is_available() instead
of pytest.importorskip("lmcache.c_ops"), since c_ops import
now always succeeds via automatic fallback.
-- Remove CUDA-only import guards across the codebase so that
lmcache can be installed and imported on machines without NVIDIA ops for
any reason (cuda version mismatch or others)
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
Replace pytest.importorskip("lmcache.c_ops") with a
torch.cuda.is_available() check, as importing c_ops will
always succeed now (either real CUDA ops or fallback).
Non-CUDA backends can still be tested on machines without
CUDA hardware.
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
lmcache.c_ops can now be imported safely regardless of CUDA availability. Python fallback is selected automatically on exceptions. Signed-off-by: Tony Lin <tony.lin@intel.com>
| @@ -0,0 +1,71 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
why do we need this __init__.py? can we also use it to export an LMCache version in the future, just curious
There was a problem hiding this comment.
This init.py is needed to handle conditional imports — it checks whether the compiled CUDA extensions are available and falls back to the pure-Python implementations if not.
Two main reasons for centralizing it here:
To avoid large-scale code changes where importing the C library could throw an exception at each individual call site.
Moving forward, this can serve as a single place to hook different kernel implementations per device, hiding device-specific details from the rest of the codebase.
And yes, we could definitely add a version export here in the future! Great idea — I can follow up with a separate PR for that.
sammshen
left a comment
There was a problem hiding this comment.
I like this PR. Let's see if the CI will pass
|
@hlin99 can you see you unit test failures on AMD? Let me know if you need access to the Buildkite portal |
hi @sammshen . thanks for pointing this out! not aware recently there was an API update in these kernels. i'm working on an upgrade now. |
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
- Remove module-level CUDA skip to allow tests in non-CUDA environments
- Add _cuda_available flag to track CUDA availability at module level
- Conditionally run comparison tests only when CUDA is available
- Run no-crash tests (verify execution without comparing results) when CUDA is unavailable
- Maintain backward compatibility: full comparison testing when CUDA is available
Signed-off-by: Tony Lin <tony.lin@intel.com>
- Modified scenario_rotary_embedding_k_fused to test both is_neox=True (NeoX-style, contiguous halves) and is_neox=False (GPT-J-style, interleaved)
- Each test case now saves results with distinct suffixes (_neox and _gptj)
- This ensures both code paths in the rotary_embedding_k_fused function are properly tested
- Addresses the critical test coverage gap identified in the coverage analysis
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
- Added NB_NL_TWO_BS_NH_HS to format_cases in scenario_multi_layer_kv_transfer - Updated comments to clarify code paths for different GPUKVFormat enums - Created comprehensive coverage analysis document - All applicable GPUKVFormat enums now covered in tests Signed-off-by: Tony Lin <tony.lin@intel.com>
|
thanks @chunxiaozheng for the detailed review |
thank you @sammshen & @chunxiaozheng |
|
the |
|
this PR should be able to merge after: #3016 |
|
@maobaolong @chunxiaozheng PTAL! |
…thout compiled CUDA extensions LMCache#2591 Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…thout compiled CUDA extensions LMCache#2591 (#16) Signed-off-by: baoloongmao <baoloongmao@tencent.com>
|
@hlin99 Maybe you forget to update the |
|
@DongDongJu Would you like to take a look at this PR? |
Hi @maobaolong . you're right! i was aware of it. since this PR was opened, there're some changes on c_ops, i only updated necessary part(signature, enum def) in this PR, because i really don't want this PR to go bigger and bigger. I will raise separate PR to fill the known gaps bettween c ops and non cuda equivalents soon. or anybody wants to contribute is also welcome |
…iled CUDA extensions (LMCache#2591) Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Martin Hickey <martin.hickey@ie.ibm.com>
…iled CUDA extensions (LMCache#2591) Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Martin Hickey <martin.hickey@ie.ibm.com>

What this PR does / why we need it:
Decouple LMCache from compiled CUDA extensions by introducing a complete set of pure-Python (+ NumPy/SciPy) fallback implementations for every function previously only available through c_ops.
Key changes:
-- Centralize backend selection in lmcache/init.py with a predicate-based registry that probes available backends at import
time and dispatches to the best one (CUDA > fallback).
-- Implement non_cuda_equivalents.py covering all c_ops surfaces: rotary embedding, CDF calculation, KV-cache reshape/transfer,
encode/decode, pinned/NUMA memory alloc/free, memcpy, etc.
-- Add tests/v1/test_non_cuda_equivalents.py that runs each op under three backends (CUDA c_ops, non-CUDA with GPU visible,
non-CUDA without GPU) and cross-compares results to ensure numerical equivalence.
-- Adapt test skip logic to use torch.cuda.is_available() instead of pytest.importorskip("lmcache.c_ops"), since c_ops import
now always succeeds via automatic fallback.
-- Remove CUDA-only import guards across the codebase so that lmcache can be installed and imported on machines without
NVIDIA GPUs (e.g., Intel Gaudi / Habana).
Special notes for your reviewers:
If applicable:
[ No ] this PR contains user facing changes - docs added
[ Yes ] this PR contains unit tests
Note
Medium Risk
Medium risk because it changes how
lmcache.c_opsis imported and dispatched at runtime and introduces many new Python implementations of performance- and memory-sensitive kernels (KV transfers, memcpy, arithmetic coding). Behavioral parity is guarded by new tests, but pointer-based tensor views and runtime library loading could surface platform-specific issues.Overview
LMCache now selects an ops backend dynamically at import time:
lmcache/__init__.pyprobes candidates (currently CUDA) and otherwise falls back tolmcache.non_cuda_equivalents, then aliases the chosen module intosys.modules["lmcache.c_ops"]so existingimport lmcache.c_ops as lmc_opscall sites keep working.lmcache/non_cuda_equivalents.pyis expanded into a full pure-Python/NumPy/Numba fallback surface for previously CUDA-only ops, including pointer-to-tensor views, KV-cache transfer/reshape helpers, arithmetic encode/decode, CDF computation, rotary embedding update, PCI bus ID lookup, and almcache_memcpy_asyncfallback that can uselibcudart/ROCm when available.CUDA availability guards across the codebase are removed in favor of always importing
lmcache.c_ops(now backend-dispatched), and tests are updated/added: a new parity suite (test_non_cuda_equivalents.pyandtest_c_ops_fallback_parity.py) validates signature/enum compatibility and numerical equivalence across backends, while existing CUDA-kernel tests adjust skip logic to avoid relying onimportorskip("lmcache.c_ops").Reviewed by Cursor Bugbot for commit 42c0da9. Bugbot is set up for automated code reviews on this repo. Configure here.