[ops][refactor] Add full list of Python fallbacks to run without compiled CUDA extensions by hlin99 · Pull Request #2591 · LMCache/LMCache

hlin99 · 2026-02-12T05:16:43Z

What this PR does / why we need it:

Decouple LMCache from compiled CUDA extensions by introducing a complete set of pure-Python (+ NumPy/SciPy) fallback implementations for every function previously only available through c_ops.

Key changes:

-- Centralize backend selection in lmcache/init.py with a predicate-based registry that probes available backends at import
time and dispatches to the best one (CUDA > fallback).
-- Implement non_cuda_equivalents.py covering all c_ops surfaces: rotary embedding, CDF calculation, KV-cache reshape/transfer,
encode/decode, pinned/NUMA memory alloc/free, memcpy, etc.
-- Add tests/v1/test_non_cuda_equivalents.py that runs each op under three backends (CUDA c_ops, non-CUDA with GPU visible,
non-CUDA without GPU) and cross-compares results to ensure numerical equivalence.
-- Adapt test skip logic to use torch.cuda.is_available() instead of pytest.importorskip("lmcache.c_ops"), since c_ops import
now always succeeds via automatic fallback.
-- Remove CUDA-only import guards across the codebase so that lmcache can be installed and imported on machines without
NVIDIA GPUs (e.g., Intel Gaudi / Habana).

Special notes for your reviewers:

If applicable:

[ No ] this PR contains user facing changes - docs added
[ Yes ] this PR contains unit tests

Note

Medium Risk
Medium risk because it changes how lmcache.c_ops is imported and dispatched at runtime and introduces many new Python implementations of performance- and memory-sensitive kernels (KV transfers, memcpy, arithmetic coding). Behavioral parity is guarded by new tests, but pointer-based tensor views and runtime library loading could surface platform-specific issues.

Overview
LMCache now selects an ops backend dynamically at import time: lmcache/__init__.py probes candidates (currently CUDA) and otherwise falls back to lmcache.non_cuda_equivalents, then aliases the chosen module into sys.modules["lmcache.c_ops"] so existing import lmcache.c_ops as lmc_ops call sites keep working.

lmcache/non_cuda_equivalents.py is expanded into a full pure-Python/NumPy/Numba fallback surface for previously CUDA-only ops, including pointer-to-tensor views, KV-cache transfer/reshape helpers, arithmetic encode/decode, CDF computation, rotary embedding update, PCI bus ID lookup, and a lmcache_memcpy_async fallback that can use libcudart/ROCm when available.

CUDA availability guards across the codebase are removed in favor of always importing lmcache.c_ops (now backend-dispatched), and tests are updated/added: a new parity suite (test_non_cuda_equivalents.py and test_c_ops_fallback_parity.py) validates signature/enum compatibility and numerical equivalence across backends, while existing CUDA-kernel tests adjust skip logic to avoid relying on importorskip("lmcache.c_ops").

^{Reviewed by Cursor Bugbot for commit 42c0da9. Bugbot is set up for automated code reviews on this repo. Configure here.}

gemini-code-assist · 2026-02-12T05:17:04Z

Summary of Changes

Hello @hlin99, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and portability of the LMCache library by decoupling it from mandatory compiled CUDA extensions. It achieves this by providing robust pure-Python fallback implementations for all core operations and introducing a dynamic backend selection mechanism. This change allows LMCache to function seamlessly on systems without NVIDIA GPUs, broadening its compatibility and ease of use across diverse hardware environments.

Highlights

Centralized Backend Selection: Implemented a predicate-based registry in lmcache/__init__.py to dynamically select the best available backend (CUDA > fallback) at import time, ensuring lmcache can be installed and imported without NVIDIA GPUs.
Pure-Python Fallback Implementations: Introduced lmcache/non_cuda_equivalents.py with a complete set of pure-Python (+ NumPy/SciPy) fallback implementations for all operations previously exclusive to c_ops, including rotary embedding, CDF calculation, KV-cache reshape/transfer, encode/decode, and pinned/NUMA memory management.
Numerical Equivalence Testing: Added tests/v1/test_non_cuda_equivalents.py to rigorously cross-compare results of each operation under CUDA c_ops, non-CUDA with GPU visible, and non-CUDA without GPU, ensuring numerical equivalence across backends.
Simplified Import Logic: Removed CUDA-only import guards (if torch.cuda.is_available():) across the codebase, as c_ops import now always succeeds due to automatic fallback, simplifying module dependencies.
Updated Test Skip Logic: Adapted test skip logic in tests/v1/test_mem_kernels.py to use torch.cuda.is_available() directly, reflecting the new backend selection mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache/init.py
- Added dynamic backend selection logic to prioritize CUDA operations and fall back to Python equivalents.
lmcache/non_cuda_equivalents.py
- Implemented pure-Python fallback functions for multi-layer KV transfer, single-layer KV transfer (vLLM and SGLang formats), flash attention reshape operations, asynchronous memory copy, arithmetic encoding and decoding, CDF calculation, rotary embedding, and GPU PCI bus ID retrieval.
- Added TransferDirection enum for memory copy operations.
lmcache/storage_backend/serde/cachegen_decoder.py
- Removed conditional import of lmcache.c_ops.
lmcache/storage_backend/serde/cachegen_encoder.py
- Removed conditional import of lmcache.c_ops.
lmcache/v1/compute/positional_encoding.py
- Removed conditional import of lmcache.c_ops.
lmcache/v1/gpu_connector/gpu_connectors.py
- Removed conditional import of lmcache.c_ops.
lmcache/v1/gpu_connector/gpu_ops.py
- Removed conditional import of lmcache.c_ops.
lmcache/v1/lazy_memory_allocator.py
- Removed conditional imports for lmc_ops, now directly importing lmcache.c_ops.
lmcache/v1/memory_management.py
- Removed conditional imports for lmc_ops, now directly importing lmcache.c_ops.
tests/v1/test_mem_kernels.py
- Updated test skip logic to check torch.cuda.is_available() directly instead of pytest.importorskip.
tests/v1/test_non_cuda_equivalents.py
- Added new test suite to compare CUDA and non-CUDA fallback implementations for numerical equivalence across various operations.
- Implemented a child process execution model for testing different backend modes (CUDA_OPS, NON_CUDA with GPU, NON_CUDA without GPU).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring to decouple lmcache from compiled CUDA extensions. By adding pure Python fallbacks for all CUDA operations and implementing a dynamic backend selection mechanism, the library can now be installed and used on machines without NVIDIA GPUs. The changes are extensive and include:

A new lmcache/__init__.py for dynamic backend dispatching.
A comprehensive lmcache/non_cuda_equivalents.py with Python implementations for all c_ops.
An excellent new test suite in tests/v1/test_non_cuda_equivalents.py that validates the numerical equivalence between the CUDA and Python backends under various conditions.
Removal of CUDA-specific import guards throughout the codebase.

The approach is robust, particularly the testing strategy which runs scenarios in different environments and compares results. I have a couple of suggestions to improve safety and test correctness, but overall this is a high-quality contribution.

hlin99 · 2026-02-14T01:58:24Z

hi @hickeyma @sammshen . i see you're the creator of non cuda equivalents. can you take some time to review this PR and let me know your thoughts? many thanks for your time.

hickeyma

Thanks @hlin99 for pushing this PR. I'll review this shortly.

hickeyma

@hlin99 What about installing LMCache from source? AFIK, this requires CUDA?

hlin99 · 2026-02-24T10:04:15Z

@hlin99 What about installing LMCache from source? AFIK, this requires CUDA?

hi @hickeyma thanks for your review. installing from source I did not encounter the issue, and I think that might be the reason that people are not complaining crash in import c ops. However, just viewing from logic perspective, cuda.is_available != c lib must be there to be safely imported. There might be other reasons to fail the import. So fallback path with a try & exception block is a safe way to properly handle this

hlin99 · 2026-03-02T06:27:29Z

resolve conflicts and rebase to latest.

…iled CUDA extensions Decouple LMCache from compiled CUDA extensions by introducing a complete set of pure-Python (+ NumPy/SciPy) fallback implementations for every function previously only available through c_ops. Key changes: -- Centralize backend selection in lmcache/init.py with a predicate-based registry that probes available backends at import time and dispatches to the best one (CUDA > fallback). -- Implement non_cuda_equivalents.py covering all c_ops surfaces: rotary embedding, CDF calculation, KV-cache reshape/transfer, encode/decode, pinned/NUMA memory alloc/free, memcpy, etc. -- Add tests/v1/test_non_cuda_equivalents.py that runs each op under three backends (CUDA c_ops, non-CUDA with GPU visible, non-CUDA without GPU) and cross-compares results to ensure numerical equivalence. -- Adapt test skip logic to use torch.cuda.is_available() instead of pytest.importorskip("lmcache.c_ops"), since c_ops import now always succeeds via automatic fallback. -- Remove CUDA-only import guards across the codebase so that lmcache can be installed and imported on machines without NVIDIA ops for any reason (cuda version mismatch or others) Signed-off-by: Tony Lin <tony.lin@intel.com>

Signed-off-by: Tony Lin <tony.lin@intel.com>

Replace pytest.importorskip("lmcache.c_ops") with a torch.cuda.is_available() check, as importing c_ops will always succeed now (either real CUDA ops or fallback). Non-CUDA backends can still be tested on machines without CUDA hardware. Signed-off-by: Tony Lin <tony.lin@intel.com>

Signed-off-by: Tony Lin <tony.lin@intel.com>

lmcache.c_ops can now be imported safely regardless of CUDA availability. Python fallback is selected automatically on exceptions. Signed-off-by: Tony Lin <tony.lin@intel.com>

sammshen · 2026-03-04T02:45:01Z

@@ -0,0 +1,71 @@
+# SPDX-License-Identifier: Apache-2.0


why do we need this __init__.py? can we also use it to export an LMCache version in the future, just curious

This init.py is needed to handle conditional imports — it checks whether the compiled CUDA extensions are available and falls back to the pure-Python implementations if not.

Two main reasons for centralizing it here:

To avoid large-scale code changes where importing the C library could throw an exception at each individual call site.
Moving forward, this can serve as a single place to hook different kernel implementations per device, hiding device-specific details from the rest of the codebase.
And yes, we could definitely add a version export here in the future! Great idea — I can follow up with a separate PR for that.

sammshen

I like this PR. Let's see if the CI will pass

sammshen · 2026-03-04T02:46:41Z

@hlin99 can you see you unit test failures on AMD?


@pytest.fixture(scope="module")
--
def run_all_children():
"""Launch 3 child processes. Runs once for the entire module."""
if RESULTS_DIR.exists():
shutil.rmtree(RESULTS_DIR)
 
for mode, cuda_vis in [("CUDA_OPS", "0"), ("NON_CUDA", "0"), ("NON_CUDA", "")]:
r = run_scenario(mode, cuda_vis)
>           assert r.returncode == 0, (
f"Scenario {mode}/CUDA_VISIBLE_DEVICES='{cuda_vis}' failed:\n"
f"{r.stdout}\n{r.stderr}"
)
E           AssertionError: Scenario CUDA_OPS/CUDA_VISIBLE_DEVICES='0' failed:
E
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[transfer_direction_enum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer_unilateral] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer_sgl] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[load_and_reshape_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[reshape_and_cache_back_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[lmcache_memcpy_async] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[encode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_prefsum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[calculate_cdf] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[rotary_embedding_k_fused] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[get_gpu_pci_bus_id] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E
E             =================================== FAILURES ===================================
E             ____________________ test_scenario[multi_layer_kv_transfer] ____________________
E
E             name = 'multi_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [0, 2, 5, 9],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "paged2lmc" if direction else "lmc2paged"
E
E                         # 1. LMCache Tensor
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         key_value = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E                         if not direction:  # LMC → Paged
E                             for ly in range(num_layers):
E                                 for t in range(num_tokens):
E                                     val = (
E                                         ly * 1000 + t * 10 + torch.arange(head_size, device=device)
E                                     ).to(dtype)
E                                     key_value[0, ly, t] = val
E                                     key_value[1, ly, t] = val + 500
E
E                         # 2. Paged Buffers
E                         page_buffers = []
E                         for ly in range(num_layers):
E                             pb = torch.zeros(
E                                 (2, page_buffer_size, head_size),
E                                 dtype=dtype,
E                                 device=device,
E                             )
E                             if direction:  # Paged → LMC
E                                 for s in range(page_buffer_size):
E                                     val = (
E                                         ly * 2000
E                                         + s * 10
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     pb[0, s] = val
E                                     pb[1, s] = val + 700
E                             page_buffers.append(pb)
E
E                         # 3. Pointer Tensor
E                         key_value_ptrs = torch.tensor(
E                             [pb.data_ptr() for pb in page_buffers],
E                             dtype=torch.int64,
E                             device=device,
E                         )
E
E                         # 4. Execute
E             >           ops.multi_layer_kv_transfer(
E                             key_value,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat, arg7: typing.SupportsInt) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642798592, 140133642801152], device='cuda:0'), tensor([0, 2, 5, 9], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1093: TypeError
E             ______________ test_scenario[multi_layer_kv_transfer_unilateral] _______________
E
E             name = 'multi_layer_kv_transfer_unilateral'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer_unilateral():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [1, 3, 4, 7],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "p2l" if direction else "l2p"
E
E                         # LMC Layout: [2, num_layers, num_tokens, head_size]
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         lmc_tensor = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E
E                         if not direction:  # LMC → Paged
E                             for kv in range(2):
E                                 for ly in range(num_layers):
E                                     for t in range(num_tokens):
E                                         val = (
E                                             kv * 5000
E                                             + ly * 1000
E                                             + t * 10
E                                             + torch.arange(
E                                                 head_size,
E                                                 device=device,
E                                             )
E                                         ).to(dtype)
E                                         lmc_tensor[kv, ly, t] = val
E
E                         # 1. Paged Buffers
E                         buffers = {}
E                         for kv in range(2):
E                             for ly in range(num_layers):
E                                 pb = torch.zeros(
E                                     (page_buffer_size, head_size),
E                                     dtype=dtype,
E                                     device=device,
E                                 )
E                                 if direction:  # Paged → LMC
E                                     val = (
E                                         kv * 7000
E                                         + ly * 2000
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     for s in range(page_buffer_size):
E                                         pb[s] = val + (s * 10)
E                                 buffers[(kv, ly)] = pb
E
E                         # 2. Grouped Pointer Tensor
E                         # C++: ptrs[layer_id] = Key,
E                         #      ptrs[layer_id + num_layers] = Value
E                         ptr_list = []
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(0, ly)].data_ptr())
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(1, ly)].data_ptr())
E
E                         key_value_ptrs = torch.tensor(
E                             ptr_list,
E                             dtype=torch.int64,
E                             device=device,
E                         ).contiguous()
E
E                         # 3. Execute
E             >           ops.multi_layer_kv_transfer_unilateral(
E                             lmc_tensor,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer_unilateral(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642804224, 140133642805760, 140133642807296, 140133642808320],
E             E                  device='cuda:0'), tensor([1, 3, 4, 7], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1213: TypeError
E             ___________________ test_scenario[single_layer_kv_transfer] ____________________
E
E             name = 'single_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_single_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_tokens = 64
E                     num_blocks = 256
E                     block_size = 16
E                     num_heads = 12
E                     head_size = 64
E                     hidden_size = num_heads * head_size
E
E                     slot_mapping = torch.arange(
E                         0,
E                         num_tokens * 2,
E                         2,
E                         device=device,
E                     ).to(torch.int64)
E
E                     # (use_mla, token_major, vllm_two_major, direction)
E                     # direction: False = LMC→vLLM, True = vLLM→LMC
E                     test_cases = [
E                         (False, True, True, False),
E                         (False, False, False, False),
E                         (False, True, True, True),
E                         (True, True, True, False),
E                         (True, True, True, True),
E                     ]
E
E                     for use_mla, token_major, vllm_two_major, direction in test_cases:
E                         dir_tag = "v2l" if direction else "l2v"
E                         case_desc = (
E                             f"MLA={use_mla}, TM={token_major}, 2Maj={vllm_two_major}, Dir={dir_tag}"
E                         )
E
E                         # 1. Setup Shapes
E                         if use_mla:
E                             lmc_shape = (num_tokens, hidden_size)
E                             vllm_shape = (num_blocks, block_size, hidden_size)
E                         else:
E                             lmc_shape = (
E                                 (num_tokens, 2, hidden_size)
E                                 if token_major
E                                 else (2, num_tokens, hidden_size)
E                             )
E                             if vllm_two_major:
E                                 vllm_shape = (
E                                     2,
E                                     num_blocks,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E                             else:
E                                 vllm_shape = (
E                                     num_blocks,
E                                     2,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E
E                         # 2. Deterministic Data
E                         lmc_size = 1
E                         for s in lmc_shape:
E                             lmc_size *= s
E                         vllm_size = 1
E                         for s in vllm_shape:
E                             vllm_size *= s
E
E                         lmc_tensor = (
E                             (torch.arange(lmc_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(lmc_shape)
E                         )
E                         vllm_tensor = (
E                             (torch.arange(vllm_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(vllm_shape)
E                         )
E
E                         # 3. Golden Reference
E                         lmc_ref = lmc_tensor.clone()
E                         vllm_ref = vllm_tensor.clone()
E                         block_indices = slot_mapping // block_size
E                         block_offsets = slot_mapping % block_size
E
E                         if not direction:  # LMC → vLLM
E                             if use_mla:
E                                 vllm_ref[block_indices, block_offsets, :] = lmc_ref
E                             else:
E                                 src = lmc_ref if token_major else lmc_ref.permute(1, 0, 2)
E                                 src = src.view(
E                                     num_tokens,
E                                     2,
E                                     num_heads,
E                                     head_size,
E                                 )
E                                 if vllm_two_major:
E                                     vllm_ref[0, block_indices, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[1, block_indices, block_offsets] = src[:, 1, :, :]
E                                 else:
E                                     vllm_ref[block_indices, 0, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[block_indices, 1, block_offsets] = src[:, 1, :, :]
E                         else:  # vLLM → LMC
E                             if use_mla:
E                                 lmc_ref = vllm_ref[block_indices, block_offsets, :]
E                             else:
E                                 if vllm_two_major:
E                                     k = vllm_ref[0, block_indices, block_offsets]
E                                     v = vllm_ref[1, block_indices, block_offsets]
E                                 else:
E                                     k = vllm_ref[block_indices, 0, block_offsets]
E                                     v = vllm_ref[block_indices, 1, block_offsets]
E                                 combined = torch.stack(
E                                     [k, v],
E                                     dim=1,
E                                 ).view(num_tokens, 2, hidden_size)
E                                 lmc_ref = combined if token_major else combined.permute(1, 0, 2)
E
E                         # 4. Execute
E             >           ops.single_layer_kv_transfer(
E                             lmc_tensor,
E                             vllm_tensor,
E                             slot_mapping,
E                             direction,
E                             token_major,
E                             vllm_two_major,
E                             use_mla,

Let me know if you need access to the Buildkite portal

hlin99 · 2026-03-04T05:25:54Z

@hlin99 can you see you unit test failures on AMD?


@pytest.fixture(scope="module")
--
def run_all_children():
"""Launch 3 child processes. Runs once for the entire module."""
if RESULTS_DIR.exists():
shutil.rmtree(RESULTS_DIR)
 
for mode, cuda_vis in [("CUDA_OPS", "0"), ("NON_CUDA", "0"), ("NON_CUDA", "")]:
r = run_scenario(mode, cuda_vis)
>           assert r.returncode == 0, (
f"Scenario {mode}/CUDA_VISIBLE_DEVICES='{cuda_vis}' failed:\n"
f"{r.stdout}\n{r.stderr}"
)
E           AssertionError: Scenario CUDA_OPS/CUDA_VISIBLE_DEVICES='0' failed:
E
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[transfer_direction_enum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer_unilateral] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer_sgl] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[load_and_reshape_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[reshape_and_cache_back_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[lmcache_memcpy_async] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[encode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_prefsum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[calculate_cdf] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[rotary_embedding_k_fused] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[get_gpu_pci_bus_id] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E
E             =================================== FAILURES ===================================
E             ____________________ test_scenario[multi_layer_kv_transfer] ____________________
E
E             name = 'multi_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [0, 2, 5, 9],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "paged2lmc" if direction else "lmc2paged"
E
E                         # 1. LMCache Tensor
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         key_value = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E                         if not direction:  # LMC → Paged
E                             for ly in range(num_layers):
E                                 for t in range(num_tokens):
E                                     val = (
E                                         ly * 1000 + t * 10 + torch.arange(head_size, device=device)
E                                     ).to(dtype)
E                                     key_value[0, ly, t] = val
E                                     key_value[1, ly, t] = val + 500
E
E                         # 2. Paged Buffers
E                         page_buffers = []
E                         for ly in range(num_layers):
E                             pb = torch.zeros(
E                                 (2, page_buffer_size, head_size),
E                                 dtype=dtype,
E                                 device=device,
E                             )
E                             if direction:  # Paged → LMC
E                                 for s in range(page_buffer_size):
E                                     val = (
E                                         ly * 2000
E                                         + s * 10
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     pb[0, s] = val
E                                     pb[1, s] = val + 700
E                             page_buffers.append(pb)
E
E                         # 3. Pointer Tensor
E                         key_value_ptrs = torch.tensor(
E                             [pb.data_ptr() for pb in page_buffers],
E                             dtype=torch.int64,
E                             device=device,
E                         )
E
E                         # 4. Execute
E             >           ops.multi_layer_kv_transfer(
E                             key_value,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat, arg7: typing.SupportsInt) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642798592, 140133642801152], device='cuda:0'), tensor([0, 2, 5, 9], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1093: TypeError
E             ______________ test_scenario[multi_layer_kv_transfer_unilateral] _______________
E
E             name = 'multi_layer_kv_transfer_unilateral'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer_unilateral():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [1, 3, 4, 7],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "p2l" if direction else "l2p"
E
E                         # LMC Layout: [2, num_layers, num_tokens, head_size]
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         lmc_tensor = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E
E                         if not direction:  # LMC → Paged
E                             for kv in range(2):
E                                 for ly in range(num_layers):
E                                     for t in range(num_tokens):
E                                         val = (
E                                             kv * 5000
E                                             + ly * 1000
E                                             + t * 10
E                                             + torch.arange(
E                                                 head_size,
E                                                 device=device,
E                                             )
E                                         ).to(dtype)
E                                         lmc_tensor[kv, ly, t] = val
E
E                         # 1. Paged Buffers
E                         buffers = {}
E                         for kv in range(2):
E                             for ly in range(num_layers):
E                                 pb = torch.zeros(
E                                     (page_buffer_size, head_size),
E                                     dtype=dtype,
E                                     device=device,
E                                 )
E                                 if direction:  # Paged → LMC
E                                     val = (
E                                         kv * 7000
E                                         + ly * 2000
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     for s in range(page_buffer_size):
E                                         pb[s] = val + (s * 10)
E                                 buffers[(kv, ly)] = pb
E
E                         # 2. Grouped Pointer Tensor
E                         # C++: ptrs[layer_id] = Key,
E                         #      ptrs[layer_id + num_layers] = Value
E                         ptr_list = []
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(0, ly)].data_ptr())
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(1, ly)].data_ptr())
E
E                         key_value_ptrs = torch.tensor(
E                             ptr_list,
E                             dtype=torch.int64,
E                             device=device,
E                         ).contiguous()
E
E                         # 3. Execute
E             >           ops.multi_layer_kv_transfer_unilateral(
E                             lmc_tensor,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer_unilateral(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642804224, 140133642805760, 140133642807296, 140133642808320],
E             E                  device='cuda:0'), tensor([1, 3, 4, 7], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1213: TypeError
E             ___________________ test_scenario[single_layer_kv_transfer] ____________________
E
E             name = 'single_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_single_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_tokens = 64
E                     num_blocks = 256
E                     block_size = 16
E                     num_heads = 12
E                     head_size = 64
E                     hidden_size = num_heads * head_size
E
E                     slot_mapping = torch.arange(
E                         0,
E                         num_tokens * 2,
E                         2,
E                         device=device,
E                     ).to(torch.int64)
E
E                     # (use_mla, token_major, vllm_two_major, direction)
E                     # direction: False = LMC→vLLM, True = vLLM→LMC
E                     test_cases = [
E                         (False, True, True, False),
E                         (False, False, False, False),
E                         (False, True, True, True),
E                         (True, True, True, False),
E                         (True, True, True, True),
E                     ]
E
E                     for use_mla, token_major, vllm_two_major, direction in test_cases:
E                         dir_tag = "v2l" if direction else "l2v"
E                         case_desc = (
E                             f"MLA={use_mla}, TM={token_major}, 2Maj={vllm_two_major}, Dir={dir_tag}"
E                         )
E
E                         # 1. Setup Shapes
E                         if use_mla:
E                             lmc_shape = (num_tokens, hidden_size)
E                             vllm_shape = (num_blocks, block_size, hidden_size)
E                         else:
E                             lmc_shape = (
E                                 (num_tokens, 2, hidden_size)
E                                 if token_major
E                                 else (2, num_tokens, hidden_size)
E                             )
E                             if vllm_two_major:
E                                 vllm_shape = (
E                                     2,
E                                     num_blocks,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E                             else:
E                                 vllm_shape = (
E                                     num_blocks,
E                                     2,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E
E                         # 2. Deterministic Data
E                         lmc_size = 1
E                         for s in lmc_shape:
E                             lmc_size *= s
E                         vllm_size = 1
E                         for s in vllm_shape:
E                             vllm_size *= s
E
E                         lmc_tensor = (
E                             (torch.arange(lmc_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(lmc_shape)
E                         )
E                         vllm_tensor = (
E                             (torch.arange(vllm_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(vllm_shape)
E                         )
E
E                         # 3. Golden Reference
E                         lmc_ref = lmc_tensor.clone()
E                         vllm_ref = vllm_tensor.clone()
E                         block_indices = slot_mapping // block_size
E                         block_offsets = slot_mapping % block_size
E
E                         if not direction:  # LMC → vLLM
E                             if use_mla:
E                                 vllm_ref[block_indices, block_offsets, :] = lmc_ref
E                             else:
E                                 src = lmc_ref if token_major else lmc_ref.permute(1, 0, 2)
E                                 src = src.view(
E                                     num_tokens,
E                                     2,
E                                     num_heads,
E                                     head_size,
E                                 )
E                                 if vllm_two_major:
E                                     vllm_ref[0, block_indices, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[1, block_indices, block_offsets] = src[:, 1, :, :]
E                                 else:
E                                     vllm_ref[block_indices, 0, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[block_indices, 1, block_offsets] = src[:, 1, :, :]
E                         else:  # vLLM → LMC
E                             if use_mla:
E                                 lmc_ref = vllm_ref[block_indices, block_offsets, :]
E                             else:
E                                 if vllm_two_major:
E                                     k = vllm_ref[0, block_indices, block_offsets]
E                                     v = vllm_ref[1, block_indices, block_offsets]
E                                 else:
E                                     k = vllm_ref[block_indices, 0, block_offsets]
E                                     v = vllm_ref[block_indices, 1, block_offsets]
E                                 combined = torch.stack(
E                                     [k, v],
E                                     dim=1,
E                                 ).view(num_tokens, 2, hidden_size)
E                                 lmc_ref = combined if token_major else combined.permute(1, 0, 2)
E
E                         # 4. Execute
E             >           ops.single_layer_kv_transfer(
E                             lmc_tensor,
E                             vllm_tensor,
E                             slot_mapping,
E                             direction,
E                             token_major,
E                             vllm_two_major,
E                             use_mla,

Let me know if you need access to the Buildkite portal

hi @sammshen . thanks for pointing this out! not aware recently there was an API update in these kernels. i'm working on an upgrade now.

Signed-off-by: Tony Lin <tony.lin@intel.com>

- Remove module-level CUDA skip to allow tests in non-CUDA environments - Add _cuda_available flag to track CUDA availability at module level - Conditionally run comparison tests only when CUDA is available - Run no-crash tests (verify execution without comparing results) when CUDA is unavailable - Maintain backward compatibility: full comparison testing when CUDA is available Signed-off-by: Tony Lin <tony.lin@intel.com>

- Modified scenario_rotary_embedding_k_fused to test both is_neox=True (NeoX-style, contiguous halves) and is_neox=False (GPT-J-style, interleaved) - Each test case now saves results with distinct suffixes (_neox and _gptj) - This ensures both code paths in the rotary_embedding_k_fused function are properly tested - Addresses the critical test coverage gap identified in the coverage analysis Signed-off-by: Tony Lin <tony.lin@intel.com>

Signed-off-by: Tony Lin <tony.lin@intel.com>

- Added NB_NL_TWO_BS_NH_HS to format_cases in scenario_multi_layer_kv_transfer - Updated comments to clarify code paths for different GPUKVFormat enums - Created comprehensive coverage analysis document - All applicable GPUKVFormat enums now covered in tests Signed-off-by: Tony Lin <tony.lin@intel.com>

sammshen · 2026-04-12T06:14:36Z

thanks @chunxiaozheng for the detailed review

chunxiaozheng

LGTM!

hlin99 · 2026-04-12T06:56:20Z

thanks @chunxiaozheng for the detailed review

thank you @sammshen & @chunxiaozheng

maobaolong

LGTM

sammshen · 2026-04-13T07:34:08Z

sammshen · 2026-04-13T07:34:14Z

the CODEOWNERS is a little too fine-grained, let me change that

sammshen · 2026-04-13T07:43:23Z

this PR should be able to merge after: #3016

sammshen · 2026-04-13T07:43:48Z

@maobaolong @chunxiaozheng PTAL!

…thout compiled CUDA extensions LMCache#2591 Signed-off-by: baoloongmao <baoloongmao@tencent.com>

…thout compiled CUDA extensions LMCache#2591 (#16) Signed-off-by: baoloongmao <baoloongmao@tencent.com>

maobaolong · 2026-04-13T15:46:02Z

@hlin99 Maybe you forget to update the import lmcache.c_ops as lmc_ops in lmcache/v1/multiprocess/gpu_context.py?

maobaolong · 2026-04-13T23:21:41Z

@DongDongJu Would you like to take a look at this PR?

ApostaC

LGTM!

hlin99 · 2026-04-14T00:32:27Z

@hlin99 Maybe you forget to update the import lmcache.c_ops as lmc_ops in lmcache/v1/multiprocess/gpu_context.py?

Hi @maobaolong . you're right! i was aware of it. since this PR was opened, there're some changes on c_ops, i only updated necessary part(signature, enum def) in this PR, because i really don't want this PR to go bigger and bigger.

I will raise separate PR to fill the known gaps bettween c ops and non cuda equivalents soon. or anybody wants to contribute is also welcome

…iled CUDA extensions (LMCache#2591) Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Martin Hickey <martin.hickey@ie.ibm.com>

gemini-code-assist Bot reviewed Feb 12, 2026

View reviewed changes

Comment thread lmcache/non_cuda_equivalents.py Outdated

Comment thread tests/v1/test_non_cuda_equivalents.py Outdated

hlin99 mentioned this pull request Feb 21, 2026

[RFC]: Non-Python backends #1362

Open

hickeyma self-requested a review February 23, 2026 17:44

hickeyma reviewed Feb 23, 2026

View reviewed changes

hickeyma reviewed Feb 24, 2026

View reviewed changes

hlin99 force-pushed the non-cuda-extension branch from 6a7496c to a18b286 Compare March 2, 2026 06:27

hlin99 added 7 commits March 2, 2026 06:28

python fallback implementation of c_ops

47e4c5a

Signed-off-by: Tony Lin <tony.lin@intel.com>

add tests/v1/test_non_cuda_equivalents.py

1506b90

Signed-off-by: Tony Lin <tony.lin@intel.com>

won't cross compare pcie id

68702b8

Signed-off-by: Tony Lin <tony.lin@intel.com>

Remove CUDA-only import restriction for lmcache.c_ops

a18b286

lmcache.c_ops can now be imported safely regardless of CUDA availability. Python fallback is selected automatically on exceptions. Signed-off-by: Tony Lin <tony.lin@intel.com>

Merge branch 'dev' into non-cuda-extension

ad7dacc

sammshen reviewed Mar 4, 2026

View reviewed changes

sammshen approved these changes Mar 4, 2026

View reviewed changes

Merge branch 'dev' into non-cuda-extension

876a9a1

hlin99 added 8 commits March 4, 2026 07:29

upgrade non_cuda_equivalents to the latest API definitions

1e28cbf

Signed-off-by: Tony Lin <tony.lin@intel.com>

update UT

4d330cd

Signed-off-by: Tony Lin <tony.lin@intel.com>

further refine of non_cuda_equivalents

1c7dacf

Signed-off-by: Tony Lin <tony.lin@intel.com>

Add unit tests for alloc_shm_pinned_ptr and free_shm_pinned_ptr APIs

16ac910

Signed-off-by: Tony Lin <tony.lin@intel.com>

Fix alloc_pinned_ptr parameter name to match CUDA API

67ea8db

Signed-off-by: Tony Lin <tony.lin@intel.com>

chunxiaozheng approved these changes Apr 12, 2026

View reviewed changes

chunxiaozheng enabled auto-merge (squash) April 12, 2026 06:48

github-actions Bot added the full Run comprehensive tests on this PR label Apr 12, 2026

Merge branch 'dev' into non-cuda-extension

70c3db8

hlin99 mentioned this pull request Apr 12, 2026

[MP]Add platform module #3009

Open

Merge branch 'dev' into non-cuda-extension

42c0da9

maobaolong approved these changes Apr 13, 2026

View reviewed changes

maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 13, 2026

Backport: [ops][refactor] Add full list of Python fallbacks to run wi…

0d4006c

…thout compiled CUDA extensions LMCache#2591 Signed-off-by: baoloongmao <baoloongmao@tencent.com>

maobaolong mentioned this pull request Apr 13, 2026

Backport: [ops][refactor] Add full list of Python fallbacks to run wi… maobaolong/LMCache#16

Merged

maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 13, 2026

Backport: [ops][refactor] Add full list of Python fallbacks to run wi…

d6ef992

…thout compiled CUDA extensions LMCache#2591 (#16) Signed-off-by: baoloongmao <baoloongmao@tencent.com>

ApostaC approved these changes Apr 13, 2026

View reviewed changes

ApostaC disabled auto-merge April 13, 2026 23:48

ApostaC enabled auto-merge (squash) April 13, 2026 23:48

ApostaC disabled auto-merge April 13, 2026 23:48

ApostaC merged commit 41b669e into LMCache:dev Apr 13, 2026
41 checks passed

github-actions Bot added full Run comprehensive tests on this PR and removed full Run comprehensive tests on this PR labels Apr 14, 2026

hlin99 deleted the non-cuda-extension branch April 25, 2026 05:21

Conversation

hlin99 commented Feb 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Feb 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

hlin99 commented Feb 14, 2026

Uh oh!

hickeyma left a comment

Choose a reason for hiding this comment

Uh oh!

hickeyma left a comment

Choose a reason for hiding this comment

Uh oh!

hlin99 commented Feb 24, 2026

Uh oh!

hlin99 commented Mar 2, 2026

Uh oh!

sammshen Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

hlin99 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

sammshen commented Mar 4, 2026

Uh oh!

hlin99 commented Mar 4, 2026

Uh oh!

sammshen commented Apr 12, 2026

Uh oh!

chunxiaozheng left a comment

Choose a reason for hiding this comment

Uh oh!

hlin99 commented Apr 12, 2026

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

sammshen commented Apr 13, 2026

Uh oh!

sammshen commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sammshen commented Apr 13, 2026

Uh oh!

sammshen commented Apr 13, 2026

Uh oh!

maobaolong commented Apr 13, 2026

Uh oh!

maobaolong commented Apr 13, 2026

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hlin99 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

hlin99 commented Feb 12, 2026 •

edited by cursor Bot

Loading

sammshen commented Apr 13, 2026 •

edited

Loading