Skip to content

[ops][refactor] Add full list of Python fallbacks to run without compiled CUDA extensions#2591

Merged
ApostaC merged 59 commits intoLMCache:devfrom
hlin99:non-cuda-extension
Apr 13, 2026
Merged

[ops][refactor] Add full list of Python fallbacks to run without compiled CUDA extensions#2591
ApostaC merged 59 commits intoLMCache:devfrom
hlin99:non-cuda-extension

Conversation

@hlin99
Copy link
Copy Markdown
Contributor

@hlin99 hlin99 commented Feb 12, 2026

What this PR does / why we need it:

Decouple LMCache from compiled CUDA extensions by introducing a complete set of pure-Python (+ NumPy/SciPy) fallback implementations for every function previously only available through c_ops.

Key changes:

-- Centralize backend selection in lmcache/init.py with a predicate-based registry that probes available backends at import
time and dispatches to the best one (CUDA > fallback).
-- Implement non_cuda_equivalents.py covering all c_ops surfaces: rotary embedding, CDF calculation, KV-cache reshape/transfer,
encode/decode, pinned/NUMA memory alloc/free, memcpy, etc.
-- Add tests/v1/test_non_cuda_equivalents.py that runs each op under three backends (CUDA c_ops, non-CUDA with GPU visible,
non-CUDA without GPU) and cross-compares results to ensure numerical equivalence.
-- Adapt test skip logic to use torch.cuda.is_available() instead of pytest.importorskip("lmcache.c_ops"), since c_ops import
now always succeeds via automatic fallback.
-- Remove CUDA-only import guards across the codebase so that lmcache can be installed and imported on machines without
NVIDIA GPUs (e.g., Intel Gaudi / Habana).

Special notes for your reviewers:

If applicable:

[ No ] this PR contains user facing changes - docs added
[ Yes ] this PR contains unit tests


Note

Medium Risk
Medium risk because it changes how lmcache.c_ops is imported and dispatched at runtime and introduces many new Python implementations of performance- and memory-sensitive kernels (KV transfers, memcpy, arithmetic coding). Behavioral parity is guarded by new tests, but pointer-based tensor views and runtime library loading could surface platform-specific issues.

Overview
LMCache now selects an ops backend dynamically at import time: lmcache/__init__.py probes candidates (currently CUDA) and otherwise falls back to lmcache.non_cuda_equivalents, then aliases the chosen module into sys.modules["lmcache.c_ops"] so existing import lmcache.c_ops as lmc_ops call sites keep working.

lmcache/non_cuda_equivalents.py is expanded into a full pure-Python/NumPy/Numba fallback surface for previously CUDA-only ops, including pointer-to-tensor views, KV-cache transfer/reshape helpers, arithmetic encode/decode, CDF computation, rotary embedding update, PCI bus ID lookup, and a lmcache_memcpy_async fallback that can use libcudart/ROCm when available.

CUDA availability guards across the codebase are removed in favor of always importing lmcache.c_ops (now backend-dispatched), and tests are updated/added: a new parity suite (test_non_cuda_equivalents.py and test_c_ops_fallback_parity.py) validates signature/enum compatibility and numerical equivalence across backends, while existing CUDA-kernel tests adjust skip logic to avoid relying on importorskip("lmcache.c_ops").

Reviewed by Cursor Bugbot for commit 42c0da9. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @hlin99, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and portability of the LMCache library by decoupling it from mandatory compiled CUDA extensions. It achieves this by providing robust pure-Python fallback implementations for all core operations and introducing a dynamic backend selection mechanism. This change allows LMCache to function seamlessly on systems without NVIDIA GPUs, broadening its compatibility and ease of use across diverse hardware environments.

Highlights

  • Centralized Backend Selection: Implemented a predicate-based registry in lmcache/__init__.py to dynamically select the best available backend (CUDA > fallback) at import time, ensuring lmcache can be installed and imported without NVIDIA GPUs.
  • Pure-Python Fallback Implementations: Introduced lmcache/non_cuda_equivalents.py with a complete set of pure-Python (+ NumPy/SciPy) fallback implementations for all operations previously exclusive to c_ops, including rotary embedding, CDF calculation, KV-cache reshape/transfer, encode/decode, and pinned/NUMA memory management.
  • Numerical Equivalence Testing: Added tests/v1/test_non_cuda_equivalents.py to rigorously cross-compare results of each operation under CUDA c_ops, non-CUDA with GPU visible, and non-CUDA without GPU, ensuring numerical equivalence across backends.
  • Simplified Import Logic: Removed CUDA-only import guards (if torch.cuda.is_available():) across the codebase, as c_ops import now always succeeds due to automatic fallback, simplifying module dependencies.
  • Updated Test Skip Logic: Adapted test skip logic in tests/v1/test_mem_kernels.py to use torch.cuda.is_available() directly, reflecting the new backend selection mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/init.py
    • Added dynamic backend selection logic to prioritize CUDA operations and fall back to Python equivalents.
  • lmcache/non_cuda_equivalents.py
    • Implemented pure-Python fallback functions for multi-layer KV transfer, single-layer KV transfer (vLLM and SGLang formats), flash attention reshape operations, asynchronous memory copy, arithmetic encoding and decoding, CDF calculation, rotary embedding, and GPU PCI bus ID retrieval.
    • Added TransferDirection enum for memory copy operations.
  • lmcache/storage_backend/serde/cachegen_decoder.py
    • Removed conditional import of lmcache.c_ops.
  • lmcache/storage_backend/serde/cachegen_encoder.py
    • Removed conditional import of lmcache.c_ops.
  • lmcache/v1/compute/positional_encoding.py
    • Removed conditional import of lmcache.c_ops.
  • lmcache/v1/gpu_connector/gpu_connectors.py
    • Removed conditional import of lmcache.c_ops.
  • lmcache/v1/gpu_connector/gpu_ops.py
    • Removed conditional import of lmcache.c_ops.
  • lmcache/v1/lazy_memory_allocator.py
    • Removed conditional imports for lmc_ops, now directly importing lmcache.c_ops.
  • lmcache/v1/memory_management.py
    • Removed conditional imports for lmc_ops, now directly importing lmcache.c_ops.
  • tests/v1/test_mem_kernels.py
    • Updated test skip logic to check torch.cuda.is_available() directly instead of pytest.importorskip.
  • tests/v1/test_non_cuda_equivalents.py
    • Added new test suite to compare CUDA and non-CUDA fallback implementations for numerical equivalence across various operations.
    • Implemented a child process execution model for testing different backend modes (CUDA_OPS, NON_CUDA with GPU, NON_CUDA without GPU).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring to decouple lmcache from compiled CUDA extensions. By adding pure Python fallbacks for all CUDA operations and implementing a dynamic backend selection mechanism, the library can now be installed and used on machines without NVIDIA GPUs. The changes are extensive and include:

  • A new lmcache/__init__.py for dynamic backend dispatching.
  • A comprehensive lmcache/non_cuda_equivalents.py with Python implementations for all c_ops.
  • An excellent new test suite in tests/v1/test_non_cuda_equivalents.py that validates the numerical equivalence between the CUDA and Python backends under various conditions.
  • Removal of CUDA-specific import guards throughout the codebase.

The approach is robust, particularly the testing strategy which runs scenarios in different environments and compares results. I have a couple of suggestions to improve safety and test correctness, but overall this is a high-quality contribution.

Comment thread lmcache/non_cuda_equivalents.py Outdated
Comment thread tests/v1/test_non_cuda_equivalents.py Outdated
@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Feb 14, 2026

hi @hickeyma @sammshen . i see you're the creator of non cuda equivalents. can you take some time to review this PR and let me know your thoughts? many thanks for your time.

@hickeyma hickeyma self-requested a review February 23, 2026 17:44
Copy link
Copy Markdown
Collaborator

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hlin99 for pushing this PR. I'll review this shortly.

Copy link
Copy Markdown
Collaborator

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hlin99 What about installing LMCache from source? AFIK, this requires CUDA?

@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Feb 24, 2026

@hlin99 What about installing LMCache from source? AFIK, this requires CUDA?

hi @hickeyma thanks for your review. installing from source I did not encounter the issue, and I think that might be the reason that people are not complaining crash in import c ops. However, just viewing from logic perspective, cuda.is_available != c lib must be there to be safely imported. There might be other reasons to fail the import. So fallback path with a try & exception block is a safe way to properly handle this

@hlin99 hlin99 force-pushed the non-cuda-extension branch from 6a7496c to a18b286 Compare March 2, 2026 06:27
@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Mar 2, 2026

resolve conflicts and rebase to latest.

hlin99 added 7 commits March 2, 2026 06:28
…iled CUDA extensions

Decouple LMCache from compiled CUDA extensions by introducing a
complete set of pure-Python (+ NumPy/SciPy) fallback implementations
for every function previously only available through c_ops.

Key changes:

  -- Centralize backend selection in lmcache/init.py with a
predicate-based registry that probes available backends at import
time and dispatches to the best one (CUDA > fallback).
  -- Implement non_cuda_equivalents.py covering all c_ops surfaces:
rotary embedding, CDF calculation, KV-cache reshape/transfer,
encode/decode, pinned/NUMA memory alloc/free, memcpy, etc.
  -- Add tests/v1/test_non_cuda_equivalents.py that runs each op
under three backends (CUDA c_ops, non-CUDA with GPU visible,
non-CUDA without GPU) and cross-compares results to ensure
numerical equivalence.
  -- Adapt test skip logic to use torch.cuda.is_available() instead
of pytest.importorskip("lmcache.c_ops"), since c_ops import
now always succeeds via automatic fallback.
  -- Remove CUDA-only import guards across the codebase so that
lmcache can be installed and imported on machines without NVIDIA ops for
any reason (cuda version mismatch or others)

Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
Replace pytest.importorskip("lmcache.c_ops") with a
torch.cuda.is_available() check, as importing c_ops will
always succeed now (either real CUDA ops or fallback).

Non-CUDA backends can still be tested on machines without
CUDA hardware.

Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
lmcache.c_ops can now be imported safely regardless of CUDA availability.
Python fallback is selected automatically on exceptions.

Signed-off-by: Tony Lin <tony.lin@intel.com>
Comment thread lmcache/__init__.py
@@ -0,0 +1,71 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this __init__.py? can we also use it to export an LMCache version in the future, just curious

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This init.py is needed to handle conditional imports — it checks whether the compiled CUDA extensions are available and falls back to the pure-Python implementations if not.

Two main reasons for centralizing it here:

To avoid large-scale code changes where importing the C library could throw an exception at each individual call site.
Moving forward, this can serve as a single place to hook different kernel implementations per device, hiding device-specific details from the rest of the codebase.
And yes, we could definitely add a version export here in the future! Great idea — I can follow up with a separate PR for that.

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR. Let's see if the CI will pass

@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Mar 4, 2026

@hlin99 can you see you unit test failures on AMD?


@pytest.fixture(scope="module")
--
def run_all_children():
"""Launch 3 child processes. Runs once for the entire module."""
if RESULTS_DIR.exists():
shutil.rmtree(RESULTS_DIR)
 
for mode, cuda_vis in [("CUDA_OPS", "0"), ("NON_CUDA", "0"), ("NON_CUDA", "")]:
r = run_scenario(mode, cuda_vis)
>           assert r.returncode == 0, (
f"Scenario {mode}/CUDA_VISIBLE_DEVICES='{cuda_vis}' failed:\n"
f"{r.stdout}\n{r.stderr}"
)
E           AssertionError: Scenario CUDA_OPS/CUDA_VISIBLE_DEVICES='0' failed:
E
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[transfer_direction_enum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer_unilateral] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer_sgl] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[load_and_reshape_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[reshape_and_cache_back_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[lmcache_memcpy_async] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[encode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_prefsum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[calculate_cdf] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[rotary_embedding_k_fused] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[get_gpu_pci_bus_id] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E
E             =================================== FAILURES ===================================
E             ____________________ test_scenario[multi_layer_kv_transfer] ____________________
E
E             name = 'multi_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [0, 2, 5, 9],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "paged2lmc" if direction else "lmc2paged"
E
E                         # 1. LMCache Tensor
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         key_value = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E                         if not direction:  # LMC → Paged
E                             for ly in range(num_layers):
E                                 for t in range(num_tokens):
E                                     val = (
E                                         ly * 1000 + t * 10 + torch.arange(head_size, device=device)
E                                     ).to(dtype)
E                                     key_value[0, ly, t] = val
E                                     key_value[1, ly, t] = val + 500
E
E                         # 2. Paged Buffers
E                         page_buffers = []
E                         for ly in range(num_layers):
E                             pb = torch.zeros(
E                                 (2, page_buffer_size, head_size),
E                                 dtype=dtype,
E                                 device=device,
E                             )
E                             if direction:  # Paged → LMC
E                                 for s in range(page_buffer_size):
E                                     val = (
E                                         ly * 2000
E                                         + s * 10
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     pb[0, s] = val
E                                     pb[1, s] = val + 700
E                             page_buffers.append(pb)
E
E                         # 3. Pointer Tensor
E                         key_value_ptrs = torch.tensor(
E                             [pb.data_ptr() for pb in page_buffers],
E                             dtype=torch.int64,
E                             device=device,
E                         )
E
E                         # 4. Execute
E             >           ops.multi_layer_kv_transfer(
E                             key_value,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat, arg7: typing.SupportsInt) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642798592, 140133642801152], device='cuda:0'), tensor([0, 2, 5, 9], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1093: TypeError
E             ______________ test_scenario[multi_layer_kv_transfer_unilateral] _______________
E
E             name = 'multi_layer_kv_transfer_unilateral'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer_unilateral():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [1, 3, 4, 7],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "p2l" if direction else "l2p"
E
E                         # LMC Layout: [2, num_layers, num_tokens, head_size]
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         lmc_tensor = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E
E                         if not direction:  # LMC → Paged
E                             for kv in range(2):
E                                 for ly in range(num_layers):
E                                     for t in range(num_tokens):
E                                         val = (
E                                             kv * 5000
E                                             + ly * 1000
E                                             + t * 10
E                                             + torch.arange(
E                                                 head_size,
E                                                 device=device,
E                                             )
E                                         ).to(dtype)
E                                         lmc_tensor[kv, ly, t] = val
E
E                         # 1. Paged Buffers
E                         buffers = {}
E                         for kv in range(2):
E                             for ly in range(num_layers):
E                                 pb = torch.zeros(
E                                     (page_buffer_size, head_size),
E                                     dtype=dtype,
E                                     device=device,
E                                 )
E                                 if direction:  # Paged → LMC
E                                     val = (
E                                         kv * 7000
E                                         + ly * 2000
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     for s in range(page_buffer_size):
E                                         pb[s] = val + (s * 10)
E                                 buffers[(kv, ly)] = pb
E
E                         # 2. Grouped Pointer Tensor
E                         # C++: ptrs[layer_id] = Key,
E                         #      ptrs[layer_id + num_layers] = Value
E                         ptr_list = []
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(0, ly)].data_ptr())
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(1, ly)].data_ptr())
E
E                         key_value_ptrs = torch.tensor(
E                             ptr_list,
E                             dtype=torch.int64,
E                             device=device,
E                         ).contiguous()
E
E                         # 3. Execute
E             >           ops.multi_layer_kv_transfer_unilateral(
E                             lmc_tensor,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer_unilateral(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642804224, 140133642805760, 140133642807296, 140133642808320],
E             E                  device='cuda:0'), tensor([1, 3, 4, 7], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1213: TypeError
E             ___________________ test_scenario[single_layer_kv_transfer] ____________________
E
E             name = 'single_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_single_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_tokens = 64
E                     num_blocks = 256
E                     block_size = 16
E                     num_heads = 12
E                     head_size = 64
E                     hidden_size = num_heads * head_size
E
E                     slot_mapping = torch.arange(
E                         0,
E                         num_tokens * 2,
E                         2,
E                         device=device,
E                     ).to(torch.int64)
E
E                     # (use_mla, token_major, vllm_two_major, direction)
E                     # direction: False = LMC→vLLM, True = vLLM→LMC
E                     test_cases = [
E                         (False, True, True, False),
E                         (False, False, False, False),
E                         (False, True, True, True),
E                         (True, True, True, False),
E                         (True, True, True, True),
E                     ]
E
E                     for use_mla, token_major, vllm_two_major, direction in test_cases:
E                         dir_tag = "v2l" if direction else "l2v"
E                         case_desc = (
E                             f"MLA={use_mla}, TM={token_major}, 2Maj={vllm_two_major}, Dir={dir_tag}"
E                         )
E
E                         # 1. Setup Shapes
E                         if use_mla:
E                             lmc_shape = (num_tokens, hidden_size)
E                             vllm_shape = (num_blocks, block_size, hidden_size)
E                         else:
E                             lmc_shape = (
E                                 (num_tokens, 2, hidden_size)
E                                 if token_major
E                                 else (2, num_tokens, hidden_size)
E                             )
E                             if vllm_two_major:
E                                 vllm_shape = (
E                                     2,
E                                     num_blocks,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E                             else:
E                                 vllm_shape = (
E                                     num_blocks,
E                                     2,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E
E                         # 2. Deterministic Data
E                         lmc_size = 1
E                         for s in lmc_shape:
E                             lmc_size *= s
E                         vllm_size = 1
E                         for s in vllm_shape:
E                             vllm_size *= s
E
E                         lmc_tensor = (
E                             (torch.arange(lmc_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(lmc_shape)
E                         )
E                         vllm_tensor = (
E                             (torch.arange(vllm_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(vllm_shape)
E                         )
E
E                         # 3. Golden Reference
E                         lmc_ref = lmc_tensor.clone()
E                         vllm_ref = vllm_tensor.clone()
E                         block_indices = slot_mapping // block_size
E                         block_offsets = slot_mapping % block_size
E
E                         if not direction:  # LMC → vLLM
E                             if use_mla:
E                                 vllm_ref[block_indices, block_offsets, :] = lmc_ref
E                             else:
E                                 src = lmc_ref if token_major else lmc_ref.permute(1, 0, 2)
E                                 src = src.view(
E                                     num_tokens,
E                                     2,
E                                     num_heads,
E                                     head_size,
E                                 )
E                                 if vllm_two_major:
E                                     vllm_ref[0, block_indices, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[1, block_indices, block_offsets] = src[:, 1, :, :]
E                                 else:
E                                     vllm_ref[block_indices, 0, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[block_indices, 1, block_offsets] = src[:, 1, :, :]
E                         else:  # vLLM → LMC
E                             if use_mla:
E                                 lmc_ref = vllm_ref[block_indices, block_offsets, :]
E                             else:
E                                 if vllm_two_major:
E                                     k = vllm_ref[0, block_indices, block_offsets]
E                                     v = vllm_ref[1, block_indices, block_offsets]
E                                 else:
E                                     k = vllm_ref[block_indices, 0, block_offsets]
E                                     v = vllm_ref[block_indices, 1, block_offsets]
E                                 combined = torch.stack(
E                                     [k, v],
E                                     dim=1,
E                                 ).view(num_tokens, 2, hidden_size)
E                                 lmc_ref = combined if token_major else combined.permute(1, 0, 2)
E
E                         # 4. Execute
E             >           ops.single_layer_kv_transfer(
E                             lmc_tensor,
E                             vllm_tensor,
E                             slot_mapping,
E                             direction,
E                             token_major,
E                             vllm_two_major,
E                             use_mla,

Let me know if you need access to the Buildkite portal

@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Mar 4, 2026

@hlin99 can you see you unit test failures on AMD?


@pytest.fixture(scope="module")
--
def run_all_children():
"""Launch 3 child processes. Runs once for the entire module."""
if RESULTS_DIR.exists():
shutil.rmtree(RESULTS_DIR)
 
for mode, cuda_vis in [("CUDA_OPS", "0"), ("NON_CUDA", "0"), ("NON_CUDA", "")]:
r = run_scenario(mode, cuda_vis)
>           assert r.returncode == 0, (
f"Scenario {mode}/CUDA_VISIBLE_DEVICES='{cuda_vis}' failed:\n"
f"{r.stdout}\n{r.stderr}"
)
E           AssertionError: Scenario CUDA_OPS/CUDA_VISIBLE_DEVICES='0' failed:
E
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[transfer_direction_enum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[multi_layer_kv_transfer_unilateral] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[single_layer_kv_transfer_sgl] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             FAILED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[load_and_reshape_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[reshape_and_cache_back_flash] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[lmcache_memcpy_async] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[encode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_new] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[decode_fast_prefsum] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[calculate_cdf] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[rotary_embedding_k_fused] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_pinned_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[alloc_free_numa_ptr] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E             tests/v1/test_non_cuda_equivalents.py::test_scenario[get_gpu_pci_bus_id] >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             >>> Importing lmcache.c_ops as ops (Mode: CUDA_OPS)
E             PASSED
E
E             =================================== FAILURES ===================================
E             ____________________ test_scenario[multi_layer_kv_transfer] ____________________
E
E             name = 'multi_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [0, 2, 5, 9],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "paged2lmc" if direction else "lmc2paged"
E
E                         # 1. LMCache Tensor
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         key_value = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E                         if not direction:  # LMC → Paged
E                             for ly in range(num_layers):
E                                 for t in range(num_tokens):
E                                     val = (
E                                         ly * 1000 + t * 10 + torch.arange(head_size, device=device)
E                                     ).to(dtype)
E                                     key_value[0, ly, t] = val
E                                     key_value[1, ly, t] = val + 500
E
E                         # 2. Paged Buffers
E                         page_buffers = []
E                         for ly in range(num_layers):
E                             pb = torch.zeros(
E                                 (2, page_buffer_size, head_size),
E                                 dtype=dtype,
E                                 device=device,
E                             )
E                             if direction:  # Paged → LMC
E                                 for s in range(page_buffer_size):
E                                     val = (
E                                         ly * 2000
E                                         + s * 10
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     pb[0, s] = val
E                                     pb[1, s] = val + 700
E                             page_buffers.append(pb)
E
E                         # 3. Pointer Tensor
E                         key_value_ptrs = torch.tensor(
E                             [pb.data_ptr() for pb in page_buffers],
E                             dtype=torch.int64,
E                             device=device,
E                         )
E
E                         # 4. Execute
E             >           ops.multi_layer_kv_transfer(
E                             key_value,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat, arg7: typing.SupportsInt) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642798592, 140133642801152], device='cuda:0'), tensor([0, 2, 5, 9], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1093: TypeError
E             ______________ test_scenario[multi_layer_kv_transfer_unilateral] _______________
E
E             name = 'multi_layer_kv_transfer_unilateral'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_multi_layer_kv_transfer_unilateral():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_layers = 2
E                     num_tokens = 4
E                     head_size = 16
E                     page_buffer_size = 10
E                     dtype = torch.float32
E
E                     slot_mapping = torch.tensor(
E                         [1, 3, 4, 7],
E                         dtype=torch.int64,
E                         device=device,
E                     )
E
E                     for direction in [True, False]:
E                         dir_tag = "p2l" if direction else "l2p"
E
E                         # LMC Layout: [2, num_layers, num_tokens, head_size]
E                         lmc_shape = (2, num_layers, num_tokens, head_size)
E                         lmc_tensor = torch.zeros(
E                             lmc_shape,
E                             dtype=dtype,
E                             device=device,
E                         )
E
E                         if not direction:  # LMC → Paged
E                             for kv in range(2):
E                                 for ly in range(num_layers):
E                                     for t in range(num_tokens):
E                                         val = (
E                                             kv * 5000
E                                             + ly * 1000
E                                             + t * 10
E                                             + torch.arange(
E                                                 head_size,
E                                                 device=device,
E                                             )
E                                         ).to(dtype)
E                                         lmc_tensor[kv, ly, t] = val
E
E                         # 1. Paged Buffers
E                         buffers = {}
E                         for kv in range(2):
E                             for ly in range(num_layers):
E                                 pb = torch.zeros(
E                                     (page_buffer_size, head_size),
E                                     dtype=dtype,
E                                     device=device,
E                                 )
E                                 if direction:  # Paged → LMC
E                                     val = (
E                                         kv * 7000
E                                         + ly * 2000
E                                         + torch.arange(
E                                             head_size,
E                                             device=device,
E                                         )
E                                     ).to(dtype)
E                                     for s in range(page_buffer_size):
E                                         pb[s] = val + (s * 10)
E                                 buffers[(kv, ly)] = pb
E
E                         # 2. Grouped Pointer Tensor
E                         # C++: ptrs[layer_id] = Key,
E                         #      ptrs[layer_id + num_layers] = Value
E                         ptr_list = []
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(0, ly)].data_ptr())
E                         for ly in range(num_layers):
E                             ptr_list.append(buffers[(1, ly)].data_ptr())
E
E                         key_value_ptrs = torch.tensor(
E                             ptr_list,
E                             dtype=torch.int64,
E                             device=device,
E                         ).contiguous()
E
E                         # 3. Execute
E             >           ops.multi_layer_kv_transfer_unilateral(
E                             lmc_tensor,
E                             key_value_ptrs,
E                             slot_mapping,
E                             torch.device(device),
E                             page_buffer_size,
E                             direction,
E                             False,  # use_mla
E                         )
E             E           TypeError: multi_layer_kv_transfer_unilateral(): incompatible function arguments. The following argument types are supported:
E             E               1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.device, arg4: typing.SupportsInt, arg5: lmcache.c_ops.TransferDirection, arg6: lmcache.c_ops.GPUKVFormat) -> None
E             E
E             E           Invoked with: tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
E             E
E             E
E             E                   [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
E             E
E             E                    [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
E             E                     [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]],
E             E                  device='cuda:0'), tensor([140133642804224, 140133642805760, 140133642807296, 140133642808320],
E             E                  device='cuda:0'), tensor([1, 3, 4, 7], device='cuda:0'), device(type='cuda', index=0), 10, True, False
E
E             tests/v1/test_non_cuda_equivalents.py:1213: TypeError
E             ___________________ test_scenario[single_layer_kv_transfer] ____________________
E
E             name = 'single_layer_kv_transfer'
E
E                 @pytest.mark.parametrize("name", list(SCENARIO_REGISTRY.keys()))
E                 def test_scenario(name):
E             >       SCENARIO_REGISTRY[name]()
E
E             tests/v1/test_non_cuda_equivalents.py:1400:
E             _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E
E                 def scenario_single_layer_kv_transfer():
E                     ops, scene_info = get_test_context()
E                     is_cuda_backend = scene_info.startswith("cuda_ops")
E
E                     torch.manual_seed(42)
E                     if torch.cuda.is_available():
E                         torch.cuda.manual_seed_all(42)
E
E                     device = f"cuda:{torch.cuda.current_device()}" if is_cuda_backend else "cpu"
E
E                     num_tokens = 64
E                     num_blocks = 256
E                     block_size = 16
E                     num_heads = 12
E                     head_size = 64
E                     hidden_size = num_heads * head_size
E
E                     slot_mapping = torch.arange(
E                         0,
E                         num_tokens * 2,
E                         2,
E                         device=device,
E                     ).to(torch.int64)
E
E                     # (use_mla, token_major, vllm_two_major, direction)
E                     # direction: False = LMC→vLLM, True = vLLM→LMC
E                     test_cases = [
E                         (False, True, True, False),
E                         (False, False, False, False),
E                         (False, True, True, True),
E                         (True, True, True, False),
E                         (True, True, True, True),
E                     ]
E
E                     for use_mla, token_major, vllm_two_major, direction in test_cases:
E                         dir_tag = "v2l" if direction else "l2v"
E                         case_desc = (
E                             f"MLA={use_mla}, TM={token_major}, 2Maj={vllm_two_major}, Dir={dir_tag}"
E                         )
E
E                         # 1. Setup Shapes
E                         if use_mla:
E                             lmc_shape = (num_tokens, hidden_size)
E                             vllm_shape = (num_blocks, block_size, hidden_size)
E                         else:
E                             lmc_shape = (
E                                 (num_tokens, 2, hidden_size)
E                                 if token_major
E                                 else (2, num_tokens, hidden_size)
E                             )
E                             if vllm_two_major:
E                                 vllm_shape = (
E                                     2,
E                                     num_blocks,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E                             else:
E                                 vllm_shape = (
E                                     num_blocks,
E                                     2,
E                                     block_size,
E                                     num_heads,
E                                     head_size,
E                                 )
E
E                         # 2. Deterministic Data
E                         lmc_size = 1
E                         for s in lmc_shape:
E                             lmc_size *= s
E                         vllm_size = 1
E                         for s in vllm_shape:
E                             vllm_size *= s
E
E                         lmc_tensor = (
E                             (torch.arange(lmc_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(lmc_shape)
E                         )
E                         vllm_tensor = (
E                             (torch.arange(vllm_size, device=device) % 1000)
E                             .to(torch.float16)
E                             .reshape(vllm_shape)
E                         )
E
E                         # 3. Golden Reference
E                         lmc_ref = lmc_tensor.clone()
E                         vllm_ref = vllm_tensor.clone()
E                         block_indices = slot_mapping // block_size
E                         block_offsets = slot_mapping % block_size
E
E                         if not direction:  # LMC → vLLM
E                             if use_mla:
E                                 vllm_ref[block_indices, block_offsets, :] = lmc_ref
E                             else:
E                                 src = lmc_ref if token_major else lmc_ref.permute(1, 0, 2)
E                                 src = src.view(
E                                     num_tokens,
E                                     2,
E                                     num_heads,
E                                     head_size,
E                                 )
E                                 if vllm_two_major:
E                                     vllm_ref[0, block_indices, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[1, block_indices, block_offsets] = src[:, 1, :, :]
E                                 else:
E                                     vllm_ref[block_indices, 0, block_offsets] = src[:, 0, :, :]
E                                     vllm_ref[block_indices, 1, block_offsets] = src[:, 1, :, :]
E                         else:  # vLLM → LMC
E                             if use_mla:
E                                 lmc_ref = vllm_ref[block_indices, block_offsets, :]
E                             else:
E                                 if vllm_two_major:
E                                     k = vllm_ref[0, block_indices, block_offsets]
E                                     v = vllm_ref[1, block_indices, block_offsets]
E                                 else:
E                                     k = vllm_ref[block_indices, 0, block_offsets]
E                                     v = vllm_ref[block_indices, 1, block_offsets]
E                                 combined = torch.stack(
E                                     [k, v],
E                                     dim=1,
E                                 ).view(num_tokens, 2, hidden_size)
E                                 lmc_ref = combined if token_major else combined.permute(1, 0, 2)
E
E                         # 4. Execute
E             >           ops.single_layer_kv_transfer(
E                             lmc_tensor,
E                             vllm_tensor,
E                             slot_mapping,
E                             direction,
E                             token_major,
E                             vllm_two_major,
E                             use_mla,

Let me know if you need access to the Buildkite portal

hi @sammshen . thanks for pointing this out! not aware recently there was an API update in these kernels. i'm working on an upgrade now.

hlin99 added 8 commits March 4, 2026 07:29
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
    - Remove module-level CUDA skip to allow tests in non-CUDA environments
    - Add _cuda_available flag to track CUDA availability at module level
    - Conditionally run comparison tests only when CUDA is available
    - Run no-crash tests (verify execution without comparing results) when CUDA is unavailable
    - Maintain backward compatibility: full comparison testing when CUDA is available

Signed-off-by: Tony Lin <tony.lin@intel.com>
    - Modified scenario_rotary_embedding_k_fused to test both is_neox=True (NeoX-style, contiguous halves) and is_neox=False (GPT-J-style, interleaved)
    - Each test case now saves results with distinct suffixes (_neox and _gptj)
    - This ensures both code paths in the rotary_embedding_k_fused function are properly tested
    - Addresses the critical test coverage gap identified in the coverage analysis

Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
- Added NB_NL_TWO_BS_NH_HS to format_cases in scenario_multi_layer_kv_transfer
- Updated comments to clarify code paths for different GPUKVFormat enums
- Created comprehensive coverage analysis document
- All applicable GPUKVFormat enums now covered in tests

Signed-off-by: Tony Lin <tony.lin@intel.com>
@sammshen
Copy link
Copy Markdown
Contributor

thanks @chunxiaozheng for the detailed review

Copy link
Copy Markdown
Collaborator

@chunxiaozheng chunxiaozheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chunxiaozheng chunxiaozheng enabled auto-merge (squash) April 12, 2026 06:48
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 12, 2026
@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Apr 12, 2026

thanks @chunxiaozheng for the detailed review

thank you @sammshen & @chunxiaozheng

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sammshen
Copy link
Copy Markdown
Contributor

Screenshot 2026-04-13 at 12 34 01 AM

@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Apr 13, 2026

the CODEOWNERS is a little too fine-grained, let me change that

@sammshen
Copy link
Copy Markdown
Contributor

this PR should be able to merge after: #3016

@sammshen
Copy link
Copy Markdown
Contributor

@maobaolong @chunxiaozheng PTAL!

maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 13, 2026
…thout compiled CUDA extensions LMCache#2591

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 13, 2026
…thout compiled CUDA extensions LMCache#2591 (#16)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
@maobaolong
Copy link
Copy Markdown
Collaborator

@hlin99 Maybe you forget to update the import lmcache.c_ops as lmc_ops in lmcache/v1/multiprocess/gpu_context.py?

@maobaolong
Copy link
Copy Markdown
Collaborator

@DongDongJu Would you like to take a look at this PR?

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC disabled auto-merge April 13, 2026 23:48
@ApostaC ApostaC enabled auto-merge (squash) April 13, 2026 23:48
@ApostaC ApostaC disabled auto-merge April 13, 2026 23:48
@ApostaC ApostaC merged commit 41b669e into LMCache:dev Apr 13, 2026
41 checks passed
@github-actions github-actions Bot added full Run comprehensive tests on this PR and removed full Run comprehensive tests on this PR labels Apr 14, 2026
@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Apr 14, 2026

@hlin99 Maybe you forget to update the import lmcache.c_ops as lmc_ops in lmcache/v1/multiprocess/gpu_context.py?

Hi @maobaolong . you're right! i was aware of it. since this PR was opened, there're some changes on c_ops, i only updated necessary part(signature, enum def) in this PR, because i really don't want this PR to go bigger and bigger.

I will raise separate PR to fill the known gaps bettween c ops and non cuda equivalents soon. or anybody wants to contribute is also welcome

ekaynar pushed a commit to ekaynar/LMCache that referenced this pull request Apr 15, 2026
…iled CUDA extensions (LMCache#2591)

Signed-off-by: Tony Lin <tony.lin@intel.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Co-authored-by: Martin Hickey <martin.hickey@ie.ibm.com>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
…iled CUDA extensions (LMCache#2591)

Signed-off-by: Tony Lin <tony.lin@intel.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Co-authored-by: Martin Hickey <martin.hickey@ie.ibm.com>
@hlin99 hlin99 deleted the non-cuda-extension branch April 25, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants