[Hardware] Enable Intel Gaudi (HPU) support by shepark · Pull Request #1066 · LMCache/LMCache

shepark · 2025-07-15T21:30:22Z

Enable Intel Gaudi HPU support.
There is corresponding PR (HabanaAI/vllm-fork#1369) in vllm-fork.
The PR in vllm-fork has examples utilizes lmcache. (ex: PD use case)
We need this changes in.

install lmcache based on hpu components
PT_HPU_GPU_MIGRATION=1 pip install -e .

YaoJiayi · 2025-07-17T05:48:03Z

+                hd_shape = h*d
+                for i in range(len(kvcaches)):
+                    kvcaches[i][0].view(b, hd_shape).index_copy_(0, slot_mapping[start:end], tmp_gpu_buffer[0][i])
+                    kvcaches[i][1].view(b, hd_shape).index_copy_(0, slot_mapping[start:end], tmp_gpu_buffer[1][i])


Instead of having a bunch of if-else (also the following), why not having an HPUConnector?

@YaoJiayi Thank you for the review and suggestion. That'd be good too.
Simply, the changes here is for providing different path when lmc_ops is not exist, for kv cache transfer as you know.
Right, it looks "bunch of" but it's clearly distinguishable and only in "to_gpu" and "from_gpu".
If we have new hpu_connector, then there might be much more codes to select hpu_connector in case in multiple locations.
But, this is the first changes, so we will consider to have separate connector for hpu definitely.

shepark · 2025-07-17T15:12:50Z

@YaoJiayi I see 2 failing checks.
The 1st one is canceled by you.
The 2nd one is a failed by some permission issue.
Can you check this out?
I'm not sure whether this is something I can fix or not.
Thank you.

--
| # Host "github.com" already in list of known hosts at "/root/.ssh/known_hosts"
| $ git clone -v -- git@github.com:LMCache/LMCache.git .
| Cloning into '.'...
| git@github.com: Permission denied (publickey).
| fatal: Could not read from remote repository.
|
| Please make sure you have the correct access rights
| and the repository exists.
| ⚠️ Warning: Checkout failed! cloning git repository: exit status 128 (Attempt 3/3)
| # Removing /var/lib/buildkite-agent/builds/unit-end-to-end-test-2-2/lmcache/lmcache-vllm-integration-tests
| # Creating "/var/lib/buildkite-agent/builds/unit-end-to-end-test-2-2/lmcache/lmcache-vllm-integration-tests"
| 🚨 Error: cloning git repository: exit status 128

shepark · 2025-07-19T15:35:30Z

@YaoJiayi all test passed, can you review this pr again?

shepark · 2025-07-23T05:18:52Z

@sammshen could you review PR?

YaoJiayi

LGTM! @shepark We do want to support more hardwares! Just two minor comments

YaoJiayi · 2025-07-23T18:23:13Z

+                    flattened.extend(elem)
+                else:
+                    flattened.append(elem)
+            new_block_ids = flattened


Can you explain the code change here?

@YaoJiayi Thank you for the review.
Actually, I described it in the issue ticket as well but I closed it (#1091)
We expect allocated_block_ids is always list type, like [1,2,3,4,5].
But, with original code which recently added by 1072, it becomes nested lists when new_block_ids data type is list of lists.
So, I need to flatten this again for our case.

YaoJiayi · 2025-07-23T18:24:31Z

+                self.use_mla,
+            )
+        else:
+            if self.gpu_buffer is not None:


Can we move the code from VLLMPagedMemGPUConnectorV2 to sth like VLLMPagedMemHPUConnector?

@YaoJiayi yes, I totally agree with your suggestion but same as your comment as before about hpu_connector.
We are going to continuously improve, and definitely will consider to have separate connector.
In the future, it will be VLLMPagedMemHPUConnector in hpu_connector, not in gpu_connector.

Hi @shepark , would it be ok to do this in this PR? :)

Refer

ok will think about it :) ;;;

@YaoJiayi @sammshen The most recent commit does this. Can you review again?

shepark · 2025-07-24T16:20:05Z

@YaoJiayi @sammshen can you check the buildkite failure?
It looks some test infra related, not PR.

sammshen · 2025-07-24T18:55:54Z

Integration tests are still WIP, do not need to pass for now

sammshen · 2025-07-31T01:23:30Z

+            self.gpu_buffer = torch.empty(
+                shape, dtype=kwargs["dtype"], device=kwargs["device"]
+            )
+        self.store_stream = torch.cuda.Stream()


if torch.device() is set to hpu in the adapter, can we use torch.cuda here? sorry if I misunderstand

removed line 87 as its not used in hpu

sammshen · 2025-07-31T01:28:39Z

+        VLLMPagedMemHPUConnectorV2
    ]

    if use_mla and config.use_layerwise:


maybe just one more small check under if config.use_layerwise that if device.type == "hpu", don't support yet

sammshen · 2025-07-31T01:32:56Z

+
+                memory_obj.tensor.copy_(tmp_gpu_buffer, non_blocking=True)
+
+        if not memory_obj.tensor.is_cuda:


same here, can we call is_cuda here? tensor.device.type != "cuda" might be more safe

i removed the if loop here since we dont support cuda stream in hpu_connector

sammshen · 2025-07-31T01:36:27Z

+infinistore
+msgspec
+numpy
+nvtx


nvtx (and cufile-python) are nvidia specific . I agree we should keep it though if this helps with not breaking the installation (since all the code is annotated with nvtx_annotate)

cufile-python and nvtx want to be kept?

sammshen · 2025-07-31T01:40:36Z

Hi @skaulintel @shepark (don't worry about the CI, those will be back up very soon), apologize if this is nitpikcy but my overall comment would just be about the implicit reliance on lmc_ops == None inside of the HPU connector. would it be possible to clean it up and only keep the codepaths that use if not lmc_ops?

Approving first as mostly LGTM

DongDongJu

Hello, Happy new year.
I left few comments and questions.
One last thing is that
Do we have any chance to force cuda available to false when hpu.is_available from torch side? that will be really helpful.

DongDongJu · 2026-01-07T06:25:20Z

    VLLMPagedMemLayerwiseGPUConnector,
 )
+
+if hasattr(torch, "hpu") and torch.hpu.is_available():


IMO, Good to have a helper func like is_hpu_avaliable().

so cuda available returns true by design when using PT_HPU_GPU_MIGRATION=1 , this is used for running GPU code on HPU with minimal changes , for ref please check https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html

DongDongJu · 2026-01-07T06:31:45Z

+        # First Party
+        import lmcache.c_ops as lmc_ops
+except (ModuleNotFoundError, ImportError):
+    lmc_ops = None


DongDongJu · 2026-01-07T06:34:25Z

-    torch_dev.set_device(local_rank)
-    device = torch.device(f"{dev_name}:{local_rank}")
+        num_gpus = torch_dev.device_count()
+        local_rank = parallel_config.rank % num_gpus


So basically, gaudi do not support parallel feature. Is that correct?

DongDongJu · 2026-01-07T06:38:56Z

+    # First Party
+    import lmcache.c_ops as lmc_ops
+except (ModuleNotFoundError, ImportError):
+    lmc_ops = None


Do we need to checking import this for this hpu case? it seems alway none in this case and following all the lmc_ops check also useless.
Please correct me if im wrong.

DongDongJu · 2026-01-07T06:50:35Z

+                shape = kv_cache.shape
+                dtype = kv_cache.dtype
+
+            if shape is not None and dtype is not None:


which case shape and dtype can be none in here?

DongDongJu · 2026-01-07T06:53:10Z

-    array_type = ctypes.c_uint8 * size
-    buf = array_type.from_address(ptr)
-    buffer = torch.frombuffer(buf, dtype=torch.uint8)
+        buffer = torch.empty(size, dtype=torch.uint8)


I think it should go inside of logic.
This way has broken behavior for numa_mapping case.

this is for when lmc_ops = None

thats what im saying. It previously working with numa aware manner but will not anymore with this indent.

I'm okay with something like

if is_hpu_environment(): return torch.empty(size, dtype=torch.uint8)

The other part of the code should not be touched in this case

DongDongJu · 2026-01-07T07:02:09Z

@DongDongJu i am working on the machine , might take some time. meanwhile i can provide the logs of the tests. Will that be ok ?

Sure, please post here or send the log in community slack with DEBUG level log. Thanks!

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

…pstream

DongDongJu · 2026-01-08T17:58:27Z

@hsubramony will do tomorrow. Thanks for quick response!

ApostaC · 2026-01-08T18:42:29Z

Hey, thanks for the contribution 🙏! I would also like to take a look at this PR over the weekend.

ApostaC

Thanks for the terrific work! My understading is that this PR includes at least 3 parts:

Introduce the HPU connector for passing KV cache between vLLM and LMCache
Fix a lot of import issues
Handling the differences in KV cache data structure between HPU and GPU .

Part 1 seems good to me. But we probably want to have a better way (i.e., with clear function definition and less code changes) to achieve part 2 and part 3.

It would be great and also easier for the code maintainers to review if you can split this into 3 PRs.

ApostaC · 2026-01-11T21:58:57Z

+        # First Party
+        import lmcache.c_ops as lmc_ops
+except (ModuleNotFoundError, ImportError):
+    lmc_ops = None


What's wrong if we don't do it here? IIUC, cachegen won't be supported unless we have the kernels on intel gpus.
Therefore, functions in this file and the cachegen_encoder.py won't be called anyway.

ApostaC · 2026-01-11T22:01:20Z

+            # required for VLLMPagedMemHPUConnectorV2
+            if hasattr(torch, "hpu") and torch.hpu.is_available():
+                kv_shapes = self.gpu_connector.get_shape(num_tokens)
+            else:
+                kv_shapes = self.metadata.get_shapes(num_tokens)


We did a refactoring in #2284 and a series of related PRs done by @chunxiaozheng.
We should reuse the metadata.get_shapes instead of having an if branch here,.

ApostaC · 2026-01-11T22:04:04Z


-if torch.cuda.is_available():
-    # First Party
+if hasattr(torch, "hpu") and torch.hpu.is_available():


I've seen this in many different places. Can we do two things:

make hasattr(torch, "hpu") and torch.hpu.is_available(): a common util function

Try to see if there is a way to avoid this if-checking during import

The goal is to avoid code duplication and make it more maintainable

ApostaC · 2026-01-11T22:07:50Z

+            shape, dtype = None, None
+
+            if isinstance(kv_cache, (tuple, list)):
+                # HPU has a tuple list (K,V) with same shape and dtype
+                for tensor in kv_cache:
+                    if tensor is not None:
+                        shape = tensor.shape
+                        dtype = tensor.dtype
+                        break


Here, you are effectively making shape and dtype become optional values (which means it could be None).
Not sure why the linter doesn't complain here, but we should make the code here cleaner. Probably define a clear function to extract the dtype and shape for hpu?

ApostaC · 2026-01-11T22:09:10Z

We are refactoring this part of the code in #2380. Let's don't touch it for now

ApostaC · 2026-01-11T22:11:12Z

-    array_type = ctypes.c_uint8 * size
-    buf = array_type.from_address(ptr)
-    buffer = torch.frombuffer(buf, dtype=torch.uint8)
+        buffer = torch.empty(size, dtype=torch.uint8)


I'm okay with something like

if is_hpu_environment(): return torch.empty(size, dtype=torch.uint8)

The other part of the code should not be touched in this case

ApostaC · 2026-01-11T22:12:03Z

+            if lmc_ops:
+                ptr = lmc_ops.alloc_pinned_ptr(size, 0)
+                array_type = ctypes.c_uint8 * size
+                buf = array_type.from_address(ptr)
+                self.buffer = torch.frombuffer(buf, dtype=torch.uint8)
+            else:
+                self.buffer = torch.empty(size, dtype=torch.uint8, pin_memory=True)
+


Similar to above, creating an extra ident is not ideal and will be confusing to other contributors

ApostaC · 2026-01-11T22:14:04Z


    def close(self):
-        if not self._unregistered:
+        if lmc_ops and not self._unregistered:


Here, we use whether lmc_ops is None to determine whether it's in the hpu environment (and similar logic is used in multiple places).
This may create confusion and maintenance overhead in the future. We should define an explicit function to determine whether it's hpu or not.

And such a function should have negligible overhead if it is called during runtime

ApostaC · 2026-01-11T22:15:11Z

 from lmcache.v1.memory_management import MemoryFormat

-MAX_KEY_LENGTH = 150
+MAX_KEY_LENGTH = 250


Just wondering why we make this change here?

DongDongJu · 2026-01-11T23:26:40Z

During checking the log I noticed that hpu support is not officially merging into vllm.
And, it seems multiple forks(https://github.com/vllm-project/vllm-gaudi, https://github.com/HabanaAI/vllm-fork ...) existing.
Even these basic code(hpu_runner from vllm_gaudi, vllm_fork) are different now.
Can we know what is the plan for it?

…pstream refactor vllm_v1_adapter.py and manager.py

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

…pstream

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

hsubramony · 2026-01-13T23:28:56Z

During checking the log I noticed that hpu support is not officially merging into vllm. And, it seems multiple forks(https://github.com/vllm-project/vllm-gaudi, https://github.com/HabanaAI/vllm-fork ...) existing. Even these basic code(hpu_runner from vllm_gaudi, vllm_fork) are different now. Can we know what is the plan for it?

please use https://github.com/vllm-project/vllm-gaudi.git

hsubramony · 2026-01-13T23:30:44Z

@sammshen @ApostaC i have updated with suggested changes. Please review. thanks

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

DongDongJu

Thanks for the hard work!
I left few questions.

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

…pstream

hsubramony · 2026-01-20T18:54:01Z

@DongDongJu @sammshen @ApostaC i updated the branch and pushed. Please let me know if any other issues. Please help merge. thanks

DongDongJu · 2026-01-20T21:57:45Z

Hello, Thanks for the great work.
IMO, not sure this is good choice including this one inside of cache engine itself.
Its not like storage backend level module.
I bliv https://github.com/LMCache/LMCache-Ascend way is better than merging in here.
But I left this one for others choice.

libinta · 2026-01-20T23:27:12Z

Hello, Thanks for the great work. IMO, not sure this is good choice including this one inside of cache engine itself. Its not like storage backend level module. I bliv https://github.com/LMCache/LMCache-Ascend way is better than merging in here. But I left this one for others choice.

@DongDongJu thanks for your review, as we don't have heavy changes for LMCache at this stage, do you think this PR can be merged to main git?

This reverts commit 5666a1c.

NO_CUDA_EXT=1 BUILD_WITH_HPU=1 PT_HPU_GPU_MIGRATION=1 pip install -e .

github-actions · 2026-03-22T02:18:54Z

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

github-actions · 2026-04-21T02:24:28Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it!

shepark marked this pull request as ready for review July 15, 2025 21:54

YaoJiayi self-requested a review July 15, 2025 22:04

Shaoting-Feng assigned YaoJiayi Jul 17, 2025

YaoJiayi reviewed Jul 17, 2025

View reviewed changes

shepark force-pushed the intel-main-v1-rebase-upstream branch 3 times, most recently from 0824f9a to 07f2519 Compare July 18, 2025 15:47

shepark changed the title ~~[Feature] Enable Intel Gaudi (HPU) support~~ [Hardware] Enable Intel Gaudi (HPU) support Jul 18, 2025

shepark force-pushed the intel-main-v1-rebase-upstream branch from 07f2519 to f50ebc5 Compare July 19, 2025 01:21

shepark force-pushed the intel-main-v1-rebase-upstream branch 6 times, most recently from 57d6b69 to 1f0a331 Compare July 23, 2025 05:17

YaoJiayi reviewed Jul 23, 2025

View reviewed changes

shepark force-pushed the intel-main-v1-rebase-upstream branch from ce961f2 to f2849bb Compare July 24, 2025 15:25

sammshen reviewed Jul 31, 2025

View reviewed changes

sammshen approved these changes Jul 31, 2025

View reviewed changes

hsubramony force-pushed the intel-main-v1-rebase-upstream branch 2 times, most recently from 300f504 to 18d0780 Compare August 12, 2025 21:04

DongDongJu requested changes Jan 7, 2026

View reviewed changes

hsubramony added 2 commits January 7, 2026 16:17

review fixes and updates

aaecaf2

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

Merge remote-tracking branch 'origin/dev' into intel-main-v1-rebase-u…

322c579

…pstream

ApostaC requested changes Jan 11, 2026

View reviewed changes

hsubramony added 4 commits January 12, 2026 22:14

Merge remote-tracking branch 'origin/dev' into intel-main-v1-rebase-u…

00b0ea1

…pstream refactor vllm_v1_adapter.py and manager.py

review and upmerge fixes

732e017

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

Merge remote-tracking branch 'origin/dev' into intel-main-v1-rebase-u…

c5a4d21

…pstream

pre-commit fixes

210d810

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

update gaudi docker version

da5612b

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

DongDongJu reviewed Jan 16, 2026

View reviewed changes

Comment thread lmcache/v1/kv_layer_groups.py Outdated

Comment thread lmcache/utils.py

hsubramony added 2 commits January 20, 2026 16:38

update comments

09f1c80

Signed-off-by: Harish Subramony <hsubramony@habana.ai>

Merge remote-tracking branch 'origin/dev' into intel-main-v1-rebase-u…

68550b1

…pstream

hlin99 added a commit to hlin99/LMCache that referenced this pull request Jan 26, 2026

Revert "LMCache#1066"

2b8dafd

This reverts commit 5666a1c.

hlin99 added a commit to hlin99/LMCache that referenced this pull request Feb 4, 2026

https://github.com/LMCache/LMCache/pull/1066

aafc762

NO_CUDA_EXT=1 BUILD_WITH_HPU=1 PT_HPU_GPU_MIGRATION=1 pip install -e .

matthewygf mentioned this pull request Mar 6, 2026

[RFC] Hardware Platform Architecture #2691

Open

hlin99 mentioned this pull request Mar 19, 2026

[Platform]: Add Intel Gaudi (HPU) Support #2822

Merged

github-actions Bot added the stale label Mar 22, 2026

rebel-jinhwan mentioned this pull request Apr 14, 2026

[Feat] Make service factory and impl class-overridable for out-of-tree platforms #3025

Open

github-actions Bot closed this Apr 21, 2026


		memory_obj.tensor.copy_(tmp_gpu_buffer, non_blocking=True)

		if not memory_obj.tensor.is_cuda:

Conversation

shepark commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shepark Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shepark commented Jul 17, 2025

Uh oh!

shepark commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shepark commented Jul 23, 2025

Uh oh!

YaoJiayi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shepark Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shepark commented Jul 24, 2025

Uh oh!

sammshen commented Jul 24, 2025

Uh oh!

sammshen Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sammshen Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sammshen commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shepark commented Jul 15, 2025 •

edited

Loading

shepark Jul 17, 2025 •

edited

Loading

shepark commented Jul 19, 2025 •

edited

Loading

shepark Jul 25, 2025 •

edited

Loading

sammshen Jul 31, 2025 •

edited

Loading

sammshen Jul 31, 2025 •

edited

Loading

sammshen commented Jul 31, 2025 •

edited

Loading