[Hardware] Enable Intel Gaudi (HPU) support#1066
[Hardware] Enable Intel Gaudi (HPU) support#1066shepark wants to merge 42 commits intoLMCache:devfrom
Conversation
| hd_shape = h*d | ||
| for i in range(len(kvcaches)): | ||
| kvcaches[i][0].view(b, hd_shape).index_copy_(0, slot_mapping[start:end], tmp_gpu_buffer[0][i]) | ||
| kvcaches[i][1].view(b, hd_shape).index_copy_(0, slot_mapping[start:end], tmp_gpu_buffer[1][i]) |
There was a problem hiding this comment.
Instead of having a bunch of if-else (also the following), why not having an HPUConnector?
There was a problem hiding this comment.
@YaoJiayi Thank you for the review and suggestion. That'd be good too.
Simply, the changes here is for providing different path when lmc_ops is not exist, for kv cache transfer as you know.
Right, it looks "bunch of" but it's clearly distinguishable and only in "to_gpu" and "from_gpu".
If we have new hpu_connector, then there might be much more codes to select hpu_connector in case in multiple locations.
But, this is the first changes, so we will consider to have separate connector for hpu definitely.
|
@YaoJiayi I see 2 failing checks. -- |
0824f9a to
07f2519
Compare
07f2519 to
f50ebc5
Compare
|
@YaoJiayi all test passed, can you review this pr again? |
57d6b69 to
1f0a331
Compare
|
@sammshen could you review PR? |
| flattened.extend(elem) | ||
| else: | ||
| flattened.append(elem) | ||
| new_block_ids = flattened |
There was a problem hiding this comment.
Can you explain the code change here?
There was a problem hiding this comment.
@YaoJiayi Thank you for the review.
Actually, I described it in the issue ticket as well but I closed it (#1091)
We expect allocated_block_ids is always list type, like [1,2,3,4,5].
But, with original code which recently added by 1072, it becomes nested lists when new_block_ids data type is list of lists.
So, I need to flatten this again for our case.
| self.use_mla, | ||
| ) | ||
| else: | ||
| if self.gpu_buffer is not None: |
There was a problem hiding this comment.
Can we move the code from VLLMPagedMemGPUConnectorV2 to sth like VLLMPagedMemHPUConnector?
There was a problem hiding this comment.
@YaoJiayi yes, I totally agree with your suggestion but same as your comment as before about hpu_connector.
We are going to continuously improve, and definitely will consider to have separate connector.
In the future, it will be VLLMPagedMemHPUConnector in hpu_connector, not in gpu_connector.
There was a problem hiding this comment.
Hi @shepark , would it be ok to do this in this PR? :)
There was a problem hiding this comment.
Refer
ok will think about it :) ;;;
ce961f2 to
f2849bb
Compare
|
Integration tests are still WIP, do not need to pass for now |
| self.gpu_buffer = torch.empty( | ||
| shape, dtype=kwargs["dtype"], device=kwargs["device"] | ||
| ) | ||
| self.store_stream = torch.cuda.Stream() |
There was a problem hiding this comment.
if torch.device() is set to hpu in the adapter, can we use torch.cuda here? sorry if I misunderstand
There was a problem hiding this comment.
removed line 87 as its not used in hpu
| VLLMPagedMemHPUConnectorV2 | ||
| ] | ||
|
|
||
| if use_mla and config.use_layerwise: |
There was a problem hiding this comment.
maybe just one more small check under if config.use_layerwise that if device.type == "hpu", don't support yet
|
|
||
| memory_obj.tensor.copy_(tmp_gpu_buffer, non_blocking=True) | ||
|
|
||
| if not memory_obj.tensor.is_cuda: |
There was a problem hiding this comment.
same here, can we call is_cuda here? tensor.device.type != "cuda" might be more safe
There was a problem hiding this comment.
i removed the if loop here since we dont support cuda stream in hpu_connector
| infinistore | ||
| msgspec | ||
| numpy | ||
| nvtx |
There was a problem hiding this comment.
nvtx (and cufile-python) are nvidia specific . I agree we should keep it though if this helps with not breaking the installation (since all the code is annotated with nvtx_annotate)
There was a problem hiding this comment.
cufile-python and nvtx want to be kept?
|
Hi @skaulintel @shepark (don't worry about the CI, those will be back up very soon), apologize if this is nitpikcy but my overall comment would just be about the implicit reliance on Approving first as mostly LGTM |
300f504 to
18d0780
Compare
DongDongJu
left a comment
There was a problem hiding this comment.
Hello, Happy new year.
I left few comments and questions.
One last thing is that
Do we have any chance to force cuda available to false when hpu.is_available from torch side? that will be really helpful.
| VLLMPagedMemLayerwiseGPUConnector, | ||
| ) | ||
|
|
||
| if hasattr(torch, "hpu") and torch.hpu.is_available(): |
There was a problem hiding this comment.
IMO, Good to have a helper func like is_hpu_avaliable().
There was a problem hiding this comment.
so cuda available returns true by design when using PT_HPU_GPU_MIGRATION=1 , this is used for running GPU code on HPU with minimal changes , for ref please check https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html
| # First Party | ||
| import lmcache.c_ops as lmc_ops | ||
| except (ModuleNotFoundError, ImportError): | ||
| lmc_ops = None |
| torch_dev.set_device(local_rank) | ||
| device = torch.device(f"{dev_name}:{local_rank}") | ||
| num_gpus = torch_dev.device_count() | ||
| local_rank = parallel_config.rank % num_gpus |
There was a problem hiding this comment.
So basically, gaudi do not support parallel feature. Is that correct?
| # First Party | ||
| import lmcache.c_ops as lmc_ops | ||
| except (ModuleNotFoundError, ImportError): | ||
| lmc_ops = None |
There was a problem hiding this comment.
Do we need to checking import this for this hpu case? it seems alway none in this case and following all the lmc_ops check also useless.
Please correct me if im wrong.
| shape = kv_cache.shape | ||
| dtype = kv_cache.dtype | ||
|
|
||
| if shape is not None and dtype is not None: |
There was a problem hiding this comment.
which case shape and dtype can be none in here?
| array_type = ctypes.c_uint8 * size | ||
| buf = array_type.from_address(ptr) | ||
| buffer = torch.frombuffer(buf, dtype=torch.uint8) | ||
| buffer = torch.empty(size, dtype=torch.uint8) |
There was a problem hiding this comment.
I think it should go inside of logic.
This way has broken behavior for numa_mapping case.
There was a problem hiding this comment.
thats what im saying. It previously working with numa aware manner but will not anymore with this indent.
There was a problem hiding this comment.
I'm okay with something like
if is_hpu_environment():
return torch.empty(size, dtype=torch.uint8)The other part of the code should not be touched in this case
Sure, please post here or send the log in community slack with DEBUG level log. Thanks! |
Signed-off-by: Harish Subramony <hsubramony@habana.ai>
|
@hsubramony will do tomorrow. Thanks for quick response! |
|
Hey, thanks for the contribution 🙏! I would also like to take a look at this PR over the weekend. |
ApostaC
left a comment
There was a problem hiding this comment.
Thanks for the terrific work! My understading is that this PR includes at least 3 parts:
- Introduce the HPU connector for passing KV cache between vLLM and LMCache
- Fix a lot of import issues
- Handling the differences in KV cache data structure between HPU and GPU .
Part 1 seems good to me. But we probably want to have a better way (i.e., with clear function definition and less code changes) to achieve part 2 and part 3.
It would be great and also easier for the code maintainers to review if you can split this into 3 PRs.
| # First Party | ||
| import lmcache.c_ops as lmc_ops | ||
| except (ModuleNotFoundError, ImportError): | ||
| lmc_ops = None |
There was a problem hiding this comment.
What's wrong if we don't do it here? IIUC, cachegen won't be supported unless we have the kernels on intel gpus.
Therefore, functions in this file and the cachegen_encoder.py won't be called anyway.
| # required for VLLMPagedMemHPUConnectorV2 | ||
| if hasattr(torch, "hpu") and torch.hpu.is_available(): | ||
| kv_shapes = self.gpu_connector.get_shape(num_tokens) | ||
| else: | ||
| kv_shapes = self.metadata.get_shapes(num_tokens) |
There was a problem hiding this comment.
We did a refactoring in #2284 and a series of related PRs done by @chunxiaozheng.
We should reuse the metadata.get_shapes instead of having an if branch here,.
|
|
||
| if torch.cuda.is_available(): | ||
| # First Party | ||
| if hasattr(torch, "hpu") and torch.hpu.is_available(): |
There was a problem hiding this comment.
I've seen this in many different places. Can we do two things:
- make
hasattr(torch, "hpu") and torch.hpu.is_available():a common util function - Try to see if there is a way to avoid this if-checking during import
The goal is to avoid code duplication and make it more maintainable
| shape, dtype = None, None | ||
|
|
||
| if isinstance(kv_cache, (tuple, list)): | ||
| # HPU has a tuple list (K,V) with same shape and dtype | ||
| for tensor in kv_cache: | ||
| if tensor is not None: | ||
| shape = tensor.shape | ||
| dtype = tensor.dtype | ||
| break |
There was a problem hiding this comment.
Here, you are effectively making shape and dtype become optional values (which means it could be None).
Not sure why the linter doesn't complain here, but we should make the code here cleaner. Probably define a clear function to extract the dtype and shape for hpu?
There was a problem hiding this comment.
We are refactoring this part of the code in #2380. Let's don't touch it for now
| array_type = ctypes.c_uint8 * size | ||
| buf = array_type.from_address(ptr) | ||
| buffer = torch.frombuffer(buf, dtype=torch.uint8) | ||
| buffer = torch.empty(size, dtype=torch.uint8) |
There was a problem hiding this comment.
I'm okay with something like
if is_hpu_environment():
return torch.empty(size, dtype=torch.uint8)The other part of the code should not be touched in this case
| if lmc_ops: | ||
| ptr = lmc_ops.alloc_pinned_ptr(size, 0) | ||
| array_type = ctypes.c_uint8 * size | ||
| buf = array_type.from_address(ptr) | ||
| self.buffer = torch.frombuffer(buf, dtype=torch.uint8) | ||
| else: | ||
| self.buffer = torch.empty(size, dtype=torch.uint8, pin_memory=True) | ||
|
|
There was a problem hiding this comment.
Similar to above, creating an extra ident is not ideal and will be confusing to other contributors
|
|
||
| def close(self): | ||
| if not self._unregistered: | ||
| if lmc_ops and not self._unregistered: |
There was a problem hiding this comment.
Here, we use whether lmc_ops is None to determine whether it's in the hpu environment (and similar logic is used in multiple places).
This may create confusion and maintenance overhead in the future. We should define an explicit function to determine whether it's hpu or not.
There was a problem hiding this comment.
And such a function should have negligible overhead if it is called during runtime
| from lmcache.v1.memory_management import MemoryFormat | ||
|
|
||
| MAX_KEY_LENGTH = 150 | ||
| MAX_KEY_LENGTH = 250 |
There was a problem hiding this comment.
Just wondering why we make this change here?
|
During checking the log I noticed that hpu support is not officially merging into vllm. |
…pstream refactor vllm_v1_adapter.py and manager.py
Signed-off-by: Harish Subramony <hsubramony@habana.ai>
Signed-off-by: Harish Subramony <hsubramony@habana.ai>
|
Signed-off-by: Harish Subramony <hsubramony@habana.ai>
Signed-off-by: Harish Subramony <hsubramony@habana.ai>
|
@DongDongJu @sammshen @ApostaC i updated the branch and pushed. Please let me know if any other issues. Please help merge. thanks |
|
Hello, Thanks for the great work. |
@DongDongJu thanks for your review, as we don't have heavy changes for LMCache at this stage, do you think this PR can be merged to main git? |
This reverts commit 5666a1c.
NO_CUDA_EXT=1 BUILD_WITH_HPU=1 PT_HPU_GPU_MIGRATION=1 pip install -e .
|
This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days. |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it! |
Enable Intel Gaudi HPU support.
There is corresponding PR (HabanaAI/vllm-fork#1369) in vllm-fork.
The PR in vllm-fork has examples utilizes lmcache. (ex: PD use case)
We need this changes in.
PT_HPU_GPU_MIGRATION=1 pip install -e .