[KVConnector] Support 3FS KVConnector#37636
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces the HF3FS KVConnector, a significant feature for efficient KV cache offloading. The implementation is extensive, covering the connector itself, a client, a metadata server, and various utilities. Overall, the design is solid, leveraging asynchronous I/O and modern Python features. However, I've identified a few critical issues, including a logic bug in input validation, an error in metrics collection that would lead to a runtime crash, and a couple of instances of duplicated code that should be cleaned up for better maintainability. Addressing these points will improve the robustness and quality of this new connector.
| all( | ||
| [ | ||
| offset < 0 or offset + size > self.size | ||
| for offset, size in zip(offsets, sizes) | ||
| ] | ||
| ), | ||
| all([size > self.bytes_per_page for size in sizes]), |
There was a problem hiding this comment.
The validation logic here is incorrect. Using all() means an error is raised only if all offsets are out of bounds or all tensor sizes exceed the page size. This should be any(), so that an error is raised if any item is invalid. This bug could lead to out-of-bounds memory access or silent data corruption.
any(offset < 0 or offset + size > self.size
for offset, size in zip(offsets, sizes)),
any(size > self.bytes_per_page for size in sizes),There was a problem hiding this comment.
The gemini's comment seems to be correct. Can you check @ibifrost?
There was a problem hiding this comment.
Using all() here was indeed a logic error, I've fix it
| for list_item in transfer_stats_data[counter_item_key]: | ||
| counter_obj[engine_idx].inc(list_item) |
There was a problem hiding this comment.
The observe method incorrectly attempts to iterate over num_failed_save and num_failed_load from transfer_stats_data. These values are integers, not iterables, which will cause a TypeError at runtime when metrics are being observed. The loop should be removed, and inc() should be called directly with the integer value.
counter_obj[engine_idx].inc(transfer_stats_data[counter_item_key])|
|
||
| # When EPLB is enabled, redundant physical expert slots may map to | ||
| # logical experts that belong to other ranks in the default partition. | ||
| # The weight loader needs to see ALL logical expert weights so it can | ||
| # populate these redundant slots. Skip the filter entirely. | ||
| if parallel_config.enable_eplb: | ||
| return |
| # Reuse pre-allocated zero-init output buffer to avoid a memset | ||
| # kernel on every CUDA graph replay. | ||
| # q is 4D: (batch, q_len_per_req, num_heads, head_dim) | ||
| # FlashInfer has a bug where out= validation hardcodes 3D shape | ||
| # (batch, num_heads, kv_lora_rank), but the kernel writes 4D | ||
| # (batch, q_len, num_heads, kv_lora_rank) when q_len > 1. | ||
| # So we can only pass out= for single-token decode (q_len == 1). | ||
| # For q_len > 1, we zero padding slots after the kernel returns. | ||
| # TODO: upstream fix to FlashInfer | ||
| B, q_len_per_req = q.shape[0], q.shape[1] | ||
| out_kwargs: dict[str, torch.Tensor] = {} | ||
| if q_len_per_req == 1: | ||
| dtype = ( | ||
| torch.bfloat16 | ||
| if is_quantized_kv_cache(self.kv_cache_dtype) | ||
| else q.dtype | ||
| ) | ||
| if ( | ||
| self._decode_out is None | ||
| or self._decode_out.shape[0] < B | ||
| or self._decode_out.dtype != dtype | ||
| ): | ||
| self._decode_out = torch.zeros( | ||
| B, | ||
| q.shape[2], | ||
| self.kv_lora_rank, | ||
| dtype=dtype, | ||
| device=q.device, | ||
| ) | ||
| out_kwargs["out"] = self._decode_out[:B] | ||
|
|
8d38bb8 to
c451357
Compare
c451357 to
786ec6c
Compare
| @@ -48,6 +48,11 @@ lora_hf_hub_resolver = "vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_ | |||
| [tool.setuptools_scm] | |||
| # no extra settings needed, presence enables setuptools-scm | |||
|
|
|||
| [tool.setuptools.package-data] | |||
| "vllm" = [ | |||
There was a problem hiding this comment.
A naive question: why we need to define extra package-data here?
There was a problem hiding this comment.
We rely on runtime compilation via
"
from torch.utils.cpp_extension import load
...
hf3fs_utils = load(
name="hf3fs_utils",
sources=[f"{root}/utils/hf3fs_utils.cpp"],
extra_include_paths=[cuda_include_path],
)
"
in hf3fs_client.py now, that's why the hf3fs_utils.cpp file should be included into pkg here.
There was a problem hiding this comment.
@simon-mo Just wondering is it a common practice to introduce the c/c++ sources like this? Or there are some other common practices?
There was a problem hiding this comment.
package data is fine. but we put those in setup.py
There was a problem hiding this comment.
Sure, I've move the logic to setup.py as suggested.
ApostaC
left a comment
There was a problem hiding this comment.
Did a more detailed pass. There seems to be some small problems.
Please add some unit tests as well. But overall, the structure looks good to me. Thanks for the terrific job!
| all( | ||
| [ | ||
| offset < 0 or offset + size > self.size | ||
| for offset, size in zip(offsets, sizes) | ||
| ] | ||
| ), | ||
| all([size > self.bytes_per_page for size in sizes]), |
There was a problem hiding this comment.
The gemini's comment seems to be correct. Can you check @ibifrost?
| return self._fail_task("Saved", "Write operation failed", request_id, future) | ||
|
|
||
| except Exception as e: | ||
| return self._fail_task("Saved", f"Task execution error: {e}", request_id, future) |
There was a problem hiding this comment.
Do we need to call free_buffer before self._fail_task?
There was a problem hiding this comment.
Yes, we need to release the buffers in error paths. I've added a check before calling _fail_task to ensure resources are cleaned up.
| if key in self.key_metadata: | ||
| key_meta = self.key_metadata[key] | ||
| if key_meta.is_complete() and rank in key_meta.rank_to_page: | ||
| allocation_results[key] = key_meta.rank_to_page[rank] |
There was a problem hiding this comment.
Do we need to free the existing allocation_results[key] before overwriting it? I'm not so sure about the logic here.
There was a problem hiding this comment.
Right, just to clarify the logic here:
When a key is already complete (key_meta.is_complete()), we reuse its existing page from metadata, so we do not need to free that existing page because it's still valid and in use.
However, since we pre-allocated a new page for every key at the start of the function, hitting this branch means we have a newly allocated page that we won't actually use. The original code failed to release this specific unused pre-allocated page, which could lead to a potential leak.
So here need to add logic to keep the existing mapping but release the unused allocated_pages, I'll add a test case to ensure the free page count stays consistent when reusing existing keys.
| """Close the client and clean up resources.""" | ||
| deregister_fd(self.file) | ||
| os.close(self.file) | ||
| del self.ior_r | ||
| del self.ior_w | ||
| del self.iov_r | ||
| del self.iov_w | ||
| self.shm_r.close() | ||
| self.shm_w.close() |
There was a problem hiding this comment.
Do we need to do any check to prevent double-close or double-del?
There was a problem hiding this comment.
You're right, I've moved the cleanup logic into a _release_resources method and added checks to ensures the close operation is idempotent.
db51247 to
3e16a87
Compare
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com>
3e16a87 to
f2f5f88
Compare
Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com>
|
Hi @ibifrost, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@ibifrost can you fix the format? |
Head branch was pushed to by a user without write access
5c200aa to
0d61dc6
Compare
Sure |
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com>
ce544f2 to
9ab5961
Compare
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com> Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Overview
This PR introduces the implementation of the 3FS KVConnector for vLLM.
The 3FS KVConnector enables efficient offloading and sharing of KV caches across nodes, significantly accelerating long-context inference scenarios. Alongside the core implementation, we provide the 3FS Operator for one-click deployment and a mini3fs setup for easy local verification.
Deployment Operator: aliyun/kvc-3fs-operator
Mini3FS Quick Start: Mini3FS Deployment Guide
Performance Result
We conducted test to evaluate performance in Long-Context QA scenarios.
100 groups of prompts.
Each group: 20K shared prefix + 20 queries.
Query structure: 20K (prefix) + 128 (input) tokens → 64 (output) tokens.
Results
Comparing vLLM with 3FS Connector vs. vLLM with L1 Only:
Note: The "3FS Cold Start" scenario means allowing new nodes to instantly access KV caches generated by other nodes without recomputation or slow network transfer.
cc @ApostaC @KuntaiDu