[Docs] Add KV offloading usage guide (single- and multi-tier)#44415
Conversation
|
Documentation preview: https://vllm--44415.org.readthedocs.build/en/44415/ |
llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of the standalone llm-d FS connector. The filesystem offloading logic is now upstreamed into vLLM as the FS tier of the multi-tier offloading connector (TieringOffloadingSpec); all new features and support continue there. Add an [!IMPORTANT] banner to the connector README and a short note in the root README's Connectors & Utilities list, linking the vLLM KV offloading guide (vllm-project/vllm#44415). Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of the standalone llm-d FS connector. The filesystem offloading logic is now upstreamed into vLLM as the FS tier of the multi-tier offloading connector (TieringOffloadingSpec); all new features and support continue there. Add an [!IMPORTANT] banner to the connector README and a short note in the root README's Connectors & Utilities list, linking the vLLM KV offloading guide (vllm-project/vllm#44415). Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
|
I think we should add under Filesystem that in order to share the kv in the fs secondary tier the PYTHONHASHSEED needs to be set (written in class description but it wouldn't hurt to put it here too). Otherwise lgtm. |
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…ring Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
done in 449c067 |
) llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of the standalone llm-d FS connector. The filesystem offloading logic is now upstreamed into vLLM as the FS tier of the multi-tier offloading connector (TieringOffloadingSpec); all new features and support continue there. Add an [!IMPORTANT] banner to the connector README and a short note in the root README's Connectors & Utilities list, linking the vLLM KV offloading guide (vllm-project/vllm#44415). Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
orozery
left a comment
There was a problem hiding this comment.
Thanks @ronensc !
Couple of more things to consider mentioning (can be in a follow-up as well):
- Alpha support for pluggable offloadingspec / secondarytier.
- Supported hardware (GPU, ROCM, XPU)
- CPU size is across all workers, not per-worker.
- Emphasize offloading is immediate so for CPU without tiering, the CPU size should be bigger than GPU size (across all workers)
- Metrics
- KV Events
- Supported eviction policies
- Reset cache
- Use-case for handling preempted requests
- Make sure we cover all possible extra_config (e.g. offload_prompt_only)
|
|
||
| ```text | ||
| <root_dir>_r<rank>/ | ||
| <hhh>/ |
There was a problem hiding this comment.
We should explain what h stands for
| @@ -0,0 +1,120 @@ | |||
| # KV Offloading Usage Guide | |||
|
|
|||
| This guide covers configuration of the [`OffloadingConnector`](disagg_prefill.md), which extends the prefix cache by spilling KV blocks evicted from GPU memory to slower but larger tiers (CPU host memory, plus optional secondary tiers). Hits in the offload tiers are promoted back to GPU on demand. | |||
There was a problem hiding this comment.
Spilling is confusing since one thinks it occurs whenever the first tier is full.
In our case, we offload immediatly.
…iction Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…fig keys Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
I've address 2, 3, and10 in d9cf003 |
…roject#44415) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
…roject#44415) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
@yuezhu1 confirmed the Haifa team is retiring the MultiConnector + llmd_fs_backend deployment shape (Path A) in the near future, in favor of the upstream-canonical TieringOffloadingSpec shape (Path B). vllm-project/vllm#41735 already upstreamed the FS storage code; vllm-project/vllm#44415 (merged 2026-06-09) adds the end-user usage docs for it. With Path A no longer a viable migration target, the RFC is narrowed to a single M1.5 path and the Path A scaffolding is removed. Path A removal: - §4.3 (was §4a): single M1.5 row (D↔H↔FS/obj via SpyreTieringOffloadingSpec + tiering/{fs,obj}). Path A row, the lengthy Path A vs Path B prose, and the SharedOffloadRegion Path-B-only carve-out all removed. - §3.5: llmd_fs_backend reframed as historical context only. Note added that maintainers are retiring it; this RFC does not target it. - §6.4 / §10 Q6: drop Path-B-only qualifiers (Path B is the only path). - §10 Q7 deleted entirely (existed only to gate Path A). - §7: drop test_multiconnector_register.py (was the Q7 falsifier). - §12 acceptance: A1.5.1 is now the SpyreTieringOffloadingSpec + tiering/fs config (was A1.5.2 "forward-looking"); the long Path A acceptance with MultiConnector + llmd_fs_backend is gone. Old A1.5.4 → A1.5.3 (engineering budget). - §13: llm-d/llm-d-kv-cache reference demoted to "historical context." Other review-driven cleanups in the same pass (Yue's recent batch): - §1 motivation (3388839334): added "secondary tiers only interact with primary(CPU)↔secondary(storage) transfers — they never touch device tensors" so readers don't have to trace why no per-tier code. - §2 non-goals (3388886266): removed; items already covered in §11. - §4.3 NIXL out-of-scope subsection (3389029011): removed; covered in §11. - §4 numbering (3389055734): "§4a" → "§4.3" now that the NIXL slot freed up. Subsequent §6 subsections renumbered consistently (6.3a → 6.4, 6.4 → 6.5, 6.5 → 6.6). - §9 PD-disagg paragraph (3389364673): moved into §11's PD bullet. - §12 A1.3 format.sh (3389412218): clarified inline that format.sh is the repo lint wrapper around uvx prek. - §12 A1.5.1 PYTHONHASHSEED bullet (3389437690): removed — that requirement is vLLM's contract, not this RFC's to test. - §13 (3389446222): added vllm-project/vllm#44415 (upstream user-facing multi-tier usage guide); kept llm-d-kv-cache reference but reframed as historical, per the Path A drop. Net: 31 insertions, 94 deletions. Doc is materially shorter and points at one M1.5 path instead of two. Co-authored-by: Yue Zhu <yzhu@us.ibm.com> Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>
…roject#44415) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Purpose
Adds a dedicated usage guide for the
OffloadingConnectorcovering both single-tier (CPU) and multi-tier (TieringOffloadingSpec) configurations, including the filesystem (FS) secondary tier.The existing entry in
docs/features/disagg_prefill.mdonly documented the single-tier CPU example.This PR adds a new page
docs/features/kv_offloading_usage.mdand a one-line pointer from theOffloadingConnectorbullet indisagg_prefill.md.This change is documentation-only.
Relates to #33689
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.