[Docs] Add KV offloading usage guide (single- and multi-tier) by ronensc · Pull Request #44415 · vllm-project/vllm

ronensc · 2026-06-03T11:32:59Z

Purpose

Adds a dedicated usage guide for the OffloadingConnector covering both single-tier (CPU) and multi-tier (TieringOffloadingSpec) configurations, including the filesystem (FS) secondary tier.

The existing entry in docs/features/disagg_prefill.md only documented the single-tier CPU example.

This PR adds a new page docs/features/kv_offloading_usage.md and a one-line pointer from the OffloadingConnector bullet in disagg_prefill.md.

This change is documentation-only.

Relates to #33689

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

mergify · 2026-06-03T11:35:03Z

Documentation preview: https://vllm--44415.org.readthedocs.build/en/44415/

llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of the standalone llm-d FS connector. The filesystem offloading logic is now upstreamed into vLLM as the FS tier of the multi-tier offloading connector (TieringOffloadingSpec); all new features and support continue there. Add an [!IMPORTANT] banner to the connector README and a short note in the root README's Connectors & Utilities list, linking the vLLM KV offloading guide (vllm-project/vllm#44415). Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

rshavitt · 2026-06-04T09:34:30Z

I think we should add under Filesystem that in order to share the kv in the fs secondary tier the PYTHONHASHSEED needs to be set (written in class description but it wouldn't hurt to put it here too). Otherwise lgtm.

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

…ring Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc · 2026-06-04T13:47:24Z

I think we should add under Filesystem that in order to share the kv in the fs secondary tier the PYTHONHASHSEED needs to be set (written in class description but it wouldn't hurt to put it here too). Otherwise lgtm.

done in 449c067

) llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of the standalone llm-d FS connector. The filesystem offloading logic is now upstreamed into vLLM as the FS tier of the multi-tier offloading connector (TieringOffloadingSpec); all new features and support continue there. Add an [!IMPORTANT] banner to the connector README and a short note in the root README's Connectors & Utilities list, linking the vLLM KV offloading guide (vllm-project/vllm#44415). Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

orozery

Thanks @ronensc !
Couple of more things to consider mentioning (can be in a follow-up as well):

Alpha support for pluggable offloadingspec / secondarytier.
Supported hardware (GPU, ROCM, XPU)
CPU size is across all workers, not per-worker.
Emphasize offloading is immediate so for CPU without tiering, the CPU size should be bigger than GPU size (across all workers)
Metrics
KV Events
Supported eviction policies
Reset cache
Use-case for handling preempted requests
Make sure we cover all possible extra_config (e.g. offload_prompt_only)

orozery · 2026-06-08T04:46:05Z

+
+```text
+<root_dir>_r<rank>/
+  <hhh>/


We should explain what h stands for

Done in 4fe3791
@rshavitt I updated the tree structure under On-Disk Layout. Could you please verify that it is correct?

orozery · 2026-06-08T04:47:23Z

@@ -0,0 +1,120 @@
+# KV Offloading Usage Guide
+
+This guide covers configuration of the [`OffloadingConnector`](disagg_prefill.md), which extends the prefix cache by spilling KV blocks evicted from GPU memory to slower but larger tiers (CPU host memory, plus optional secondary tiers). Hits in the offload tiers are promoted back to GPU on demand.


Spilling is confusing since one thinks it occurs whenever the first tier is full.
In our case, we offload immediatly.

Done in 332f1ca

…iction Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

…fig keys Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc · 2026-06-08T09:29:37Z

Thanks @ronensc ! Couple of more things to consider mentioning (can be in a follow-up as well):

1. Alpha support for pluggable offloadingspec / secondarytier.

2. Supported hardware (GPU, ROCM, XPU)

3. CPU size is across all workers, not per-worker.

4. Emphasize offloading is immediate so for CPU without tiering, the CPU size should be bigger than GPU size (across all workers)

5. Metrics

6. KV Events

7. Supported eviction policies

8. Reset cache

9. Use-case for handling preempted requests

10. Make sure we cover all possible extra_config (e.g. offload_prompt_only)

I've address 2, 3, and10 in d9cf003
I also added a link to the relevant blog post in 759eedf

orozery

Thanks @ronensc !

…roject#44415) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

…roject#44415) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

@yuezhu1

@yuezhu1 confirmed the Haifa team is retiring the MultiConnector + llmd_fs_backend deployment shape (Path A) in the near future, in favor of the upstream-canonical TieringOffloadingSpec shape (Path B). vllm-project/vllm#41735 already upstreamed the FS storage code; vllm-project/vllm#44415 (merged 2026-06-09) adds the end-user usage docs for it. With Path A no longer a viable migration target, the RFC is narrowed to a single M1.5 path and the Path A scaffolding is removed. Path A removal: - §4.3 (was §4a): single M1.5 row (D↔H↔FS/obj via SpyreTieringOffloadingSpec + tiering/{fs,obj}). Path A row, the lengthy Path A vs Path B prose, and the SharedOffloadRegion Path-B-only carve-out all removed. - §3.5: llmd_fs_backend reframed as historical context only. Note added that maintainers are retiring it; this RFC does not target it. - §6.4 / §10 Q6: drop Path-B-only qualifiers (Path B is the only path). - §10 Q7 deleted entirely (existed only to gate Path A). - §7: drop test_multiconnector_register.py (was the Q7 falsifier). - §12 acceptance: A1.5.1 is now the SpyreTieringOffloadingSpec + tiering/fs config (was A1.5.2 "forward-looking"); the long Path A acceptance with MultiConnector + llmd_fs_backend is gone. Old A1.5.4 → A1.5.3 (engineering budget). - §13: llm-d/llm-d-kv-cache reference demoted to "historical context." Other review-driven cleanups in the same pass (Yue's recent batch): - §1 motivation (3388839334): added "secondary tiers only interact with primary(CPU)↔secondary(storage) transfers — they never touch device tensors" so readers don't have to trace why no per-tier code. - §2 non-goals (3388886266): removed; items already covered in §11. - §4.3 NIXL out-of-scope subsection (3389029011): removed; covered in §11. - §4 numbering (3389055734): "§4a" → "§4.3" now that the NIXL slot freed up. Subsequent §6 subsections renumbered consistently (6.3a → 6.4, 6.4 → 6.5, 6.5 → 6.6). - §9 PD-disagg paragraph (3389364673): moved into §11's PD bullet. - §12 A1.3 format.sh (3389412218): clarified inline that format.sh is the repo lint wrapper around uvx prek. - §12 A1.5.1 PYTHONHASHSEED bullet (3389437690): removed — that requirement is vLLM's contract, not this RFC's to test. - §13 (3389446222): added vllm-project/vllm#44415 (upstream user-facing multi-tier usage guide); kept llm-d-kv-cache reference but reframed as historical, per the Path A drop. Net: 31 insertions, 94 deletions. Doc is materially shorter and points at one M1.5 path instead of two. Co-authored-by: Yue Zhu <yzhu@us.ibm.com> Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>

…roject#44415) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

mergify Bot added the documentation Improvements or additions to documentation label Jun 3, 2026

kfirtoledo mentioned this pull request Jun 4, 2026

docs(fs-backend): note upstreaming into vLLM multi-tier offloading llm-d/llm-d-kv-cache#632

Merged

effi-ofer mentioned this pull request Jun 4, 2026

tiered-prefix-cache guide changes - multi-tier support and others llm-d/llm-d#1561

Merged

ronensc added 2 commits June 4, 2026 16:45

[Docs] Add KV offloading usage guide (single- and multi-tier)

1d06082

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: document PYTHONHASHSEED for FS-tier cross-process sha…

449c067

…ring Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc force-pushed the multi-tier-docs branch from f759748 to 449c067 Compare June 4, 2026 13:46

orozery reviewed Jun 8, 2026

View reviewed changes

ronensc added 4 commits June 8, 2026 09:48

Address review: clarify KV offload happens immediately, not on GPU ev…

332f1ca

…iction Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: explain what h stands for and update tree structure

4fe3791

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: hardware support, cpu_bytes_to_use scope, missing con…

d9cf003

…fig keys Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: link KV offloading blog post

759eedf

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

yuezhu1 mentioned this pull request Jun 8, 2026

RFC: Port the upstream KV Connector experience to spyre-inference torch-spyre/spyre-inference#240

Open

4 tasks

ronensc requested a review from orozery June 9, 2026 06:52

orozery approved these changes Jun 9, 2026

View reviewed changes

orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026

Merge branch 'main' into multi-tier-docs

66c41dc

orozery merged commit 3d119f7 into vllm-project:main Jun 9, 2026
8 checks passed

ronensc deleted the multi-tier-docs branch June 10, 2026 06:20

miroslavln mentioned this pull request Jun 10, 2026

fix/issue 656 default block size factor miroslavln/llm-d-kv-cache#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Docs] Add KV offloading usage guide (single- and multi-tier)#44415

[Docs] Add KV offloading usage guide (single- and multi-tier)#44415
orozery merged 7 commits into
vllm-project:mainfrom
ronensc:multi-tier-docs

ronensc commented Jun 3, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

rshavitt commented Jun 4, 2026

Uh oh!

ronensc commented Jun 4, 2026

Uh oh!

orozery left a comment

Uh oh!

orozery Jun 8, 2026

Uh oh!

ronensc Jun 8, 2026

Uh oh!

orozery Jun 8, 2026

Uh oh!

ronensc Jun 8, 2026

Uh oh!

ronensc commented Jun 8, 2026

Uh oh!

orozery left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,120 @@
		# KV Offloading Usage Guide

		This guide covers configuration of the [`OffloadingConnector`](disagg_prefill.md), which extends the prefix cache by spilling KV blocks evicted from GPU memory to slower but larger tiers (CPU host memory, plus optional secondary tiers). Hits in the offload tiers are promoted back to GPU on demand.

Uh oh!

Conversation

ronensc commented Jun 3, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

rshavitt commented Jun 4, 2026

Uh oh!

ronensc commented Jun 4, 2026

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

orozery Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ronensc Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ronensc Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ronensc commented Jun 8, 2026

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ronensc commented Jun 3, 2026 •

edited by github-actions Bot

Loading