Skip to content

[Docs] Add KV offloading usage guide (single- and multi-tier)#44415

Merged
orozery merged 7 commits into
vllm-project:mainfrom
ronensc:multi-tier-docs
Jun 9, 2026
Merged

[Docs] Add KV offloading usage guide (single- and multi-tier)#44415
orozery merged 7 commits into
vllm-project:mainfrom
ronensc:multi-tier-docs

Conversation

@ronensc

@ronensc ronensc commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Purpose

Adds a dedicated usage guide for the OffloadingConnector covering both single-tier (CPU) and multi-tier (TieringOffloadingSpec) configurations, including the filesystem (FS) secondary tier.

The existing entry in docs/features/disagg_prefill.md only documented the single-tier CPU example.

This PR adds a new page docs/features/kv_offloading_usage.md and a one-line pointer from the OffloadingConnector bullet in disagg_prefill.md.

This change is documentation-only.

Relates to #33689

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify

mergify Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--44415.org.readthedocs.build/en/44415/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Jun 3, 2026
kfirtoledo added a commit to kfirtoledo/llm-d-kv-cache-manager that referenced this pull request Jun 4, 2026
llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of
the standalone llm-d FS connector. The filesystem offloading logic is now
upstreamed into vLLM as the FS tier of the multi-tier offloading connector
(TieringOffloadingSpec); all new features and support continue there.

Add an [!IMPORTANT] banner to the connector README and a short note in the
root README's Connectors & Utilities list, linking the vLLM KV offloading
guide (vllm-project/vllm#44415).

Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
kfirtoledo added a commit to kfirtoledo/llm-d-kv-cache-manager that referenced this pull request Jun 4, 2026
llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of
the standalone llm-d FS connector. The filesystem offloading logic is now
upstreamed into vLLM as the FS tier of the multi-tier offloading connector
(TieringOffloadingSpec); all new features and support continue there.

Add an [!IMPORTANT] banner to the connector README and a short note in the
root README's Connectors & Utilities list, linking the vLLM KV offloading
guide (vllm-project/vllm#44415).

Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
@rshavitt

rshavitt commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

I think we should add under Filesystem that in order to share the kv in the fs secondary tier the PYTHONHASHSEED needs to be set (written in class description but it wouldn't hurt to put it here too). Otherwise lgtm.

ronensc added 2 commits June 4, 2026 16:45
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…ring

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
@ronensc ronensc force-pushed the multi-tier-docs branch from f759748 to 449c067 Compare June 4, 2026 13:46
@ronensc

ronensc commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

I think we should add under Filesystem that in order to share the kv in the fs secondary tier the PYTHONHASHSEED needs to be set (written in class description but it wouldn't hurt to put it here too). Otherwise lgtm.

done in 449c067

liu-cong pushed a commit to llm-d/llm-d-kv-cache that referenced this pull request Jun 5, 2026
)

llmd-fs-connector==0.22 (llm-d v0.8 / vLLM v0.22) is the final release of
the standalone llm-d FS connector. The filesystem offloading logic is now
upstreamed into vLLM as the FS tier of the multi-tier offloading connector
(TieringOffloadingSpec); all new features and support continue there.

Add an [!IMPORTANT] banner to the connector README and a short note in the
root README's Connectors & Utilities list, linking the vLLM KV offloading
guide (vllm-project/vllm#44415).

Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

@orozery orozery left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ronensc !
Couple of more things to consider mentioning (can be in a follow-up as well):

  1. Alpha support for pluggable offloadingspec / secondarytier.
  2. Supported hardware (GPU, ROCM, XPU)
  3. CPU size is across all workers, not per-worker.
  4. Emphasize offloading is immediate so for CPU without tiering, the CPU size should be bigger than GPU size (across all workers)
  5. Metrics
  6. KV Events
  7. Supported eviction policies
  8. Reset cache
  9. Use-case for handling preempted requests
  10. Make sure we cover all possible extra_config (e.g. offload_prompt_only)

Comment thread docs/features/kv_offloading_usage.md Outdated

```text
<root_dir>_r<rank>/
<hhh>/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain what h stands for

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4fe3791
@rshavitt I updated the tree structure under On-Disk Layout. Could you please verify that it is correct?

Comment thread docs/features/kv_offloading_usage.md Outdated
@@ -0,0 +1,120 @@
# KV Offloading Usage Guide

This guide covers configuration of the [`OffloadingConnector`](disagg_prefill.md), which extends the prefix cache by spilling KV blocks evicted from GPU memory to slower but larger tiers (CPU host memory, plus optional secondary tiers). Hits in the offload tiers are promoted back to GPU on demand.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spilling is confusing since one thinks it occurs whenever the first tier is full.
In our case, we offload immediatly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 332f1ca

ronensc added 4 commits June 8, 2026 09:48
…iction

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…fig keys

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
@ronensc

ronensc commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @ronensc ! Couple of more things to consider mentioning (can be in a follow-up as well):

1. Alpha support for pluggable offloadingspec / secondarytier.

2. Supported hardware (GPU, ROCM, XPU)

3. CPU size is across all workers, not per-worker.

4. Emphasize offloading is immediate so for CPU without tiering, the CPU size should be bigger than GPU size (across all workers)

5. Metrics

6. KV Events

7. Supported eviction policies

8. Reset cache

9. Use-case for handling preempted requests

10. Make sure we cover all possible extra_config (e.g. offload_prompt_only)

I've address 2, 3, and10 in d9cf003
I also added a link to the relevant blog post in 759eedf

@orozery orozery left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ronensc !

@orozery orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026
@orozery orozery merged commit 3d119f7 into vllm-project:main Jun 9, 2026
8 checks passed
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
…roject#44415)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ronensc ronensc deleted the multi-tier-docs branch June 10, 2026 06:20
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…roject#44415)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
wangchen615 added a commit to wangchen615/spyre-inference that referenced this pull request Jun 10, 2026
@yuezhu1 confirmed the Haifa team is retiring the
MultiConnector + llmd_fs_backend deployment shape (Path A) in the
near future, in favor of the upstream-canonical TieringOffloadingSpec
shape (Path B). vllm-project/vllm#41735 already upstreamed the FS
storage code; vllm-project/vllm#44415 (merged 2026-06-09) adds the
end-user usage docs for it. With Path A no longer a viable migration
target, the RFC is narrowed to a single M1.5 path and the Path A
scaffolding is removed.

Path A removal:

- §4.3 (was §4a): single M1.5 row (D↔H↔FS/obj via SpyreTieringOffloadingSpec
  + tiering/{fs,obj}). Path A row, the lengthy Path A vs Path B prose,
  and the SharedOffloadRegion Path-B-only carve-out all removed.
- §3.5: llmd_fs_backend reframed as historical context only. Note
  added that maintainers are retiring it; this RFC does not target it.
- §6.4 / §10 Q6: drop Path-B-only qualifiers (Path B is the only path).
- §10 Q7 deleted entirely (existed only to gate Path A).
- §7: drop test_multiconnector_register.py (was the Q7 falsifier).
- §12 acceptance: A1.5.1 is now the SpyreTieringOffloadingSpec +
  tiering/fs config (was A1.5.2 "forward-looking"); the long Path A
  acceptance with MultiConnector + llmd_fs_backend is gone. Old A1.5.4
  → A1.5.3 (engineering budget).
- §13: llm-d/llm-d-kv-cache reference demoted to "historical context."

Other review-driven cleanups in the same pass (Yue's recent batch):

- §1 motivation (3388839334): added "secondary tiers only interact
  with primary(CPU)↔secondary(storage) transfers — they never touch
  device tensors" so readers don't have to trace why no per-tier code.
- §2 non-goals (3388886266): removed; items already covered in §11.
- §4.3 NIXL out-of-scope subsection (3389029011): removed; covered in §11.
- §4 numbering (3389055734): "§4a" → "§4.3" now that the NIXL slot
  freed up. Subsequent §6 subsections renumbered consistently
  (6.3a → 6.4, 6.4 → 6.5, 6.5 → 6.6).
- §9 PD-disagg paragraph (3389364673): moved into §11's PD bullet.
- §12 A1.3 format.sh (3389412218): clarified inline that format.sh is
  the repo lint wrapper around uvx prek.
- §12 A1.5.1 PYTHONHASHSEED bullet (3389437690): removed — that
  requirement is vLLM's contract, not this RFC's to test.
- §13 (3389446222): added vllm-project/vllm#44415 (upstream user-facing
  multi-tier usage guide); kept llm-d-kv-cache reference but reframed
  as historical, per the Path A drop.

Net: 31 insertions, 94 deletions. Doc is materially shorter and
points at one M1.5 path instead of two.

Co-authored-by: Yue Zhu <yzhu@us.ibm.com>
Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…roject#44415)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants