Skip to content

Always export idle slots to RAM#24190

Merged
ggerganov merged 3 commits into
ggml-org:masterfrom
fiesh:fix-slot-exporting
Jun 9, 2026
Merged

Always export idle slots to RAM#24190
ggerganov merged 3 commits into
ggml-org:masterfrom
fiesh:fix-slot-exporting

Conversation

@fiesh

@fiesh fiesh commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot. It seems to me the correct behavior is what's being implemented here: respect --cache-idle-slots also without unified KV cache, but not evict anything from VRAM since that wouldn't serve any purpose.

Overview

Make --cache-idle-slots export to RAM also without unified KV cache, but not evict from VRAM.

Additional information

This closes #22942 which observes the problem of seemingly superfluous preprocessing with parallel slots without unified KV cache.

Requirements

  • I have read and agree with the contributing guidelines YES
  • AI usage disclosure: YES, Opus used to create the diff

Without this, a slot's VRAM cache may not be written to RAM.  If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.
@fiesh fiesh requested review from a team as code owners June 5, 2026 14:12

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems like an improvement.

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposed logic is incorrect because it will keep saving idle slots to prompt cache over and over again.

@fiesh

fiesh commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

I think the proposed logic is incorrect because it will keep saving idle slots to prompt cache over and over again.

Does this

https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-task.cpp#L2006

not keep it from being rewritten?

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I missed that.

@ggerganov ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 8, 2026
@ggerganov ggerganov merged commit 961e9a3 into ggml-org:master Jun 9, 2026
25 checks passed
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request Jun 11, 2026
* upstream/HEAD: (329 commits)
  vendor : update LibreSSL to 4.3.2 (ggml-org#24397)
  Remove padding and multiple D2D copies for MTP (ggml-org#24086)
  chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377)
  CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360)
  ci : bump komac version (ggml-org#24396)
  speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253)
  webui: implement pinned conversations support (ggml-org#21387)
  graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357)
  ci : fix windows release (ggml-org#24369)
  ui: add opt-in run_javascript frontend tool (ggml-org#24244)
  mtmd: build_vit batching (ggml-org#24352)
  vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287)
  vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123)
  ui: Fix excessive style recalculation on hover (ggml-org#24243)
  mtmd: refactor video subproc handling (ggml-org#24316)
  server: log prompts to directory (ggml-org#22031)
  ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158)
  ggml : add GGML_OP_COL2IM_1D (ggml-org#24206)
  server : do not clear slots without unified KV cache (ggml-org#24190)
  models : fix plamo2 attention_key/value_length regression (ggml-org#24317)
  ...
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 11, 2026
* Always export idle slots to RAM

Without this, a slot's VRAM cache may not be written to RAM.  If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.

* cont : clean-up

---------

Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
(cherry picked from commit 961e9a3)
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 12, 2026
* Always export idle slots to RAM

Without this, a slot's VRAM cache may not be written to RAM.  If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.

* cont : clean-up

---------

Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
(cherry picked from commit 961e9a3)
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 12, 2026
* Always export idle slots to RAM

Without this, a slot's VRAM cache may not be written to RAM.  If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.

* cont : clean-up

---------

Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
(cherry picked from commit 961e9a3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: prompt cache checkpoints are slot-local, missing across slots under -np > 1

3 participants