Always export idle slots to RAM by fiesh · Pull Request #24190 · ggml-org/llama.cpp

fiesh · 2026-06-05T14:12:28Z

Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot. It seems to me the correct behavior is what's being implemented here: respect --cache-idle-slots also without unified KV cache, but not evict anything from VRAM since that wouldn't serve any purpose.

Overview

Make --cache-idle-slots export to RAM also without unified KV cache, but not evict from VRAM.

Additional information

This closes #22942 which observes the problem of seemingly superfluous preprocessing with parallel slots without unified KV cache.

Requirements

I have read and agree with the contributing guidelines YES
AI usage disclosure: YES, Opus used to create the diff

Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot.

ggerganov

Yes, this seems like an improvement.

ggerganov

I think the proposed logic is incorrect because it will keep saving idle slots to prompt cache over and over again.

fiesh · 2026-06-08T12:32:53Z

I think the proposed logic is incorrect because it will keep saving idle slots to prompt cache over and over again.

Does this

https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-task.cpp#L2006

not keep it from being rewritten?

ggerganov

Yes, I missed that.

* upstream/HEAD: (329 commits) vendor : update LibreSSL to 4.3.2 (ggml-org#24397) Remove padding and multiple D2D copies for MTP (ggml-org#24086) chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377) CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360) ci : bump komac version (ggml-org#24396) speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253) webui: implement pinned conversations support (ggml-org#21387) graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357) ci : fix windows release (ggml-org#24369) ui: add opt-in run_javascript frontend tool (ggml-org#24244) mtmd: build_vit batching (ggml-org#24352) vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287) vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123) ui: Fix excessive style recalculation on hover (ggml-org#24243) mtmd: refactor video subproc handling (ggml-org#24316) server: log prompts to directory (ggml-org#22031) ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158) ggml : add GGML_OP_COL2IM_1D (ggml-org#24206) server : do not clear slots without unified KV cache (ggml-org#24190) models : fix plamo2 attention_key/value_length regression (ggml-org#24317) ...

* Always export idle slots to RAM Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot. * cont : clean-up --------- Co-authored-by: Christoph Weiss <weiss@wsoptics.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit 961e9a3)

Always export idle slots to RAM

83c897d

Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot.

fiesh requested review from a team as code owners June 5, 2026 14:12

ggerganov reviewed Jun 5, 2026

View reviewed changes

ggerganov self-assigned this Jun 5, 2026

github-actions Bot added examples server labels Jun 5, 2026

ggerganov reviewed Jun 8, 2026

View reviewed changes

ggerganov approved these changes Jun 8, 2026

View reviewed changes

ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 8, 2026

ggerganov added 2 commits June 9, 2026 10:06

Merge remote-tracking branch 'origin/master' into pr/24190

dd605e4

cont : clean-up

d4716e1

ggerganov merged commit 961e9a3 into ggml-org:master Jun 9, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always export idle slots to RAM#24190

Always export idle slots to RAM#24190
ggerganov merged 3 commits into
ggml-org:masterfrom
fiesh:fix-slot-exporting

fiesh commented Jun 5, 2026

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov left a comment

Uh oh!

fiesh commented Jun 8, 2026

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fiesh commented Jun 5, 2026

Overview

Additional information

Requirements

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

fiesh commented Jun 8, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants