llama : use n_swa + n_ubatch cells for SWA cache by ggerganov · Pull Request #13833 · ggml-org/llama.cpp

ggerganov · 2025-05-27T17:16:45Z

Overview

SWA cache now uses less memory
Enable SWA speculative decoding
Allow short SWA rollbacks (avoids cache recalculations caused by whitespace truncation of the last response)

./scripts/compare-commits.sh master gg/swa-optimize -m models/gemma-3-4b/ggml-model-q8_0.gguf -d 8192 -p 0 -b 512,1024,2048,4096,8192 -n 32 -fa 0,1 -t 1

Model	Batch size	FA	Test	t/s master	t/s gg/swa-optimize	Speedup
gemma3 4B Q8_0	512	No	tg32@d8192	75.63	75.83	1.00
gemma3 4B Q8_0	512	Yes	tg32@d8192	75.12	75.13	1.00
gemma3 4B Q8_0	1024	No	tg32@d8192	73.68	76.03	1.03
gemma3 4B Q8_0	1024	Yes	tg32@d8192	74.94	75.15	1.00
gemma3 4B Q8_0	2048	No	tg32@d8192	69.73	75.92	1.09
gemma3 4B Q8_0	2048	Yes	tg32@d8192	74.62	75.14	1.01
gemma3 4B Q8_0	4096	No	tg32@d8192	62.43	75.92	1.22
gemma3 4B Q8_0	4096	Yes	tg32@d8192	74.39	75.15	1.01
gemma3 4B Q8_0	8192	No	tg32@d8192	54.37	76.00	1.40
gemma3 4B Q8_0	8192	Yes	tg32@d8192	73.25	75.11	1.03

aviallon · 2025-05-28T10:15:14Z

I'll try testing.
Edit: I got distracted, and forgot to test it. Oops.

ngxson · 2025-05-30T14:56:17Z

tools/server/server.cpp

                                const auto pos_min = llama_kv_self_seq_pos_min(ctx, slot.id);
-                                if (pos_min > 0) {
-                                    SLT_WRN(slot, "n_past = %d, cache_tokens.size() = %d, seq_id = %d, pos_min = %d\n", slot.n_past, (int) slot.cache_tokens.size(), slot.id, pos_min);
+                                if (pos_min == -1 || pos_min > slot.n_past - n_swa) {


pos_min == -1 meaning the seq is empty. In this case, I think the behavior of setting n_past = 0 is expected, so we don't necessary need to log the warning

If the sequence is not present in the KV cache (i.e. pos_min == -1), but we somehow decided that slot.n_past > 0 (see the condition above) then this is still unexpected. I think we might even want to abort in such cases, because it means there is a bug somewhere.

Apparently there is a bug somewhere, indeed, because, when enabling prompt caching, requests are failing due to pos_min being -1, e.g. see mudler/LocalAI#553 (comment)

It is tracked as llama.cpp bug #17118.

ggml-ci

louwers · 2025-12-27T01:20:17Z

I got this error just now.

slot update_slots: id  3 | task 7464 | n_past = 62, slot.prompt.tokens.size() = 1584, seq_id = 3, pos_min = -1
/build/source/tools/server/server.cpp:3843: pos_min == -1, but n_past > 0 - should not happen: https://github.com/ggml-org/llama.cpp/pull/13833#discussion_r2116181237
/nix/store/5z7ph6k388mfi49fvylgf5m3hpjj8hhn-llama-cpp-6981/lib/libggml-base.so(+0x16ffd) [0x7efd92883ffd]
/nix/store/5z7ph6k388mfi49fvylgf5m3hpjj8hhn-llama-cpp-6981/lib/libggml-base.so(ggml_print_backtrace+0x216) [0x7efd928844a6]
/nix/store/5z7ph6k388mfi49fvylgf5m3hpjj8hhn-llama-cpp-6981/lib/libggml-base.so(ggml_abort+0x144) [0x7efd92884664]
llama-server(+0xf2fa8) [0x557a11982fa8]
llama-server(+0xb23da) [0x557a119423da]
llama-server(+0x686f9) [0x557a118f86f9]
/nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libc.so.6(+0x2a4d8) [0x7efd8ee2a4d8]
/nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libc.so.6(__libc_start_main+0x8b) [0x7efd8ee2a59b]
llama-server(+0x6a235) [0x557a118fa235]
Aborted                    (core dumped) llama-server -hf ggml-org/gpt-oss-120b-GGUF --jinja -c 0 --host 127.0.0.1 --port 8033 --no-mmap

Version:

llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 6981 (647b960)
built with gcc (GCC) 14.3.0 for x86_64-unknown-linux-gnu

Not sure why Nix stable has such an ancient version, will upgrade.

github-actions bot added examples server labels May 27, 2025

ggerganov changed the title ~~llama : use n_swa + n_ubatch cells for SWA cache + auto-batch~~ llama : use n_swa + n_ubatch cells for SWA cache May 28, 2025

ggerganov force-pushed the gg/swa-optimize branch from 1bce7e8 to 6468631 Compare May 28, 2025 07:43

ggerganov changed the base branch from gg/kv-cache-simplify-part3 to gg/auto-batch May 28, 2025 07:52

This was referenced May 28, 2025

kv-cache : refactor + add llama_memory_state_i #13746

Merged

Feature Request: --swa-extra parameter needed to restore speculative decode function with SWA #13747

Closed

ggerganov force-pushed the gg/auto-batch branch from 1adcd4b to ca69f32 Compare May 30, 2025 13:28

ggerganov force-pushed the gg/swa-optimize branch from 6468631 to ef5bb61 Compare May 30, 2025 13:48

ggerganov marked this pull request as ready for review May 30, 2025 14:36

ggerganov requested a review from ngxson as a code owner May 30, 2025 14:36

ngxson approved these changes May 30, 2025

View reviewed changes

ggerganov force-pushed the gg/auto-batch branch from ca69f32 to a059161 Compare May 31, 2025 09:20

Base automatically changed from gg/auto-batch to master May 31, 2025 09:56

llama : use n_swa + n_ubatch cells for SWA cache

4a9253a

ggml-ci

ggerganov force-pushed the gg/swa-optimize branch from ef5bb61 to 4a9253a Compare May 31, 2025 10:09

llama : add warning about multi-sqeuence SWA contexts

855b397

ggerganov force-pushed the gg/swa-optimize branch from 8342295 to 855b397 Compare May 31, 2025 10:48

ggerganov merged commit 3600cc2 into master May 31, 2025
46 checks passed

ggerganov deleted the gg/swa-optimize branch May 31, 2025 12:57

ggerganov mentioned this pull request Jun 11, 2025

Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes #14113

Closed

krystophny mentioned this pull request Mar 11, 2026

server: fix multi-turn cache reuse for hybrid/recurrent models #20428

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : use n_swa + n_ubatch cells for SWA cache#13833

llama : use n_swa + n_ubatch cells for SWA cache#13833
ggerganov merged 2 commits intomasterfrom
gg/swa-optimize

ggerganov commented May 27, 2025 •

edited

Loading

Uh oh!

aviallon commented May 28, 2025 •

edited

Loading

Uh oh!

ngxson May 30, 2025

Uh oh!

ggerganov May 30, 2025

Uh oh!

mgoltzsche Dec 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

louwers commented Dec 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ggerganov commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

aviallon commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson May 30, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov May 30, 2025

Choose a reason for hiding this comment

Uh oh!

mgoltzsche Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

louwers commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented May 27, 2025 •

edited

Loading

aviallon commented May 28, 2025 •

edited

Loading

mgoltzsche Dec 11, 2025 •

edited

Loading

louwers commented Dec 27, 2025 •

edited

Loading