server-context: fall back to full seq clear when partial KV eviction is refused#23280
Merged
ggerganov merged 3 commits intoMay 19, 2026
Merged
Conversation
…is refused The startup probe in common_context_can_seq_rm only tests a 2 token tail removal on seq 0, it cannot guarantee that every partial eviction will succeed at any position on any live seq. The previous code aborted the process via GGML_ABORT in common_context_seq_rm whenever the backend refused the partial removal, taking down the server on a recoverable condition. On refusal we now clear the whole seq on both target and draft contexts, reset the prompt cache counters, and let update_slots reprefill from zero on the current iteration. The server stays alive, the slot loses its prefix cache and pays a single reprefill, no crash.
Member
|
Could you try the following patch applied to diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 0f3fb9efa..7b801eac0 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2583,9 +2583,9 @@ private:
llama_pos pos_next = slot.prompt.tokens.pos_next(n_past);
// the largest pos_min required for a checkpoint to be useful
- const auto pos_min_thold = std::max(0, pos_next - n_swa);
+ const auto pos_min_thold = std::max(0, pos_next - n_swa - 1);
- if (n_past > 0 && n_past < slot.prompt.n_tokens()) {
+ if (n_past > 0 && n_past <= slot.prompt.n_tokens()) {
const auto pos_min = llama_memory_seq_pos_min(llama_get_memory(ctx_tgt), slot.id);
if (pos_min == -1) {
SLT_ERR(slot, "n_past = %d, slot.prompt.tokens.size() = %d, seq_id = %d, pos_min = %d\n", n_past, (int) slot.prompt.tokens.size(), slot.id, pos_min);I think this will fix both this problem and also #23223. |
Contributor
Author
Look better, I try now ! |
Contributor
Author
|
Successfully tested in Master+ on my PR sse-replay-buffer, It fixes the problem at its source, much better than my fallback. thanks ! |
…viction is refused" This reverts commit fa9770c.
Reproduces in master on hybrid models by asking the assistant to continue its last reply on a multi turn conversation: the LCP match is perfect, the deep partial seq_rm is refused by the recurrent backend, common_context_seq_rm aborts the process via GGML_ABORT. Patch by @ggerganov routes the n_past == slot.prompt.n_tokens() case through the existing do_reset path.
Member
|
Ok, let's do some testing also with non-recurrent models to make sure I am not overlooking something and we can merge. |
Contributor
Author
|
Tested with Qwen3 30B A3B, GPT-OSS, and Llama 3.3 (all pure transformers), multi-turn continuation works as expected, no regression. |
kgrama
pushed a commit
to kgrama/llama.cpp
that referenced
this pull request
May 19, 2026
xxmustafacooTR
pushed a commit
to xxPlayground/llama-cpp-turboquant
that referenced
this pull request
May 19, 2026
rsenthilkumar6
pushed a commit
to rsenthilkumar6/llama.cpp
that referenced
this pull request
May 19, 2026
ArberSephirotheca
pushed a commit
to ArberSephirotheca/llama.cpp
that referenced
this pull request
May 19, 2026
fhnmor21
pushed a commit
to fhnmor21/llama-cpp-turboquant
that referenced
this pull request
May 19, 2026
dbrain
pushed a commit
to dbrain/hbd-llama-cpp-turboquant
that referenced
this pull request
May 21, 2026
baramofme
pushed a commit
to baramofme/llama-cpp-turboquant
that referenced
this pull request
May 23, 2026
Jcfunk
added a commit
to Jcfunk/llama.cpp
that referenced
this pull request
May 23, 2026
* upstream/HEAD: ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ggml-webgpu : extend GDN for K>1 (ggml-org#23299) [SCYL] add chapter for performance reference in SYCL.md (ggml-org#23315) convert : filter lora tensor names (ggml-org#23077) sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (ggml-org#22153) rpc : keep last_graph_uid in the device context (ggml-org#23273)
srossitto79
pushed a commit
to srossitto79/llama.cpp
that referenced
this pull request
May 23, 2026
jimbothigpen
added a commit
to jimbothigpen/llama.cpp
that referenced
this pull request
May 29, 2026
Reverts mainline commit ccee426 (PR ggml-org#23280) in tools/server/server-context.cpp which we picked up via 2026-05-25 forward-sync. The change introduced a KV cache reuse regression on Qwen3.6-35B-A3B (and likely Qwen3.5-35B-A3B-MTP) where a full batch of cached tokens is dropped per turn on multi-turn requests. Mainline issue: ggml-org#23589 RC + reproducer: orangeswim 2026-05-24 §-RISK: This is a naked revert per the issue author's mainline test; it may reintroduce the hybrid-attention crash that ggml-org#23280 was fixing. Build + smoke verify gated on GPU-lockout-clear; follow-up worker required before FF-merge. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
fewtarius
pushed a commit
to fewtarius/llama.cpp
that referenced
this pull request
May 30, 2026
turbo-tan
pushed a commit
to turbo-tan/llama.cpp-tq3
that referenced
this pull request
Jun 2, 2026
Jcfunk
added a commit
to Jcfunk/llama.cpp
that referenced
this pull request
Jun 11, 2026
* upstream/HEAD: (25 commits) metal : optimize pad + cpy (ggml-org#23354) snapdragon: update toolchain to v0.6 (ggml-org#23369) ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349) opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303) hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317) refactor: Chat Screen UI rendering (ggml-org#23333) github: mention --log-file in issue templates (ggml-org#23277) common: fix --help for --verbosity (ggml-org#23278) common: fix --fit verbosity with --verbosity 4 (ggml-org#23282) convert : update mtp related help (ggml-org#23334) hexagon: enable support for NORM op (ggml-org#23319) model : clarify MTP layer comment in qwen35.cpp [no ci] (ggml-org#23338) llama : MTP clean-up (ggml-org#23269) ui: Bump packages + address build warnings (ggml-org#23300) ci : install libssl-dev (ggml-org#23325) ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ...
TheTom
pushed a commit
to TheTom/llama-cpp-turboquant
that referenced
this pull request
Jun 12, 2026
…odels When an incoming prompt exactly matches the slot's cached tokens, the server backs off one token (n_past--) to guarantee at least one token is decoded for logits [TAG_PROMPT_LOGITS]. The subsequent truncation then calls seq_rm with p0 = n_past > 0, which recurrent memory cannot satisfy (the state cannot be rewound to a mid-sequence position when the rollback exceeds n_rs_seq), so common_context_seq_rm hits GGML_ABORT and the whole server dies: common/common.cpp:1472: failed to remove sequence 0 with p0=490, p1=-1 Observed in production with Qwen3.6-35B-A3B (GatedDeltaNet layers): any client re-sending an identical prompt with cache_prompt enabled (regenerate / retry) crashed the server. Reproducible with any recurrent model, e.g. mamba-130m, by sending the same /completion prompt twice. Fix: include the exact-match case (n_past == slot.prompt.n_tokens()) in the existing checkpoint-restore/full-reprocess branch by relaxing its condition to n_past <= slot.prompt.n_tokens() and extending pos_min_thold by 1 when there are no new tokens, so the upcoming back-off is accounted for. Recurrent/hybrid models now restore a context checkpoint (or fall back to full re-processing) instead of aborting; attention models are unaffected. This backports upstream ggml-org/llama.cpp PRs ggml-org#23280 and ggml-org#24110 (commits ccee426, 6f3a9f3). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
To reproduce on master, run llama-server with a recent hybrid attention model such as Qwen3.6-MoE, fill the KV cache with a few conversation turns, then click "Continue" at the end of an assistant reply and watch the server abort on a partial seq_rm refusal.
Additional information
Fix this
Requirements