Skip to content

server: fix ret=-3 on hybrid/recurrent prompt cache and clear sticky stop flag#1673

Merged
ikawrakow merged 1 commit into
ikawrakow:mainfrom
markaalonzo:fix/hybrid-prompt-cache-sticky-stop
Apr 23, 2026
Merged

server: fix ret=-3 on hybrid/recurrent prompt cache and clear sticky stop flag#1673
ikawrakow merged 1 commit into
ikawrakow:mainfrom
markaalonzo:fix/hybrid-prompt-cache-sticky-stop

Conversation

@markaalonzo

@markaalonzo markaalonzo commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes two issues that together produce llama_decode ret=-3 on hybrid / recurrent architectures (Qwen3.5 MoE, Qwen3.6 MoE, Qwen3-Next, Mamba), matching the symptom in #1576 and reproduced under our Qwen3.6 production workload.

  1. server_context::apply_checkpoint() is not hybrid-aware. The selector was written for transformer KV semantics (per-token pos_min window). Hybrid/recurrent models use a single whole-prefix state snapshot, so the transformer selector can either pick a checkpoint whose pos_max is past the current n_past (restoring state ahead of the decode position) or miss entirely and fall through to do_reset = true, which zeros slot.n_past / slot.n_past_prompt while the recurrent state in the context is still populated. The next decode batch then disagrees with the live recurrent state → ret=-3.

  2. stop_internal_decode is a sticky file-static global. llama_decode_stop() (called on client disconnect) sets it; the decode loop polls it and bails with ret=-3. It is only cleared on one conditional branch of server_slot::release(), so a stop signal that races past a decode that has already returned bleeds into the next llama_decode() call on the same context, producing an immediate ret=-3 with zero work performed.

Changes

examples/server/server-context.cpp

apply_checkpoint gates its logic on llama_model_has_recurrent(llama_get_model(slot.ctx)) — the same helper already used in this file at line 115 and line 1471:

  • Enter the checkpoint path whenever has_recurrent (not just on pos_min > pos_min_thold), so recurrent slots don't silently skip it.
  • Selector (recurrent branch): match the latest checkpoint with cur.pos_max <= slot.n_past && cur.pos_max < pos_next — a whole-prefix snapshot that still leaves at least one token to decode (preserving [TAG_PROMPT_LOGITS]).
  • On miss (recurrent branch): do not zero slot.n_past. update_slots() will continue from the valid n_past_prompt — this is the critical part; zeroing was the proximate trigger for the ret=-3.
  • Erase loop (recurrent branch): drop checkpoints with cur.pos_max > pos_next — stale snapshots ahead of the current decode position.

Transformer behavior is byte-identical to before (has_recurrent is false → all branches fall to the existing logic).

src/llama.cpp

Public llama_decode() entry clears stop_internal_decode before calling llama_decode_internal(). This scopes the stop signal to the in-flight decode it was meant to interrupt. Concurrent llama_decode_stop() during the decode still takes effect on the next loop iteration as before.

How to reproduce the bug (pre-fix)

Issue #1576 has the full reproduction. We hit it on Qwen3.6 (hybrid QWEN35MOE) via the prompt-cache path:

  • warm a conversational prompt on the server (a checkpoint is recorded),
  • send a follow-up prompt whose prefix matches the cached one partially,
  • apply_checkpoint falls through to do_reset = true, zeros n_past,
  • next llama_decode() returns -3 and the slot wedges until restart.

The sticky-flag case reproduces with a client that cancels a streaming completion mid-prefill and then retries on the same context.

Test plan

  • Builds clean with CUDA (sm_86): cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_IQK_MUL_MAT=ON -DGGML_IQK_FLASH_ATTENTION=ON -DGGML_IQK_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build --target llama-server
  • No new public APIs; llama_model_has_recurrent is already declared in include/llama.h:684 and defined in src/llama-model.cpp:1872
  • Transformer path is unchanged (reviewed each edit — has_recurrent defaults false for non-hybrid, non-recurrent architectures)
  • Qwen3.6 validation on a live server (will follow this PR once maintainers weigh in on approach — happy to iterate)

Closes #1576

… stop flag

Two related issues that manifest as 'llama_decode ret=-3' on hybrid
architectures (e.g. Qwen3.5/3.6 MoE, Qwen3-Next), matching the symptom
reported in ikawrakow#1576.

1) server_context::apply_checkpoint() was written around transformer KV
   semantics (pos_min / pos_max per-token window). For hybrid and pure
   recurrent models the per-token pos_min threshold does not apply: the
   recurrent state is a single snapshot, and the server-side checkpoint
   is a whole-prefix record. The old selector 'cur.pos_min < pos_min_thold'
   can succeed on a checkpoint whose pos_max is past the current n_past,
   and — more commonly — fall through to do_reset = true, which zeros
   slot.n_past / slot.n_past_prompt. Zeroing in-place while the recurrent
   state in the context is still populated makes the next decode batch
   disagree with the live state, returning ret=-3.

   This change gates the checkpoint path on
   llama_model_has_recurrent(llama_get_model(slot.ctx)):
   - selector uses pos_max <= slot.n_past && pos_max < pos_next
     (whole-prefix match, leaves at least one token to decode);
   - on miss, slot state is preserved rather than zeroed, letting
     update_slots() continue from the already-valid n_past_prompt;
   - the erase loop drops any checkpoint whose pos_max > pos_next,
     matching the rewind semantics for recurrent state.

   Transformer behavior is unchanged.

2) stop_internal_decode is a file-static global in src/llama.cpp, set by
   llama_decode_stop() (called on client disconnect) and polled inside
   the decode loop to bail out with ret=-3. The flag is only cleared on
   one conditional path in server_slot::release(), so a stop signal that
   arrives after the interrupted llama_decode() has already returned
   bleeds into the NEXT decode call and causes an immediate ret=-3 with
   no work performed. Clear it at the top of the public llama_decode()
   entry so the signal is scoped to the in-flight decode it was meant
   for.

Build-verified: llama-server with GGML_CUDA=ON, -DCMAKE_CUDA_ARCHITECTURES=86
(sm_86), IQK flash-attn + matmul enabled. No new APIs introduced —
llama_model_has_recurrent is already public and already used elsewhere in
server-context.cpp.

Closes ikawrakow#1576
"https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055");
slot.n_past = 0;
slot.n_past_prompt = 0;
if (has_recurrent) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand this change. If we concluded that we need full prompt reprocessing, why would that not apply to a recurrent model?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the change is needed inside apply_checkpoint. At the current state, we only create checkpoint when it's a recurrent model. If it's not a recurrent model, there is no checkpoint to restore.

Comment thread src/llama.cpp
// is intended to interrupt the decode that is currently in flight; without this reset, a stop
// that arrived after the interrupted call returned would bleed into the next decode and cause
// an immediate ret=-3, which servers interpret as a fatal decode failure.
stop_internal_decode = false;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting comment. Can you tells us how stop_internal_decode became true ?

@ikawrakow

Copy link
Copy Markdown
Owner

@firecoperana

As you have done all the checkpointing stuff, I'll refer this one to you to see if the changes make sense.

@firecoperana

Copy link
Copy Markdown
Collaborator

@markaalonzo stop_internal_decode will be cleared when the slot is released, and it does not invalidate kv cache leading to prompt fully reprocessed. Can you provide a detailed example to demonstrate the issue and whether this PR attempts to fix #1576 where there is no log to show why it occurred?

@FNsi

FNsi commented Apr 22, 2026

Copy link
Copy Markdown

I am currently facing this issue, maybe? From 56k to 86k, repeated 3 times. Didn't do anything special,

CMD I used
./llama-server -fa on -ctv q4_0 -ctk q4_0 -vhad -khad -m /home/ssss/Documents/Qwen3.6-35b-a3b-claude-4.6-opus-reasoning-distilled-t2-Q8_0.gguf --mmproj /home/ssss/Documents/mmproj-Qwen3.6-35B-F16.gguf -ctk-first q8_0,8 -ctk-last q8_0,8 -ctv-first q8_0,8 -ctv-last q8_0,8 --host 0.0.0.0 --port 8081 --chat-template-file /home/ssss/Documents/agent-template-step -c 524288 --rope-scaling yarn --rope-scale 2 --yarn-orig-ctx 262144 --alias Lumi -ub 2048 -b 2048 --no-warmup --jinja --webui llamacpp -t 8 -fdn 1024

I run with hemers-agent, even I don't think there's anything related to it, I was using Hermes to generate music in igpu, while I was running ik_llama in cpu; the tg and pp in the beginning of log is lower than normal, and I have no clue what it..(maybe there's a power competition between igpu and cpu so both going to slow)

**Jinja template: I think that could be my problem part... my failure to ignore the stepfun temple also apply sequence change for thinking block. Still not really make sense for the first 50k, might not fully explain why it trigger re evaluation endlessly after 50k; **

So my case might not really related.


{%- set reasoning_content = content.split("</think>")[0].rstrip("\n").split("<think>")[-1].lstrip("\n") %}
{%- set content = content.split("</think>")[-1].lstrip("\n") %}


======== Prompt cache: cache size: 54784, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.99, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.991, sim = 0.987, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 26.92 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776865431 id_slot=0 id_task=5863
======== Cache: cache_size = 54784, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 5863 | n_past = 54248, slot.prompt.tokens.size() = 54784, seq_id = 0, pos_min = 54782
slot apply_checkp: id  0 | task 5863 | restored context checkpoint took  28.89 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 5863 | erased invalidated context checkpoint (pos_min = 54681, pos_max = 54681, size = 63.231 MiB)
slot apply_checkp: id  0 | task 5863 | erased invalidated context checkpoint (pos_min = 54687, pos_max = 54687, size = 63.231 MiB)
slot apply_checkp: id  0 | task 5863 | erased invalidated context checkpoint (pos_min = 54782, pos_max = 54782, size = 63.232 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865431 id_slot=0 id_task=5863 p0=54246
slot create_check: id  0 | task 5863 | created context checkpoint 30 of 32 (pos_min = 55191, pos_max = 55191, size = 63.235 MiB, took 78.88 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865461 id_slot=0 id_task=5863 p0=55192
slot create_check: id  0 | task 5863 | created context checkpoint 31 of 32 (pos_min = 55197, pos_max = 55197, size = 63.235 MiB, took 54.32 ms)
INFO [      log_server_request] request | tid="126875254900416" timestamp=1776865492 remote_addr="127.0.0.1" remote_port=51504 status=200 method="POST" path="/v1/chat/completions" params={}
slot print_timing: id  0 | task 5863 | 
prompt eval time =   31698.53 ms /   951 tokens (   33.33 ms per token,    30.00 tokens per second)
       eval time =   29009.61 ms /   110 tokens (  263.72 ms per token,     3.79 tokens per second)
      total time =   60708.14 ms /  1061 tokens
slot create_check: id  0 | task 5863 | created context checkpoint 32 of 32 (pos_min = 55305, pos_max = 55305, size = 63.236 MiB, took 68.72 ms)
INFO [           release_slots] slot released | tid="126920474798784" timestamp=1776865492 id_slot=0 id_task=5863 n_ctx=524288 n_past=55307 n_system_tokens=0 n_cache_tokens=55307 truncated=false
INFO [              slots_idle] all slots are idle | tid="126920474798784" timestamp=1776865492
======== Prompt cache: cache size: 55307, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.98, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.982, sim = 0.982, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 23.34 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776865552 id_slot=0 id_task=5975
======== Cache: cache_size = 55307, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 5975 | n_past = 54248, slot.prompt.tokens.size() = 55307, seq_id = 0, pos_min = 55305
slot apply_checkp: id  0 | task 5975 | restored context checkpoint took  93.43 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 5975 | erased invalidated context checkpoint (pos_min = 55191, pos_max = 55191, size = 63.235 MiB)
slot apply_checkp: id  0 | task 5975 | erased invalidated context checkpoint (pos_min = 55197, pos_max = 55197, size = 63.235 MiB)
slot apply_checkp: id  0 | task 5975 | erased invalidated context checkpoint (pos_min = 55305, pos_max = 55305, size = 63.236 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865552 id_slot=0 id_task=5975 p0=54246
slot create_check: id  0 | task 5975 | created context checkpoint 30 of 32 (pos_min = 55350, pos_max = 55350, size = 63.236 MiB, took 58.26 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865600 id_slot=0 id_task=5975 p0=55351
slot create_check: id  0 | task 5975 | created context checkpoint 31 of 32 (pos_min = 55356, pos_max = 55356, size = 63.236 MiB, took 99.30 ms)
INFO [      log_server_request] request | tid="126875246507712" timestamp=1776865661 remote_addr="127.0.0.1" remote_port=38662 status=200 method="POST" path="/v1/chat/completions" params={}
slot print_timing: id  0 | task 5975 | 
prompt eval time =   49152.01 ms /  1110 tokens (   44.28 ms per token,    22.58 tokens per second)
       eval time =   59771.09 ms /   220 tokens (  271.69 ms per token,     3.68 tokens per second)
      total time =  108923.10 ms /  1330 tokens
INFO [      log_server_request] request | tid="126875238115008" timestamp=1776865661 remote_addr="127.0.0.1" remote_port=35544 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="126875229722304" timestamp=1776865661 remote_addr="127.0.0.1" remote_port=35560 status=200 method="GET" path="/v1/props" params={}
slot create_check: id  0 | task 5975 | created context checkpoint 32 of 32 (pos_min = 55574, pos_max = 55574, size = 63.238 MiB, took 51.96 ms)
INFO [           release_slots] slot released | tid="126920474798784" timestamp=1776865661 id_slot=0 id_task=5975 n_ctx=524288 n_past=55576 n_system_tokens=0 n_cache_tokens=55576 truncated=false
INFO [              slots_idle] all slots are idle | tid="126920474798784" timestamp=1776865661
======== Prompt cache: cache size: 55576, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.98, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.979, sim = 0.974, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 22.77 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776865662 id_slot=0 id_task=6197
======== Cache: cache_size = 55576, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 6197 | n_past = 54248, slot.prompt.tokens.size() = 55576, seq_id = 0, pos_min = 55574
slot apply_checkp: id  0 | task 6197 | restored context checkpoint took  20.12 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 6197 | erased invalidated context checkpoint (pos_min = 55350, pos_max = 55350, size = 63.236 MiB)
slot apply_checkp: id  0 | task 6197 | erased invalidated context checkpoint (pos_min = 55356, pos_max = 55356, size = 63.236 MiB)
slot apply_checkp: id  0 | task 6197 | erased invalidated context checkpoint (pos_min = 55574, pos_max = 55574, size = 63.238 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865662 id_slot=0 id_task=6197 p0=54246
slot create_check: id  0 | task 6197 | created context checkpoint 30 of 32 (pos_min = 56082, pos_max = 56082, size = 63.242 MiB, took 173.24 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865723 id_slot=0 id_task=6197 p0=56083
slot create_check: id  0 | task 6197 | created context checkpoint 31 of 32 (pos_min = 56088, pos_max = 56088, size = 63.242 MiB, took 39.13 ms)
slot print_timing: id  0 | task 6197 | 
prompt eval time =   61798.89 ms /  1842 tokens (   33.55 ms per token,    29.81 tokens per second)
       eval time =   23766.99 ms /    85 tokens (  279.61 ms per token,     3.58 tokens per second)
      total time =   85565.88 ms /  1927 tokens
INFO [      log_server_request] request | tid="126875221329600" timestamp=1776865748 remote_addr="127.0.0.1" remote_port=35566 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 6197 | created context checkpoint 32 of 32 (pos_min = 56171, pos_max = 56171, size = 63.243 MiB, took 118.00 ms)
INFO [           release_slots] slot released | tid="126920474798784" timestamp=1776865748 id_slot=0 id_task=6197 n_ctx=524288 n_past=56173 n_system_tokens=0 n_cache_tokens=56173 truncated=false
INFO [              slots_idle] all slots are idle | tid="126920474798784" timestamp=1776865748
======== Prompt cache: cache size: 56173, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.97, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.969, sim = 0.776, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 24.71 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776865748 id_slot=0 id_task=6284
======== Cache: cache_size = 56173, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 6284 | n_past = 54248, slot.prompt.tokens.size() = 56173, seq_id = 0, pos_min = 56171
slot apply_checkp: id  0 | task 6284 | restored context checkpoint took  51.49 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 6284 | erased invalidated context checkpoint (pos_min = 56082, pos_max = 56082, size = 63.242 MiB)
slot apply_checkp: id  0 | task 6284 | erased invalidated context checkpoint (pos_min = 56088, pos_max = 56088, size = 63.242 MiB)
slot apply_checkp: id  0 | task 6284 | erased invalidated context checkpoint (pos_min = 56171, pos_max = 56171, size = 63.243 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865749 id_slot=0 id_task=6284 p0=54246
slot create_check: id  0 | task 6284 | created context checkpoint 30 of 32 (pos_min = 56293, pos_max = 56293, size = 63.243 MiB, took 127.78 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865816 id_slot=0 id_task=6284 p0=56294
slot create_check: id  0 | task 6284 | created context checkpoint 31 of 32 (pos_min = 58341, pos_max = 58341, size = 63.259 MiB, took 61.72 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865885 id_slot=0 id_task=6284 p0=58342
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 60389, pos_max = 60389, size = 63.275 MiB, took 55.40 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776865955 id_slot=0 id_task=6284 p0=60390
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 17970, pos_max = 17970, size = 62.951 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 62437, pos_max = 62437, size = 63.290 MiB, took 87.49 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866028 id_slot=0 id_task=6284 p0=62438
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 18060, pos_max = 18060, size = 62.952 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 64485, pos_max = 64485, size = 63.306 MiB, took 40.71 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866100 id_slot=0 id_task=6284 p0=64486
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 18169, pos_max = 18169, size = 62.953 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 66533, pos_max = 66533, size = 63.322 MiB, took 43.01 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866161 id_slot=0 id_task=6284 p0=66534
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 18175, pos_max = 18175, size = 62.953 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 68581, pos_max = 68581, size = 63.337 MiB, took 39.67 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866223 id_slot=0 id_task=6284 p0=68582
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 18269, pos_max = 18269, size = 62.953 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 70629, pos_max = 70629, size = 63.353 MiB, took 42.04 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866286 id_slot=0 id_task=6284 p0=70630
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 18436, pos_max = 18436, size = 62.955 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 72677, pos_max = 72677, size = 63.368 MiB, took 41.94 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866350 id_slot=0 id_task=6284 p0=72678
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 18442, pos_max = 18442, size = 62.955 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 74725, pos_max = 74725, size = 63.384 MiB, took 40.50 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866415 id_slot=0 id_task=6284 p0=74726
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 18684, pos_max = 18684, size = 62.957 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 76773, pos_max = 76773, size = 63.400 MiB, took 40.40 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866481 id_slot=0 id_task=6284 p0=76774
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 20733, pos_max = 20733, size = 62.972 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 78821, pos_max = 78821, size = 63.415 MiB, took 39.60 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866549 id_slot=0 id_task=6284 p0=78822
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 22781, pos_max = 22781, size = 62.988 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 80869, pos_max = 80869, size = 63.431 MiB, took 42.06 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866619 id_slot=0 id_task=6284 p0=80870
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 24829, pos_max = 24829, size = 63.003 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 82917, pos_max = 82917, size = 63.447 MiB, took 40.70 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866690 id_slot=0 id_task=6284 p0=82918
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 25062, pos_max = 25062, size = 63.005 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 83685, pos_max = 83685, size = 63.452 MiB, took 40.16 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866717 id_slot=0 id_task=6284 p0=83686
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 25340, pos_max = 25340, size = 63.007 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 83691, pos_max = 83691, size = 63.453 MiB, took 44.96 ms)
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 27110, pos_max = 27110, size = 63.021 MiB)
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 84202, pos_max = 84202, size = 63.456 MiB, took 44.31 ms)
slot print_timing: id  0 | task 6284 | 
prompt eval time =  968668.81 ms / 29445 tokens (   32.90 ms per token,    30.40 tokens per second)
       eval time =   60080.47 ms /   520 tokens (  115.54 ms per token,     8.66 tokens per second)
      total time = 1028749.28 ms / 29965 tokens
INFO [      log_server_request] request | tid="126875212936896" timestamp=1776866777 remote_addr="127.0.0.1" remote_port=38902 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="126875204544192" timestamp=1776866777 remote_addr="127.0.0.1" remote_port=41378 status=200 method="GET" path="/v1/models" params={}
slot create_check: id  0 | task 6284 | erasing old context checkpoint (pos_min = 29158, pos_max = 29158, size = 63.036 MiB)
INFO [      log_server_request] request | tid="126875196151488" timestamp=1776866777 remote_addr="127.0.0.1" remote_port=41386 status=200 method="GET" path="/v1/props" params={}
slot create_check: id  0 | task 6284 | created context checkpoint 32 of 32 (pos_min = 84209, pos_max = 84209, size = 63.456 MiB, took 49.74 ms)
INFO [           release_slots] slot released | tid="126920474798784" timestamp=1776866777 id_slot=0 id_task=6284 n_ctx=524288 n_past=84211 n_system_tokens=0 n_cache_tokens=84211 truncated=false
INFO [              slots_idle] all slots are idle | tid="126920474798784" timestamp=1776866777
======== Prompt cache: cache size: 84211, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.64, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.643, sim = 0.643, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 26.74 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776866778 id_slot=0 id_task=6820
======== Cache: cache_size = 84211, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 6820 | n_past = 54248, slot.prompt.tokens.size() = 84211, seq_id = 0, pos_min = 84209
slot apply_checkp: id  0 | task 6820 | restored context checkpoint took  13.36 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 56293, pos_max = 56293, size = 63.243 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 58341, pos_max = 58341, size = 63.259 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 60389, pos_max = 60389, size = 63.275 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 62437, pos_max = 62437, size = 63.290 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 64485, pos_max = 64485, size = 63.306 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 66533, pos_max = 66533, size = 63.322 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 68581, pos_max = 68581, size = 63.337 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 70629, pos_max = 70629, size = 63.353 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 72677, pos_max = 72677, size = 63.368 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 74725, pos_max = 74725, size = 63.384 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 76773, pos_max = 76773, size = 63.400 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 78821, pos_max = 78821, size = 63.415 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 80869, pos_max = 80869, size = 63.431 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 82917, pos_max = 82917, size = 63.447 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 83685, pos_max = 83685, size = 63.452 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 83691, pos_max = 83691, size = 63.453 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 84202, pos_max = 84202, size = 63.456 MiB)
slot apply_checkp: id  0 | task 6820 | erased invalidated context checkpoint (pos_min = 84209, pos_max = 84209, size = 63.456 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866778 id_slot=0 id_task=6820 p0=54246
slot create_check: id  0 | task 6820 | created context checkpoint 15 of 32 (pos_min = 56293, pos_max = 56293, size = 63.243 MiB, took 105.70 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866841 id_slot=0 id_task=6820 p0=56294
slot create_check: id  0 | task 6820 | created context checkpoint 16 of 32 (pos_min = 58341, pos_max = 58341, size = 63.259 MiB, took 107.02 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866910 id_slot=0 id_task=6820 p0=58342
slot create_check: id  0 | task 6820 | created context checkpoint 17 of 32 (pos_min = 60389, pos_max = 60389, size = 63.275 MiB, took 110.00 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776866981 id_slot=0 id_task=6820 p0=60390
slot create_check: id  0 | task 6820 | created context checkpoint 18 of 32 (pos_min = 62437, pos_max = 62437, size = 63.290 MiB, took 98.20 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867054 id_slot=0 id_task=6820 p0=62438
slot create_check: id  0 | task 6820 | created context checkpoint 19 of 32 (pos_min = 64485, pos_max = 64485, size = 63.306 MiB, took 77.37 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867129 id_slot=0 id_task=6820 p0=64486
slot create_check: id  0 | task 6820 | created context checkpoint 20 of 32 (pos_min = 66533, pos_max = 66533, size = 63.322 MiB, took 70.37 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867205 id_slot=0 id_task=6820 p0=66534
slot create_check: id  0 | task 6820 | created context checkpoint 21 of 32 (pos_min = 68581, pos_max = 68581, size = 63.337 MiB, took 73.30 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867283 id_slot=0 id_task=6820 p0=68582
slot create_check: id  0 | task 6820 | created context checkpoint 22 of 32 (pos_min = 70629, pos_max = 70629, size = 63.353 MiB, took 102.95 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867365 id_slot=0 id_task=6820 p0=70630
slot create_check: id  0 | task 6820 | created context checkpoint 23 of 32 (pos_min = 72677, pos_max = 72677, size = 63.368 MiB, took 91.85 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867447 id_slot=0 id_task=6820 p0=72678
slot create_check: id  0 | task 6820 | created context checkpoint 24 of 32 (pos_min = 74725, pos_max = 74725, size = 63.384 MiB, took 78.53 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867531 id_slot=0 id_task=6820 p0=74726
slot create_check: id  0 | task 6820 | created context checkpoint 25 of 32 (pos_min = 76773, pos_max = 76773, size = 63.400 MiB, took 40.02 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867609 id_slot=0 id_task=6820 p0=76774
slot create_check: id  0 | task 6820 | created context checkpoint 26 of 32 (pos_min = 78821, pos_max = 78821, size = 63.415 MiB, took 36.42 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867678 id_slot=0 id_task=6820 p0=78822
slot create_check: id  0 | task 6820 | created context checkpoint 27 of 32 (pos_min = 80869, pos_max = 80869, size = 63.431 MiB, took 36.19 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867748 id_slot=0 id_task=6820 p0=80870
slot create_check: id  0 | task 6820 | created context checkpoint 28 of 32 (pos_min = 82917, pos_max = 82917, size = 63.447 MiB, took 39.81 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867820 id_slot=0 id_task=6820 p0=82918
slot create_check: id  0 | task 6820 | created context checkpoint 29 of 32 (pos_min = 84276, pos_max = 84276, size = 63.457 MiB, took 39.19 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867869 id_slot=0 id_task=6820 p0=84277
slot create_check: id  0 | task 6820 | created context checkpoint 30 of 32 (pos_min = 84282, pos_max = 84282, size = 63.457 MiB, took 38.16 ms)
slot print_timing: id  0 | task 6820 | 
prompt eval time = 1092000.39 ms / 30036 tokens (   36.36 ms per token,    27.51 tokens per second)
       eval time =   15106.74 ms /   131 tokens (  115.32 ms per token,     8.67 tokens per second)
      total time = 1107107.13 ms / 30167 tokens
INFO [      log_server_request] request | tid="126875187758784" timestamp=1776867885 remote_addr="127.0.0.1" remote_port=41398 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="126875179366080" timestamp=1776867885 remote_addr="127.0.0.1" remote_port=54934 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="126875170973376" timestamp=1776867885 remote_addr="127.0.0.1" remote_port=54944 status=200 method="GET" path="/v1/props" params={}
slot create_check: id  0 | task 6820 | created context checkpoint 31 of 32 (pos_min = 84411, pos_max = 84411, size = 63.458 MiB, took 39.56 ms)
INFO [           release_slots] slot released | tid="126920474798784" timestamp=1776867885 id_slot=0 id_task=6820 n_ctx=524288 n_past=84413 n_system_tokens=0 n_cache_tokens=84413 truncated=false
INFO [              slots_idle] all slots are idle | tid="126920474798784" timestamp=1776867885
======== Prompt cache: cache size: 84413, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.64, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.642, sim = 0.642, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 26.37 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776867885 id_slot=0 id_task=6967
======== Cache: cache_size = 84413, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 6967 | n_past = 54248, slot.prompt.tokens.size() = 84413, seq_id = 0, pos_min = 84411
slot apply_checkp: id  0 | task 6967 | restored context checkpoint took  13.10 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 56293, pos_max = 56293, size = 63.243 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 58341, pos_max = 58341, size = 63.259 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 60389, pos_max = 60389, size = 63.275 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 62437, pos_max = 62437, size = 63.290 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 64485, pos_max = 64485, size = 63.306 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 66533, pos_max = 66533, size = 63.322 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 68581, pos_max = 68581, size = 63.337 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 70629, pos_max = 70629, size = 63.353 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 72677, pos_max = 72677, size = 63.368 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 74725, pos_max = 74725, size = 63.384 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 76773, pos_max = 76773, size = 63.400 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 78821, pos_max = 78821, size = 63.415 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 80869, pos_max = 80869, size = 63.431 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 82917, pos_max = 82917, size = 63.447 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 84276, pos_max = 84276, size = 63.457 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 84282, pos_max = 84282, size = 63.457 MiB)
slot apply_checkp: id  0 | task 6967 | erased invalidated context checkpoint (pos_min = 84411, pos_max = 84411, size = 63.458 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867885 id_slot=0 id_task=6967 p0=54246
slot create_check: id  0 | task 6967 | created context checkpoint 15 of 32 (pos_min = 56293, pos_max = 56293, size = 63.243 MiB, took 39.81 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867938 id_slot=0 id_task=6967 p0=56294
slot create_check: id  0 | task 6967 | created context checkpoint 16 of 32 (pos_min = 58341, pos_max = 58341, size = 63.259 MiB, took 39.41 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776867992 id_slot=0 id_task=6967 p0=58342
slot create_check: id  0 | task 6967 | created context checkpoint 17 of 32 (pos_min = 60389, pos_max = 60389, size = 63.275 MiB, took 38.84 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868048 id_slot=0 id_task=6967 p0=60390
slot create_check: id  0 | task 6967 | created context checkpoint 18 of 32 (pos_min = 62437, pos_max = 62437, size = 63.290 MiB, took 40.07 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868105 id_slot=0 id_task=6967 p0=62438
slot create_check: id  0 | task 6967 | created context checkpoint 19 of 32 (pos_min = 64485, pos_max = 64485, size = 63.306 MiB, took 38.00 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868164 id_slot=0 id_task=6967 p0=64486
slot create_check: id  0 | task 6967 | created context checkpoint 20 of 32 (pos_min = 66533, pos_max = 66533, size = 63.322 MiB, took 38.24 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868224 id_slot=0 id_task=6967 p0=66534
slot create_check: id  0 | task 6967 | created context checkpoint 21 of 32 (pos_min = 68581, pos_max = 68581, size = 63.337 MiB, took 37.17 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868285 id_slot=0 id_task=6967 p0=68582
slot create_check: id  0 | task 6967 | created context checkpoint 22 of 32 (pos_min = 70629, pos_max = 70629, size = 63.353 MiB, took 36.97 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868347 id_slot=0 id_task=6967 p0=70630
slot create_check: id  0 | task 6967 | created context checkpoint 23 of 32 (pos_min = 72677, pos_max = 72677, size = 63.368 MiB, took 36.15 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868411 id_slot=0 id_task=6967 p0=72678
slot create_check: id  0 | task 6967 | created context checkpoint 24 of 32 (pos_min = 74725, pos_max = 74725, size = 63.384 MiB, took 37.86 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868477 id_slot=0 id_task=6967 p0=74726
slot create_check: id  0 | task 6967 | created context checkpoint 25 of 32 (pos_min = 76773, pos_max = 76773, size = 63.400 MiB, took 36.92 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868543 id_slot=0 id_task=6967 p0=76774
slot create_check: id  0 | task 6967 | created context checkpoint 26 of 32 (pos_min = 78821, pos_max = 78821, size = 63.415 MiB, took 36.91 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868611 id_slot=0 id_task=6967 p0=78822
slot create_check: id  0 | task 6967 | created context checkpoint 27 of 32 (pos_min = 80869, pos_max = 80869, size = 63.431 MiB, took 37.15 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868681 id_slot=0 id_task=6967 p0=80870
slot create_check: id  0 | task 6967 | created context checkpoint 28 of 32 (pos_min = 82917, pos_max = 82917, size = 63.447 MiB, took 37.30 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868752 id_slot=0 id_task=6967 p0=82918
slot create_check: id  0 | task 6967 | created context checkpoint 29 of 32 (pos_min = 84464, pos_max = 84464, size = 63.458 MiB, took 37.62 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868807 id_slot=0 id_task=6967 p0=84465
slot create_check: id  0 | task 6967 | created context checkpoint 30 of 32 (pos_min = 84470, pos_max = 84470, size = 63.458 MiB, took 36.06 ms)
slot print_timing: id  0 | task 6967 | 
prompt eval time =  922416.76 ms / 30224 tokens (   30.52 ms per token,    32.77 tokens per second)
       eval time =   10651.64 ms /    93 tokens (  114.53 ms per token,     8.73 tokens per second)
      total time =  933068.40 ms / 30317 tokens
INFO [      log_server_request] request | tid="126875363940032" timestamp=1776868818 remote_addr="127.0.0.1" remote_port=54958 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="126875280078528" timestamp=1776868818 remote_addr="127.0.0.1" remote_port=50058 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="126875271685824" timestamp=1776868818 remote_addr="127.0.0.1" remote_port=50066 status=200 method="GET" path="/v1/props" params={}
slot create_check: id  0 | task 6967 | created context checkpoint 31 of 32 (pos_min = 84561, pos_max = 84561, size = 63.459 MiB, took 37.93 ms)
INFO [           release_slots] slot released | tid="126920474798784" timestamp=1776868818 id_slot=0 id_task=6967 n_ctx=524288 n_past=84563 n_system_tokens=0 n_cache_tokens=84563 truncated=false
INFO [              slots_idle] all slots are idle | tid="126920474798784" timestamp=1776868818
======== Prompt cache: cache size: 84563, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.64, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.641, sim = 0.636, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 25.69 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776868819 id_slot=0 id_task=7076
======== Cache: cache_size = 84563, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 7076 | n_past = 54248, slot.prompt.tokens.size() = 84563, seq_id = 0, pos_min = 84561
slot apply_checkp: id  0 | task 7076 | restored context checkpoint took  12.89 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 56293, pos_max = 56293, size = 63.243 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 58341, pos_max = 58341, size = 63.259 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 60389, pos_max = 60389, size = 63.275 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 62437, pos_max = 62437, size = 63.290 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 64485, pos_max = 64485, size = 63.306 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 66533, pos_max = 66533, size = 63.322 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 68581, pos_max = 68581, size = 63.337 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 70629, pos_max = 70629, size = 63.353 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 72677, pos_max = 72677, size = 63.368 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 74725, pos_max = 74725, size = 63.384 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 76773, pos_max = 76773, size = 63.400 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 78821, pos_max = 78821, size = 63.415 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 80869, pos_max = 80869, size = 63.431 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 82917, pos_max = 82917, size = 63.447 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 84464, pos_max = 84464, size = 63.458 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 84470, pos_max = 84470, size = 63.458 MiB)
slot apply_checkp: id  0 | task 7076 | erased invalidated context checkpoint (pos_min = 84561, pos_max = 84561, size = 63.459 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868819 id_slot=0 id_task=7076 p0=54246
slot create_check: id  0 | task 7076 | created context checkpoint 15 of 32 (pos_min = 56293, pos_max = 56293, size = 63.243 MiB, took 39.64 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868871 id_slot=0 id_task=7076 p0=56294
slot create_check: id  0 | task 7076 | created context checkpoint 16 of 32 (pos_min = 58341, pos_max = 58341, size = 63.259 MiB, took 39.75 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868926 id_slot=0 id_task=7076 p0=58342
slot create_check: id  0 | task 7076 | created context checkpoint 17 of 32 (pos_min = 60389, pos_max = 60389, size = 63.275 MiB, took 40.67 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776868981 id_slot=0 id_task=7076 p0=60390
slot create_check: id  0 | task 7076 | created context checkpoint 18 of 32 (pos_min = 62437, pos_max = 62437, size = 63.290 MiB, took 37.93 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869038 id_slot=0 id_task=7076 p0=62438
slot create_check: id  0 | task 7076 | created context checkpoint 19 of 32 (pos_min = 64485, pos_max = 64485, size = 63.306 MiB, took 38.84 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869097 id_slot=0 id_task=7076 p0=64486
slot create_check: id  0 | task 7076 | created context checkpoint 20 of 32 (pos_min = 66533, pos_max = 66533, size = 63.322 MiB, took 38.58 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869157 id_slot=0 id_task=7076 p0=66534
slot create_check: id  0 | task 7076 | created context checkpoint 21 of 32 (pos_min = 68581, pos_max = 68581, size = 63.337 MiB, took 40.58 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869227 id_slot=0 id_task=7076 p0=68582
slot create_check: id  0 | task 7076 | created context checkpoint 22 of 32 (pos_min = 70629, pos_max = 70629, size = 63.353 MiB, took 36.85 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869306 id_slot=0 id_task=7076 p0=70630
slot create_check: id  0 | task 7076 | created context checkpoint 23 of 32 (pos_min = 72677, pos_max = 72677, size = 63.368 MiB, took 38.13 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869388 id_slot=0 id_task=7076 p0=72678
slot create_check: id  0 | task 7076 | created context checkpoint 24 of 32 (pos_min = 74725, pos_max = 74725, size = 63.384 MiB, took 39.30 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869457 id_slot=0 id_task=7076 p0=74726
slot create_check: id  0 | task 7076 | created context checkpoint 25 of 32 (pos_min = 76773, pos_max = 76773, size = 63.400 MiB, took 39.61 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869524 id_slot=0 id_task=7076 p0=76774
slot create_check: id  0 | task 7076 | created context checkpoint 26 of 32 (pos_min = 78821, pos_max = 78821, size = 63.415 MiB, took 37.95 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869593 id_slot=0 id_task=7076 p0=78822
slot create_check: id  0 | task 7076 | created context checkpoint 27 of 32 (pos_min = 80869, pos_max = 80869, size = 63.431 MiB, took 38.14 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869663 id_slot=0 id_task=7076 p0=80870
slot create_check: id  0 | task 7076 | created context checkpoint 28 of 32 (pos_min = 82917, pos_max = 82917, size = 63.447 MiB, took 37.36 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869735 id_slot=0 id_task=7076 p0=82918
slot create_check: id  0 | task 7076 | created context checkpoint 29 of 32 (pos_min = 84965, pos_max = 84965, size = 63.462 MiB, took 37.62 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869808 id_slot=0 id_task=7076 p0=84966
slot create_check: id  0 | task 7076 | created context checkpoint 30 of 32 (pos_min = 85750, pos_max = 85750, size = 63.468 MiB, took 36.73 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869837 id_slot=0 id_task=7076 p0=85751
slot create_check: id  0 | task 7076 | created context checkpoint 31 of 32 (pos_min = 85756, pos_max = 85756, size = 63.468 MiB, took 37.27 ms)
slot print_timing: id  0 | task 7076 | 
prompt eval time = 1018477.30 ms / 31510 tokens (   32.32 ms per token,    30.94 tokens per second)
       eval time =   27096.44 ms /   233 tokens (  116.29 ms per token,     8.60 tokens per second)
      total time = 1045573.73 ms / 31743 tokens
INFO [      log_server_request] request | tid="126875263293120" timestamp=1776869864 remote_addr="127.0.0.1" remote_port=50070 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="126875254900416" timestamp=1776869864 remote_addr="127.0.0.1" remote_port=55028 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="126875246507712" timestamp=1776869864 remote_addr="127.0.0.1" remote_port=55036 status=200 method="GET" path="/v1/props" params={}
slot create_check: id  0 | task 7076 | created context checkpoint 32 of 32 (pos_min = 85987, pos_max = 85987, size = 63.470 MiB, took 39.14 ms)
INFO [           release_slots] slot released | tid="126920474798784" timestamp=1776869864 id_slot=0 id_task=7076 n_ctx=524288 n_past=85989 n_system_tokens=0 n_cache_tokens=85989 truncated=false
INFO [              slots_idle] all slots are idle | tid="126920474798784" timestamp=1776869864
======== Prompt cache: cache size: 85989, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.63, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.631, sim = 0.630, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 2479.544 MiB (limits: 8192.000 MiB, 0 tokens, 176910 est)
   - prompt 0x63490003ca30:   53547 tokens,       0 discarded, checkpoints: 32,  2479.544 MiB
prompt cache load took 26.21 ms
banned strings is not supported by recurrent model, it will be disabled.
INFO [   launch_slot_with_task] slot is processing task | tid="126920474798784" timestamp=1776869867 id_slot=0 id_task=7326
======== Cache: cache_size = 85989, n_past0 =  17261, n_past1 =  17261, n_past_prompt1 = 17261,  n_past2 =  54248, n_past_prompt2 =  54231
Common part does not match fully
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

slot apply_checkp: id  0 | task 7326 | n_past = 54248, slot.prompt.tokens.size() = 85989, seq_id = 0, pos_min = 85987
slot apply_checkp: id  0 | task 7326 | restored context checkpoint took  13.93 ms (pos_min = 54245, pos_max = 54245, size = 63.228 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 56293, pos_max = 56293, size = 63.243 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 58341, pos_max = 58341, size = 63.259 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 60389, pos_max = 60389, size = 63.275 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 62437, pos_max = 62437, size = 63.290 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 64485, pos_max = 64485, size = 63.306 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 66533, pos_max = 66533, size = 63.322 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 68581, pos_max = 68581, size = 63.337 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 70629, pos_max = 70629, size = 63.353 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 72677, pos_max = 72677, size = 63.368 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 74725, pos_max = 74725, size = 63.384 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 76773, pos_max = 76773, size = 63.400 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 78821, pos_max = 78821, size = 63.415 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 80869, pos_max = 80869, size = 63.431 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 82917, pos_max = 82917, size = 63.447 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 84965, pos_max = 84965, size = 63.462 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 85750, pos_max = 85750, size = 63.468 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 85756, pos_max = 85756, size = 63.468 MiB)
slot apply_checkp: id  0 | task 7326 | erased invalidated context checkpoint (pos_min = 85987, pos_max = 85987, size = 63.470 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869867 id_slot=0 id_task=7326 p0=54246
slot create_check: id  0 | task 7326 | created context checkpoint 15 of 32 (pos_min = 56293, pos_max = 56293, size = 63.243 MiB, took 39.35 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869921 id_slot=0 id_task=7326 p0=56294
slot create_check: id  0 | task 7326 | created context checkpoint 16 of 32 (pos_min = 58341, pos_max = 58341, size = 63.259 MiB, took 40.10 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776869976 id_slot=0 id_task=7326 p0=58342
slot create_check: id  0 | task 7326 | created context checkpoint 17 of 32 (pos_min = 60389, pos_max = 60389, size = 63.275 MiB, took 40.42 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870032 id_slot=0 id_task=7326 p0=60390
slot create_check: id  0 | task 7326 | created context checkpoint 18 of 32 (pos_min = 62437, pos_max = 62437, size = 63.290 MiB, took 40.27 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870089 id_slot=0 id_task=7326 p0=62438
slot create_check: id  0 | task 7326 | created context checkpoint 19 of 32 (pos_min = 64485, pos_max = 64485, size = 63.306 MiB, took 40.91 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870148 id_slot=0 id_task=7326 p0=64486
slot create_check: id  0 | task 7326 | created context checkpoint 20 of 32 (pos_min = 66533, pos_max = 66533, size = 63.322 MiB, took 38.31 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870208 id_slot=0 id_task=7326 p0=66534
slot create_check: id  0 | task 7326 | created context checkpoint 21 of 32 (pos_min = 68581, pos_max = 68581, size = 63.337 MiB, took 38.26 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870271 id_slot=0 id_task=7326 p0=68582
slot create_check: id  0 | task 7326 | created context checkpoint 22 of 32 (pos_min = 70629, pos_max = 70629, size = 63.353 MiB, took 39.53 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870334 id_slot=0 id_task=7326 p0=70630
slot create_check: id  0 | task 7326 | created context checkpoint 23 of 32 (pos_min = 72677, pos_max = 72677, size = 63.368 MiB, took 37.55 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870399 id_slot=0 id_task=7326 p0=72678
slot create_check: id  0 | task 7326 | created context checkpoint 24 of 32 (pos_min = 74725, pos_max = 74725, size = 63.384 MiB, took 37.17 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870465 id_slot=0 id_task=7326 p0=74726
slot create_check: id  0 | task 7326 | created context checkpoint 25 of 32 (pos_min = 76773, pos_max = 76773, size = 63.400 MiB, took 39.68 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870532 id_slot=0 id_task=7326 p0=76774
slot create_check: id  0 | task 7326 | created context checkpoint 26 of 32 (pos_min = 78821, pos_max = 78821, size = 63.415 MiB, took 38.87 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870602 id_slot=0 id_task=7326 p0=78822
slot create_check: id  0 | task 7326 | created context checkpoint 27 of 32 (pos_min = 80869, pos_max = 80869, size = 63.431 MiB, took 39.56 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870673 id_slot=0 id_task=7326 p0=80870
slot create_check: id  0 | task 7326 | created context checkpoint 28 of 32 (pos_min = 82917, pos_max = 82917, size = 63.447 MiB, took 38.66 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870746 id_slot=0 id_task=7326 p0=82918
slot create_check: id  0 | task 7326 | created context checkpoint 29 of 32 (pos_min = 84965, pos_max = 84965, size = 63.462 MiB, took 37.33 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="126920474798784" timestamp=1776870819 id_slot=0 id_task=7326 p0=84966

@firecoperana

Copy link
Copy Markdown
Collaborator
cache :  "thinking": 863227,
    "thinking": false,
    "lm_temperature": 0.9,

prompt:     "thinking": false,
    "lm_temperature": 0.9,
    "lm_cfg_scale": 3.0,

This is the problematic part. Could be front end related. If the front end keeps "thinking": 863227 in the prompt, it will be fine.

@FNsi

FNsi commented Apr 22, 2026

Copy link
Copy Markdown

cache :  "thinking": 863227,

    "thinking": false,

    "lm_temperature": 0.9,



prompt:     "thinking": false,

    "lm_temperature": 0.9,

    "lm_cfg_scale": 3.0,

This is the problematic part. Could be front end related. If the front end keeps "thinking": 863227 in the prompt, it will be fine.

I think it maybe the ctx length sequence, the 863227s token? I totally have no clue about that...

Sorry I just closed my laptop...cannot restore more detail terminal contents...

@firecoperana

Copy link
Copy Markdown
Collaborator

Add --minilog to check what's sent by the front end and what server generates. It's saved in the log file.

@markaalonzo

Copy link
Copy Markdown
Contributor Author

Thanks for the review. Answering both questions with evidence from the tree (line numbers are from the PR head at 655ab81):

@ikawrakow on "why not full reprocess for recurrent too": the PR still keeps full reprocess as the fallback — we didn't touch the SWA/transformer branch. What we changed is the selector and the reset mechanics, because both encode transformer-window semantics that don't match recurrent state.

  • Old selector cur.pos_min < pos_min_thold (server-context.cpp:3099, now the transformer fallback after the patch) treats a checkpoint as a per-token window. A recurrent checkpoint is a whole-prefix state snapshot, so the valid predicate is cur.pos_max <= slot.n_past && cur.pos_max < pos_next at server-context.cpp:3097 (leaves at least one token to decode).
  • Old fallback slot.n_past = 0; slot.n_past_prompt = 0; (server-context.cpp:3135-3136, now also transformer-only) zeros the bookkeeping while the live recurrent state in ctx is still populated. The next decode batch positions disagree with the state and llama_decode returns ret=-3. For recurrent we preserve slot state instead (the has_recurrent branch at L3124-3130) — update_slots() then continues from the already-valid n_past_prompt without tripping the decode.

@ikawrakow on how stop_internal_decode becomes true: it's a file-static global in src/llama.cpp:147, set by llama_decode_stop() which is called from three disconnect paths in server.cpp: L1121 (non-stream completion, wait_for_all terminated), L1144 (stream completion, first result nullptr), L1473 (embeddings/rerank, same terminated path). It's cleared only by llama_decode_reset() from server_slot::release() (server-context.cpp:499, reset call at L505), and only when state == SLOT_STATE_PROCESSING. If the stop signal fires for a slot that isn't in PROCESSING state — or if another slot's disconnect races first — the flag persists. Being process-global means one slot's disconnect bleeds into the next decode on any slot. Clearing at the llama_decode() entry (src/llama.cpp:7924) scopes the signal to the in-flight decode it was meant for.

@firecoperana on "checkpoints only exist for recurrent": you're right — params_base.do_checkpoint is gated on llama_model_has_recurrent(...) at server-context.cpp:1471. That's exactly why the selector fix matters: apply_checkpoint() is effectively recurrent-only today, yet its predicates assume transformer windows. The transformer branches in this patch are unreachable for now but preserved to keep the function correct if checkpointing is ever enabled for non-recurrent models.

@markaalonzo

Copy link
Copy Markdown
Contributor Author

Follow-up with production trace evidence for the stop-flag half, and an honest recalibration on the checkpoint half.

Stop-flag race, captured live

Production log (Qwen 3.6 under chat workload, timestamps UTC-4, PIDs/thread-ids redacted). The race firing:

14:08:49 llama-server: srv         stop: cancel task, id_task = 4290
14:08:49 llama-server: llama_decode: failed to decode, ret = -3
14:08:49 llama-server: Decode process is cancelled by user.
14:08:49 llama-server: INFO [release_slots] slot released | id_slot=0 id_task=4280 n_past=24048

Stop fires on task 4290 (client disconnect) → sets the file-static stop_internal_decode → in-flight decode on task 4280 polls the flag → bails with ret=-3. Different task. That's the process-global bleed between concurrent tasks on the same context.

Three seconds later the flag gets stuck:

14:08:52 llama-server: srv         stop: cancel task, id_task = 4280

Task 4280 was already released at 14:08:49 (state transitioned to IDLE). llama_decode_stop() sets the flag again, but the only clear path is llama_decode_reset() inside server_slot::release(), gated on state == SLOT_STATE_PROCESSING. Slot was IDLE, so reset never runs. Flag sticks.

Every subsequent decode fails immediately, zero work done, until the service is restarted. Five minutes later at the client's retry cadence:

14:13:19 llama-server: INFO [launch_slot_with_task] slot is processing task | id_task=4474
14:13:19 llama-server: slot apply_checkp: id 0 | task 4474 | forcing full prompt re-processing ...
14:13:19 llama-server: llama_decode: failed to decode, ret = -3
14:13:19 llama-server: INFO [release_slots] slot released | id_slot=0 id_task=4474 n_past=8

Same pattern repeats at 14:18:19, 14:23:20, 14:28:20, 14:33:20 — client retry interval, immediate ret=-3 each time, no forward progress. Service restart at 15:23 is what actually recovered it.

That's race → stuck → repeat, demonstrated end-to-end. The scope-to-llama_decode()-entry change addresses all three failure modes (process-global bleed, state != PROCESSING gap, sticky retry loop).

Recalibration on the apply_checkpoint half

Being straight: the trace above shows the apply_checkpoint recurrent fallback firing ("forcing full prompt re-processing"), but the ret=-3 in both cases was the stop flag — not the zero-n_past bookkeeping mismatch. The fallback code path actually completes when the stop flag isn't in the way (the first trace successfully processed 24,048 tokens before the stop bled over).

@firecoperana's earlier log upthread — selector restoring pos_max = 54245 three tokens behind n_past = 54248 while erasing valid checkpoints ahead at pos 54681/54687/54782 — does corroborate the wrong-semantics claim (per-token pos_min predicate applied to a whole-prefix recurrent snapshot). So the selector fix is semantic correctness with external evidence.

But we can't claim we have a trace where the recurrent zero-n_past fallback on its own causes ret=-3 in our environment. The semantic argument (n_past=0 bookkeeping disagrees with populated recurrent state in ctx) is plausible, but unproven here.

On splitting

Given the asymmetric evidence, we're fully open to splitting this into two PRs if that's cleaner:

  1. llama_decode_stop() sticky-flag fix — lands on its own trace evidence above.
  2. apply_checkpoint recurrent-aware selector + preserve-state branch — semantic correctness; can stand on the code argument + @firecoperana's corroborating log, or we can hold it until we capture a cleaner repro that isolates it from the stop-flag path.

Happy to do either. If you'd prefer the split, we'll open a second PR for (1) against this branch's commit and let this one focus on (2), or vice versa — whichever order you'd rather review.

@FNsi

FNsi commented Apr 23, 2026

Copy link
Copy Markdown

Now caught it! ret=-3 (I didn't do anything interrupt, even the log said cancel by user. )


======== Prompt cache: cache size: 3573, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.92, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 0.920, sim = 0.187, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 548.791 MiB (limits: 8192.000 MiB, 0 tokens, 217730 est)
   - prompt 0x5c1855e302b0:   14586 tokens,       0 discarded, checkpoints:  6,   548.791 MiB
prompt cache load took 3.37 ms
INFO [   launch_slot_with_task] slot is processing task | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9719
======== Cache: cache_size = 3573, n_past0 =  3288, n_past1 =  3288, n_past_prompt1 = 3288,  n_past2 =  3288, n_past_prompt2 =  3288
Common part does not match fully
cache : 

<|im_start|>system
Generate a short, descriptive title (3-7 words) for a conversation that starts with the following exchange. The title should capture
prompt: 

<|im_start|>system
# Hermes Agent Persona

<!--
This file defines the agent's personality and tone.
The agent will embody whatever you write here
slot apply_checkp: id  0 | task 9719 | n_past = 3288, slot.prompt.tokens.size() = 3573, seq_id = 0, pos_min = 3571
slot apply_checkp: id  0 | task 9719 | restored context checkpoint took  8.31 ms (pos_min = 2047, pos_max = 2047, size = 62.830 MiB)
slot apply_checkp: id  0 | task 9719 | erased invalidated context checkpoint (pos_min = 3537, pos_max = 3537, size = 62.841 MiB)
slot apply_checkp: id  0 | task 9719 | erased invalidated context checkpoint (pos_min = 3543, pos_max = 3543, size = 62.841 MiB)
slot apply_checkp: id  0 | task 9719 | erased invalidated context checkpoint (pos_min = 3571, pos_max = 3571, size = 62.841 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9719 p0=2048
llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.
INFO [           release_slots] slot released | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9719 n_ctx=524288 n_past=4096 n_system_tokens=0 n_cache_tokens=4096 truncated=false
INFO [              slots_idle] all slots are idle | tid="140251317070528" timestamp=1776920989
======== Prompt cache: cache size: 4096, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
 - looking for better prompt, base f_keep = 1.000, sim = 0.230, n_keep = 0, n_discarded_prompt = 0
 - cache state: 1 prompts, 548.791 MiB (limits: 8192.000 MiB, 0 tokens, 217730 est)
   - prompt 0x5c1855e302b0:   14586 tokens,       0 discarded, checkpoints:  6,   548.791 MiB
prompt cache load took 3.14 ms
INFO [   launch_slot_with_task] slot is processing task | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9721
======== Cache: cache_size = 4096, n_past0 =  4096, n_past1 =  4096, n_past_prompt1 = 4096,  n_past2 =  4096, n_past_prompt2 =  4096
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9721 p0=4096
slot create_check: id  0 | task 9721 | created context checkpoint 2 of 32 (pos_min = 6143, pos_max = 6143, size = 62.861 MiB, took 42.70 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776921007 id_slot=0 id_task=9721 p0=6144
slot create_check: id  0 | task 9721 | created context checkpoint 3 of 32 (pos_min = 8191, pos_max = 8191, size = 62.877 MiB, took 39.48 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776921027 id_slot=0 id_task=9721 p0=8192
slot create_check: id  0 | task 9721 | created context checkpoint 4 of 32 (pos_min = 10239, pos_max = 10239, size = 62.892 MiB, took 37.76 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776921048 id_slot=0 id_task=9721 p0=10240
slot create_check: id  0 | task 9721 | created context checkpoint 5 of 32 (pos_min = 12287, pos_max = 12287, size = 62.908 MiB, took 38.02 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776921071 id_slot=0 id_task=9721 p0=12288
slot create_check: id  0 | task 9721 | created context checkpoint 6 of 32 (pos_min = 14335, pos_max = 14335, size = 62.923 MiB, took 36.03 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776921095 id_slot=0 id_task=9721 p0=14336
slot create_check: id  0 | task 9721 | created context checkpoint 7 of 32 (pos_min = 16383, pos_max = 16383, size = 62.939 MiB, took 37.86 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776921120 id_slot=0 id_task=9721 p0=16384
slot create_check: id  0 | task 9721 | created context checkpoint 8 of 32 (pos_min = 18431, pos_max = 18431, size = 62.955 MiB, took 36.93 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776921146 id_slot=0 id_task=9721 p0=18432

@ikawrakow ikawrakow merged commit 48819da into ikawrakow:main Apr 23, 2026
@firecoperana

Copy link
Copy Markdown
Collaborator

I need to clarify the logic of checkpoint as it's ported from mainline. Checkpoint is just the kv cache of the recurrent layer. It cannot be rolled back, so we need to save as the checkpoint. There is no recurrent layer for transformers, so there is no need to save/restore checkpoint for transformers. The better guard is to check has_recurrent at the very beginning of apply_checkpoint.
Now that it's clear that it only applies to recurrent model, there is no difference between pos_min and pos_max. They are the same value. Using either pos_min or pos_max is the same thing.
In this PR, when it has_recurrent and you don't find the checkpoint, you don't zeros slot.n_past. This is problematic because you will continue with the kv cache that has your later conversation context. The original approach is correct.

@FNsi

FNsi commented Apr 24, 2026

Copy link
Copy Markdown

when it has_recurrent and you don't find the checkpoint, you don't zeros slot.n_past. This is problematic because you will continue with the kv cache that has your later conversation context.

You are right, that will continue with latest conversation, and repeat it in that place

I can confirm current impl indeed has that problem...

For me, time for reprocessing prefill cycles is harder to accept comparing to decode latest assistant message twice, especially in 50k or higher ctx.

Can we find a better approach to solve this dilemma?

@firecoperana

Copy link
Copy Markdown
Collaborator

It's more about correctness than performance. Your log shows that the issue lies with your frontend. Better to submit the bug report there.

@FNsi

FNsi commented Apr 24, 2026

Copy link
Copy Markdown

It's more about correctness than performance. Your log shows that the issue lies with your frontend. Better to submit the bug report there.

I see, but the second log didn't show it up...anyway I haven't store it, only the piece above survived.

This file defines the agent's personality and tone.
The agent will embody whatever you write here
slot apply_checkp: id 0 | task 9719 | n_past = 3288, slot.prompt.tokens.size() = 3573, seq_id = 0, pos_min = 3571
slot apply_checkp: id 0 | task 9719 | restored context checkpoint took 8.31 ms (pos_min = 2047, pos_max = 2047, size = 62.830 MiB)
slot apply_checkp: id 0 | task 9719 | erased invalidated context checkpoint (pos_min = 3537, pos_max = 3537, size = 62.841 MiB)
slot apply_checkp: id 0 | task 9719 | erased invalidated context checkpoint (pos_min = 3543, pos_max = 3543, size = 62.841 MiB)
slot apply_checkp: id 0 | task 9719 | erased invalidated context checkpoint (pos_min = 3571, pos_max = 3571, size = 62.841 MiB)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9719 p0=2048
llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.
INFO [ release_slots] slot released | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9719 n_ctx=524288 n_past=4096 n_system_tokens=0 n_cache_tokens=4096 truncated=false
INFO [ slots_idle] all slots are idle | tid="140251317070528" timestamp=1776920989
======== Prompt cache: cache size: 4096, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50

  • looking for better prompt, base f_keep = 1.000, sim = 0.230, n_keep = 0, n_discarded_prompt = 0
  • cache state: 1 prompts, 548.791 MiB (limits: 8192.000 MiB, 0 tokens, 217730 est)
    • prompt 0x5c1855e302b0: 14586 tokens, 0 discarded, checkpoints: 6, 548.791 MiB
      prompt cache load took 3.14 ms
      INFO [ launch_slot_with_task] slot is processing task | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9721
      ======== Cache: cache_size = 4096, n_past0 = 4096, n_past1 = 4096, n_past_prompt1 = 4096, n_past2 = 4096, n_past_prompt2 = 4096
      INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140251317070528" timestamp=1776920989 id_slot=0 id_task=9721 p0=4096

SamuelOliveirads pushed a commit to SamuelOliveirads/ik_llama.cpp that referenced this pull request Apr 27, 2026
… stop flag (ikawrakow#1673)

Two related issues that manifest as 'llama_decode ret=-3' on hybrid
architectures (e.g. Qwen3.5/3.6 MoE, Qwen3-Next), matching the symptom
reported in ikawrakow#1576.

1) server_context::apply_checkpoint() was written around transformer KV
   semantics (pos_min / pos_max per-token window). For hybrid and pure
   recurrent models the per-token pos_min threshold does not apply: the
   recurrent state is a single snapshot, and the server-side checkpoint
   is a whole-prefix record. The old selector 'cur.pos_min < pos_min_thold'
   can succeed on a checkpoint whose pos_max is past the current n_past,
   and — more commonly — fall through to do_reset = true, which zeros
   slot.n_past / slot.n_past_prompt. Zeroing in-place while the recurrent
   state in the context is still populated makes the next decode batch
   disagree with the live state, returning ret=-3.

   This change gates the checkpoint path on
   llama_model_has_recurrent(llama_get_model(slot.ctx)):
   - selector uses pos_max <= slot.n_past && pos_max < pos_next
     (whole-prefix match, leaves at least one token to decode);
   - on miss, slot state is preserved rather than zeroed, letting
     update_slots() continue from the already-valid n_past_prompt;
   - the erase loop drops any checkpoint whose pos_max > pos_next,
     matching the rewind semantics for recurrent state.

   Transformer behavior is unchanged.

2) stop_internal_decode is a file-static global in src/llama.cpp, set by
   llama_decode_stop() (called on client disconnect) and polled inside
   the decode loop to bail out with ret=-3. The flag is only cleared on
   one conditional path in server_slot::release(), so a stop signal that
   arrives after the interrupted llama_decode() has already returned
   bleeds into the NEXT decode call and causes an immediate ret=-3 with
   no work performed. Clear it at the top of the public llama_decode()
   entry so the signal is scoped to the in-flight decode it was meant
   for.

Build-verified: llama-server with GGML_CUDA=ON, -DCMAKE_CUDA_ARCHITECTURES=86
(sm_86), IQK flash-attn + matmul enabled. No new APIs introduced —
llama_model_has_recurrent is already public and already used elsewhere in
server-context.cpp.

Closes ikawrakow#1576
@firecoperana

Copy link
Copy Markdown
Collaborator

@markaalonzo Can you provide a easy to reproduce case where ret =-3 causes an issue?

@FNsi

FNsi commented May 3, 2026

Copy link
Copy Markdown

easy to reproduce case where ret =-3 causes an issue?

I think copy past some code to let llm fix in llama-sever, fix and fix, goes to around 60k after multiple dialogue run. After trigger that error, we cannot do anything meaningful, basically cannot be use due to endless reprocessing.

@FNsi

FNsi commented May 3, 2026

Copy link
Copy Markdown

Unrelated here

And the img error, exist very early before that pr. I was tried to cache all img in cache and it works. #1585 was tried to cache and load in disk if possible, while it's not fixed the disk cache part at that time, even it works later, it introduces ton of code, which let me deleted it. But full cache do work with img problem.

@markaalonzo

markaalonzo commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

Apologies for the slow response — concrete reproducer below.

Minimal repro for ret = -3 (hybrid recurrent + abandoned decode)

# Server: any hybrid recurrent model is fine (Qwen3.6-MTP shown)
./llama-server -m Qwen3.6-MTP-IQ4_K_R4.gguf -c 32768 -fa on \
  -ctk q4_0 -ctv q4_0 --jinja --port 8080

Step 1 — start a thinking-mode request and abandon it mid-stream (any client that drops the connection before generation completes):

timeout 3s curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Solve x^2 + 2x - 7 = 0 step by step with full algebra."}],"max_tokens":2048,"stream":true}'

Step 2 — any subsequent request:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'

Step 2's server log shows:

llama_decode: failed to decode, ret = -3

Step 1's connection drop sets the sticky stop flag in llama_decode_stop(); step 2's decode polls it on entry and bails immediately. Same path the PR description and the production trace I posted upthread pointed at.

Happy to construct a different angle if useful — concurrent-tasks bleed, or the apply_checkpoint zero-n_past mismatch in isolation.

@firecoperana

Copy link
Copy Markdown
Collaborator

Couldn't reproduce using your request, but I encountered something similar. Test if #1787 fixed it.

ikawrakow pushed a commit that referenced this pull request Jun 12, 2026
…cel cascade) (#1941)

With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.

Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.

Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.

Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Hybrid models revaluate the prompt invalidating kv

4 participants