server: fix ret=-3 on hybrid/recurrent prompt cache and clear sticky stop flag#1673
Conversation
… stop flag Two related issues that manifest as 'llama_decode ret=-3' on hybrid architectures (e.g. Qwen3.5/3.6 MoE, Qwen3-Next), matching the symptom reported in ikawrakow#1576. 1) server_context::apply_checkpoint() was written around transformer KV semantics (pos_min / pos_max per-token window). For hybrid and pure recurrent models the per-token pos_min threshold does not apply: the recurrent state is a single snapshot, and the server-side checkpoint is a whole-prefix record. The old selector 'cur.pos_min < pos_min_thold' can succeed on a checkpoint whose pos_max is past the current n_past, and — more commonly — fall through to do_reset = true, which zeros slot.n_past / slot.n_past_prompt. Zeroing in-place while the recurrent state in the context is still populated makes the next decode batch disagree with the live state, returning ret=-3. This change gates the checkpoint path on llama_model_has_recurrent(llama_get_model(slot.ctx)): - selector uses pos_max <= slot.n_past && pos_max < pos_next (whole-prefix match, leaves at least one token to decode); - on miss, slot state is preserved rather than zeroed, letting update_slots() continue from the already-valid n_past_prompt; - the erase loop drops any checkpoint whose pos_max > pos_next, matching the rewind semantics for recurrent state. Transformer behavior is unchanged. 2) stop_internal_decode is a file-static global in src/llama.cpp, set by llama_decode_stop() (called on client disconnect) and polled inside the decode loop to bail out with ret=-3. The flag is only cleared on one conditional path in server_slot::release(), so a stop signal that arrives after the interrupted llama_decode() has already returned bleeds into the NEXT decode call and causes an immediate ret=-3 with no work performed. Clear it at the top of the public llama_decode() entry so the signal is scoped to the in-flight decode it was meant for. Build-verified: llama-server with GGML_CUDA=ON, -DCMAKE_CUDA_ARCHITECTURES=86 (sm_86), IQK flash-attn + matmul enabled. No new APIs introduced — llama_model_has_recurrent is already public and already used elsewhere in server-context.cpp. Closes ikawrakow#1576
ab81818 to
655ab81
Compare
| "https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055"); | ||
| slot.n_past = 0; | ||
| slot.n_past_prompt = 0; | ||
| if (has_recurrent) { |
There was a problem hiding this comment.
Don't understand this change. If we concluded that we need full prompt reprocessing, why would that not apply to a recurrent model?
There was a problem hiding this comment.
I don't think the change is needed inside apply_checkpoint. At the current state, we only create checkpoint when it's a recurrent model. If it's not a recurrent model, there is no checkpoint to restore.
| // is intended to interrupt the decode that is currently in flight; without this reset, a stop | ||
| // that arrived after the interrupted call returned would bleed into the next decode and cause | ||
| // an immediate ret=-3, which servers interpret as a fatal decode failure. | ||
| stop_internal_decode = false; |
There was a problem hiding this comment.
Interesting comment. Can you tells us how stop_internal_decode became true ?
|
As you have done all the checkpointing stuff, I'll refer this one to you to see if the changes make sense. |
|
@markaalonzo stop_internal_decode will be cleared when the slot is released, and it does not invalidate kv cache leading to prompt fully reprocessed. Can you provide a detailed example to demonstrate the issue and whether this PR attempts to fix #1576 where there is no log to show why it occurred? |
|
I am currently facing this issue, maybe? From 56k to 86k, repeated 3 times. Didn't do anything special, CMD I used I run with hemers-agent, even I don't think there's anything related to it, I was using Hermes to generate music in igpu, while I was running ik_llama in cpu; the tg and pp in the beginning of log is lower than normal, and I have no clue what it..(maybe there's a power competition between igpu and cpu so both going to slow) **Jinja template: I think that could be my problem part... my failure to ignore the stepfun temple also apply sequence change for thinking block. Still not really make sense for the first 50k, might not fully explain why it trigger re evaluation endlessly after 50k; ** So my case might not really related. |
This is the problematic part. Could be front end related. If the front end keeps |
I think it maybe the ctx length sequence, the 863227s token? I totally have no clue about that... Sorry I just closed my laptop...cannot restore more detail terminal contents... |
|
Add |
|
Thanks for the review. Answering both questions with evidence from the tree (line numbers are from the PR head at 655ab81): @ikawrakow on "why not full reprocess for recurrent too": the PR still keeps full reprocess as the fallback — we didn't touch the SWA/transformer branch. What we changed is the selector and the reset mechanics, because both encode transformer-window semantics that don't match recurrent state.
@ikawrakow on how @firecoperana on "checkpoints only exist for recurrent": you're right — |
|
Follow-up with production trace evidence for the stop-flag half, and an honest recalibration on the checkpoint half. Stop-flag race, captured liveProduction log (Qwen 3.6 under chat workload, timestamps UTC-4, PIDs/thread-ids redacted). The race firing: Stop fires on task 4290 (client disconnect) → sets the file-static Three seconds later the flag gets stuck: Task 4280 was already released at 14:08:49 (state transitioned to IDLE). Every subsequent decode fails immediately, zero work done, until the service is restarted. Five minutes later at the client's retry cadence: Same pattern repeats at 14:18:19, 14:23:20, 14:28:20, 14:33:20 — client retry interval, immediate ret=-3 each time, no forward progress. Service restart at 15:23 is what actually recovered it. That's race → stuck → repeat, demonstrated end-to-end. The scope-to- Recalibration on the
|
|
Now caught it! ret=-3 (I didn't do anything interrupt, even the log said cancel by user. ) |
|
I need to clarify the logic of checkpoint as it's ported from mainline. Checkpoint is just the kv cache of the recurrent layer. It cannot be rolled back, so we need to save as the checkpoint. There is no recurrent layer for transformers, so there is no need to save/restore checkpoint for transformers. The better guard is to check |
You are right, that will continue with latest conversation, and repeat it in that place I can confirm current impl indeed has that problem... For me, time for reprocessing prefill cycles is harder to accept comparing to decode latest assistant message twice, especially in 50k or higher ctx. Can we find a better approach to solve this dilemma? |
|
It's more about correctness than performance. Your log shows that the issue lies with your frontend. Better to submit the bug report there. |
I see, but the second log didn't show it up...anyway I haven't store it, only the piece above survived.
|
… stop flag (ikawrakow#1673) Two related issues that manifest as 'llama_decode ret=-3' on hybrid architectures (e.g. Qwen3.5/3.6 MoE, Qwen3-Next), matching the symptom reported in ikawrakow#1576. 1) server_context::apply_checkpoint() was written around transformer KV semantics (pos_min / pos_max per-token window). For hybrid and pure recurrent models the per-token pos_min threshold does not apply: the recurrent state is a single snapshot, and the server-side checkpoint is a whole-prefix record. The old selector 'cur.pos_min < pos_min_thold' can succeed on a checkpoint whose pos_max is past the current n_past, and — more commonly — fall through to do_reset = true, which zeros slot.n_past / slot.n_past_prompt. Zeroing in-place while the recurrent state in the context is still populated makes the next decode batch disagree with the live state, returning ret=-3. This change gates the checkpoint path on llama_model_has_recurrent(llama_get_model(slot.ctx)): - selector uses pos_max <= slot.n_past && pos_max < pos_next (whole-prefix match, leaves at least one token to decode); - on miss, slot state is preserved rather than zeroed, letting update_slots() continue from the already-valid n_past_prompt; - the erase loop drops any checkpoint whose pos_max > pos_next, matching the rewind semantics for recurrent state. Transformer behavior is unchanged. 2) stop_internal_decode is a file-static global in src/llama.cpp, set by llama_decode_stop() (called on client disconnect) and polled inside the decode loop to bail out with ret=-3. The flag is only cleared on one conditional path in server_slot::release(), so a stop signal that arrives after the interrupted llama_decode() has already returned bleeds into the NEXT decode call and causes an immediate ret=-3 with no work performed. Clear it at the top of the public llama_decode() entry so the signal is scoped to the in-flight decode it was meant for. Build-verified: llama-server with GGML_CUDA=ON, -DCMAKE_CUDA_ARCHITECTURES=86 (sm_86), IQK flash-attn + matmul enabled. No new APIs introduced — llama_model_has_recurrent is already public and already used elsewhere in server-context.cpp. Closes ikawrakow#1576
|
@markaalonzo Can you provide a easy to reproduce case where ret =-3 causes an issue? |
I think copy past some code to let llm fix in llama-sever, fix and fix, goes to around 60k after multiple dialogue run. After trigger that error, we cannot do anything meaningful, basically cannot be use due to endless reprocessing. |
|
Unrelated here And the img error, exist very early before that pr. I was tried to cache all img in cache and it works. #1585 was tried to cache and load in disk if possible, while it's not fixed the disk cache part at that time, even it works later, it introduces ton of code, which let me deleted it. But full cache do work with img problem. |
|
Apologies for the slow response — concrete reproducer below. Minimal repro for
|
|
Couldn't reproduce using your request, but I encountered something similar. Test if #1787 fixed it. |
…cel cascade) (#1941) With --parallel 1, a client disconnect/timeout on a *queued* request aborts the *active* decode of a different client (llama_decode: failed to decode, ret = -3 / "Decode process is cancelled by user"), releasing the slot with the request unfinished. To the active client the stream silently stalls and never returns, while the server reports healthy — easy to misdiagnose as a network/proxy wedge. Root cause: llama_decode_stop() signals a process-global stop flag that the active decode loop polls. examples/server/server.cpp calls it *ungated* from the request reader's connection-closed paths, so any reader closing (including a queued, not-yet-running task's) trips the global flag against whatever decode is currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" + hybrid/recurrent ret=-3), which did not gate these call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on current main. Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the three llama_decode_stop() sites on it, so the global stop is signalled only when one of THIS reader's tasks is on a slot (the active decode). A queued task's disconnect then only drops that queued task. Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills). Stdlib-only reproducer in the PR description. Caveat: any_task_on_slot() reads the slots vector from the reader thread — the same race class as the existing process-global flag; can be tightened to a per-context/per-task cancellation if preferred.
Summary
Fixes two issues that together produce
llama_decode ret=-3on hybrid / recurrent architectures (Qwen3.5 MoE, Qwen3.6 MoE, Qwen3-Next, Mamba), matching the symptom in #1576 and reproduced under our Qwen3.6 production workload.server_context::apply_checkpoint()is not hybrid-aware. The selector was written for transformer KV semantics (per-tokenpos_minwindow). Hybrid/recurrent models use a single whole-prefix state snapshot, so the transformer selector can either pick a checkpoint whosepos_maxis past the currentn_past(restoring state ahead of the decode position) or miss entirely and fall through todo_reset = true, which zerosslot.n_past/slot.n_past_promptwhile the recurrent state in the context is still populated. The next decode batch then disagrees with the live recurrent state →ret=-3.stop_internal_decodeis a sticky file-static global.llama_decode_stop()(called on client disconnect) sets it; the decode loop polls it and bails withret=-3. It is only cleared on one conditional branch ofserver_slot::release(), so a stop signal that races past a decode that has already returned bleeds into the nextllama_decode()call on the same context, producing an immediateret=-3with zero work performed.Changes
examples/server/server-context.cppapply_checkpointgates its logic onllama_model_has_recurrent(llama_get_model(slot.ctx))— the same helper already used in this file at line 115 and line 1471:has_recurrent(not just onpos_min > pos_min_thold), so recurrent slots don't silently skip it.cur.pos_max <= slot.n_past && cur.pos_max < pos_next— a whole-prefix snapshot that still leaves at least one token to decode (preserving [TAG_PROMPT_LOGITS]).slot.n_past.update_slots()will continue from the validn_past_prompt— this is the critical part; zeroing was the proximate trigger for theret=-3.cur.pos_max > pos_next— stale snapshots ahead of the current decode position.Transformer behavior is byte-identical to before (
has_recurrentis false → all branches fall to the existing logic).src/llama.cppPublic
llama_decode()entry clearsstop_internal_decodebefore callingllama_decode_internal(). This scopes the stop signal to the in-flight decode it was meant to interrupt. Concurrentllama_decode_stop()during the decode still takes effect on the next loop iteration as before.How to reproduce the bug (pre-fix)
Issue #1576 has the full reproduction. We hit it on Qwen3.6 (hybrid QWEN35MOE) via the prompt-cache path:
apply_checkpointfalls through todo_reset = true, zerosn_past,llama_decode()returns-3and the slot wedges until restart.The sticky-flag case reproduces with a client that cancels a streaming completion mid-prefill and then retries on the same context.
Test plan
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_IQK_MUL_MAT=ON -DGGML_IQK_FLASH_ATTENTION=ON -DGGML_IQK_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build --target llama-serverllama_model_has_recurrentis already declared ininclude/llama.h:684and defined insrc/llama-model.cpp:1872has_recurrentdefaults false for non-hybrid, non-recurrent architectures)Closes #1576