You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want PawWork to recognize stuck low-yield loops even when every tool call exits successfully. The PR #264 v1 gate currently observes only tool failures, so a model that keeps re-running the same successful but useless command is invisible to it. The user has to interrupt manually.
What do you do today?
Today the only end-of-loop signal in this case is the user typing something like "你死循环了吗?". PR #204 used to inject a reminder after 3 same-input successful repeats, but PR #264 explicitly removed that path in favor of the failure-only signature inputHash | targetHash + errorFingerprint.
What would a good result look like?
The harness should also gate on "N identical-signature successful repeats within one user turn" as a separate, independently-counted signal:
count successful repeats by (tool, inputHash | targetHash) only, with no errorFingerprint requirement
keep the success counter separate from the failure counter so neither pollutes the other's threshold
reuse the existing escalation ladder (reminder → block with autoResume → stop) once the success threshold trips
avoid firing on legitimate batched parallel calls in the same step where the model is fanning out across distinct inputs
Which audience does this matter to most?
Both
Extra context
Repro session
Session: shiny-moon / ses_22ff9de7effeE5A31FJvclUziw, exported 2026-04-27 18:05 local
Model: alibaba-coding-plan-cn/qwen3.6-plus
Total tool calls in the loop region (messages 33-67, after the user asked about a specific @fahdmirza tweet): 66 bash invocations, all state.status === "completed", all state.error === null
Two distinct query strings, run a combined 63 times back-to-back. Each invocation returned the same handful of tweets (or error: unknown command 'user' swallowed via 2>&1), so from the bash tool's perspective every call exited 0 with non-empty stdout.
packages/opencode/src/session/processor.ts:348-368 — SessionDiagnostics.observeToolError is only invoked from failToolCall. packages/opencode/src/session/processor.ts:223-255 — the errorRecords feeder filters to parts that already have a loop.errorFingerprint (or are synthetic block/stop markers). Successful tool parts never enter the feed, so applyLoopGate has zero samples to count and deriveParentLoopState sees an empty parent state.
This matches PR #264's own description: "v1 explicitly removes firing on success, so the test is obsolete." The session above is the failure mode of that tradeoff in production.
Prior art from external reports
The same failure shape is independently reported on Qwen3 and Kimi K2.6 outside PawWork, which suggests this is a model-family pattern rather than a single-session anomaly:
Reddit r/LocalLLaMA — "Qwen3.6-35b stuck in infinite loop" (OP ConfidentSolution737, 12 comments). OP describes "the model keeps responding with a repeated text/tool call without ever stopping". Multiple independent commenters reproduce on different stacks and converge on the harness-side conclusion: "infinite tool call loops are a fundamental issue with reasoning models that don't have a hard stopping condition outside the model itself ... worth adding an external loop guard: a max tool call count per run, or a budget cap that kills the run if it exceeds N steps" (commenter MoistApplication5759). Other commenters report the loop fires under preserve-thinking, that raising presence penalty makes it loop more, and that disabling thinking removes the loop, all of which are sampling-side workarounds rather than fixes.
X — @phonezawphyo, 2026-04-27 on Kimi K2.6: "I was using K2.6 with Hermes agent over the weekend and it burnt through weekly limit too fast and it acted totally dumb — infinite loop on repeated patch/read failures etc. ... I tried Kimi cli today and K2.6 is back to its original form." This one leans into the failure-repeat side that PR feat(session): low-yield repeated probing detection #264 already covers, but it is direct evidence that harness-side loop guarding is a perceived product difference between agent shells running the same model.
Neither report is on Qwen3.6-plus specifically, but the reported behavior shape (success-side complete-but-useless repeats and failure-side repeats) matches what we observe locally and what PR #264 only partially covers.
A new success-repeat signal exists alongside the existing failure signal, keyed by (tool, inputHash | targetHash) with no errorFingerprint.
Success and failure counters are tracked independently; tripping one does not advance the other.
The escalation reuses the reminder → block → stop ladder from PR feat(session): low-yield repeated probing detection #264 at the same 3 / 6 / 7 thresholds as the failure side, so the two signals stay symmetric and only one number is ever tuned.
Threshold calibration is done by replaying real session exports (the export already contains every parts[].state.input and state.status, so success-repeat distributions can be computed offline with one jq aggregation grouped by (tool, hash(input), status)). Re-tune only if a future stuck session shows a non-bimodal distribution (i.e. counts in the 4-8 range rather than the current ≥30 vs ≤2 split observed in shiny-moon).
A regression test reproduces the shiny-moon shape: ≥7 identical successful bash invocations in one turn produce a stop synthesis and a Chinese stop-summary text part.
Parallel fan-out within the same step (distinct inputs) does not trip the new signal.
What task are you trying to do?
We want PawWork to recognize stuck low-yield loops even when every tool call exits successfully. The PR #264 v1 gate currently observes only tool failures, so a model that keeps re-running the same successful but useless command is invisible to it. The user has to interrupt manually.
What do you do today?
Today the only end-of-loop signal in this case is the user typing something like "你死循环了吗?". PR #204 used to inject a reminder after 3 same-input successful repeats, but PR #264 explicitly removed that path in favor of the failure-only signature
inputHash | targetHash + errorFingerprint.What would a good result look like?
The harness should also gate on "N identical-signature successful repeats within one user turn" as a separate, independently-counted signal:
(tool, inputHash | targetHash)only, with noerrorFingerprintrequirementWhich audience does this matter to most?
Both
Extra context
Repro session
shiny-moon/ses_22ff9de7effeE5A31FJvclUziw, exported2026-04-27 18:05localalibaba-coding-plan-cn/qwen3.6-plusstate.status === "completed", allstate.error === null"diagnostics": {}— the PR feat(session): low-yield repeated probing detection #264 gate never recorded a single observationDistribution of commands in the loop region
opencli twitter search "fahdmirza code review app results Kimi GLM" --limit 10 2>&1opencli twitter search "fahdmirza 6 Top Chinese AI Models results" --limit 10 2>&1opencli twitter search "fahdmirza 6 Top Chinese AI Models" --limit 10 2>&1opencli twitter search "fahdmirza code review app results" --limit 10 2>&1opencli twitter user fahdmirza --limit 10 2>&1Two distinct query strings, run a combined 63 times back-to-back. Each invocation returned the same handful of tweets (or
error: unknown command 'user'swallowed via2>&1), so from the bash tool's perspective every call exited 0 with non-empty stdout.Why PR #264 did not catch it
packages/opencode/src/session/processor.ts:348-368—SessionDiagnostics.observeToolErroris only invoked fromfailToolCall.packages/opencode/src/session/processor.ts:223-255— theerrorRecordsfeeder filters to parts that already have aloop.errorFingerprint(or are synthetic block/stop markers). Successful tool parts never enter the feed, soapplyLoopGatehas zero samples to count andderiveParentLoopStatesees an empty parent state.This matches PR #264's own description: "v1 explicitly removes firing on success, so the test is obsolete." The session above is the failure mode of that tradeoff in production.
Prior art from external reports
The same failure shape is independently reported on Qwen3 and Kimi K2.6 outside PawWork, which suggests this is a model-family pattern rather than a single-session anomaly:
r/LocalLLaMA— "Qwen3.6-35b stuck in infinite loop" (OPConfidentSolution737, 12 comments). OP describes "the model keeps responding with a repeated text/tool call without ever stopping". Multiple independent commenters reproduce on different stacks and converge on the harness-side conclusion: "infinite tool call loops are a fundamental issue with reasoning models that don't have a hard stopping condition outside the model itself ... worth adding an external loop guard: a max tool call count per run, or a budget cap that kills the run if it exceeds N steps" (commenterMoistApplication5759). Other commenters report the loop fires under preserve-thinking, that raising presence penalty makes it loop more, and that disabling thinking removes the loop, all of which are sampling-side workarounds rather than fixes.Neither report is on Qwen3.6-plus specifically, but the reported behavior shape (success-side complete-but-useless repeats and failure-side repeats) matches what we observe locally and what PR #264 only partially covers.
Out of scope for this issue
Acceptance criteria
(tool, inputHash | targetHash)with noerrorFingerprint.3 / 6 / 7thresholds as the failure side, so the two signals stay symmetric and only one number is ever tuned.parts[].state.inputandstate.status, so success-repeat distributions can be computed offline with onejqaggregation grouped by(tool, hash(input), status)). Re-tune only if a future stuck session shows a non-bimodal distribution (i.e. counts in the 4-8 range rather than the current ≥30 vs ≤2 split observed inshiny-moon).shiny-moonshape: ≥7 identical successful bash invocations in one turn produce a stop synthesis and a Chinese stop-summary text part.Refs #229, #195. Follows up on PR #264.