Skip to content

[Feature] Loop gate also covers low-yield successful repeats, not only failures #279

@Astro-Han

Description

@Astro-Han

What task are you trying to do?

We want PawWork to recognize stuck low-yield loops even when every tool call exits successfully. The PR #264 v1 gate currently observes only tool failures, so a model that keeps re-running the same successful but useless command is invisible to it. The user has to interrupt manually.

What do you do today?

Today the only end-of-loop signal in this case is the user typing something like "你死循环了吗?". PR #204 used to inject a reminder after 3 same-input successful repeats, but PR #264 explicitly removed that path in favor of the failure-only signature inputHash | targetHash + errorFingerprint.

What would a good result look like?

The harness should also gate on "N identical-signature successful repeats within one user turn" as a separate, independently-counted signal:

  • count successful repeats by (tool, inputHash | targetHash) only, with no errorFingerprint requirement
  • keep the success counter separate from the failure counter so neither pollutes the other's threshold
  • reuse the existing escalation ladder (reminder → block with autoResume → stop) once the success threshold trips
  • avoid firing on legitimate batched parallel calls in the same step where the model is fanning out across distinct inputs

Which audience does this matter to most?

Both

Extra context

Repro session

  • Session: shiny-moon / ses_22ff9de7effeE5A31FJvclUziw, exported 2026-04-27 18:05 local
  • Model: alibaba-coding-plan-cn/qwen3.6-plus
  • Total tool calls in the loop region (messages 33-67, after the user asked about a specific @fahdmirza tweet): 66 bash invocations, all state.status === "completed", all state.error === null
  • Top-of-loop diagnostics field on the exported session: "diagnostics": {} — the PR feat(session): low-yield repeated probing detection #264 gate never recorded a single observation

Distribution of commands in the loop region

Count Command
32 opencli twitter search "fahdmirza code review app results Kimi GLM" --limit 10 2>&1
31 opencli twitter search "fahdmirza 6 Top Chinese AI Models results" --limit 10 2>&1
1 opencli twitter search "fahdmirza 6 Top Chinese AI Models" --limit 10 2>&1
1 opencli twitter search "fahdmirza code review app results" --limit 10 2>&1
1 opencli twitter user fahdmirza --limit 10 2>&1

Two distinct query strings, run a combined 63 times back-to-back. Each invocation returned the same handful of tweets (or error: unknown command 'user' swallowed via 2>&1), so from the bash tool's perspective every call exited 0 with non-empty stdout.

Why PR #264 did not catch it

packages/opencode/src/session/processor.ts:348-368SessionDiagnostics.observeToolError is only invoked from failToolCall. packages/opencode/src/session/processor.ts:223-255 — the errorRecords feeder filters to parts that already have a loop.errorFingerprint (or are synthetic block/stop markers). Successful tool parts never enter the feed, so applyLoopGate has zero samples to count and deriveParentLoopState sees an empty parent state.

This matches PR #264's own description: "v1 explicitly removes firing on success, so the test is obsolete." The session above is the failure mode of that tradeoff in production.

Prior art from external reports

The same failure shape is independently reported on Qwen3 and Kimi K2.6 outside PawWork, which suggests this is a model-family pattern rather than a single-session anomaly:

  • Reddit r/LocalLLaMA — "Qwen3.6-35b stuck in infinite loop" (OP ConfidentSolution737, 12 comments). OP describes "the model keeps responding with a repeated text/tool call without ever stopping". Multiple independent commenters reproduce on different stacks and converge on the harness-side conclusion: "infinite tool call loops are a fundamental issue with reasoning models that don't have a hard stopping condition outside the model itself ... worth adding an external loop guard: a max tool call count per run, or a budget cap that kills the run if it exceeds N steps" (commenter MoistApplication5759). Other commenters report the loop fires under preserve-thinking, that raising presence penalty makes it loop more, and that disabling thinking removes the loop, all of which are sampling-side workarounds rather than fixes.
  • X — @phonezawphyo, 2026-04-27 on Kimi K2.6: "I was using K2.6 with Hermes agent over the weekend and it burnt through weekly limit too fast and it acted totally dumb — infinite loop on repeated patch/read failures etc. ... I tried Kimi cli today and K2.6 is back to its original form." This one leans into the failure-repeat side that PR feat(session): low-yield repeated probing detection #264 already covers, but it is direct evidence that harness-side loop guarding is a perceived product difference between agent shells running the same model.

Neither report is on Qwen3.6-plus specifically, but the reported behavior shape (success-side complete-but-useless repeats and failure-side repeats) matches what we observe locally and what PR #264 only partially covers.

Out of scope for this issue

Acceptance criteria

  • A new success-repeat signal exists alongside the existing failure signal, keyed by (tool, inputHash | targetHash) with no errorFingerprint.
  • Success and failure counters are tracked independently; tripping one does not advance the other.
  • The escalation reuses the reminder → block → stop ladder from PR feat(session): low-yield repeated probing detection #264 at the same 3 / 6 / 7 thresholds as the failure side, so the two signals stay symmetric and only one number is ever tuned.
  • Threshold calibration is done by replaying real session exports (the export already contains every parts[].state.input and state.status, so success-repeat distributions can be computed offline with one jq aggregation grouped by (tool, hash(input), status)). Re-tune only if a future stuck session shows a non-bimodal distribution (i.e. counts in the 4-8 range rather than the current ≥30 vs ≤2 split observed in shiny-moon).
  • A regression test reproduces the shiny-moon shape: ≥7 identical successful bash invocations in one turn produce a stop synthesis and a Chinese stop-summary text part.
  • Parallel fan-out within the same step (distinct inputs) does not trip the new signal.

Refs #229, #195. Follows up on PR #264.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityenhancementNew feature or requestharnessModel harness, prompts, tool descriptions, and session mechanics

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions