[Feature] Loop gate also covers low-yield successful repeats, not only failures

## What task are you trying to do?
We want PawWork to recognize stuck low-yield loops even when every tool call exits successfully. The PR #264 v1 gate currently observes only tool failures, so a model that keeps re-running the same successful but useless command is invisible to it. The user has to interrupt manually.

## What do you do today?
Today the only end-of-loop signal in this case is the user typing something like "你死循环了吗？". PR #204 used to inject a reminder after 3 same-input successful repeats, but PR #264 explicitly removed that path in favor of the failure-only signature `inputHash | targetHash + errorFingerprint`.

## What would a good result look like?
The harness should also gate on "N identical-signature successful repeats within one user turn" as a separate, independently-counted signal:

- count successful repeats by `(tool, inputHash | targetHash)` only, with no `errorFingerprint` requirement
- keep the success counter separate from the failure counter so neither pollutes the other's threshold
- reuse the existing escalation ladder (reminder → block with autoResume → stop) once the success threshold trips
- avoid firing on legitimate batched parallel calls in the same step where the model is fanning out across distinct inputs

## Which audience does this matter to most?
Both

## Extra context

### Repro session

- Session: `shiny-moon` / `ses_22ff9de7effeE5A31FJvclUziw`, exported `2026-04-27 18:05` local
- Model: `alibaba-coding-plan-cn/qwen3.6-plus`
- Total tool calls in the loop region (messages 33-67, after the user asked about a specific @fahdmirza tweet): **66 bash invocations**, all `state.status === "completed"`, all `state.error === null`
- Top-of-loop diagnostics field on the exported session: `"diagnostics": {}` — the PR #264 gate never recorded a single observation

### Distribution of commands in the loop region

| Count | Command |
|------:|---------|
| 32 | `opencli twitter search "fahdmirza code review app results Kimi GLM" --limit 10 2>&1` |
| 31 | `opencli twitter search "fahdmirza 6 Top Chinese AI Models results" --limit 10 2>&1` |
| 1 | `opencli twitter search "fahdmirza 6 Top Chinese AI Models" --limit 10 2>&1` |
| 1 | `opencli twitter search "fahdmirza code review app results" --limit 10 2>&1` |
| 1 | `opencli twitter user fahdmirza --limit 10 2>&1` |

Two distinct query strings, run a combined 63 times back-to-back. Each invocation returned the same handful of tweets (or `error: unknown command 'user'` swallowed via `2>&1`), so from the bash tool's perspective every call exited 0 with non-empty stdout.

### Why PR #264 did not catch it

`packages/opencode/src/session/processor.ts:348-368` — `SessionDiagnostics.observeToolError` is only invoked from `failToolCall`. `packages/opencode/src/session/processor.ts:223-255` — the `errorRecords` feeder filters to parts that already have a `loop.errorFingerprint` (or are synthetic block/stop markers). Successful tool parts never enter the feed, so `applyLoopGate` has zero samples to count and `deriveParentLoopState` sees an empty parent state.

This matches PR #264's own description: *"v1 explicitly removes firing on success, so the test is obsolete."* The session above is the failure mode of that tradeoff in production.

### Prior art from external reports

The same failure shape is independently reported on Qwen3 and Kimi K2.6 outside PawWork, which suggests this is a model-family pattern rather than a single-session anomaly:

- **Reddit `r/LocalLLaMA` — "[Qwen3.6-35b stuck in infinite loop](https://www.reddit.com/r/LocalLLaMA/comments/1sshjm0/qwen3635b_stuck_in_infinite_loop/)"** (OP `ConfidentSolution737`, 12 comments). OP describes "the model keeps responding with a repeated text/tool call without ever stopping". Multiple independent commenters reproduce on different stacks and converge on the harness-side conclusion: *"infinite tool call loops are a fundamental issue with reasoning models that don't have a hard stopping condition outside the model itself ... worth adding an external loop guard: a max tool call count per run, or a budget cap that kills the run if it exceeds N steps"* (commenter `MoistApplication5759`). Other commenters report the loop fires under preserve-thinking, that raising presence penalty makes it loop more, and that disabling thinking removes the loop, all of which are sampling-side workarounds rather than fixes.
- **X — [@phonezawphyo, 2026-04-27](https://x.com/i/status/2048601042578034919)** on Kimi K2.6: *"I was using K2.6 with Hermes agent over the weekend and it burnt through weekly limit too fast and it acted totally dumb — infinite loop on repeated patch/read failures etc. ... I tried Kimi cli today and K2.6 is back to its original form."* This one leans into the failure-repeat side that PR #264 already covers, but it is direct evidence that harness-side loop guarding is a perceived product difference between agent shells running the same model.

Neither report is on Qwen3.6-plus specifically, but the reported behavior shape (success-side complete-but-useless repeats and failure-side repeats) matches what we observe locally and what PR #264 only partially covers.

### Out of scope for this issue

- Target-level grouping across tool families (already tracked separately in #229's acceptance criteria; this issue is the narrower "same tool, same input, all-success" cousin).
- Semantic output-emptiness detection. Stay structural and signature-keyed; do not parse tool stdout for "useful or not".
- Any change to the failure-side ladder shipped in #264. The success counter must be additive and independent.

## Acceptance criteria

- A new success-repeat signal exists alongside the existing failure signal, keyed by `(tool, inputHash | targetHash)` with no `errorFingerprint`.
- Success and failure counters are tracked independently; tripping one does not advance the other.
- The escalation reuses the reminder → block → stop ladder from PR #264 at the same `3 / 6 / 7` thresholds as the failure side, so the two signals stay symmetric and only one number is ever tuned.
- Threshold calibration is done by replaying real session exports (the export already contains every `parts[].state.input` and `state.status`, so success-repeat distributions can be computed offline with one `jq` aggregation grouped by `(tool, hash(input), status)`). Re-tune only if a future stuck session shows a non-bimodal distribution (i.e. counts in the 4-8 range rather than the current ≥30 vs ≤2 split observed in `shiny-moon`).
- A regression test reproduces the `shiny-moon` shape: ≥7 identical successful bash invocations in one turn produce a stop synthesis and a Chinese stop-summary text part.
- Parallel fan-out within the same step (distinct inputs) does not trip the new signal.

Refs #229, #195. Follows up on PR #264.



Count	Command
32	`opencli twitter search "fahdmirza code review app results Kimi GLM" --limit 10 2>&1`
31	`opencli twitter search "fahdmirza 6 Top Chinese AI Models results" --limit 10 2>&1`
1	`opencli twitter search "fahdmirza 6 Top Chinese AI Models" --limit 10 2>&1`
1	`opencli twitter search "fahdmirza code review app results" --limit 10 2>&1`
1	`opencli twitter user fahdmirza --limit 10 2>&1`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Loop gate also covers low-yield successful repeats, not only failures #279

What task are you trying to do?

What do you do today?

What would a good result look like?

Which audience does this matter to most?

Extra context

Repro session

Distribution of commands in the loop region

Why PR #264 did not catch it

Prior art from external reports

Out of scope for this issue

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Loop gate also covers low-yield successful repeats, not only failures #279

Description

What task are you trying to do?

What do you do today?

What would a good result look like?

Which audience does this matter to most?

Extra context

Repro session

Distribution of commands in the loop region

Why PR #264 did not catch it

Prior art from external reports

Out of scope for this issue

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions