Skip to content

feat(session): gate repeated successful tool calls#406

Merged
Astro-Han merged 3 commits into
devfrom
codex/i279-loop-gate
May 3, 2026
Merged

feat(session): gate repeated successful tool calls#406
Astro-Han merged 3 commits into
devfrom
codex/i279-loop-gate

Conversation

@Astro-Han

@Astro-Han Astro-Han commented May 3, 2026

Copy link
Copy Markdown
Owner

Summary

Implement a unified structural loop gate for repeated tool calls:

  • Count repeated successful tool calls separately from repeated failures.
  • Use the shared 3 / 4 / 5 occurrence ladder: reminder, synthetic block, then stop.
  • Keep same-step parallel calls observable but non-escalating until the model has a chance to react.
  • Keep failure-loop compatibility while intentionally moving failure escalation to the same 3 / 4 / 5 occurrence ladder.
  • Export success-loop diagnostics while keeping backward-compatible completedFailures.

Why

#279 showed that PawWork could detect repeated failures, but repeated successful low-value tool calls could still loop until the user interrupted. #378 improved provider/tool-call matching, but it does not solve successful-repeat loops.

Related Issue

Closes #279.

Human Review Status

Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.

Review Focus

Please focus on loop-gate semantics:

  • Success and failure counters stay independent.
  • Same assistant-step parallel calls do not trigger block/stop too early.
  • Fallback target hashes do not trigger target-level blocking.
  • Failure reminders remain compatible, while failure block/stop thresholds are intentionally unified with success at 3 / 4 / 5.

Risk Notes

Behavior change in the session tool loop gate. Failure loops now use the same 3 / 4 / 5 occurrence ladder as success loops instead of the older looser thresholds. The main residual risk is real-provider streaming timing around highly parallel tool calls, though local prompt-effect coverage exercises the same-step and cross-step boundaries.

No dependencies, migrations, generated files, credentials, permissions, or visible UI changes.

How To Verify

Focused session tests: 144 passed
bun --cwd packages/opencode test test/session/diagnostics.test.ts test/session/loop-gate.test.ts test/session/loop-renderer.test.ts test/session/prompt-effect.test.ts test/session/export.test.ts

Full session tests: 470 passed, 4 skipped, 1 todo, 0 failed
bun --cwd packages/opencode test test/session

Typecheck: passed
bun run --cwd packages/opencode typecheck

Diff check: no whitespace errors
git diff --check

Final reviewer pass: no findings
Reviewer checked success/failure separation, same-step behavior, failure compatibility, and export compatibility.

Screenshots or Recordings

Not applicable. No visible UI changes.

Checklist

  • Human review status is stated above as pending, approved, or not required
  • I linked the related issue, or stated why there is no issue
  • This PR has type, scope, and priority labels, or I requested maintainer labeling
  • I described the review focus and any meaningful risks
  • I listed the relevant verification steps and the key result for each
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope
  • I manually checked visible UI or copy changes when needed, with screenshots or recordings
  • I considered macOS and Windows impact for desktop, packaging, updater, signing, paths, shell, or permissions changes
  • I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant
  • I reviewed the final diff for unrelated changes and suspicious dependency changes
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English

Summary by CodeRabbit

  • New Features

    • Richer loop diagnostics with explicit success/failure outcomes, occurrence counts, and reminder/block/stop thresholds; snapshot export now surfaces loop outcome, completedCount and occurrenceCount.
  • Bug Fixes

    • Reminder counting now only tallies successful repeats; gating decisions use per-step context and thresholds to reduce false positives and improve block/stop accuracy.
    • Renderer messages consistently report failure counts.
  • Tests

    • Expanded coverage for outcome-prefixed reminders, gating scenarios, snapshot exports, and renderer behavior.

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: aa918de4-257e-44a3-ae15-dc4aa326f772

📥 Commits

Reviewing files that changed from the base of the PR and between b1434a6 and c4340ee.

📒 Files selected for processing (6)
  • packages/opencode/src/session/diagnostics.ts
  • packages/opencode/src/session/export.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/test/session/export.test.ts
  • packages/opencode/test/session/loop-gate.test.ts
  • packages/opencode/test/session/prompt-effect.test.ts
🚧 Files skipped from review as they are similar to previous changes (5)
  • packages/opencode/src/session/export.ts
  • packages/opencode/test/session/loop-gate.test.ts
  • packages/opencode/test/session/prompt-effect.test.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/diagnostics.ts

📝 Walkthrough

Walkthrough

Session loop diagnostics were rewritten to track structured success/failure signatures and counters, introduce outcome-prefixed signature keys, add threshold-driven reminder/block/stop decisions, and propagate richer loop metadata through observation, parent-state derivation, gating, synthetic block/stop recording, export, renderer, and tests.

Changes

Loop-Gate Success & Failure Tracking (single cohesive DAG)

Layer / File(s) Summary
Types & Constants
packages/opencode/src/session/diagnostics.ts
Adds LoopOutcome, LOOP_THRESHOLDS, and expands SignatureState, GateDecision, and LoopMetadata to include structured outcome, kind, completedCount, and optional completedFailures/counters.
Signature Key Helpers
packages/opencode/src/session/diagnostics.ts
Adds signatureKey(...) and parseReminderKey() to emit/parse outcome-prefixed outcome:kind:tool:hash reminder keys while preserving legacy formats.
Observation Logic
packages/opencode/src/session/diagnostics.ts
observeToolCall counts successful completed records (excludes failure outcomes and loop-only metadata), emits outcome:"success" diagnostics/reminders using signatureKey; observeToolError builds failure candidate keys, computes max completed-failure counters, emits outcome:"failure" diagnostics, fires reminders at LOOP_THRESHOLDS.reminderAt and records errorFingerprint plus loopRecoverFiredFor.
Parent State Derivation
packages/opencode/src/session/diagnostics.ts
deriveParentLoopState accepts optional successRecords, filters by optional currentStepIndex, rebuilds signature buckets keyed by signatureKey, increments completedCount/completedFailures, preserves recoverEmitted/last input/error, and parses synthetic block sigKeys into outcome/kind.
Gate Decision Logic
packages/opencode/src/session/diagnostics.ts
queryGateAction accepts optional outcome, computes structured inputKey/targetKey, uses completedCount/completedFailures, autoResumeSpent/blockEmitted, and nextOccurrenceCount to return enriched block/stop/observe decisions.
Processor / Loop Context
packages/opencode/src/session/processor.ts
Handle.buildLoopContext now returns successRecords and optional currentStepIndex; loopRecords now include only state.status === "completed" parts and targetHash is optional; buildLoopContext refactored into a single-pass aggregator that computes successRecords, errorRecords, syntheticBlockSigKeys, hasStopped, and per-tool stepIndex.
Synthetic Block/Stop Recording
packages/opencode/src/session/processor.ts
recordSyntheticBlock / recordSyntheticStop accept outcome, completedCount, optional completedFailures, and nextOccurrenceCount; stored loop metadata now includes outcome, loopCompletedCount, normalized loopCompletedFailures, and loopOccurrenceCount.
Gate Application in Prompt
packages/opencode/src/session/prompt.ts
applyLoopGate builds richer parentLoopState, queries queryGateAction for both failure and success, chooses a final decision, formats outcome-specific user messages, and records synthetic block/stop with expanded metadata.
Renderer Output
packages/opencode/src/session/loop-renderer.ts
LoopRenderer.render handles outcome === "success" early and uses completedFailures = state.completedFailures ?? state.completedCount for message interpolation.
Export Snapshot
packages/opencode/src/session/export.ts
Export.Snapshot.diagnostics.loop.last and deriveSnapshotDiagnostics() now optionally include outcome, completedCount, and occurrenceCount; deriveSnapshotDiagnostics reads loop.loopCompletedCount (falling back to failures) and normalizes exported fields.
Tests & E2E
packages/opencode/test/session/*
Updated and added tests (diagnostics, loop-gate, export, loop-renderer, prompt-effect, compaction) to validate outcome-prefixed signature keys, separate success/failure counters, threshold behavior, step-index filtering, renderer text, export fields, and synthetic metadata; test fakes updated to include successRecords.

Sequence Diagram

sequenceDiagram
    participant Tool as Tool
    participant Proc as Processor
    participant Diag as Diagnostics
    participant Gate as GateEvaluator
    participant LLM as Model

    Tool->>Proc: tool call completes (success)
    Proc->>Diag: observeToolCall(success record)
    Diag->>Diag: signatureKey(outcome:"success", kind, tool, hash)
    Diag->>Diag: increment success bucket, compute repeats
    alt repeats >= LOOP_THRESHOLDS.reminderAt
        Diag->>Proc: emit success reminder / record loop outcome:"success"
    end

    Tool->>Proc: tool call errors (failure)
    Proc->>Diag: observeToolError(error record)
    Diag->>Diag: signatureKey(outcome:"failure", kind, tool, hash)
    Diag->>Diag: increment failure bucket, compute reminders
    alt failures >= LOOP_THRESHOLDS.reminderAt
        Diag->>Proc: emit failure reminder / record loop outcome:"failure"
    end

    Proc->>Gate: applyLoopGate(parentLoopState)
    Gate->>Diag: queryGateAction(outcome:"failure")
    Gate->>Diag: queryGateAction(outcome:"success")
    Gate->>Gate: chooseGateDecision
    alt decision.block
        Gate->>Proc: recordSyntheticBlock(outcome, completedCount, ...)
    else decision.stop
        Gate->>Proc: recordSyntheticStop(outcome, completedCount, ...)
    end
    Gate->>LLM: return gating action or allow continuation
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • #264 — Previous loop-gate work that tracked failures; this PR generalizes to also track successes and restructures signature keys/logic.
  • #204 — Earlier success-repeat reminder approach; historical context and tests influenced the reintroduction of success tracking.
  • #234 — Overlaps with export snapshot diagnostics changes; touches the same export fields.

Suggested labels

enhancement, P1

Poem

🐰
I count each hop, repeat and fall,
Success and failure—not ignored,
I nudge, then block, then stop it all,
A soft pawed guard on looping horde,
Hops curbed, the model rests restored.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title directly and clearly describes the main feature: implementing a gate that blocks/stops repeated successful tool calls, which is the core objective.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with all required sections (Summary, Why, Related Issue, Review Focus, Risk Notes, How To Verify, and Checklist) appropriately filled.
Linked Issues check ✅ Passed The PR fully implements the acceptance criteria from #279: independent success-repeat counter keyed by (tool, inputHash|targetHash), symmetric 3/4/5 escalation ladder, same-step parallel call handling, and regression test coverage.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to loop-gating diagnostics and session processing. File modifications include SessionDiagnostics, Export, LoopRenderer, Processor, and Prompt files plus corresponding tests, all aligned with the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/i279-loop-gate

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements tracking and gating for repeated successful tool requests, mirroring the existing logic for failed requests. It introduces LoopOutcome and LOOP_THRESHOLDS to standardize the loop detection lifecycle (reminders at 3, blocks at 4, and stops at 5 occurrences). Additionally, the system now tracks step indices to avoid blocking parallel tool calls within the same execution step. A redundant logical condition in the gating logic was identified for simplification.

Comment thread packages/opencode/src/session/diagnostics.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
packages/opencode/test/session/prompt-effect.test.ts (1)

583-617: ⚡ Quick win

Add a mixed success/failure loop-gate regression.

These new cases only exercise failure-only histories. Since the gate now merges independent success and failure counters, add one case where the same signature has a failure-side block candidate and a success-side stop candidate, and assert the turn stops. That would catch the precedence bug above.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/test/session/prompt-effect.test.ts` around lines 583 - 617,
Add a new regression test alongside "loop gate blocks repeated tool errors
across model steps" that builds a mixed success/failure history: create a
session via Session.Service and send parts via SessionPrompt.Service/llm so the
same tool signature produces both failing invocations that would generate a
failure-side "block" candidate and a successful invocation that would produce a
success-side "stop" candidate; then invoke prompt.loop
(SessionPrompt.Service.loop) and assert the loop decision honors the
success-side stop (i.e., the turn stops), checking
MessageV2.filterCompactedEffect for the tool parts and that the final decision
reflects a "stop" outcome rather than being blocked by the failure-side
candidate. Ensure the test uses the same signature, verifies both candidate
types (loopAction "block" and "stop"), and asserts the overall turn stopped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/opencode/src/session/processor.ts`:
- Around line 314-325: In buildLoopContext (session/processor.ts) the current
logic assigns parts before the first "step-start" in an assistant message a
lower stepIndex (and observedStepIndex = -1), which lets late "step-start"
events make same-step tool bursts look like prior-step history; fix by tracking
whether this message has seen any "step-start" yet (e.g., a local boolean like
sawStepStart while iterating message.parts) and when handling tool parts for the
active assistant message (message.info.id === ctx.assistantMessage.id) set
observedStepIndex to currentStepIndex if no step-start has been seen (instead of
-1); ensure loopWithStep uses loop.stepIndex ?? observedStepIndex so early tool
parts in the current assistant step get the same-step index rather than being
treated as prior-step.

In `@packages/opencode/src/session/prompt.ts`:
- Around line 137-151: The current merge picks failureDecision over
successDecision which can downgrade a "stop" to a "block"; change the merge
logic after computing failureDecision and successDecision (from
SessionDiagnostics.queryGateAction) to pick the strongest action across both: if
either.failureDecision.action === "stop" || successDecision.action === "stop"
then decision = { ...choose stop decision (prefer the one with stop) }, else if
either has action === "block" then decision = { ...choose block decision }, else
decision = the observe outcome; ensure you preserve other metadata from the
chosen decision object (not just the action).

---

Nitpick comments:
In `@packages/opencode/test/session/prompt-effect.test.ts`:
- Around line 583-617: Add a new regression test alongside "loop gate blocks
repeated tool errors across model steps" that builds a mixed success/failure
history: create a session via Session.Service and send parts via
SessionPrompt.Service/llm so the same tool signature produces both failing
invocations that would generate a failure-side "block" candidate and a
successful invocation that would produce a success-side "stop" candidate; then
invoke prompt.loop (SessionPrompt.Service.loop) and assert the loop decision
honors the success-side stop (i.e., the turn stops), checking
MessageV2.filterCompactedEffect for the tool parts and that the final decision
reflects a "stop" outcome rather than being blocked by the failure-side
candidate. Ensure the test uses the same signature, verifies both candidate
types (loopAction "block" and "stop"), and asserts the overall turn stopped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: cbcd8d9f-41c0-43b7-97eb-20ab19995c1c

📥 Commits

Reviewing files that changed from the base of the PR and between cd8d47f and 610224a.

📒 Files selected for processing (11)
  • packages/opencode/src/session/diagnostics.ts
  • packages/opencode/src/session/export.ts
  • packages/opencode/src/session/loop-renderer.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/prompt.ts
  • packages/opencode/test/session/compaction.test.ts
  • packages/opencode/test/session/diagnostics.test.ts
  • packages/opencode/test/session/export.test.ts
  • packages/opencode/test/session/loop-gate.test.ts
  • packages/opencode/test/session/loop-renderer.test.ts
  • packages/opencode/test/session/prompt-effect.test.ts

Comment thread packages/opencode/src/session/prompt.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/opencode/src/session/processor.ts (1)

209-218: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep input-hash-only successes in loopRecords.

observeToolCall() still builds its reminder history from loopRecords(parentID), but this helper drops any completed success where loop.targetHash is missing. That means tools that only produce an inputHash never set loopRecoverFiredFor, so their success repeats can never advance to block/stop even though this PR is supposed to gate (tool, inputHash | targetHash) signatures.

Suggested fix
-            if (!loop?.inputHash || !loop.targetHash) return []
+            if (!loop?.inputHash) return []
             if (loop.errorFingerprint || loop.loopAction) return []
             return [
               {
                 sessionID: ctx.sessionID,
                 parentID,
                 tool: part.tool,
                 inputHash: loop.inputHash,
-                targetHash: loop.targetHash,
+                targetHash: loop.targetHash ?? "",
                 metadata: { diagnostics: { loop } },
               } satisfies SessionDiagnostics.ToolCallRecord,
             ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/session/processor.ts` around lines 209 - 218, The
helper loopRecords currently rejects loop entries that lack loop.targetHash,
causing successes that only have loop.inputHash to be dropped; update the
predicate in loopRecords (the check around toolDiagnostics(part)?.loop) to only
require loop.inputHash (i.e., change the condition from "if (!loop?.inputHash ||
!loop.targetHash) return []" to "if (!loop?.inputHash) return []") so entries
with an inputHash but no targetHash are kept for observeToolCall() to build
reminder history and set loopRecoverFiredFor; keep the other exclusions
(errorFingerprint, loopAction, completed status, role/parentID checks)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/opencode/src/session/processor.ts`:
- Around line 209-218: The helper loopRecords currently rejects loop entries
that lack loop.targetHash, causing successes that only have loop.inputHash to be
dropped; update the predicate in loopRecords (the check around
toolDiagnostics(part)?.loop) to only require loop.inputHash (i.e., change the
condition from "if (!loop?.inputHash || !loop.targetHash) return []" to "if
(!loop?.inputHash) return []") so entries with an inputHash but no targetHash
are kept for observeToolCall() to build reminder history and set
loopRecoverFiredFor; keep the other exclusions (errorFingerprint, loopAction,
completed status, role/parentID checks) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 27b157f4-6e2c-4093-aea7-fb50224c723e

📥 Commits

Reviewing files that changed from the base of the PR and between 610224a and b1434a6.

📒 Files selected for processing (5)
  • packages/opencode/src/session/diagnostics.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/prompt.ts
  • packages/opencode/test/session/loop-gate.test.ts
  • packages/opencode/test/session/prompt-effect.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/opencode/test/session/loop-gate.test.ts
  • packages/opencode/src/session/diagnostics.ts

Comment thread packages/opencode/src/session/processor.ts
@Astro-Han

Copy link
Copy Markdown
Owner Author

Handled the non-inline CodeRabbit note in c4340ee: loopRecords now keeps input-hash-only successful records by requiring only inputHash and using an empty fallback target hash. The same commit also persists loopOccurrenceCount and adds the symmetric failure stop coverage.

@Astro-Han Astro-Han merged commit 080261e into dev May 3, 2026
26 of 27 checks passed
@Astro-Han Astro-Han deleted the codex/i279-loop-gate branch May 3, 2026 10:46
Astro-Han added a commit that referenced this pull request May 4, 2026
Fix #406 follow-up false positives in the session loop gate.

Problem
- PR #406 made successful repeated tool calls observable, but same-target success repeats were treated as hard loop evidence.
- Normal sessions can read different ranges of the same file, or rerun the same command after a file mutation, while still making progress.

Policy changes
- success + same_target is only a weak reminder signal and never directly hard-blocks or stops.
- success + exact input only gates when prior completed records have the same output and the same parent-turn mutationEpoch.
- mutationEpoch is parent-turn scoped across assistant messages, not message-local.
- failure gates keep the stricter target/input protections.
- synthetic block/stop diagnostics preserve attemptedInput with size guarding.
- sanitized exports redact loop attemptedInput, and success loop exports no longer pretend completedFailures.

Regression coverage
- read same file with different ranges does not block or stop.
- successful exact input with changed output does not gate.
- successful exact input across mutation epochs does not gate.
- cross-assistant-message parent-turn mutationEpoch reset is covered.
- bash file mutation produces patch tracking and resets successful exact-input gating.
- repeated same-output success signature still blocks, and repeated blocked signature still stops.

Verification
- bun --cwd packages/opencode test test/session/loop-gate.test.ts test/session/diagnostics.test.ts test/session/processor-effect.test.ts test/session/prompt-effect.test.ts
- bun --cwd packages/opencode test test/session
- bun run --cwd packages/opencode typecheck
- git diff --check
- GitHub Actions and CodeRabbit were green before merge.
@coderabbitai coderabbitai Bot mentioned this pull request May 29, 2026
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Loop gate also covers low-yield successful repeats, not only failures

1 participant