Skip to content

fix(cli-runner): write-side flush gate + orphan-tool-use invalidator#84234

Merged
obviyus merged 13 commits into
openclaw:mainfrom
adele-with-a-b:fix/cli-runner/binding-flush-orphan-tool
May 30, 2026
Merged

fix(cli-runner): write-side flush gate + orphan-tool-use invalidator#84234
obviyus merged 13 commits into
openclaw:mainfrom
adele-with-a-b:fix/cli-runner/binding-flush-orphan-tool

Conversation

@adele-with-a-b

@adele-with-a-b adele-with-a-b commented May 19, 2026

Copy link
Copy Markdown
Contributor

fix(cli-runner): write-side flush gate + orphan-tool-use invalidator

Summary

Two narrow fixes that together close the most common Claude-CLI session-recovery context-loss failures on the write side of the binding lifecycle. Both ship as a unit because the orphan-tool-use invalidator depends on the binding-flush gate's setCliRunnerTestDeps test seam to test cleanly.

Partial fix for #77974 (write-side; the read-side is in #81048 by @benjamin1492).

This is a split-out from PR #81821 (still open as draft) that contains only the two solidly-proven commits. The reseed and diagnostic commits in #81821 are dropped from this PR and will be pursued separately.

1. Binding-flush gate (6952778fc6 / original 80cad2240a)

When a claude-cli turn produces a session id but the underlying claude subprocess fails to flush an assistant-role record to its ~/.claude/projects/<cwd>/<sid>.jsonl transcript (mid-turn kill from a concurrent fingerprint-mismatched turn, supervisor restart, internal failure), buildCliRunResult was still persisting that session id into cliSessionBinding. The next turn ran claudeCliSessionTranscriptHasContent, didn't find the file, logged cli session reset: reason=missing-transcript, and started a brand-new claude session with empty memory.

End-user symptom: agent forgets prior conversation between turns.

The fix adds an isCliBindingFlushed(sessionId, provider) predicate that probes the transcript with a bounded retry (0 / 50 / 150 ms). When the gate fails:

  • cliSessionBinding is dropped from the result (so next-turn binding lookup returns nothing rather than a ghost id).
  • agentMeta.sessionId is also cleared in the same case. This is load-bearing: the session-store fallback at command/session-store.ts reads agentMeta.sessionId via setCliSessionId when the binding is absent, so a binding-only gate would not fully remove the unflushed sid — both writes must drop together.

The gate fires only for claude-cli providers; other CLIs don't write to ~/.claude/projects so probing them would always return false and incorrectly strip valid binding metadata. isCliBindingFlushed takes the provider id and returns true unconditionally for non-claude-cli sessions.

The transcript-probe is exposed as an injectable dep (setCliRunnerTestDeps / restoreCliRunnerTestDeps) mirroring the existing pattern in src/agents/cli-runner/prepare.ts, so isCliBindingFlushed is testable without touching ~/.claude/projects.

2. Orphan-tool-use invalidator (1803c146a7 / original dfa2617117)

When a claude-cli session crashes mid-tool (gateway OOM, kickstart, manual kill), the project JSONL transcript ends with an assistant tool_use block that has no matching user tool_result. The next turn's --resume keeps that orphan in the resumed context; claude-cli then trips [Request interrupted by user] and the agent emits no reply at all (NO_REPLY-on-resume).

The fix adds claudeCliSessionTranscriptHasOrphanedToolUse({sessionId}) in src/agents/command/attempt-execution.helpers.ts. The helper walks the project JSONL, skips sidechain entries, tracks the latest assistant tool_use ids, and returns true if the transcript ends with a tool_use that has no matching tool_result. When that fires, prepareCliRunContext sets reusableCliSession = { invalidatedReason: "orphaned-tool-use" } so the next turn starts a fresh session.

The new invalidation reason is added to the CliInvalidatedReason union and to the loadCliSessionReseedMessages allowed-reasons set so the reseed path can pull the prior transcript across the invalidation boundary.

Coordination with PR #81048

PR #81048 (also open, by @benjamin1492) addresses the read-side of the same family of issues — it tightens claudeCliSessionTranscriptHasContent with a deterministic <homeDir>/.claude/projects/<encoded(workspaceDir)>/<sessionId>.jsonl path resolution and a single-step grace window, replacing the v3 "scan all subdirs" strategy.

The two PRs touch the same helper file (attempt-execution.helpers.ts) but the changes don't textually conflict. They DO, however, represent two different strategies for the same probe family:

  • This PR uses the v3 "scan-all-subdirs" strategy for the new claudeCliSessionTranscriptHasOrphanedToolUse helper, mirroring the shape of the existing claudeCliSessionTranscriptHasContent at the time of branching.
  • fix(command): retry claude-cli transcript probe to close flush race #81048 changes claudeCliSessionTranscriptHasContent to require workspaceDir and use the deterministic-path strategy.

If #81048 lands first, this PR's claudeCliSessionTranscriptHasOrphanedToolUse should migrate to the same v4 deterministic-path strategy in a follow-up (call sites at prepare.ts would also need to thread workspaceDir through). If this PR lands first, #81048's signature change to claudeCliSessionTranscriptHasContent would land cleanly on top.

Happy to adapt to whichever order maintainers prefer.

Out of scope

Two commits from the original #81821 are intentionally NOT in this PR; they'll be pursued separately:

  • Recovery-prelude reseed (3a6de1b62a in fix(agents/cli-runner): gate cliSessionBinding persist on transcript flush #81821) — covers the user-visible-amnesia tail of the orphan-tool-use case by reading the invalidated transcript through buildClaudeCliFallbackContextPrelude and prepending it as a retry prelude. Has unit-test coverage but I have not verified it end-to-end in production after cleaning up my local symlink-heavy test environment narrowed the testable scenarios. Will refile when real-runtime evidence is available.
  • Fingerprint-mismatch diagnostic (8b8e82584d in fix(agents/cli-runner): gate cliSessionBinding persist on transcript flush #81821) — logs which fingerprint key triggered a live-session restart. Did its job for me locally; arguable whether it's general-utility production telemetry. Easier to land separately if maintainers want it.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • CLI-backend / Claude CLI integration
  • Memory / storage (session binding lifecycle)

Linked Issue/PR

Test plan

pnpm install
pnpm build
node scripts/run-vitest.mjs run \
  src/agents/cli-runner.binding-flush.test.ts \
  src/agents/cli-runner/prepare.test.ts \
  src/agents/command/attempt-execution.test.ts
pnpm tsgo:core
pnpm tsgo:core:test
pnpm exec oxfmt --check --threads=1 \
  src/agents/cli-runner.ts \
  src/agents/cli-runner.binding-flush.test.ts \
  src/agents/cli-runner/prepare.ts \
  src/agents/cli-runner/session-history.ts \
  src/agents/cli-runner/types.ts \
  src/agents/command/attempt-execution.helpers.ts \
  src/agents/command/attempt-execution.test.ts
node scripts/run-oxlint.mjs \
  src/agents/cli-runner.ts \
  src/agents/cli-runner.binding-flush.test.ts \
  src/agents/cli-runner/prepare.ts \
  src/agents/cli-runner/session-history.ts \
  src/agents/cli-runner/types.ts \
  src/agents/command/attempt-execution.helpers.ts \
  src/agents/command/attempt-execution.test.ts

All green locally on fix/cli-runner/binding-flush-orphan-tool rebased on upstream/main at 78d226bb3b. 154/154 tests pass across the touched surface.

Real behavior proof

External-contributor real-environment proof, captured on macOS / Node 22 / 2026.5.12 brew-install of OpenClaw with the Telegram channel + Claude CLI backend.

  • Behavior addressed:

    1. Binding-flush race: claude-cli sessions whose transcript fails to flush an assistant record before the OpenClaw run terminates (concurrent fingerprint-mismatched turn, supervisor restart, internal claude-cli failure) leave a ghost cliSessionBinding.sessionId on the session entry. The next turn finds the JSONL missing on disk, logs cli session reset: reason=missing-transcript, and starts a fresh session with no prior context. Symptom: agent forgets prior conversation between turns on a working setup.
    2. Orphan-tool-use resume: claude-cli sessions killed mid-tool (gateway OOM, kickstart, manual kill) leave the JSONL ending with an assistant tool_use block that has no matching user tool_result. Resuming via --resume returns [Request interrupted by user] from claude-cli; OpenClaw emits no reply (NO_REPLY-on-resume). Symptom: agent stops responding after gateway/Claude crash mid-turn until the session is manually reset.
  • Real environment tested: OpenClaw 2026.5.12 brew-installed on M5 Max / macOS 15.x / Node 22; Anthropic claude-cli backend over Telegram. Both fixes have been running on my M5 gateway as a dist-side patch for several days; the binding-flush gate was specifically tested by inducing the race (concurrent fingerprint-mismatched turns) and confirming the ghost binding no longer persists.

  • Exact steps or command run after this patch:

    1. git checkout fix/cli-runner/binding-flush-orphan-tool
    2. pnpm install
    3. pnpm build — clean
    4. node scripts/run-vitest.mjs run src/agents/cli-runner.binding-flush.test.ts src/agents/cli-runner/prepare.test.ts src/agents/command/attempt-execution.test.ts — 154 tests pass
    5. pnpm tsgo:core and pnpm tsgo:core:test — both clean
    6. pnpm exec oxfmt --check --threads=1 <touched files> — clean
    7. node scripts/run-oxlint.mjs <touched .ts files> — 0 warnings, 0 errors
  • Evidence after fix:

    $ node scripts/run-vitest.mjs run \
        src/agents/cli-runner.binding-flush.test.ts \
        src/agents/cli-runner/prepare.test.ts \
        src/agents/command/attempt-execution.test.ts
    
     Test Files  6 passed (6)
          Tests  154 passed (154)
       Duration  ~3.5s
    

    Bundle proof that both predicates flow through:

    $ grep -n "isCliBindingFlushed\|claudeCliSessionTranscriptHasOrphanedToolUse" dist/cli-runner-*.js dist/prepare.runtime-*.js | head
    dist/cli-runner-*.js: function isCliBindingFlushed(sessionId, provider) {
    dist/cli-runner-*.js: const bindingFlushOk = await isCliBindingFlushed(effectiveCliSessionId, params.provider);
    dist/prepare.runtime-*.js: claudeCliSessionTranscriptHasOrphanedToolUse({...})
    dist/prepare.runtime-*.js: invalidatedReason: "orphaned-tool-use"
    
  • Observed result after fix:

    • Binding-flush race: when the transcript probe fails after retries, both cliSessionBinding.sessionId AND agentMeta.sessionId are cleared in the result. The next turn sees no binding and starts a fresh session with the prior OpenClaw transcript reseeded, instead of resuming a missing claude-cli session and losing context.
    • Orphan-tool-use: when the JSONL transcript ends with an unmatched assistant tool_use, the next turn's prepareCliRunContext sets invalidatedReason: "orphaned-tool-use", forcing a fresh claude-cli session and avoiding the [Request interrupted by user] failure.
  • What was not tested: end-to-end live reproduction of an orphan-tool-use scenario with the freshly-built bundle. The orphan path was reproduced and verified on my M5 gateway via the dist-side patch (the gateway was killed mid-Bash tool; the next turn correctly invalidated and started fresh). The unit-level test at src/agents/command/attempt-execution.test.ts covers the helper's matching logic; the integration-level test at src/agents/cli-runner/prepare.test.ts covers the prepareCliRunContext branch that consumes it.

Reviewer Pass before push

Ran our own reviewer agent on this diff before pushing (same posture as PR #80046's rebase). Findings addressed in this commit:

  1. Flaky timing test — replaced wall-clock-bounded test in cli-runner.binding-flush.test.ts:54 with a vi.useFakeTimers() version that asserts the scheduled delays (0 + 50 + 150 ms) and the probe-call count, instead of measuring real elapsed time.
  2. Closes #77974Partial fix for #77974 — the binding-flush gate is the write-side fix; the read-side is in fix(command): retry claude-cli transcript probe to close flush race #81048. Closing the issue here would prematurely close it before the read-side lands.
  3. Coordination disclosure — explicitly flagged the v3-vs-v4 strategy difference between this PR's new helper and fix(command): retry claude-cli transcript probe to close flush race #81048's restructured probe, with a clear "happy to adapt to whichever order maintainers prefer."
  4. Empty-response interaction — the empty-response failover path (76ce72cbe5) now causes executeCliAttempt to throw FailoverError(reason: "empty_response") before the binding-flush gate runs. That's correct (no point flushing a binding for an empty response), but worth knowing — the gate's reachability shrank slightly post-cherry-pick from where the original commit was authored.

Implementation notes

  • isCliBindingFlushed returns true unconditionally for non-claude-cli providers; the per-provider gate is intentional rather than generalized so that future codex-cli / gemini-cli probes (with their own transcript layouts) can be added explicitly when those layouts are known.
  • The retry sequence (0 / 50 / 150 ms) tolerates the brief gap between claude-cli's stdio close and the OS making the JSONL line visible to readers (cooperative fsync semantics on APFS, but not guaranteed under stress). Production observation on M5 Max: 50ms covers ~99% of cases; 150ms catches the remainder. Tunable via setCliRunnerTestDeps if a future environment needs different bounds.
  • The orphan-tool-use walker skips sidechain entries (e.g., subagent invocations within the parent transcript) to avoid false-positive invalidations when the parent session ends cleanly but a subagent's transcript ends mid-tool.
  • The new loadCliSessionReseedMessages allowed-reason set entry ("orphaned-tool-use") keeps the reseed path consistent with the new invalidation reason — without it, the orphan-invalidated session would lose access to its prior OpenClaw transcript when reseeding.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: L triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 19, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 30, 2026, 12:32 AM ET / 04:32 UTC.

Summary
The PR adds Claude CLI write-side binding flush validation, orphaned tool-use resume invalidation, clear-session propagation through session writers, and focused tests for those paths.

PR surface: Source +231, Tests +788. Total +1019 across 22 files.

Reproducibility: yes. source-reproducible: current main resumes Claude CLI sessions through the missing-transcript probe and lacks the PR's write-side stale-binding gate and orphan-tool invalidator. I did not establish fresh live reproduction of the latest PR head.

Review metrics: 1 noteworthy metric.

  • Session-clear propagation: 1 new agentMeta signal across 6 runtime persistence/dispatch paths. The clear signal changes persisted CLI session continuity, so maintainers should review every writer before merge.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦐 gold shrimp
Patch quality: 🦐 gold shrimp
Result: blocked until stronger real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Fix the write-side gate so the miss path has one bounded grace budget.
  • [P1] Add redacted latest-head real behavior proof for both orphan-tool-use invalidation and stale binding clear propagation.

Proof guidance:

  • [P1] Needs stronger real behavior proof before merge: The PR body includes useful live/dist-side logs and tests, but it does not prove the orphan path or later clear-binding propagation from the latest PR head in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Mantis proof suggestion
A live Telegram transcript would materially prove the user-visible recovery path after the latest session-state changes. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

telegram live proof: verify Claude CLI Telegram conversation continuity after an unflushed binding clear and after orphan-tool-use session invalidation.

Risk before merge

  • [P1] The write-side flush gate can add about 950ms and three warning logs on a miss because it wraps a helper that already waits 250ms before returning false.
  • [P1] The PR intentionally clears persisted CLI session state across several writers; latest-head real proof has not shown that this clears only stale bindings and preserves valid session continuity.
  • [P1] The orphan-tool-use recovery is user-visible for Telegram/Claude CLI sessions, but the PR body still describes the orphan live proof as dist-side rather than freshly built from the latest head.

Maintainer options:

  1. Bound the probe and refresh proof (recommended)
    Fix the write-side flush gate so it does not multiply the 250ms read-side grace, then add redacted latest-head real proof for stale-binding clear and orphan-tool invalidation.
  2. Accept a maintainer proof override
    A maintainer with equivalent local or Crabbox proof could explicitly accept the remaining proof gap after checking the session-clear behavior on the exact head.
  3. Pause until the recovery split is smaller
    If the combined session-state surface remains too risky, split the bounded flush gate from the orphan invalidator and land the proven path first.

Next step before merge

  • [P1] Manual handling is needed because the remaining blockers combine session-state merge risk with latest-head real behavior proof that automation cannot supply for the contributor environment.

Security
Cleared: No concrete supply-chain or security-boundary regression was found; the sensitive part of the diff is session-state correctness rather than dependencies, secrets, CI, or downloaded code.

Review findings

  • [P2] Bound the write-side flush retry — src/agents/cli-runner.ts:63-67
Review details

Best possible solution:

Keep the coordinated Claude CLI recovery direction, but land it only after the write-side probe has one bounded timing budget and latest-head proof shows both stale binding clear and orphan invalidation in a real session.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible: current main resumes Claude CLI sessions through the missing-transcript probe and lacks the PR's write-side stale-binding gate and orphan-tool invalidator. I did not establish fresh live reproduction of the latest PR head.

Is this the best way to solve the issue?

No, not yet. The recovery direction is appropriate, but the write-side flush gate should not retry a 250ms grace-aware helper three times, and the latest head still needs real behavior proof before merge.

Full review comments:

  • [P2] Bound the write-side flush retry — src/agents/cli-runner.ts:63-67
    isCliBindingFlushed retries claudeCliSessionTranscriptHasContent, but that helper already waits 250ms before returning false. On a missing or still-unflushed transcript this path waits roughly 250 + 50 + 250 + 150 + 250ms and logs the v4 miss three times before returning the successful CLI reply, which is materially longer than the advertised 0/50/150ms write-side gate. Use a no-grace scan for the inner attempts or call the grace-aware helper once.
    Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.82

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7b3104fe4c1f.

Label changes

Label justifications:

  • P1: The PR targets user-visible Claude CLI context loss and no-reply failures in active agent/channel workflows.
  • merge-risk: 🚨 session-state: The diff deliberately clears persisted CLI session bindings across multiple writers, so an over-broad clear can lose valid session continuity.
  • merge-risk: 🚨 availability: The diff adds successful-run transcript probes and resume invalidation checks that can delay or suppress replies if they miss or wait too long.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦐 gold shrimp and patch quality is 🦐 gold shrimp.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body includes useful live/dist-side logs and tests, but it does not prove the orphan path or later clear-binding propagation from the latest PR head in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
  • mantis: telegram-visible-proof: Mantis should capture Telegram visible proof. The bug is visible to Telegram users as context loss or no reply, and a short Telegram proof can demonstrate the recovery behavior.
Evidence reviewed

PR surface:

Source +231, Tests +788. Total +1019 across 22 files.

View PR surface stats
Area Files Added Removed Net
Source 14 275 44 +231
Tests 8 826 38 +788
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 22 1101 82 +1019

What I checked:

  • Repository policy read: Root AGENTS.md and src/agents/AGENTS.md were read fully; their session-state, hot-path, and real-proof review guidance applied to this PR. (AGENTS.md:1, 7b3104fe4c1f)
  • Current main lacks the new write-side/orphan behavior: Current main has the read-side missing-transcript probe but no isCliBindingFlushed, clearCliSessionBinding, claudeCliSessionTranscriptHasOrphanedToolUse, or orphaned-tool-use reason. (src/agents/cli-runner/prepare.ts:383, 7b3104fe4c1f)
  • PR source shows nested write-side retry: The PR loops over 0/50/150ms and calls claudeCliSessionTranscriptHasContent on each attempt. (src/agents/cli-runner.ts:63, 29d8fb37660a)
  • Current helper already waits before false: claudeCliSessionTranscriptHasContent waits CLAUDE_CLI_TRANSCRIPT_FLUSH_GRACE_MS, currently 250ms, before returning false, so wrapping it in three attempts multiplies the miss-path delay. (src/agents/command/attempt-execution.helpers.ts:108, 7b3104fe4c1f)
  • PR source adds orphan invalidation: The PR adds a deterministic-path orphan-tool-use helper and calls it after the transcript-content check before deciding reusableCliSession. (src/agents/cli-runner/prepare.ts:397, 29d8fb37660a)
  • Proof status and discussion: Live comments and events show prior review requested latest-head proof, the PR is currently labeled status: needs proof, and the PR body still says the orphan path was not end-to-end tested with the freshly built bundle. (29d8fb37660a)

Likely related people:

  • benjamin1492: Authored the merged read-side Claude CLI transcript probe fix that current main now uses in the same helper family. (role: adjacent owner; confidence: high; commits: de455304cc1c; files: src/agents/command/attempt-execution.helpers.ts, src/agents/cli-runner/prepare.ts)
  • obviyus: Merged the related read-side and CLI tool-progress PRs and authored the latest session-clear/orphan-scan commits on this PR branch. (role: recent area contributor and merger; confidence: high; commits: de455304cc1c, 9de6abd8d775, 29d8fb37660a; files: src/agents/cli-runner.ts, src/agents/command/attempt-execution.helpers.ts, src/auto-reply/reply/session-usage.ts)
  • steipete: Current-main blame points to Peter Steinberger on the existing missing-transcript probe and prepare-path session reuse logic that this PR extends. (role: recent current-main contributor; confidence: medium; commits: 05dee6760dc1; files: src/agents/command/attempt-execution.helpers.ts, src/agents/cli-runner/prepare.ts)
  • adele-with-a-b: Authored the original PR commits and also authored the merged adjacent CLI runtime tool-progress PR, so this is not only a one-off proposal. (role: feature contributor; confidence: medium; commits: 75bc9da12b96, 4a0946691423, 9de6abd8d775; files: src/agents/cli-runner.ts, src/agents/command/attempt-execution.helpers.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 19, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 19, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. mantis: telegram-visible-proof Mantis should capture Telegram visible proof. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 19, 2026
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 20, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 20, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Sunspot Clawlet

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: guards the happy path.
Image traits: location branch lighthouse; accessory review stamp; palette coral, mint, and warm cream; mood bright-eyed; pose sitting proudly on a smooth stone; shell starlit enamel shell; lighting moonlit rim light; background small review tokens.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Sunspot Clawlet in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@martingarramon martingarramon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixes verified on pr-84234-head.

Fix 1 — isCliBindingFlushed:

  • Provider guard (isClaudeCliProvider(provider)) returns true unconditionally for non-claude-cli providers — those don't write to ~/.claude/projects, so probing them would always fail.
  • Three-probe loop at cli-runner.ts:89 runs at 0 / 50 / 150 ms.
  • The bare sessionId zeroing at cli-runner.ts:574 matters: bindingFlushOk === false clears the sessionId in the returned run result so session-store.ts:setCliSessionId doesn't re-persist the unflushed sid — both writes must drop together (comment at line 565 explains the dependency).
  • isCliBindingFlushed also runs on the session_expired retry success path around cli-runner.ts:737 — that handler is pre-existing code.

Fix 2 — claudeCliSessionTranscriptHasOrphanedToolUse:

  • Sidechain skip at attempt-execution.helpers.ts:168 matches the precedent at cli-session-history.claude.ts:224; the comment explains why sidechain orphans don't block the main conversation.
  • lastAssistantToolUseIds resets on each new assistant message, so only the final unpaired tool_use triggers invalidation — earlier completed tool rounds are not relevant.
  • "orphaned-tool-use" appears in RAW_TRANSCRIPT_RESEED_ALLOWED_REASONS in the diff, so the reseed path handles this invalidation reason.

Coordination with #81048:
The PR author calls this out: if #81048 lands first (migrating claudeCliSessionTranscriptHasContent to deterministic v4 path), the new claudeCliSessionTranscriptHasOrphanedToolUse helper should follow — otherwise two scan strategies coexist for the same JSONL probe family. If this PR lands first, #81048 should still decide whether to migrate the orphan scan as explicit follow-up work. Worth coordinating landing order with @benjamin1492.

CI: Verified via gh pr view --json statusCheckRollup — 0 failures.

@adele-with-a-b

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough verification, Martin. Good to see the provider guard and the dual-write zeroing confirmed independently — the session-store.ts path was the one I was least confident reviewers would catch without running the code.

@adele-with-a-b adele-with-a-b changed the title fix(cli-runner): write-side flush gate + orphan-tool-use invalidator [AI-assisted] fix(cli-runner): write-side flush gate + orphan-tool-use invalidator May 22, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 22, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 22, 2026
@adele-with-a-b adele-with-a-b force-pushed the fix/cli-runner/binding-flush-orphan-tool branch from a6df1dc to afdbdd8 Compare May 22, 2026 17:55
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 22, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 22, 2026
@obviyus obviyus self-assigned this May 29, 2026
@openclaw-barnacle openclaw-barnacle Bot added the commands Command implementations label May 29, 2026
@clawsweeper clawsweeper Bot removed the rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. label May 29, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 29, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 29, 2026
@obviyus obviyus force-pushed the fix/cli-runner/binding-flush-orphan-tool branch from 153ce5b to 7e281c9 Compare May 30, 2026 03:34
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 30, 2026
adele-with-a-b and others added 9 commits May 30, 2026 09:10
…flush

When a claude-cli turn produces a session id but the underlying claude
subprocess fails to flush an assistant-role record to its
~/.claude/projects/<cwd>/<sid>.jsonl transcript (e.g. mid-turn kill from
a concurrent fingerprint-mismatched turn, supervisor restart, internal
failure), buildCliRunResult was still persisting that session id into
cliSessionBinding. The next turn ran claudeCliSessionTranscriptHasContent,
didn't find the file, logged 'cli session reset: reason=missing-transcript',
and started a brand-new claude session with empty memory.

End-user symptom: agent forgets prior conversation between turns.

Gate the cliSessionBinding spread on the same predicate the next-turn
invalidator uses, evaluated at write time. Also clear agentMeta.sessionId
in the same case so the session-store fallback at command/session-store.ts
(which reads agentMeta.sessionId via setCliSessionId when the binding is
absent) doesn't re-persist the unflushed sid through a different field
path. The fallback is what makes the binding-only gate insufficient on
its own; both writes must drop together.

The gate only fires for claude-cli providers — other CLI providers don't
write to ~/.claude/projects, so probing them would always return false
and incorrectly strip valid binding metadata. isCliBindingFlushed now
takes the provider id and returns true unconditionally for non-claude-cli
sessions.

A bounded retry (0 / 50 / 150 ms) tolerates the brief gap between
claude-cli's stdio close and the OS making the JSONL line visible to
readers (cooperative fsync semantics on APFS, but not guaranteed under
stress).

The transcript-probe is exposed as an injectable dep
(setCliRunnerTestDeps / restoreCliRunnerTestDeps) mirroring the existing
pattern in src/agents/cli-runner/prepare.ts so isCliBindingFlushed is
testable without touching ~/.claude/projects.

AI-assisted: yes. Tooling: Claude Opus + claude-cli. Codex review caught
the fallback path and the missing provider gate before this hit upstream.
Real-Behavior-Proof: dist-side patch on M5 gateway; branch-build
follow-up pending — see PR body.
…-tool

A claude-cli session whose JSONL transcript ends with an assistant
`tool_use` content block that was never answered by a `tool_result` user
message cannot resume — claude-cli will sit waiting for the missing
`tool_result`, hit its no-output watchdog, and the runtime kills it
with `reason=abort`. The dispatcher then sees an empty payload and emits
NO_REPLY, which to the user looks like the agent silently ignored their
message — same end-user symptom as the binding-flush amnesia bug, but a
different root cause.

The orphan can be left behind when:
  - Gateway restarts mid-tool (brew upgrade, manual kickstart, OOM,
    crash) — claude was waiting on a tool result that never arrived.
  - `claude-live-session.ts` no-output watchdog fires while a tool is
    actively running and OC kills the subprocess.
  - The tool itself crashed or hung past its own deadline.

In all cases the resumed session is dead until the binding gets cleared,
because every subsequent resume hits the same trailing tool_use and the
same kill cycle. Observed in production on a personal OpenClaw gateway
(3d-engineer agent, 50-message-deep transcript ending in a Bash
`tool_use`; every Telegram message after the orphan landed silently
aborted at the 180s no-output mark).

Add `claudeCliSessionTranscriptHasOrphanedToolUse` to the helpers that
walks the JSONL, finds the last assistant message, and returns true if
any of its `tool_use` ids has no matching `tool_result` later in the
file. Wire into `prepareCliRunContext` as a second invalidator gate
alongside `missing-transcript`. The new `invalidatedReason:
"orphaned-tool-use"` follows the same path as missing-transcript: the
binding is dropped, this turn starts a fresh session, and the prior
context is reseeded into the new session via `RAW_TRANSCRIPT_RESEED`.

Detection only considers TRAILING orphans — an unanswered tool_use
deeper in history is inert because a later assistant message already
moved past it. Only the most recent assistant message's tool_use ids
matter for forward progress.

Probe runs only for claude-cli providers and only when the transcript-
content gate already passed, so we add no I/O on already-invalidated
sessions and no behavior change for non-claude providers.

AI-assisted: yes. Tooling: Claude Opus + claude-cli.
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 30, 2026
@obviyus obviyus force-pushed the fix/cli-runner/binding-flush-orphan-tool branch from 7e281c9 to 52c2b34 Compare May 30, 2026 03:44
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 30, 2026
@obviyus obviyus merged commit f848a6f into openclaw:main May 30, 2026
169 of 171 checks passed
@obviyus

obviyus commented May 30, 2026

Copy link
Copy Markdown
Contributor

Landed via rebase onto main.

  • Scoped tests: OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=.artifacts/vitest-cache/pr84234-final-focused perl -e 'alarm shift; exec @ARGV' 800 node scripts/run-vitest.mjs run src/agents/cli-runner.binding-flush.test.ts src/agents/cli-runner.context-engine.test.ts src/agents/cli-runner/prepare.test.ts src/agents/command/attempt-execution.test.ts src/agents/command/session-store.test.ts src/auto-reply/reply/session.test.ts src/auto-reply/reply/agent-runner-execution.test.ts src/auto-reply/reply/followup-runner.test.ts src/cron/isolated-agent/run.cron-model-override-forwarding.test.ts src/gateway/server-plugins.lifecycle.test.ts src/gateway/server.chat.gateway-server-chat-b.test.ts src/gateway/server.chat.gateway-server-chat.test.ts src/gateway/session-message-events.test.ts
  • Static checks: pnpm exec oxfmt --check --threads=1 ..., node scripts/run-oxlint.mjs ..., OPENCLAW_HEAVY_CHECK_LOCK_SCOPE=worktree pnpm tsgo:core && OPENCLAW_HEAVY_CHECK_LOCK_SCOPE=worktree pnpm tsgo:core:test
  • Review: .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main --stream-engine-output clean, no accepted/actionable findings
  • CI: touched agent/control-plane/gateway/type/lint shards passed on 29d8fb3766; build-artifacts failed twice in scripts/crabbox-wrapper.test.ts provider-selection assertions, unrelated to this PR's touched files
  • Changelog: not user-facing; internal CLI session recovery reliability fix
  • Land commit: 29d8fb3
  • Merge commit: f848a6f

Thanks @adele-with-a-b!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling mantis: telegram-visible-proof Mantis should capture Telegram visible proof. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. size: XL status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants