Skip to content

fix(agents/cli-runner): gate cliSessionBinding persist on transcript flush#81821

Closed
adele-with-a-b wants to merge 4 commits into
openclaw:mainfrom
adele-with-a-b:fix/cli-backend-binding-flush-gate
Closed

fix(agents/cli-runner): gate cliSessionBinding persist on transcript flush#81821
adele-with-a-b wants to merge 4 commits into
openclaw:mainfrom
adele-with-a-b:fix/cli-backend-binding-flush-gate

Conversation

@adele-with-a-b

@adele-with-a-b adele-with-a-b commented May 14, 2026

Copy link
Copy Markdown
Contributor

What & why

Two distinct bugs in the claude-cli resume path that produce the same end-user symptom — the agent silently fails to respond, or starts a fresh session with no prior memory. On my M5 OpenClaw gateway both showed up as "Telegram amnesia": rapid follow-ups would either get ignored (silent abort) or come back as if the bot had no memory of the conversation. ClawSweeper called for "one coordinated Claude CLI transcript-flush fix that preserves stale-session protection, gates write-side ghost session ids, keeps non-Claude providers untouched" — this PR is that coordinated fix, on the resume-validity predicate path that #81048 also touches.

Fix 1: cliSessionBinding persisted before transcript flushed

When a claude-cli turn produces a session id but the underlying claude subprocess fails to flush an assistant-role record to its ~/.claude/projects/<cwd>/<sid>.jsonl transcript (mid-turn kill from a concurrent fingerprint-mismatched turn, supervisor restart, internal failure), buildCliRunResult was still persisting that session id into cliSessionBinding. The next turn runs claudeCliSessionTranscriptHasContent(sid), doesn't find an assistant message, logs cli session reset: reason=missing-transcript, and starts a fresh claude session with no prior memory.

This PR gates the cliSessionBinding write on the same predicate the next-turn invalidator already uses, evaluated at write time. If the transcript doesn't have an assistant message at the moment we'd persist the binding, the binding is omitted and agentMeta.sessionId is cleared ("") so the session-store fallback at command/session-store.ts:166-170 — which reads agentMeta.sessionId and persists it via setCliSessionId(...) when no binding is present — also drops the unflushed sid. Both writes have to drop together; gating only the binding lets the same bad sid round-trip through the fallback path.

The probe is gated on isClaudeCliProvider(params.provider). Other CLI providers (codex-cli, etc.) don't write to ~/.claude/projects so probing them would always return false and incorrectly strip valid binding metadata. For non-claude-cli providers isCliBindingFlushed returns true unconditionally and behavior is unchanged.

Fix 2: trailing orphaned tool_use blocks all forward progress on resume

A claude-cli session whose JSONL ends with an assistant tool_use content block that was never answered by a tool_result user message cannot resume — claude-cli will sit waiting for the missing tool_result, hit its no-output watchdog, and the runtime kills it with reason=abort. The dispatcher then sees an empty payload and emits NO_REPLY, which to the user looks like the agent silently ignored their message.

The orphan can be left behind when:

  • Gateway restarts mid-tool (brew upgrade, manual kickstart, OOM, crash) — claude was waiting on a tool result that never arrived
  • claude-live-session.ts no-output watchdog fires while a tool is actively running and OC kills the subprocess
  • The tool itself crashed or hung past its own deadline

In all cases the resumed session is dead until the binding gets cleared, because every subsequent resume hits the same trailing tool_use and the same kill cycle. Critically, the existing claudeCliSessionTranscriptHasContent invalidator does not catch this — it only checks whether any assistant message exists, not whether the most recent one is a complete turn.

This PR adds claudeCliSessionTranscriptHasOrphanedToolUse in command/attempt-execution.helpers.ts. It walks the JSONL, finds the last assistant message, and returns true if any of its tool_use ids has no matching tool_result later in the file. prepareCliRunContext runs it as a second gate alongside missing-transcript, with invalidatedReason: "orphaned-tool-use". The new reason is added to RAW_TRANSCRIPT_RESEED_ALLOWED_REASONS so prior context is reseeded into the new session.

Detection only considers TRAILING orphans — an unanswered tool_use deeper in history is inert because a later assistant message already moved past it. Probe runs only for claude-cli providers and only when the transcript-content gate already passed, so we add no I/O on already-invalidated sessions and no behavior change for non-claude providers.

Root cause walkthroughs

Fix 1 (binding-flush)

  1. resolveSessionIdToSend mints a fresh UUID via crypto.randomUUID() when useResume=false.
  2. The UUID is passed to claude-cli as --session-id.
  3. buildCliRunResult writes cliSessionBinding.sessionId = output.sessionId ?? context.reusableCliSession.sessionId regardless of whether claude-cli committed an assistant record to disk.
  4. If the subprocess is killed mid-turn — for example, the live-session manager calls closeLiveSession(session, "restart") on fingerprint mismatch (src/agents/cli-runner/claude-live-session.ts:78 and :110) — the .jsonl may exist with only the user/system prelude, no "role":"assistant", or may not exist at all.
  5. Next turn calls claudeCliSessionTranscriptHasContent(sid), finds nothing → invalidatedReason: "missing-transcript" → fresh session, empty memory.

Fix 2 (orphan-tool)

  1. Resume turn N runs successfully through claude-cli; assistant emits a tool_use block (e.g. Bash command).
  2. While the runtime is waiting for the tool to return — gateway restarts (brew upgrade, manual kickstart, etc.) OR claude-live-session.ts:closeLiveSession(_, "restart") fires from an unrelated fingerprint mismatch OR the tool itself hangs.
  3. claude-cli's transcript line for the tool_use is flushed, but the corresponding tool_result user message is never written.
  4. Resume turn N+1 calls claude --resume <sid>; claude-cli loads the JSONL, sees the trailing unanswered tool_use, and waits indefinitely for a tool_result that will never arrive.
  5. After 180s of no streaming output, OC's claude-live-session.ts no-output watchdog kills the subprocess with reason=abort.
  6. The dispatcher sees the empty final payload, emits NO_REPLY, and the user-visible result is silence.
  7. Repeats forever for that session because the binding still points at the dead transcript — the only escape is binding clear.

False-negative window

There's a brief gap between claude-cli closing its stdio and the OS making the JSONL line readable to a separate process. A single-shot probe immediately after the await can race that window and falsely conclude the transcript is empty, which would cause the missing-transcript reset we're trying to prevent.

Fix 1's isCliBindingFlushed handles this with a bounded retry: 3 probes at delays [0, 50, 150] ms (max ~200 ms total scheduled delay). Empirically, the first probe succeeds in the common case, and the retries cover the long tail without serializing the happy path. If all three probes return false the binding is dropped and the next turn starts fresh — which is the correct behavior, since at that point the transcript genuinely isn't there.

Tests

src/agents/cli-runner.binding-flush.test.ts (Fix 1) — exercises isCliBindingFlushed directly via the new injectable-deps seam (setCliRunnerTestDeps / restoreCliRunnerTestDeps, mirroring the existing pattern in src/agents/cli-runner/prepare.ts):

  • returns false without probing for claude-cli with an undefined sessionId
  • returns true on first-probe success
  • retries 3 times before giving up
  • succeeds on a later retry when the transcript becomes visible mid-sequence
  • bounded under 400 ms wall-clock for the full 3-probe sequence (200 ms scheduled + jitter ceiling)
  • returns true without probing for non-claude-cli providers (codex-cli, anthropic, openai)
  • returns true without probing when the provider is undefined

src/agents/command/attempt-execution.test.ts (Fix 2) — exercises claudeCliSessionTranscriptHasOrphanedToolUse against synthetic on-disk JSONL fixtures:

  • returns false when the transcript is missing
  • returns false when the last assistant message has no tool_use
  • returns false when every tool_use has a matching tool_result
  • returns true when the last assistant message has a trailing tool_use without tool_result (the bug repro)
  • returns true when the last assistant has multiple tool_use and at least one is orphaned
  • returns false when an earlier assistant tool_use is unanswered but the last assistant message resolved cleanly (proves it's a TRAILING-orphan check, not a depth-N orphan check)
  • rejects path-like session ids instead of escaping the Claude projects tree

pnpm build && pnpm check clean locally. pnpm vitest run src/agents/cli-runner src/agents/cli-runner.ts src/agents/cli-runner.binding-flush.test.ts src/agents/cli-runner.test.ts src/agents/command/attempt-execution.test.ts src/agents/command/session-store.test.ts → 32 files / 302 tests pass.

cli-runner.reliability.test.ts is failing on main (Cannot find module '@mariozechner/pi-ai/oauth') — verified pre-existing on origin/main without my changes — so per CONTRIBUTING I'm not chasing that.

Real Behavior Proof

Behavior or issue addressed: Two distinct bugs in the claude-cli resume path on a live OpenClaw gateway. (1) Missing-transcript ghost binding (Fix 1). Concurrent / fingerprint-mismatched turns kill the claude-cli subprocess mid-flight; buildCliRunResult persists the unflushed sessionId; the next turn's invalidator correctly rejects it as missing-transcript and resets the session — user sees amnesia. (2) Orphan-tool stuck resume (Fix 2). Gateway restart mid-tool leaves the JSONL transcript ending in an unanswered assistant tool_use; every subsequent claude --resume hangs waiting for the missing tool_result, hits OC's 180s no-output watchdog, gets killed with reason=abort, and the dispatcher emits NO_REPLY — user sees the agent silently ignoring messages.

Real environment tested: M5 (Apple Silicon, macOS Darwin 25.4.0, Node v26.0.0); OpenClaw 2026.5.7 (homebrew install at /opt/homebrew/lib/node_modules/openclaw/); live gateway service (ai.openclaw.gateway LaunchAgent, port 18789); claude-cli backend via Anthropic Bedrock (CLAUDE_CODE_USE_BEDROCK=1, model claude-opus-4-7); 3d-engineer agent on Telegram (chat -1003753238419, topic 3). Caveat: the after-fix evidence below is from a structurally-equivalent dist-side carry-patch (same predicates, same retry schedule [0, 50, 150] ms, same call sites primary and failover, same isClaudeCliProvider gate, same agentMeta.sessionId clear, same orphan-tool walk algorithm) applied to the running gateway, not from a pnpm build of this PR's branch. I attempted a brew-dist swap from the branch build but the branch-built dist references workspace-internal npm packages (@earendil-works/pi-ai) that aren't in the brew install's node_modules; the gateway refused to boot. I rolled back. A proper branch-build proof requires running gateway from the source clone with its own auth profile, which I'm tracking as a follow-up. Maintainer is welcome to apply proof: override if the structural-equivalence argument is acceptable, or to defer merge until the source-build window is captured.

Exact steps or command run after the patch: Fix 1 was verified by counting missing-transcript reset events in the live gateway log across pre- and post-patch windows. Fix 2 was verified by applying the dist-side patch to a gateway whose 3d-engineer session was already stuck on a trailing tool_use(Bash) (binding 5918a618-... still in ~/.openclaw/agents/3d-engineer/sessions/sessions.json), restarting the gateway, sending a real Telegram message via the iOS Telegram app to 3d-engineer, and watching ~/.openclaw/logs/gateway.log for the claude live session turn and [telegram] sendMessage ok lines:

$ grep "cli session reset: provider=claude-cli reason=missing-transcript" \
    ~/.openclaw/logs/gateway.log \
    | grep -E "2026-05-(05|06|07|08|09|10|11|12)" | wc -l   # Fix 1 pre-patch window
$ grep "cli session reset: provider=claude-cli reason=missing-transcript" \
    ~/.openclaw/logs/gateway.log \
    | grep -E "2026-05-(13|14)" | wc -l                       # Fix 1 post-patch window
$ /opt/homebrew/bin/python3 ~/.local/bin/openclaw-orphan-tool-use-patch.py
$ launchctl kickstart -k "gui/$(id -u)/ai.openclaw.gateway"
$ awk '/^2026-05-14T12:2[0-3]/' ~/.openclaw/logs/gateway.log | grep -E "3d-engineer|claude live|telegram"

Between the kickstart and the awk grep, I sent a real Telegram message to 3d-engineer from the iOS Telegram app to verify the recovery path end-to-end.

Evidence after fix: Live runtime log output from the production gateway, captured by terminal grep/awk on ~/.openclaw/logs/gateway.log — see fenced blocks below.

Fix 1 — pre-patch (May 5–12, 8 days, no patch applied) returned 24 missing-transcript reset events; post-patch (May 13–14, 2 days, dist-side patch applied) returned 0:

24
0

Fix 1 sample raw pre-patch event lines:

2026-05-09T18:55:09.246-04:00 [agent/cli-backend] cli session reset: provider=claude-cli reason=missing-transcript
2026-05-09T19:09:15.007-04:00 [agent/cli-backend] cli session reset: provider=claude-cli reason=missing-transcript
2026-05-09T20:54:14.602-04:00 [agent/cli-backend] cli session reset: provider=claude-cli reason=missing-transcript

Fix 2 stuck transcript on disk that proves the bug existed:

$ tail -1 ~/.claude/projects/<encoded-cwd>/5918a618-513b-40cf-bbb7-24b64d1b6aef.jsonl
{"parentUuid":"7071f993-...","isSidechain":false,"message":{"model":"claude-opus-4-6","id":"msg_bdrk_01QdQnfErtmZkAmcJA9p5xCS","type":"message","role":"assistant","content":[{"type":"tool_use","id":"toolu_bdrk_01Cyhp6vtsDwvEysvorDighT","name":"Bash","input":{"command":"blende...

(Last message is a trailing tool_use(Bash) content block; no tool_result user message after it — the orphan.) Fix 2 pre-patch live-gateway repro of the silent-abort cycle (4 events on May 14 before the patch landed at 12:07):

2026-05-14T10:34:18.084-04:00 [agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
2026-05-14T10:34:18.090-04:00 [silent-reply/dispatcher] exact NO_REPLY final payload was skipped before delivery
2026-05-14T10:34:18.090-04:00 [diagnostic] message processed: channel=telegram chatId=telegram:-1003753238419 messageId=1787 sessionKey=agent:3d-engineer:telegram:group:-1003753238419:topic:3 outcome=completed duration=425224ms
2026-05-14T10:38:41.175-04:00 [agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
2026-05-14T10:58:27.727-04:00 [agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
2026-05-14T11:19:41.706-04:00 [agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
2026-05-14T11:19:41.715-04:00 [model-fallback/decision] decision=candidate_failed requested=claude-cli/claude-opus-4-7 reason=timeout next=none detail=CLI produced no output for 180s and was terminated.
2026-05-14T11:19:41.718-04:00 [silent-reply/dispatcher] exact NO_REPLY final payload was skipped before delivery
2026-05-14T11:19:42.839-04:00 [telegram] turn ended without visible final response

Fix 2 post-patch live-gateway recovery (real Telegram message sent after the patch went live at 12:07; same stuck binding 5918a618-... was still in the session-store at the time):

2026-05-14T12:22:00.064-04:00 [agent/cli-backend] claude live session turn: provider=claude-cli model=claude-opus-4-7 durationMs=719440 rawLines=13837
2026-05-14T12:22:01.770-04:00 [diagnostic] message processed: channel=telegram chatId=telegram:-1003753238419 messageId=1801 sessionKey=agent:3d-engineer:telegram:group:-1003753238419:topic:3 outcome=completed duration=721507ms
2026-05-14T12:22:02.151-04:00 [telegram] sendMessage ok chat=-1003753238419 message=1804
2026-05-14T12:23:50.732-04:00 [agent/cli-backend] cli exec: provider=claude-cli model=opus promptChars=929 trigger=user useResume=true session=present resumeSession=3553b3a8c7af reuse=reusable historyPrompt=none

Observed result after fix: Fix 1 — zero missing-transcript reset events across the 2-day post-patch window on the live gateway, vs. 24 events / ~3 per day pre-patch (pre-patch baseline rate gives ~6 expected events for a 2-day window; observed 0). Fix 2 — the 3d-engineer agent had been stuck since 10:34 (every Telegram message hit the same orphan, aborted at the 180s no-output mark, produced silent NO_REPLY — 4 documented events). 15 minutes after the dist-side patch went live (12:07), the next Telegram message (12:22) triggered the new orphan-tool-use invalidator, the binding was dropped, a fresh session started, and the agent produced a 12-minute / 13,837-line streaming reply that was successfully delivered to Telegram (sendMessage ok message=1804). The follow-up message a minute later resumed the new session cleanly (useResume=true reuse=reusable). Same agent, same stuck session-store entry, recovered the moment the patch went live.

What was not tested: Non-claude-cli providers (codex-cli, openai) — the fixes are gated on isClaudeCliProvider(params.provider) and only probe ~/.claude/projects/, so I have high confidence they don't change behavior for other providers, but I have not exercised those paths manually. Fix 1's failover-retry branch under a forced live-session restart — couldn't reliably reproduce the failover code path in my setup; the patched failover branch calls the same isCliBindingFlushed helper and threads bindingFlushOk through buildCliRunResult identically to the primary path. Sidechain transcripts with trailing orphans — fixture coverage exists (synthetic sidechain-trailing and main-orphan-with-sidechain cases) but my actual stuck transcript happens to have zero sidechain entries, so the sidechain skip is fixture-validated only, not live-validated. A pnpm build of this branch on the live gateway — attempted via brew-dist swap, hit ERR_MODULE_NOT_FOUND for @earendil-works/pi-ai, rolled back; branch-build proof is the outstanding follow-up.

Notes for reviewers

  • Why gate at the binding-write site rather than at the live-session-kill site? The binding write is the single funnel through which session ids reach next-turn state. Gating there catches all causes of unflushed sids (mid-turn kill, claude-cli internal error, OOM). Gating at every kill site would be brittle.
  • Why detect the orphan in prepare.ts rather than recover by injecting a synthetic tool_result? Injection would be a behavior change to the conversation contents — claude-cli would see a tool_result it never actually produced, potentially leading the model down false branches. Invalidating the resume and falling through to a fresh session preserves the model's view of the world; the user keeps the conversation going on a clean session, with the prior context reseeded via RAW_TRANSCRIPT_RESEED.
  • Coordination with fix(command): retry claude-cli transcript probe to close flush race #81048. That PR adds a workspaceDir parameter to claudeCliSessionTranscriptHasContent and changes the lookup from "scan all project dirs" to "deterministic single-dir under <homeDir>/.claude/projects/<encoded(workspaceDir)>". My new claudeCliSessionTranscriptHasOrphanedToolUse follows the same shape as the current claudeCliSessionTranscriptHasContent (multi-project scan), and would need an identical workspaceDir thread-through if fix(command): retry claude-cli transcript probe to close flush race #81048 lands first. Happy to rebase in either direction depending on what the maintainer team prefers.
  • Why not serialize per-session-key turns? That's the more invasive correctness fix — folds heartbeats, channel auto-replies, and rapid follow-ups into a single serialized stream per session. Plausibly worth doing but it's a bigger surface area and changes turn-ordering semantics. This PR is the minimal coordinated fix for the resume-validity predicate path.
  • Performance. Fix 1's gate calls the same probe the next-turn invalidator already runs (one fs.readdir on ~/.claude/projects/ then up to N JSONL existence-and-content checks); the additional cost is up to 200ms of scheduled setTimeout delay across the bounded retry, only when the transcript hasn't shown up yet. Fix 2's gate adds at most one additional JSONL walk (same I/O pattern) when the content gate already passed; on healthy resumes the JSONL is fully resident in OS page cache from the prior probe so this is essentially a re-walk-in-RAM. I did not microbenchmark.
  • Local carry pattern. I'm carrying both fixes via a Python patch script that re-applies a structurally-equivalent dist-side patch after brew upgrade openclaw. The script auto-detects upstream-fix landing and becomes a no-op once this merges.

AI-assistance

This PR was authored with Claude (Opus + claude-cli). I understand what the code does — root-caused both bugs via live instrumentation of the gateway log on my own M5 install:

  1. Fix 1: caught via repeated cli session reset: reason=missing-transcript events correlating with rapid follow-ups; traced the ghost-id origin to buildCliRunResult in the dist bundle.
  2. Fix 2: caught via repeated silent-reply/dispatcher: exact NO_REPLY final payload was skipped events on a single 3d-engineer session; correlated with claude live session close: reason=abort and CLI produced no output for 180s; manually tail'd the session JSONL to find the trailing tool_use(Bash) block with no tool_result after it.

I ran codex review --base origin/main locally before opening this PR. Codex's first pass on the binding-flush-only diff caught two real bugs: (1) the session-store fallback path at command/session-store.ts:166-170 re-persisted agentMeta.sessionId via setCliSessionId(...), defeating the binding-only gate; (2) the probe ran for every CLI provider, not just claude-cli, which would have stripped valid binding metadata for codex/openai sessions. Both are now fixed in this commit and reflected in the test suite. Re-ran codex review against the updated diff with the orphan-tool fix added before pushing this update.

Session logs are local; happy to share specific excerpts if useful.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 14, 2026
@clawsweeper

clawsweeper Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The branch gates Claude CLI session binding persistence on transcript flush, invalidates trailing orphaned tool_use resumes, reseeds context after invalidation, adds fingerprint mismatch diagnostics, and adds focused tests.

Reproducibility: no. live current-main reproduction was run in this read-only review. The failure is source-reproducible because current main persists effectiveCliSessionId without a write-side transcript gate, can re-persist the bare session id fallback, and only checks for any assistant message before resume.

Real behavior proof
Needs stronger real behavior proof before merge: The PR body includes live logs from a structurally equivalent dist-side carry-patch, but not a branch-built or current-head run after later reseed/diagnostic commits; the contributor should add redacted current-head logs, terminal output, recording, or linked artifacts and update the PR body to trigger re-review.

Next step before merge
Human review is needed because the PR is draft, overlaps another open transcript-probe fix, and needs current-head real behavior proof rather than an automated repair PR.

Security
Cleared: The diff changes agent session predicates, diagnostics, and tests only; I found no dependency, CI, credential, artifact download, publishing, or permission-surface change.

Review details

Best possible solution:

Land one coordinated Claude CLI resume-validity fix after resolving the overlap with #81048 and requiring current-head proof for flush gating, orphan invalidation, and reseed recovery.

Do we have a high-confidence way to reproduce the issue?

No live current-main reproduction was run in this read-only review. The failure is source-reproducible because current main persists effectiveCliSessionId without a write-side transcript gate, can re-persist the bare session id fallback, and only checks for any assistant message before resume.

Is this the best way to solve the issue?

Mostly yes: the proposed shape targets the Claude CLI resume-validity path, keeps non-Claude providers out of the transcript probe, and closes both the binding and fallback-id paths. The safer merge path is to coordinate with the overlapping transcript-probe PR and add current-head real behavior proof before landing.

What I checked:

  • Current main persists CLI bindings without a flush gate: On current main, buildCliRunResult writes agentMeta.sessionId and cliSessionBinding from effectiveCliSessionId without checking that the Claude CLI transcript has flushed assistant content first. (src/agents/cli-runner.ts:367, 926bf66ee336)
  • Current session-store fallback can re-persist the bare id: When cliSessionBinding is absent, updateSessionStoreAfterAgentRun falls back to agentMeta.sessionId and calls setCliSessionId, so a binding-only gate would not fully remove a ghost CLI session id. (src/agents/command/session-store.ts:162, 926bf66ee336)
  • Current main only checks for any assistant transcript content: prepareCliRunContext invalidates missing Claude CLI transcripts, and claudeCliSessionTranscriptHasContent returns true once any assistant message is found; current main has no trailing orphaned tool_use predicate. (src/agents/cli-runner/prepare.ts:279, 926bf66ee336)
  • PR adds write-side flush gating: The PR patch adds isCliBindingFlushed, skips probing for non-Claude providers, omits cliSessionBinding on a failed Claude transcript probe, and clears agentMeta.sessionId to avoid the fallback path. (src/agents/cli-runner.ts:23, 3a6de1b62ab9)
  • PR adds trailing orphaned tool_use invalidation: The PR patch walks Claude JSONL transcripts, skips sidechain entries, tracks the latest assistant tool_use ids, and adds orphaned-tool-use as a CLI session invalidation reason. (src/agents/command/attempt-execution.helpers.ts:105, 3a6de1b62ab9)
  • PR adds context reseed after invalidation: The latest patch reads the invalidated Claude CLI transcript through buildClaudeCliFallbackContextPrelude and prepends a retry prelude before starting a fresh session. (src/agents/cli-runner/prepare.ts:438, 3a6de1b62ab9)

Likely related people:

  • Ayaan Zaidi: Current-main blame for the CLI runner result metadata, cliSessionBinding write path, transcript helper, session-store fallback, and live-session fingerprint code points to commit 02f2e08. (role: recent area contributor; confidence: medium; commits: 02f2e08493f4; files: src/agents/cli-runner.ts, src/agents/command/attempt-execution.helpers.ts, src/agents/command/session-store.ts)
  • steipete: git log for the central CLI runner files shows repeated recent work and earlier CLI runner session/reuse refactors by Peter Steinberger, including the command cron and CLI runner pipeline history. (role: major refactor and recent adjacent owner; confidence: medium; commits: 686b93e5c710, 48ae97633303, 3f54076d3736; files: src/agents/cli-runner.ts, src/agents/cli-runner/prepare.ts, src/agents/command/attempt-execution.helpers.ts)

Remaining risk / open question:

  • The current PR head still lacks branch-built or current-head live proof for the latest reseed and diagnostic commits.
  • The branch overlaps an open transcript-probe PR, so landing both independently could create conflicting retry/logging behavior in the same helper.
  • This read-only review did not execute the proposed tests or a live Claude CLI resume scenario.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 926bf66ee336.

Re-review progress:

…flush

When a claude-cli turn produces a session id but the underlying claude
subprocess fails to flush an assistant-role record to its
~/.claude/projects/<cwd>/<sid>.jsonl transcript (e.g. mid-turn kill from
a concurrent fingerprint-mismatched turn, supervisor restart, internal
failure), buildCliRunResult was still persisting that session id into
cliSessionBinding. The next turn ran claudeCliSessionTranscriptHasContent,
didn't find the file, logged 'cli session reset: reason=missing-transcript',
and started a brand-new claude session with empty memory.

End-user symptom: agent forgets prior conversation between turns.

Gate the cliSessionBinding spread on the same predicate the next-turn
invalidator uses, evaluated at write time. Also clear agentMeta.sessionId
in the same case so the session-store fallback at command/session-store.ts
(which reads agentMeta.sessionId via setCliSessionId when the binding is
absent) doesn't re-persist the unflushed sid through a different field
path. The fallback is what makes the binding-only gate insufficient on
its own; both writes must drop together.

The gate only fires for claude-cli providers — other CLI providers don't
write to ~/.claude/projects, so probing them would always return false
and incorrectly strip valid binding metadata. isCliBindingFlushed now
takes the provider id and returns true unconditionally for non-claude-cli
sessions.

A bounded retry (0 / 50 / 150 ms) tolerates the brief gap between
claude-cli's stdio close and the OS making the JSONL line visible to
readers (cooperative fsync semantics on APFS, but not guaranteed under
stress).

The transcript-probe is exposed as an injectable dep
(setCliRunnerTestDeps / restoreCliRunnerTestDeps) mirroring the existing
pattern in src/agents/cli-runner/prepare.ts so isCliBindingFlushed is
testable without touching ~/.claude/projects.

AI-assisted: yes. Tooling: Claude Opus + claude-cli. Codex review caught
the fallback path and the missing provider gate before this hit upstream.
Real-Behavior-Proof: dist-side patch on M5 gateway; branch-build
follow-up pending — see PR body.
…-tool

A claude-cli session whose JSONL transcript ends with an assistant
`tool_use` content block that was never answered by a `tool_result` user
message cannot resume — claude-cli will sit waiting for the missing
`tool_result`, hit its no-output watchdog, and the runtime kills it
with `reason=abort`. The dispatcher then sees an empty payload and emits
NO_REPLY, which to the user looks like the agent silently ignored their
message — same end-user symptom as the binding-flush amnesia bug, but a
different root cause.

The orphan can be left behind when:
  - Gateway restarts mid-tool (brew upgrade, manual kickstart, OOM,
    crash) — claude was waiting on a tool result that never arrived.
  - `claude-live-session.ts` no-output watchdog fires while a tool is
    actively running and OC kills the subprocess.
  - The tool itself crashed or hung past its own deadline.

In all cases the resumed session is dead until the binding gets cleared,
because every subsequent resume hits the same trailing tool_use and the
same kill cycle. Observed in production on a personal OpenClaw gateway
(3d-engineer agent, 50-message-deep transcript ending in a Bash
`tool_use`; every Telegram message after the orphan landed silently
aborted at the 180s no-output mark).

Add `claudeCliSessionTranscriptHasOrphanedToolUse` to the helpers that
walks the JSONL, finds the last assistant message, and returns true if
any of its `tool_use` ids has no matching `tool_result` later in the
file. Wire into `prepareCliRunContext` as a second invalidator gate
alongside `missing-transcript`. The new `invalidatedReason:
"orphaned-tool-use"` follows the same path as missing-transcript: the
binding is dropped, this turn starts a fresh session, and the prior
context is reseeded into the new session via `RAW_TRANSCRIPT_RESEED`.

Detection only considers TRAILING orphans — an unanswered tool_use
deeper in history is inert because a later assistant message already
moved past it. Only the most recent assistant message's tool_use ids
matter for forward progress.

Probe runs only for claude-cli providers and only when the transcript-
content gate already passed, so we add no I/O on already-invalidated
sessions and no behavior change for non-claude providers.

AI-assisted: yes. Tooling: Claude Opus + claude-cli.
@adele-with-a-b adele-with-a-b force-pushed the fix/cli-backend-binding-flush-gate branch from deed605 to dfa2617 Compare May 14, 2026 16:37
@adele-with-a-b

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review — added orphan-tool-use invalidator, addressed all prior Codex findings (sidechain skip, no record cap), and now have real-behavior proof from a live gateway (see updated PR body)

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 14, 2026
@adele-with-a-b

Copy link
Copy Markdown
Contributor Author

Updated PR body to match the Real Behavior Proof schema (behavior, environment, steps, evidence, observedResult, notTested). Proof-gate is now passing. @clawsweeper re-review please.

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026
…ession restart

Prior to this change, the only signal that the claude-cli live session
was being torn down and re-spawned was a `claude live session close:
reason=restart` log line — which doesn't say WHY the fingerprint
changed. Restarts mid-tool are a primary cause of orphaned tool_use
entries in the JSONL transcript (the silent-abort cycle this PR's
recovery path handles), so it matters to understand when and why the
fingerprint differs across consecutive turns.

When fingerprint comparison fails, JSON-parse both fingerprints, list
the top-level keys whose serialized values differ, and emit them at
`info` level. Defensive fallback: emits `<unparseable>` rather than
crash. No behavior change beyond the new log line; no values are
exposed (we only print key names).

Likely candidates for unstable fingerprint dimensions visible from
production gateway logs: systemPrompt (per-message hook context being
accidentally folded into it), env (timestamps or per-turn ids), skills
(mid-session skill set rebuilds). The diagnostic surfaces which
without further code changes, allowing targeted follow-up fixes.

AI-assisted: yes. Tooling: Claude Opus + claude-cli.
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 14, 2026
…riven session reset

When `prepareCliRunContext` invalidates a claude-cli session for ANY
reason — missing-transcript, orphaned-tool-use, system-prompt, mcp,
auth-profile, auth-epoch — the runtime starts a fresh claude-cli
session and the agent loses memory of every prior turn. The OC-side
history reseed (`buildCliSessionHistoryPrompt` /
`loadCliSessionReseedMessages`) only fires on a pre-existing compaction
summary, which agents that haven't been long enough to compact don't
have. The user-visible result: the silent-abort cycle is replaced by
amnesia. Same end-user pain, different shape.

The dead claude-cli transcript still has the full conversation on disk
(it's what we just walked to find the orphan, or what the
missing-transcript / system-prompt / mcp paths were just told to
discard). Read it, build a `priorContextPrelude` from the main-
conversation messages (sidechain already filtered out by the existing
`parseClaudeCliHistoryEntry`), and prepend via the same
`resolveFallbackRetryPrompt` shape the fallback-to-different-provider
path already uses. The fresh session inherits the conversation; the
recovery becomes invisible to the user beyond a one-turn delay.

Note: `session-expired` retains the sessionId, so it doesn't apply
here — it's already covered by the existing `rawTranscriptReseedReason`
path that runs `buildCliSessionHistoryPrompt`.

Surfaces `buildClaudeCliFallbackContextPrelude` through `prepareDeps`
so tests can stub the on-disk reader without seeding a real
`~/.claude/projects/<encoded-cwd>/<sid>.jsonl`. Tests cover:
  - reseed fires for `missing-transcript` invalidation
  - reseed fires for `orphaned-tool-use` invalidation
  - reseed fires for `system-prompt` invalidation
  - reseed does NOT fire when the session is reusable
  - reseed does NOT fire for non-claude-cli providers

Note: this only mitigates the user-visible amnesia. The upstream cause
of the orphan creation (live-session manager tearing down claude-cli
mid-tool on fingerprint mismatch) and the system-prompt-drift cause
(extraSystemPromptHash changing across turns despite the static-only
hashing intent) are still tracked separately — see the diagnostic in
`claude-live-session.ts` from the prior commit on this branch.

AI-assisted: yes. Tooling: Claude Opus + claude-cli.
@adele-with-a-b adele-with-a-b force-pushed the fix/cli-backend-binding-flush-gate branch from 171d549 to 3a6de1b Compare May 14, 2026 21:16
@adele-with-a-b

Copy link
Copy Markdown
Contributor Author

Pausing to gather more evidence before asking for re-review.

What's solid:

  • Binding-flush gate (80cad2240a) — well-tested
  • Orphan-tool-use invalidator (dfa2617117) — verified working in production on my M5 gateway (caught and recovered the exact stuck-tool-use scenario it was designed for)

What's open:

  • The recovery-prelude reseed (3a6de1b62a) has unit-test coverage but I have not verified it preserves session memory in production end-to-end.
  • I'm seeing a separate cli session reset: provider=claude-cli reason=system-prompt event between consecutive turns on the same agent that I haven't root-caused — extraSystemPromptHash flips between turn N and N+1 despite the static-only hashing intent. The reseed branch covers this case via the broadened invalidation gate, but without root-causing the underlying drift I can't claim the fix is complete vs. masking.

Leaving the PR as draft while I run a longer observation window. Diff is here for review of the parts that are solid; happy to split into a smaller PR if maintainers prefer to land just the binding-flush + orphan-tool pieces while the reseed gets more validation.

@adele-with-a-b

Copy link
Copy Markdown
Contributor Author

Update on the open question from my prior comment:

The cli session reset: provider=claude-cli reason=system-prompt path I hadn't root-caused turned out to be skills-fingerprint instability in my local environment. Several agents had skill symlinks pointing OUTSIDE their workspace boundary (a agents/<agent>/skills/<name> -> /repo-root/skills/<name> shape), which OC's symlink-escape security check correctly rejects. The rejection itself was non-deterministic between turns, flipping the skillsFingerprint (and downstream systemPromptHash) and triggering live-session restarts. The diagnostic I added in claude-live-session.ts caught one in production: claude live session fingerprint mismatch: keys=skillsFingerprint. Removing the bad symlinks stops the resets at their source.

That's a config issue specific to my setup, not an OC bug — but it does narrow the scope of evidence for this PR. To be honest about what's left:

  • Binding-flush gate (80cad2240a) — solid, well-tested. Stops a real ghost-binding write that exists regardless of skill config. Concurrent fingerprint-mismatched turns and supervisor restarts mid-turn can still produce unflushed sids in any environment.
  • Orphan-tool-use invalidator (dfa2617117) — solid, well-tested. Recovers sessions whose transcript ends mid-tool. Real failure mode whenever the gateway dies mid-tool (brew upgrade, OOM, crash, manual kickstart) — not config-specific, but lower frequency in a clean setup than my symlink-heavy one made it look.
  • Recovery-prelude reseed (3a6de1b62a) — covers the user-visible-amnesia tail of the above. Has unit-test coverage; I have not verified end-to-end in production.
  • Fingerprint diagnostic (8b8e82584d) — did its job for me; happy to remove before merge or keep as production telemetry per maintainer preference.

Happy to split the PR into a smaller "binding-flush + orphan-tool" pair (the two fixes that are solidly proven) and pursue the reseed separately once I have evidence from the cleaner environment. Will leave the PR as draft until I hear what shape the maintainer team wants.

@adele-with-a-b

Copy link
Copy Markdown
Contributor Author

Closing this in favor of the split-out PR.

The two solidly-proven commits (binding-flush gate + orphan-tool-use invalidator) are now in #84234, rebased on current upstream/main and reconciled with the recent finalizeCliContextEngineTurn and promptToolNamesHash changes.

Dropped from this PR (and from the new split):

  • Recovery-prelude reseed (3a6de1b62a) — has unit-test coverage but I have not verified it end-to-end in production after cleaning up the local symlink-heavy environment that originally surfaced the failure mode. Will refile when real-runtime evidence is available.
  • Fingerprint-mismatch diagnostic (8b8e82584d) — did its job locally; will refile separately if maintainers want it as production telemetry.

Closing per maintainer-options conversation in the prior review thread (option: "split into binding-flush + orphan-tool, pursue reseed separately"). New PR: #84234.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling proof: supplied External PR includes structured after-fix real behavior proof. size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant