fix(agents/cli-runner): gate cliSessionBinding persist on transcript flush#81821
fix(agents/cli-runner): gate cliSessionBinding persist on transcript flush#81821adele-with-a-b wants to merge 4 commits into
Conversation
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: no. live current-main reproduction was run in this read-only review. The failure is source-reproducible because current main persists effectiveCliSessionId without a write-side transcript gate, can re-persist the bare session id fallback, and only checks for any assistant message before resume. Real behavior proof Next step before merge Security Review detailsBest possible solution: Land one coordinated Claude CLI resume-validity fix after resolving the overlap with #81048 and requiring current-head proof for flush gating, orphan invalidation, and reseed recovery. Do we have a high-confidence way to reproduce the issue? No live current-main reproduction was run in this read-only review. The failure is source-reproducible because current main persists effectiveCliSessionId without a write-side transcript gate, can re-persist the bare session id fallback, and only checks for any assistant message before resume. Is this the best way to solve the issue? Mostly yes: the proposed shape targets the Claude CLI resume-validity path, keeps non-Claude providers out of the transcript probe, and closes both the binding and fallback-id paths. The safer merge path is to coordinate with the overlapping transcript-probe PR and add current-head real behavior proof before landing. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 926bf66ee336. Re-review progress:
|
…flush When a claude-cli turn produces a session id but the underlying claude subprocess fails to flush an assistant-role record to its ~/.claude/projects/<cwd>/<sid>.jsonl transcript (e.g. mid-turn kill from a concurrent fingerprint-mismatched turn, supervisor restart, internal failure), buildCliRunResult was still persisting that session id into cliSessionBinding. The next turn ran claudeCliSessionTranscriptHasContent, didn't find the file, logged 'cli session reset: reason=missing-transcript', and started a brand-new claude session with empty memory. End-user symptom: agent forgets prior conversation between turns. Gate the cliSessionBinding spread on the same predicate the next-turn invalidator uses, evaluated at write time. Also clear agentMeta.sessionId in the same case so the session-store fallback at command/session-store.ts (which reads agentMeta.sessionId via setCliSessionId when the binding is absent) doesn't re-persist the unflushed sid through a different field path. The fallback is what makes the binding-only gate insufficient on its own; both writes must drop together. The gate only fires for claude-cli providers — other CLI providers don't write to ~/.claude/projects, so probing them would always return false and incorrectly strip valid binding metadata. isCliBindingFlushed now takes the provider id and returns true unconditionally for non-claude-cli sessions. A bounded retry (0 / 50 / 150 ms) tolerates the brief gap between claude-cli's stdio close and the OS making the JSONL line visible to readers (cooperative fsync semantics on APFS, but not guaranteed under stress). The transcript-probe is exposed as an injectable dep (setCliRunnerTestDeps / restoreCliRunnerTestDeps) mirroring the existing pattern in src/agents/cli-runner/prepare.ts so isCliBindingFlushed is testable without touching ~/.claude/projects. AI-assisted: yes. Tooling: Claude Opus + claude-cli. Codex review caught the fallback path and the missing provider gate before this hit upstream. Real-Behavior-Proof: dist-side patch on M5 gateway; branch-build follow-up pending — see PR body.
…-tool
A claude-cli session whose JSONL transcript ends with an assistant
`tool_use` content block that was never answered by a `tool_result` user
message cannot resume — claude-cli will sit waiting for the missing
`tool_result`, hit its no-output watchdog, and the runtime kills it
with `reason=abort`. The dispatcher then sees an empty payload and emits
NO_REPLY, which to the user looks like the agent silently ignored their
message — same end-user symptom as the binding-flush amnesia bug, but a
different root cause.
The orphan can be left behind when:
- Gateway restarts mid-tool (brew upgrade, manual kickstart, OOM,
crash) — claude was waiting on a tool result that never arrived.
- `claude-live-session.ts` no-output watchdog fires while a tool is
actively running and OC kills the subprocess.
- The tool itself crashed or hung past its own deadline.
In all cases the resumed session is dead until the binding gets cleared,
because every subsequent resume hits the same trailing tool_use and the
same kill cycle. Observed in production on a personal OpenClaw gateway
(3d-engineer agent, 50-message-deep transcript ending in a Bash
`tool_use`; every Telegram message after the orphan landed silently
aborted at the 180s no-output mark).
Add `claudeCliSessionTranscriptHasOrphanedToolUse` to the helpers that
walks the JSONL, finds the last assistant message, and returns true if
any of its `tool_use` ids has no matching `tool_result` later in the
file. Wire into `prepareCliRunContext` as a second invalidator gate
alongside `missing-transcript`. The new `invalidatedReason:
"orphaned-tool-use"` follows the same path as missing-transcript: the
binding is dropped, this turn starts a fresh session, and the prior
context is reseeded into the new session via `RAW_TRANSCRIPT_RESEED`.
Detection only considers TRAILING orphans — an unanswered tool_use
deeper in history is inert because a later assistant message already
moved past it. Only the most recent assistant message's tool_use ids
matter for forward progress.
Probe runs only for claude-cli providers and only when the transcript-
content gate already passed, so we add no I/O on already-invalidated
sessions and no behavior change for non-claude providers.
AI-assisted: yes. Tooling: Claude Opus + claude-cli.
deed605 to
dfa2617
Compare
|
@clawsweeper re-review — added orphan-tool-use invalidator, addressed all prior Codex findings (sidechain skip, no record cap), and now have real-behavior proof from a live gateway (see updated PR body) |
|
Updated PR body to match the Real Behavior Proof schema ( |
…ession restart Prior to this change, the only signal that the claude-cli live session was being torn down and re-spawned was a `claude live session close: reason=restart` log line — which doesn't say WHY the fingerprint changed. Restarts mid-tool are a primary cause of orphaned tool_use entries in the JSONL transcript (the silent-abort cycle this PR's recovery path handles), so it matters to understand when and why the fingerprint differs across consecutive turns. When fingerprint comparison fails, JSON-parse both fingerprints, list the top-level keys whose serialized values differ, and emit them at `info` level. Defensive fallback: emits `<unparseable>` rather than crash. No behavior change beyond the new log line; no values are exposed (we only print key names). Likely candidates for unstable fingerprint dimensions visible from production gateway logs: systemPrompt (per-message hook context being accidentally folded into it), env (timestamps or per-turn ids), skills (mid-session skill set rebuilds). The diagnostic surfaces which without further code changes, allowing targeted follow-up fixes. AI-assisted: yes. Tooling: Claude Opus + claude-cli.
…riven session reset When `prepareCliRunContext` invalidates a claude-cli session for ANY reason — missing-transcript, orphaned-tool-use, system-prompt, mcp, auth-profile, auth-epoch — the runtime starts a fresh claude-cli session and the agent loses memory of every prior turn. The OC-side history reseed (`buildCliSessionHistoryPrompt` / `loadCliSessionReseedMessages`) only fires on a pre-existing compaction summary, which agents that haven't been long enough to compact don't have. The user-visible result: the silent-abort cycle is replaced by amnesia. Same end-user pain, different shape. The dead claude-cli transcript still has the full conversation on disk (it's what we just walked to find the orphan, or what the missing-transcript / system-prompt / mcp paths were just told to discard). Read it, build a `priorContextPrelude` from the main- conversation messages (sidechain already filtered out by the existing `parseClaudeCliHistoryEntry`), and prepend via the same `resolveFallbackRetryPrompt` shape the fallback-to-different-provider path already uses. The fresh session inherits the conversation; the recovery becomes invisible to the user beyond a one-turn delay. Note: `session-expired` retains the sessionId, so it doesn't apply here — it's already covered by the existing `rawTranscriptReseedReason` path that runs `buildCliSessionHistoryPrompt`. Surfaces `buildClaudeCliFallbackContextPrelude` through `prepareDeps` so tests can stub the on-disk reader without seeding a real `~/.claude/projects/<encoded-cwd>/<sid>.jsonl`. Tests cover: - reseed fires for `missing-transcript` invalidation - reseed fires for `orphaned-tool-use` invalidation - reseed fires for `system-prompt` invalidation - reseed does NOT fire when the session is reusable - reseed does NOT fire for non-claude-cli providers Note: this only mitigates the user-visible amnesia. The upstream cause of the orphan creation (live-session manager tearing down claude-cli mid-tool on fingerprint mismatch) and the system-prompt-drift cause (extraSystemPromptHash changing across turns despite the static-only hashing intent) are still tracked separately — see the diagnostic in `claude-live-session.ts` from the prior commit on this branch. AI-assisted: yes. Tooling: Claude Opus + claude-cli.
171d549 to
3a6de1b
Compare
|
Pausing to gather more evidence before asking for re-review. What's solid:
What's open:
Leaving the PR as draft while I run a longer observation window. Diff is here for review of the parts that are solid; happy to split into a smaller PR if maintainers prefer to land just the binding-flush + orphan-tool pieces while the reseed gets more validation. |
|
Update on the open question from my prior comment: The That's a config issue specific to my setup, not an OC bug — but it does narrow the scope of evidence for this PR. To be honest about what's left:
Happy to split the PR into a smaller "binding-flush + orphan-tool" pair (the two fixes that are solidly proven) and pursue the reseed separately once I have evidence from the cleaner environment. Will leave the PR as draft until I hear what shape the maintainer team wants. |
|
Closing this in favor of the split-out PR. The two solidly-proven commits (binding-flush gate + orphan-tool-use invalidator) are now in #84234, rebased on current Dropped from this PR (and from the new split):
Closing per maintainer-options conversation in the prior review thread (option: "split into binding-flush + orphan-tool, pursue reseed separately"). New PR: #84234. |
What & why
Two distinct bugs in the claude-cli resume path that produce the same end-user symptom — the agent silently fails to respond, or starts a fresh session with no prior memory. On my M5 OpenClaw gateway both showed up as "Telegram amnesia": rapid follow-ups would either get ignored (silent abort) or come back as if the bot had no memory of the conversation. ClawSweeper called for "one coordinated Claude CLI transcript-flush fix that preserves stale-session protection, gates write-side ghost session ids, keeps non-Claude providers untouched" — this PR is that coordinated fix, on the resume-validity predicate path that #81048 also touches.
Fix 1:
cliSessionBindingpersisted before transcript flushedWhen a claude-cli turn produces a session id but the underlying claude subprocess fails to flush an
assistant-role record to its~/.claude/projects/<cwd>/<sid>.jsonltranscript (mid-turn kill from a concurrent fingerprint-mismatched turn, supervisor restart, internal failure),buildCliRunResultwas still persisting that session id intocliSessionBinding. The next turn runsclaudeCliSessionTranscriptHasContent(sid), doesn't find an assistant message, logscli session reset: reason=missing-transcript, and starts a fresh claude session with no prior memory.This PR gates the
cliSessionBindingwrite on the same predicate the next-turn invalidator already uses, evaluated at write time. If the transcript doesn't have an assistant message at the moment we'd persist the binding, the binding is omitted andagentMeta.sessionIdis cleared ("") so the session-store fallback atcommand/session-store.ts:166-170— which readsagentMeta.sessionIdand persists it viasetCliSessionId(...)when no binding is present — also drops the unflushed sid. Both writes have to drop together; gating only the binding lets the same bad sid round-trip through the fallback path.The probe is gated on
isClaudeCliProvider(params.provider). Other CLI providers (codex-cli, etc.) don't write to~/.claude/projectsso probing them would always return false and incorrectly strip valid binding metadata. For non-claude-cli providersisCliBindingFlushedreturnstrueunconditionally and behavior is unchanged.Fix 2: trailing orphaned
tool_useblocks all forward progress on resumeA claude-cli session whose JSONL ends with an assistant
tool_usecontent block that was never answered by atool_resultuser message cannot resume — claude-cli will sit waiting for the missingtool_result, hit its no-output watchdog, and the runtime kills it withreason=abort. The dispatcher then sees an empty payload and emits NO_REPLY, which to the user looks like the agent silently ignored their message.The orphan can be left behind when:
claude-live-session.tsno-output watchdog fires while a tool is actively running and OC kills the subprocessIn all cases the resumed session is dead until the binding gets cleared, because every subsequent resume hits the same trailing
tool_useand the same kill cycle. Critically, the existingclaudeCliSessionTranscriptHasContentinvalidator does not catch this — it only checks whether any assistant message exists, not whether the most recent one is a complete turn.This PR adds
claudeCliSessionTranscriptHasOrphanedToolUseincommand/attempt-execution.helpers.ts. It walks the JSONL, finds the last assistant message, and returns true if any of itstool_useids has no matchingtool_resultlater in the file.prepareCliRunContextruns it as a second gate alongsidemissing-transcript, withinvalidatedReason: "orphaned-tool-use". The new reason is added toRAW_TRANSCRIPT_RESEED_ALLOWED_REASONSso prior context is reseeded into the new session.Detection only considers TRAILING orphans — an unanswered tool_use deeper in history is inert because a later assistant message already moved past it. Probe runs only for claude-cli providers and only when the transcript-content gate already passed, so we add no I/O on already-invalidated sessions and no behavior change for non-claude providers.
Root cause walkthroughs
Fix 1 (binding-flush)
resolveSessionIdToSendmints a fresh UUID viacrypto.randomUUID()whenuseResume=false.--session-id.buildCliRunResultwritescliSessionBinding.sessionId = output.sessionId ?? context.reusableCliSession.sessionIdregardless of whether claude-cli committed an assistant record to disk.closeLiveSession(session, "restart")on fingerprint mismatch (src/agents/cli-runner/claude-live-session.ts:78and:110) — the.jsonlmay exist with only the user/system prelude, no"role":"assistant", or may not exist at all.claudeCliSessionTranscriptHasContent(sid), finds nothing →invalidatedReason: "missing-transcript"→ fresh session, empty memory.Fix 2 (orphan-tool)
tool_useblock (e.g. Bash command).claude-live-session.ts:closeLiveSession(_, "restart")fires from an unrelated fingerprint mismatch OR the tool itself hangs.tool_useis flushed, but the correspondingtool_resultuser message is never written.claude --resume <sid>; claude-cli loads the JSONL, sees the trailing unansweredtool_use, and waits indefinitely for a tool_result that will never arrive.claude-live-session.tsno-output watchdog kills the subprocess withreason=abort.NO_REPLY, and the user-visible result is silence.False-negative window
There's a brief gap between claude-cli closing its stdio and the OS making the JSONL line readable to a separate process. A single-shot probe immediately after the await can race that window and falsely conclude the transcript is empty, which would cause the missing-transcript reset we're trying to prevent.
Fix 1's
isCliBindingFlushedhandles this with a bounded retry: 3 probes at delays[0, 50, 150] ms(max ~200 ms total scheduled delay). Empirically, the first probe succeeds in the common case, and the retries cover the long tail without serializing the happy path. If all three probes return false the binding is dropped and the next turn starts fresh — which is the correct behavior, since at that point the transcript genuinely isn't there.Tests
src/agents/cli-runner.binding-flush.test.ts(Fix 1) — exercisesisCliBindingFlusheddirectly via the new injectable-deps seam (setCliRunnerTestDeps/restoreCliRunnerTestDeps, mirroring the existing pattern insrc/agents/cli-runner/prepare.ts):falsewithout probing for claude-cli with an undefined sessionIdtrueon first-probe successtruewithout probing for non-claude-cli providers (codex-cli, anthropic, openai)truewithout probing when the provider isundefinedsrc/agents/command/attempt-execution.test.ts(Fix 2) — exercisesclaudeCliSessionTranscriptHasOrphanedToolUseagainst synthetic on-disk JSONL fixtures:falsewhen the transcript is missingfalsewhen the last assistant message has notool_usefalsewhen everytool_usehas a matchingtool_resulttruewhen the last assistant message has a trailingtool_usewithouttool_result(the bug repro)truewhen the last assistant has multipletool_useand at least one is orphanedfalsewhen an earlier assistanttool_useis unanswered but the last assistant message resolved cleanly (proves it's a TRAILING-orphan check, not a depth-N orphan check)pnpm build && pnpm checkclean locally.pnpm vitest run src/agents/cli-runner src/agents/cli-runner.ts src/agents/cli-runner.binding-flush.test.ts src/agents/cli-runner.test.ts src/agents/command/attempt-execution.test.ts src/agents/command/session-store.test.ts→ 32 files / 302 tests pass.cli-runner.reliability.test.tsis failing onmain(Cannot find module '@mariozechner/pi-ai/oauth') — verified pre-existing onorigin/mainwithout my changes — so per CONTRIBUTING I'm not chasing that.Real Behavior Proof
Behavior or issue addressed: Two distinct bugs in the claude-cli resume path on a live OpenClaw gateway. (1) Missing-transcript ghost binding (Fix 1). Concurrent / fingerprint-mismatched turns kill the claude-cli subprocess mid-flight;
buildCliRunResultpersists the unflushed sessionId; the next turn's invalidator correctly rejects it asmissing-transcriptand resets the session — user sees amnesia. (2) Orphan-tool stuck resume (Fix 2). Gateway restart mid-tool leaves the JSONL transcript ending in an unanswered assistanttool_use; every subsequentclaude --resumehangs waiting for the missingtool_result, hits OC's 180s no-output watchdog, gets killed withreason=abort, and the dispatcher emits NO_REPLY — user sees the agent silently ignoring messages.Real environment tested: M5 (Apple Silicon, macOS Darwin 25.4.0, Node v26.0.0); OpenClaw
2026.5.7(homebrew install at/opt/homebrew/lib/node_modules/openclaw/); live gateway service (ai.openclaw.gatewayLaunchAgent, port18789); claude-cli backend via Anthropic Bedrock (CLAUDE_CODE_USE_BEDROCK=1, modelclaude-opus-4-7); 3d-engineer agent on Telegram (chat-1003753238419, topic 3). Caveat: the after-fix evidence below is from a structurally-equivalent dist-side carry-patch (same predicates, same retry schedule[0, 50, 150] ms, same call sites primary and failover, sameisClaudeCliProvidergate, sameagentMeta.sessionIdclear, same orphan-tool walk algorithm) applied to the running gateway, not from apnpm buildof this PR's branch. I attempted a brew-dist swap from the branch build but the branch-built dist references workspace-internal npm packages (@earendil-works/pi-ai) that aren't in the brew install'snode_modules; the gateway refused to boot. I rolled back. A proper branch-build proof requires running gateway from the source clone with its own auth profile, which I'm tracking as a follow-up. Maintainer is welcome to applyproof: overrideif the structural-equivalence argument is acceptable, or to defer merge until the source-build window is captured.Exact steps or command run after the patch: Fix 1 was verified by counting
missing-transcriptreset events in the live gateway log across pre- and post-patch windows. Fix 2 was verified by applying the dist-side patch to a gateway whose 3d-engineer session was already stuck on a trailingtool_use(Bash)(binding5918a618-...still in~/.openclaw/agents/3d-engineer/sessions/sessions.json), restarting the gateway, sending a real Telegram message via the iOS Telegram app to 3d-engineer, and watching~/.openclaw/logs/gateway.logfor theclaude live session turnand[telegram] sendMessage oklines:Between the kickstart and the awk grep, I sent a real Telegram message to 3d-engineer from the iOS Telegram app to verify the recovery path end-to-end.
Evidence after fix: Live runtime log output from the production gateway, captured by terminal grep/awk on
~/.openclaw/logs/gateway.log— see fenced blocks below.Fix 1 — pre-patch (May 5–12, 8 days, no patch applied) returned 24
missing-transcriptreset events; post-patch (May 13–14, 2 days, dist-side patch applied) returned 0:Fix 1 sample raw pre-patch event lines:
Fix 2 stuck transcript on disk that proves the bug existed:
(Last message is a trailing
tool_use(Bash)content block; notool_resultuser message after it — the orphan.) Fix 2 pre-patch live-gateway repro of the silent-abort cycle (4 events on May 14 before the patch landed at 12:07):Fix 2 post-patch live-gateway recovery (real Telegram message sent after the patch went live at 12:07; same stuck binding
5918a618-...was still in the session-store at the time):Observed result after fix: Fix 1 — zero
missing-transcriptreset events across the 2-day post-patch window on the live gateway, vs. 24 events / ~3 per day pre-patch (pre-patch baseline rate gives ~6 expected events for a 2-day window; observed 0). Fix 2 — the 3d-engineer agent had been stuck since 10:34 (every Telegram message hit the same orphan, aborted at the 180s no-output mark, produced silent NO_REPLY — 4 documented events). 15 minutes after the dist-side patch went live (12:07), the next Telegram message (12:22) triggered the new orphan-tool-use invalidator, the binding was dropped, a fresh session started, and the agent produced a 12-minute / 13,837-line streaming reply that was successfully delivered to Telegram (sendMessage ok message=1804). The follow-up message a minute later resumed the new session cleanly (useResume=true reuse=reusable). Same agent, same stuck session-store entry, recovered the moment the patch went live.What was not tested: Non-claude-cli providers (codex-cli, openai) — the fixes are gated on
isClaudeCliProvider(params.provider)and only probe~/.claude/projects/, so I have high confidence they don't change behavior for other providers, but I have not exercised those paths manually. Fix 1's failover-retry branch under a forced live-session restart — couldn't reliably reproduce the failover code path in my setup; the patched failover branch calls the sameisCliBindingFlushedhelper and threadsbindingFlushOkthroughbuildCliRunResultidentically to the primary path. Sidechain transcripts with trailing orphans — fixture coverage exists (syntheticsidechain-trailingandmain-orphan-with-sidechaincases) but my actual stuck transcript happens to have zero sidechain entries, so the sidechain skip is fixture-validated only, not live-validated. Apnpm buildof this branch on the live gateway — attempted via brew-dist swap, hitERR_MODULE_NOT_FOUNDfor@earendil-works/pi-ai, rolled back; branch-build proof is the outstanding follow-up.Notes for reviewers
prepare.tsrather than recover by injecting a synthetictool_result? Injection would be a behavior change to the conversation contents — claude-cli would see atool_resultit never actually produced, potentially leading the model down false branches. Invalidating the resume and falling through to a fresh session preserves the model's view of the world; the user keeps the conversation going on a clean session, with the prior context reseeded viaRAW_TRANSCRIPT_RESEED.workspaceDirparameter toclaudeCliSessionTranscriptHasContentand changes the lookup from "scan all project dirs" to "deterministic single-dir under<homeDir>/.claude/projects/<encoded(workspaceDir)>". My newclaudeCliSessionTranscriptHasOrphanedToolUsefollows the same shape as the currentclaudeCliSessionTranscriptHasContent(multi-project scan), and would need an identicalworkspaceDirthread-through if fix(command): retry claude-cli transcript probe to close flush race #81048 lands first. Happy to rebase in either direction depending on what the maintainer team prefers.fs.readdiron~/.claude/projects/then up to N JSONL existence-and-content checks); the additional cost is up to 200ms of scheduledsetTimeoutdelay across the bounded retry, only when the transcript hasn't shown up yet. Fix 2's gate adds at most one additional JSONL walk (same I/O pattern) when the content gate already passed; on healthy resumes the JSONL is fully resident in OS page cache from the prior probe so this is essentially a re-walk-in-RAM. I did not microbenchmark.brew upgrade openclaw. The script auto-detects upstream-fix landing and becomes a no-op once this merges.AI-assistance
This PR was authored with Claude (Opus + claude-cli). I understand what the code does — root-caused both bugs via live instrumentation of the gateway log on my own M5 install:
cli session reset: reason=missing-transcriptevents correlating with rapid follow-ups; traced the ghost-id origin tobuildCliRunResultin the dist bundle.silent-reply/dispatcher: exact NO_REPLY final payload was skippedevents on a single 3d-engineer session; correlated withclaude live session close: reason=abortandCLI produced no output for 180s; manually tail'd the session JSONL to find the trailingtool_use(Bash)block with notool_resultafter it.I ran
codex review --base origin/mainlocally before opening this PR. Codex's first pass on the binding-flush-only diff caught two real bugs: (1) the session-store fallback path atcommand/session-store.ts:166-170re-persistedagentMeta.sessionIdviasetCliSessionId(...), defeating the binding-only gate; (2) the probe ran for every CLI provider, not just claude-cli, which would have stripped valid binding metadata for codex/openai sessions. Both are now fixed in this commit and reflected in the test suite. Re-rancodex reviewagainst the updated diff with the orphan-tool fix added before pushing this update.Session logs are local; happy to share specific excerpts if useful.