[Bug]: claudeCliSessionTranscriptHasContent races claude-cli's transcript flush, returning false negatives that force cold session starts
Summary
claudeCliSessionTranscriptHasContent in src/agents/command/attempt-execution.helpers.ts checks whether a claude-cli session's project transcript JSONL exists and contains at least one assistant message. The result is used to decide whether the runtime can resume the session via --resume or must start fresh.
The check is racy: claude-cli flushes its transcript JSONL asynchronously after a session-id rotation. There is a sub-100ms window where the JSONL file exists on disk but does not yet contain an assistant message, because claude-cli has only flushed the user-message header. The check returns false, the runtime decides "no transcript, cold start", and the prior turn's context is lost.
This compounds the issue described in the related systemPromptHash bug (which also forces session rotations more often than necessary): every time a fingerprint-driven rotation happens, this race has a chance to also fail the transcript probe and lose the bridge.
Symptom
When the race fires, the docs page gateway/cli-backends describes the intended
behavior:
Stored session ids are verified against an existing readable project transcript
before resume, so phantom bindings are cleared with reason=transcript-missing instead
of silently starting a fresh Claude CLI session under --resume.
In practice, reason=transcript-missing fires even when the transcript exists, just
because the assistant message hasn't been flushed yet.
Why it happens
Current implementation (src/agents/command/attempt-execution.helpers.ts line 80):
export async function claudeCliSessionTranscriptHasContent(params: {…}): Promise<boolean> {
const sessionId = normalizeClaudeCliSessionId(params.sessionId);
if (!sessionId) return false;
const homeDir = params.homeDir?.trim() || process.env.HOME || os.homedir();
const projectsDir = path.join(homeDir, CLAUDE_PROJECTS_RELATIVE_DIR);
let projectEntries;
try {
projectEntries = await fs.readdir(projectsDir, { withFileTypes: true });
} catch {
return false;
}
for (const entry of projectEntries) {
if (!entry.isDirectory()) continue;
const candidate = path.join(projectsDir, entry.name, `${sessionId}.jsonl`);
if (await jsonlFileHasAssistantMessage(candidate)) {
return true;
}
}
return false;
}
Single-pass scan. No retry. No diagnostic on the negative path, so the failure mode is
silent — context loss looks like "the agent forgot," not "the runtime decided the
session had no history."
Suggested fix
Two changes:
-
Scan-with-retry. If the JSONL exists but doesn't yet have an assistant message,
wait ~150ms and re-scan once. Closes the flush window without meaningfully delaying
the no-content path (which doesn't matter for latency since it's the prelude to a
cold start anyway).
-
Diagnostic log on negative. When the probe returns false, log which JSONL files
were inspected (project, fileExists, hasAssistant) so the cause is visible in
gateway logs. Currently this failure is silent.
Sketch:
async function claudeCliSessionTranscriptScan(params: {…}): Promise<{
hasAssistant: boolean; fileExists: boolean; sessionId: string | null;
homeDir: string | null; projects: { project: string; fileExists: boolean; hasAssistant: boolean }[];
}> {
// … existing scan logic, but record fileExists/hasAssistant per project entry
}
export async function claudeCliSessionTranscriptHasContent(params: {…}): Promise<boolean> {
const first = await claudeCliSessionTranscriptScan(params);
if (first.hasAssistant) return true;
if (first.fileExists) {
await new Promise(r => setTimeout(r, 150));
const second = await claudeCliSessionTranscriptScan(params);
if (second.hasAssistant) return true;
cliBackendLog.warn("claude-cli transcript probe negative after 150ms retry",
{ sessionId: second.sessionId, homeDir: second.homeDir, projects: second.projects });
return false;
}
cliBackendLog.warn("claude-cli transcript probe negative (no matching jsonl)",
{ sessionId: first.sessionId, homeDir: first.homeDir, projectCount: first.projects.length });
return false;
}
The 150ms sleep is gated on fileExists so we don't introduce latency for the
genuinely-missing-session case.
Repro
Easier to repro alongside the systemPromptHash fingerprint bug (more rotations =
more chances for the race). Standalone repro requires manually triggering a
session-id rotation right at the moment of the next inbound turn — feasible but
fiddly.
If you want a forced repro: add a setTimeout(50) immediately before the
jsonlFileHasAssistantMessage call and observe transcript-missing resets on every
session rotation in a chat channel.
Affected versions
Confirmed on 2026.5.7. Function has likely been racy since introduction.
Scope
- Affected:
claude-cli backend, all surfaces (more visible on chat channels
because of higher rotation frequency). Resume path only — first-turn sessions don't
hit this.
- Not affected: API backends.
Related
Next step
A PR with the source-level fix and tests will follow shortly.
[Bug]:
claudeCliSessionTranscriptHasContentraces claude-cli's transcript flush, returning false negatives that force cold session startsSummary
claudeCliSessionTranscriptHasContentinsrc/agents/command/attempt-execution.helpers.tschecks whether a claude-cli session's project transcript JSONL exists and contains at least one assistant message. The result is used to decide whether the runtime can resume the session via--resumeor must start fresh.The check is racy: claude-cli flushes its transcript JSONL asynchronously after a session-id rotation. There is a sub-100ms window where the JSONL file exists on disk but does not yet contain an assistant message, because claude-cli has only flushed the user-message header. The check returns
false, the runtime decides "no transcript, cold start", and the prior turn's context is lost.This compounds the issue described in the related
systemPromptHashbug (which also forces session rotations more often than necessary): every time a fingerprint-driven rotation happens, this race has a chance to also fail the transcript probe and lose the bridge.Symptom
When the race fires, the docs page
gateway/cli-backendsdescribes the intendedbehavior:
In practice,
reason=transcript-missingfires even when the transcript exists, justbecause the assistant message hasn't been flushed yet.
Why it happens
Current implementation (
src/agents/command/attempt-execution.helpers.tsline 80):Single-pass scan. No retry. No diagnostic on the negative path, so the failure mode is
silent — context loss looks like "the agent forgot," not "the runtime decided the
session had no history."
Suggested fix
Two changes:
Scan-with-retry. If the JSONL exists but doesn't yet have an assistant message,
wait ~150ms and re-scan once. Closes the flush window without meaningfully delaying
the no-content path (which doesn't matter for latency since it's the prelude to a
cold start anyway).
Diagnostic log on negative. When the probe returns false, log which JSONL files
were inspected (project, fileExists, hasAssistant) so the cause is visible in
gateway logs. Currently this failure is silent.
Sketch:
The 150ms sleep is gated on
fileExistsso we don't introduce latency for thegenuinely-missing-session case.
Repro
Easier to repro alongside the
systemPromptHashfingerprint bug (more rotations =more chances for the race). Standalone repro requires manually triggering a
session-id rotation right at the moment of the next inbound turn — feasible but
fiddly.
If you want a forced repro: add a
setTimeout(50)immediately before thejsonlFileHasAssistantMessagecall and observetranscript-missingresets on everysession rotation in a chat channel.
Affected versions
Confirmed on 2026.5.7. Function has likely been racy since introduction.
Scope
claude-clibackend, all surfaces (more visible on chat channelsbecause of higher rotation frequency). Resume path only — first-turn sessions don't
hit this.
Related
systemPromptHashinbuildClaudeLiveFingerprintcauses excessive rotations. Both should be fixedtogether for the chat-channel context-loss class to be fully closed.
openclaw updaterun mid-turn causes total message loss on Telegram (and likely Discord) #71178, Gateway restart causes session state loss, requiring manual intervention to resume autonomous tasks #62442, CLI backend responses sometimes not delivered to Telegram delivery context #75991 — symptom cluster.Next step
A PR with the source-level fix and tests will follow shortly.