Skip to content

[Bug]: claudeCliSessionTranscriptHasContent races claude-cli's transcript flush, returning false negatives that force cold session starts #81042

@benjamin1492

Description

@benjamin1492

[Bug]: claudeCliSessionTranscriptHasContent races claude-cli's transcript flush, returning false negatives that force cold session starts

Summary

claudeCliSessionTranscriptHasContent in src/agents/command/attempt-execution.helpers.ts checks whether a claude-cli session's project transcript JSONL exists and contains at least one assistant message. The result is used to decide whether the runtime can resume the session via --resume or must start fresh.

The check is racy: claude-cli flushes its transcript JSONL asynchronously after a session-id rotation. There is a sub-100ms window where the JSONL file exists on disk but does not yet contain an assistant message, because claude-cli has only flushed the user-message header. The check returns false, the runtime decides "no transcript, cold start", and the prior turn's context is lost.

This compounds the issue described in the related systemPromptHash bug (which also forces session rotations more often than necessary): every time a fingerprint-driven rotation happens, this race has a chance to also fail the transcript probe and lose the bridge.

Symptom

When the race fires, the docs page gateway/cli-backends describes the intended
behavior:

Stored session ids are verified against an existing readable project transcript
before resume, so phantom bindings are cleared with reason=transcript-missing instead
of silently starting a fresh Claude CLI session under --resume.

In practice, reason=transcript-missing fires even when the transcript exists, just
because the assistant message hasn't been flushed yet.

Why it happens

Current implementation (src/agents/command/attempt-execution.helpers.ts line 80):

export async function claudeCliSessionTranscriptHasContent(params: {}): Promise<boolean> {
  const sessionId = normalizeClaudeCliSessionId(params.sessionId);
  if (!sessionId) return false;
  const homeDir = params.homeDir?.trim() || process.env.HOME || os.homedir();
  const projectsDir = path.join(homeDir, CLAUDE_PROJECTS_RELATIVE_DIR);
  let projectEntries;
  try {
    projectEntries = await fs.readdir(projectsDir, { withFileTypes: true });
  } catch {
    return false;
  }
  for (const entry of projectEntries) {
    if (!entry.isDirectory()) continue;
    const candidate = path.join(projectsDir, entry.name, `${sessionId}.jsonl`);
    if (await jsonlFileHasAssistantMessage(candidate)) {
      return true;
    }
  }
  return false;
}

Single-pass scan. No retry. No diagnostic on the negative path, so the failure mode is
silent — context loss looks like "the agent forgot," not "the runtime decided the
session had no history."

Suggested fix

Two changes:

  1. Scan-with-retry. If the JSONL exists but doesn't yet have an assistant message,
    wait ~150ms and re-scan once. Closes the flush window without meaningfully delaying
    the no-content path (which doesn't matter for latency since it's the prelude to a
    cold start anyway).

  2. Diagnostic log on negative. When the probe returns false, log which JSONL files
    were inspected (project, fileExists, hasAssistant) so the cause is visible in
    gateway logs. Currently this failure is silent.

Sketch:

async function claudeCliSessionTranscriptScan(params: {}): Promise<{
  hasAssistant: boolean; fileExists: boolean; sessionId: string | null;
  homeDir: string | null; projects: { project: string; fileExists: boolean; hasAssistant: boolean }[];
}> {
  // … existing scan logic, but record fileExists/hasAssistant per project entry
}

export async function claudeCliSessionTranscriptHasContent(params: {}): Promise<boolean> {
  const first = await claudeCliSessionTranscriptScan(params);
  if (first.hasAssistant) return true;
  if (first.fileExists) {
    await new Promise(r => setTimeout(r, 150));
    const second = await claudeCliSessionTranscriptScan(params);
    if (second.hasAssistant) return true;
    cliBackendLog.warn("claude-cli transcript probe negative after 150ms retry",
      { sessionId: second.sessionId, homeDir: second.homeDir, projects: second.projects });
    return false;
  }
  cliBackendLog.warn("claude-cli transcript probe negative (no matching jsonl)",
    { sessionId: first.sessionId, homeDir: first.homeDir, projectCount: first.projects.length });
  return false;
}

The 150ms sleep is gated on fileExists so we don't introduce latency for the
genuinely-missing-session case.

Repro

Easier to repro alongside the systemPromptHash fingerprint bug (more rotations =
more chances for the race). Standalone repro requires manually triggering a
session-id rotation right at the moment of the next inbound turn — feasible but
fiddly.

If you want a forced repro: add a setTimeout(50) immediately before the
jsonlFileHasAssistantMessage call and observe transcript-missing resets on every
session rotation in a chat channel.

Affected versions

Confirmed on 2026.5.7. Function has likely been racy since introduction.

Scope

  • Affected: claude-cli backend, all surfaces (more visible on chat channels
    because of higher rotation frequency). Resume path only — first-turn sessions don't
    hit this.
  • Not affected: API backends.

Related

Next step

A PR with the source-level fix and tests will follow shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions