Skip to content

[Bug] doctor --fix archives historical session transcripts as "orphans," silently losing chat history; recommend CI hardening of every "Run openclaw doctor" recommendation #73471

@injinj

Description

@injinj

[Bug] doctor --fix archives historical session transcripts as "orphans," causing silent loss of conversational history; recommend CI hardening of every "Run openclaw doctor" recommendation

Summary

openclaw doctor --fix classifies all historical primary session transcripts (<uuid>.jsonl) as "orphans" and offers to archive them by renaming to *.deleted.<timestamp>. The phrasing of the prompt makes the operation sound innocuous ("This only renames them"), but until the fix in [fix/include-reset-transcripts-in-discovery] lands, the renamed files are no longer discovered by memory search — i.e., the user's prior conversational history disappears from memory_search results without any indication.

Closely related to but distinct from existing issues #70680 (trajectory .jsonl falsely flagged) and #50248 (fresh cron sessions falsely classified as missing). All three are different mistakes in the same orphan-classification code path. This third variant is the most damaging because it tombstones valid, complete primary transcripts — i.e., the actual user-visible chat history.

How I hit it (developer-induced, narrow trigger)

This is honest framing: my trigger was self-inflicted. I rebased a PR branch onto upstream main and rebuilt, ending up with a version skew where the running gateway didn't satisfy plugin engines.openclaw requirements declared in my config. That produced a startup crash loop, and the gateway's own error message said Run: openclaw doctor --fix. I ran it, hit the orphan-archive prompt, and clicked through.

This particular trigger is unlikely to affect normal users, so this issue isn't claiming widespread reproduction in the wild. The reason to file it anyway is that:

  1. The orphan classification, prompt language, and lack of recovery are wrong regardless of how the user got to doctor.
  2. There are ~66 distinct "Run openclaw doctor" recommendations in src/, many fired for non-dev reasons (config migrations, channel auth drift, legacy-form configs, etc.). Any of those can route a user into the same archive prompt.
  3. Once the user is at the prompt, the dangerous behavior is the same regardless of trigger.

In other words: the trigger that got me here is rare, but the trap at the end of the path is reachable from many paths.

Reproduction (minimal, no dev setup needed)

  1. Start with an OpenClaw install that has accumulated historical primary .jsonl transcripts under agents/<id>/sessions/. (Any non-trivial install will have this — every prior session leaves a transcript file behind.)
  2. Verify: sessions.json lists ~1–5 keys (e.g. agent:main:main plus active cron run keys), but the sessions directory holds many more <uuid>.jsonl files.
  3. Run openclaw doctor --fix (or openclaw doctor and answer Y at the prompt).
  4. Observe the prompt:
    Found 47 orphan transcript file(s) in ~/.openclaw/agents/main/sessions.
      These .jsonl files are no longer referenced by sessions.json, so they are
      not part of any active session history.
      Doctor can archive them safely by renaming each file to *.deleted.<timestamp>.
    Archive 47 orphan transcript file(s) in ~/.openclaw/agents/main/sessions?
    This only renames them to *.deleted.<timestamp>. [y/N]
    
  5. Confirm. Every historical session transcript is renamed <uuid>.jsonl.deleted.<timestamp>.
  6. On any release prior to fix/include-reset-transcripts-in-discovery, those transcripts are no longer discoverable by memory search.

The prompt itself is the bug surface. Whether the user got there via a dev mistake, a config migration, an upgrade hint, or a tip in a status message is incidental.

Why the classification is wrong

sessions.json is not a registry of all sessions — it is a "currently-active session per session-key" map. Each entry is keyed by something like agent:main:main and holds the most recent sessionId and sessionFile for that key. When a new session starts under the same key, the entry is overwritten; the old transcript file remains on disk.

The orphan detector in src/commands/doctor-state-integrity.ts builds referencedTranscriptPaths from entry.sessionFile for each entry in sessions.json. So the reference set always contains exactly N paths, where N is the number of active session keys (typically 1–5).

The detector then walks agents/<id>/sessions/ and flags every <uuid>.jsonl not in that reference set. This means every prior session transcript (i.e., the entire chat history for that agent) is flagged as "orphan."

A genuine orphan would be: a transcript that was never created by a known session-key (corrupted, leftover from a deleted agent, never registered, malformed first record, etc.). The current heuristic cannot distinguish "old but legitimate history" from "actual orphan" because sessions.json was never designed as a registry of "every session that ever existed."

Why it's dangerous regardless of trigger

  1. The prompt language is misleading. "Archive" implies "moved to a recoverable location"; "this only renames them" implies "no real change." In fact, it removes them from active discovery and (pre-fix) from memory search. The actual semantic is "tombstone."
  2. The user is usually under stress when they arrive at doctor. Most "Run openclaw doctor" recommendations fire after a startup failure or migration warning. Users will click through prompts to make the noise stop.
  3. No undo path. There is no openclaw doctor --restore-archives or --undo-last; recovery requires manual mv *.jsonl.deleted.* *.jsonl on the user's part, if they realize what happened.
  4. Silent failure mode upstream amplifies the trap. loadSessionStore (src/config/sessions/store-load.ts) silently degrades to an empty store {} on any JSON parse error or schema mismatch. The catch block is empty:
    } catch {
      if (attempt < maxReadAttempts - 1) { ... continue; }
    }
    No log, no warning, no "consider running doctor." A corrupted sessions.json therefore makes every transcript appear orphan to doctor on the next run — turning a small sessions.json corruption event into a "47 orphans, please archive?" prompt with no upstream signal that the store itself was the problem.

Existing related bugs (same code, different mistakes)

The recurrence pattern suggests the orphan/cleanup detection logic needs systematic test coverage that asserts semantics, not just mechanism.

Recommended fixes

Short-term (correctness)

  1. Tighten orphan classification. A primary .jsonl is only a real orphan if it cannot be parsed as a valid session transcript (no session-start record, truncated header, etc.) or if it explicitly belongs to a deleted agent. Don't equate "not currently in sessions.json" with "orphan."
  2. Reword the prompt to reflect actual semantics. Replace "archive" with "tombstone (hide from discovery)"; surface the count, total bytes, and oldest/newest timestamps; make clear that the files will not appear in chat history listings or (pre-fix) memory search.
  3. Add a recovery path. openclaw doctor --restore-archives [--since <timestamp>] to undo a previous archive run.
  4. Make loadSessionStore parse failures non-silent. On parse error, log a warning and either back up + recreate from disk-walk, or hard-fail with a clear message instead of silently returning {}.

Long-term (CI hardening)

The "Run openclaw doctor" surface is a high-leverage user trust boundary. Every site that says "run doctor to fix this" is implicitly a contract: the user trusts that doctor will fix the named problem and not damage anything else. That contract should be tested systematically.

Proposal: doctor recommendation audit + CI test suite.

  1. Inventory. Programmatically extract every "Run openclaw doctor" recommendation in src/. Output: structured list of (trigger condition, recommended command, repair flow, mutation level, has test, test asserts semantics).

  2. Triage by mutation level.

    • Read-only/info: openclaw doctor (no --fix) — low risk
    • Config-write: openclaw doctor --fix for legacy config migrations — medium risk, recoverable from VCS/backup
    • State-rename: orphan archive, session pruning, oauth-dir migrations — high risk, the dangerous tier
    • State-delete: anything actually removing files — critical, audit individually
  3. Per-recommendation test triple for every state-mutating site:

    • Reproduce trigger: set up the exact broken state that causes the "run doctor" recommendation
    • Run repair: invoke the recommended doctor command non-interactively
    • Assert semantics:
      • Repair succeeded
      • Triggering condition is gone
      • No collateral mutation to user-content data (transcripts, memory files, MEMORY.md, config-tracked files): byte-for-byte preservation, or explicit listing of every file mutated
      • Memory search still finds the same chunks for the same queries before vs. after (modulo expected new chunks)
  4. CI gate. Block PR merges that add a new "run doctor" recommendation without a corresponding test in this suite. A small lint rule + tag system (/* @doctor-recommendation tested-by: <test-id> */) would catch new sites.

The third bullet under (3) is the test type that doesn't exist today. The current doctor-state-integrity.test.ts creates a fake orphan-session.jsonl containing only {"type":"session"} — it tests the mechanism of archiving, never the semantics of "is this thing actually an orphan, and is the archive operation safe for real user data."

Workaround (until the indexer fix lands and a recovery command exists)

  • Do not run openclaw doctor --fix blind. Run openclaw doctor (read-only) first to see the proposed changes.
  • If you've already been bitten: archived transcripts are at agents/<id>/sessions/<uuid>.jsonl.deleted.<timestamp>. They are intact and can be restored by:
    cd ~/.openclaw/agents/<agent>/sessions
    for f in *.jsonl.deleted.*; do
      mv -- "$f" "${f%.deleted.*}"  # only if you're sure no live session uses that uuid
    done
    However, this puts them back as plain .jsonl and the next doctor --fix will tombstone them again (definition bug). Better to leave them as .deleted.* and pick up the indexer fix from fix/include-reset-transcripts-in-discovery, which makes archived files searchable.

Environment

  • OpenClaw 2026.4.22 → 2026.4.25/2026.4.26 (the trap reproduces across versions)
  • Linux (Fedora 43)
  • Original trigger in my case was a dev-induced version skew (rebased PR + stale build), not a path normal users typically take
  • Affected agent: main, ~47 transcripts spanning Feb–Apr archived in a single doctor run

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions