Skip to content

fix(sessions): preserve corrupt-header transcripts#89050

Merged
steipete merged 5 commits into
openclaw:mainfrom
charles-openclaw:fix-session-corrupt-header
Jun 2, 2026
Merged

fix(sessions): preserve corrupt-header transcripts#89050
steipete merged 5 commits into
openclaw:mainfrom
charles-openclaw:fix-session-corrupt-header

Conversation

@charles-openclaw

@charles-openclaw charles-openclaw commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Recover corrupted first-line session headers in SessionManager.open() before the destructive rewrite path can run.
  • Preserve the original bytes in a same-mode .corrupt-* backup, then rewrite the active transcript as valid JSONL.
  • Keep recovered user-only transcripts through run preparation and first assistant persistence, with regression coverage for the production path.

Linked context

Closes #89037.

Real behavior proof (required for external PRs)

  • Behavior addressed: resuming a session whose first JSONL header line is corrupted no longer truncates persisted transcript rows; recovered user-only transcripts also survive the production SessionManager.open() -> prepareSessionManagerForRun() -> first assistant append path.
  • Real environment tested: local OpenClaw source checkout on the PR branch, using production SessionManager.open() and prepareSessionManagerForRun() against disposable JSONL session files under the OS temp directory.
  • Exact steps or command run after this patch: node --import tsx .artifacts/corrupt-session-proof.ts.
  • Evidence after fix: terminal output from the disposable OpenClaw runtime proof:
corrupt header + assistant resume
active lines: 3
contains user: true
contains assistant: true
backup files: 1
corrupt header + user-only production path
active lines: 3
header id: sess-runtime
contains user: true
assistant appended: true
  • Observed result after fix: the corrupted-header file is rewritten to a valid three-line transcript while preserving the user and assistant rows, one .corrupt-* backup is created, and the recovered user-only production path keeps the user row before appending the first assistant row.
  • What was not tested: no private user transcript was opened; the proof uses disposable local session files with the same corrupted-header JSONL shapes.
  • Before evidence: current main rewrites the corrupted-header transcript to a fresh one-line session header when SessionManager.open() receives a non-empty file whose first parseable entry is not a session header.

Tests and validation

  • pnpm test src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -- --reporter=dot passed 2 files / 6 tests.
  • pnpm tsgo:prod passed.
  • pnpm check:test-types passed.
  • pnpm lint --threads=8 passed.
  • git diff --check -- src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts passed.
  • .agents/skills/autoreview/scripts/autoreview --mode local reported no accepted/actionable findings.

Risk checklist

  • Did user-visible behavior change? Yes, corrupt session-header resumes now fail before destructive reset instead of silently wiping the file.
  • Did config, environment, or migration behavior change? No.
  • Did security, auth, secrets, network, or tool execution behavior change? No.
  • Highest-risk area: preserving the existing first-run session-manager workaround. The guard only reads and validates the existing first nonblank line before the reset; valid pre-created session headers still follow the old reset path.

Current review state

Maintainer fix 30307017a8 pushed and ClawSweeper re-review requested.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels Jun 1, 2026
@clawsweeper

clawsweeper Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge. Reviewed June 2, 2026, 6:57 AM ET / 10:57 UTC.

Summary
The PR recovers corrupt first-line session JSONL headers in SessionManager.open(), preserves the original bytes in a same-mode corrupt backup, and adds session-manager/run-preparation regression coverage for recovered transcripts.

PR surface: Source +103, Tests +224. Total +327 across 4 files.

Reproducibility: yes. The linked issue and PR comments provide a high-confidence source reproduction through SessionManager.open() with a corrupt first header line, and the current main code path shows that malformed first line is skipped before the empty/corrupt rewrite branch.

Review metrics: 1 noteworthy metric.

  • Persistent transcript artifact surface: 1 new .corrupt-*.jsonl sidecar. The added sidecar contains private transcript bytes and crosses existing transcript classifier/scanner behavior before merge.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🦞 diamond lobster
Patch quality: 🧂 unranked krab
Result: blocked by patch quality or review findings.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Classify generated corrupt backups as recovery/archive artifacts, not primary or usage-counted transcripts.
  • [P2] Add focused src/config/sessions/artifacts.test.ts coverage and rerun the focused session-manager/session-manager-init/artifacts tests.

Risk before merge

  • [P1] Persistent .corrupt-*.jsonl backups contain private transcript bytes but currently match primary/usage-counted transcript helpers, so memory, cost, doctor, and disk-budget paths can process them as live sessions.
  • [P1] Because this is a P0 data-loss recovery path, merge should include artifact classifier proof in addition to the focused recovery tests.

Maintainer options:

  1. Classify corrupt backups before merge (recommended)
    Change the backup naming/classification so generated corrupt backups are non-primary and non-usage-counted recovery artifacts, then add focused artifact tests.
  2. Accept the scanner exposure risk
    Maintainers could intentionally accept that corrupt backups are scanned like live transcripts, but that should be an explicit data-handling decision before merge.
  3. Pause for recovery-artifact policy
    If the team is not ready to define persistent corrupt transcript artifacts, pause this PR and land only after the recovery artifact contract is settled.
Copy recommended automerge instruction
@clawsweeper automerge

Special instructions:
Change corrupt transcript backup naming/classification so generated backups are not treated as primary or usage-counted session transcripts; add focused artifacts tests for the generated corrupt backup name; rerun the focused session-manager, session-manager-init, and artifacts tests plus git diff --check.

Next step before merge

  • [P2] A narrow automated repair can address the remaining blocker by changing corrupt backup naming/classification and adding focused classifier coverage.

Security
Needs attention: The diff creates a persistent copy of private transcript bytes, and its .jsonl name currently falls through live transcript classifiers.

Review findings

  • [P1] Stop corrupt backups from matching live transcript artifacts — src/agents/sessions/session-manager.ts:431
Review details

Best possible solution:

Keep the recovery-in-SessionManager.open() approach, but land it only after corrupt backups use a recognized recovery/archive classification and focused scanner/classifier tests prove they are not treated as live transcripts.

Do we have a high-confidence way to reproduce the issue?

Yes. The linked issue and PR comments provide a high-confidence source reproduction through SessionManager.open() with a corrupt first header line, and the current main code path shows that malformed first line is skipped before the empty/corrupt rewrite branch.

Is this the best way to solve the issue?

No, not yet. Moving recovery into SessionManager.open() is the right layer, but the merge-safe solution also needs the generated corrupt backup to be classified as a recovery/archive artifact rather than a live transcript.

Full review comments:

  • [P1] Stop corrupt backups from matching live transcript artifacts — src/agents/sessions/session-manager.ts:431
    buildCorruptSessionBackupPath() creates backups named <session>.jsonl.corrupt-... .jsonl, but the existing artifact helpers treat any non-archive .jsonl as a primary and usage-counted transcript. That means this new private backup can be scanned or surfaced like a live session instead of a recovery artifact. Please use an existing archive-style suffix or add a corrupt archive classification, with focused tests proving these backups are non-primary and non-usage-counted.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.88

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1db2c2a3e058.

Label changes

Label justifications:

  • P0: The PR addresses a linked data-loss bug where corrupted session headers can wipe persisted transcripts.
  • merge-risk: 🚨 session-state: Merging as-is adds a persistent recovered-transcript sidecar that current session scanners can treat as a live transcript.
  • merge-risk: 🚨 security-boundary: The sidecar preserves private transcript bytes, and its current .jsonl shape can expose it to memory/cost/session tooling instead of recovery-only handling.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🦞 diamond lobster and patch quality is 🧂 unranked krab.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (terminal): The PR body and follow-up comments include terminal output from disposable filesystem-backed runs through production SessionManager.open() and prepareSessionManagerForRun() showing recovered messages, assistant append, backup creation, and backup mode.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body and follow-up comments include terminal output from disposable filesystem-backed runs through production SessionManager.open() and prepareSessionManagerForRun() showing recovered messages, assistant append, backup creation, and backup mode.
Evidence reviewed

PR surface:

Source +103, Tests +224. Total +327 across 4 files.

View PR surface stats
Area Files Added Removed Net
Source 2 112 9 +103
Tests 2 224 0 +224
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 4 336 9 +327

Security concerns:

  • [medium] Corrupt transcript backup can be scanned as live session data — src/agents/sessions/session-manager.ts:431
    The backup is same-mode, which is good, but it is still named like a normal .jsonl transcript while preserving private transcript bytes. Existing session memory, usage/cost, doctor, and disk-budget paths rely on filename classifiers, so the backup needs explicit recovery/archive classification before merge.
    Confidence: 0.88

Acceptance criteria:

  • [P1] node scripts/run-vitest.mjs src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.test.ts src/config/sessions/artifacts.test.ts.
  • [P1] git diff --check -- src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts src/config/sessions/artifacts.ts src/config/sessions/artifacts.test.ts.

What I checked:

  • Recovery happens before destructive rewrite: At PR head, setSessionFile() attempts corrupt-header recovery before the old empty/corrupt rewrite branch, then rewrites recovered entries and marks the manager as recovered. (src/agents/sessions/session-manager.ts:775, 30307017a866)
  • Backup naming creates a live-looking JSONL: The new corrupt backup path is ${filePath}.corrupt-<timestamp>-<uuid>.jsonl, so a generated backup still ends in .jsonl and does not use an existing archive suffix. (src/agents/sessions/session-manager.ts:429, 30307017a866)
  • Current classifier treats unknown .jsonl as primary: isPrimarySessionTranscriptFileName() accepts any .jsonl that is not trajectory, checkpoint, or a recognized archive; recognized archive reasons are only deleted, reset, and bak. (src/config/sessions/artifacts.ts:88, 1db2c2a3e058)
  • Scanner callers depend on the classifier: Disk-budget cleanup includes primary transcript files as unreferenced artifacts, usage/cost scanning filters on usage-counted transcript names, and session memory hit parsing treats .jsonl names as transcript identities. (src/config/sessions/disk-budget.ts:287, 1db2c2a3e058)
  • Production entry path checked: The embedded runner opens the session file with SessionManager.open() before prepareSessionManagerForRun(), matching the linked data-loss reproduction path discussed on the PR. (src/agents/embedded-agent-runner/run/attempt.ts:1961, 1db2c2a3e058)
  • Scoped policy read: Root AGENTS.md and src/agents/AGENTS.md were read; the review applied the private session-state, scanner-proof, and whole-path review guidance. Maintainer notes were checked and only a Telegram note existed, so no matching maintainer note applied. (AGENTS.md:1, 1db2c2a3e058)

Likely related people:

  • steipete: Authored the latest PR-head maintainer fix commit and summarized the recovery behavior and local validation in the PR discussion. (role: recent fix author; confidence: high; commits: 30307017a866; files: src/agents/sessions/session-manager.ts, src/agents/embedded-agent-runner/session-manager-init.ts)
  • Abner Shang: Current checkout blame/log show the session manager, init helper, and artifact classifier files entering this checkout through commit 5be282e; history depth limits stronger ownership inference. (role: current-main area contributor; confidence: low; commits: 5be282e459ee; files: src/agents/sessions/session-manager.ts, src/agents/embedded-agent-runner/session-manager-init.ts, src/config/sessions/artifacts.ts)
  • yetval: Provided the linked data-loss issue and rechecked the PR against the real SessionManager.open() path, including the follow-up header-only edge case. (role: reporter and reproduction reviewer; confidence: medium; files: src/agents/sessions/session-manager.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P0 Emergency: data loss, security bypass, crash loop, or unusable core runtime. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels Jun 1, 2026
@charles-openclaw

Copy link
Copy Markdown
Contributor Author

Thanks for the review. I cannot honestly provide live production resume proof against a real corrupted user transcript from this contributor environment without touching private/local session data.

What I can confirm publicly is the focused behavior proof in this PR: the regression uses the real filesystem-backed prepareSessionManagerForRun helper with a temp JSONL transcript whose first line is malformed and whose later line contains a persisted user message. After the guarded path rejects, the test asserts the transcript string remains byte-for-byte unchanged, which is the data-preservation behavior this patch is meant to protect.

Verification already run:

  • PNPM_CONFIG_OFFLINE=true corepack pnpm test src/agents/embedded-agent-runner/session-manager-init.test.ts -- --reporter=dot passed the focused shard / 3 tests.
  • git diff --check -- src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts passed.

The remaining live proof item needs a maintainer or safe disposable session fixture that can reproduce the production resume path without exposing a real user's transcript.

@yetval

yetval commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Thanks for jumping on this @charles-openclaw — but I think this fix guards the wrong layer, and #89037's data loss is still reproducible on this branch (ffb09b97).

The real truncation happens before prepareSessionManagerForRun runs

The production resume path is:

attempt.ts:1961 sessionManager = guardSessionManager(SessionManager.open(params.sessionFile), …)
 └─ SessionManager.open setSessionFile() // src/agents/sessions/session-manager.ts
 └─ loadEntriesFromFile() returns [] //:384 — corrupt header line is skipped, entries[0]
 // is a message, header check fails []
 └─ "empty or corrupted" branch //:725
 └─ this.rewriteFile() //:731 — TRUNCATES the file to a fresh header
...
attempt.ts:2023 await prepareSessionManagerForRun({ … }) // runs LATER

By the time prepareSessionManagerForRun is reached, SessionManager.open() has already overwritten the corrupt-header file with a brand-new valid header. So assertExistingHeaderIsReadable() reads that fresh, valid header, never throws, and the original transcript is long gone. The guard added here can't fire for the case in the issue.

Proof on this PR branch (ffb09b97)

I ran the exact #89037 repro through the real resume entrypoint (SessionManager.open), not a hand-built object:

// file = [crash-truncated-header, user-msg, assistant-msg] (3 lines)
writeFileSync(file, `${JSON.stringify(header).slice(0, 30)}\n${JSON.stringify(u)}\n${JSON.stringify(a)}\n`);
SessionManager.open(file); // what attempt.ts:1961 does on resume
const text = readFileSync(file, "utf8");
expect(text).toContain("important question"); // FAILS

Output (PR branch):

before: 3 lines
after: 1 lines: [ '{"type":"session","version":3,"id":"019e82c7-…' ] // new id, both messages destroyed
AssertionError: expected '…' to contain 'important question'

Why the new regression test passes anyway

session-manager-init.test.ts builds a fake sessionManager object literal whose fileEntries already contain a header, so it never exercises SessionManager.open() / setSessionFile() / loadEntriesFromFile() — i.e. it skips the layer that actually does the destructive rewriteFile(). That's why it's green while the real bug remains. A faithful regression should drive SessionManager.open(corruptFile) (or go through the attempt.ts resume path) and assert the bytes survive.

Suggested fix location

Move the guard down into setSessionFile()'s empty/corrupt branch (session-manager.ts:725-735), where the data is still intact:

  • Don't rewriteFile() when the file is non-empty on disk but only fails header validation.
  • Before any rewrite, back up the original bytes (e.g. <file>.corrupt-<ts>.jsonl) so history is recoverable.
  • Prefer recovery: if a type:"session" entry exists anywhere in the parsed entries, use it; otherwise synthesize a header and prepend it, keeping the existing message lines.
  • Only take the "fresh start" path when the raw file is genuinely empty (0 bytes / no parseable lines), not merely "no valid header at line 0".

The guard in prepareSessionManagerForRun is still a reasonable belt-and-suspenders for that helper's own fs.writeFile(sessionFile, ""), so it's fine to keep — but it isn't sufficient to close #89037 on its own.

Happy to push the session-manager.ts fix + a real SessionManager.open() regression onto this branch (or a follow-up) if that helps. cc @charles-openclaw

@charles-openclaw

Copy link
Copy Markdown
Contributor Author

Thanks, that reproduction is exactly right. I pushed follow-up commit 73f73255 moving the recovery into SessionManager.open() / setSessionFile(), before the destructive rewrite path can run.

Summary:

  • When an existing session file is non-empty but fails first-line header validation, setSessionFile() now backs up the original bytes to a .corrupt-*.jsonl sidecar before rewriting anything.
  • It recovers a valid session header found later in the parsed entries, or synthesizes a fresh header when only message entries are parseable.
  • It rewrites the active transcript as valid JSONL with the recovered/synthesized header plus the existing non-header entries, so the user/assistant messages survive SessionManager.open(corruptFile).
  • I kept the previous prepareSessionManagerForRun guard as the belt-and-suspenders check for that helper's reset path.

Verification:

  • Added a real SessionManager.open(sessionFile) regression for the corrupted first-line repro shape. The test asserts the user and assistant messages remain in the active transcript and the original bytes are present in a .corrupt-* backup.
  • Ran corepack pnpm exec oxfmt --write --threads=1 src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts.
  • Ran PNPM_CONFIG_OFFLINE=true corepack pnpm test src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -- --reporter=dot -> 2 files / 4 tests passed.
  • Ran git diff --check -- src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -> passed.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added the merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. label Jun 1, 2026
@charles-openclaw

Copy link
Copy Markdown
Contributor Author

Thanks, the backup-mode finding was right. I pushed follow-up commit 9d373724 to keep the persistent .corrupt-* sidecar under the same private-mode contract as the source transcript.

Summary:

  • recoverCorruptSessionEntries() now stats the original transcript, writes the backup with that mode, and explicitly chmods the sidecar so process umask does not leave a broader-readable copy.
  • The focused regression now chmods the disposable transcript to 0o600 and asserts the generated corrupt backup is also 0o600 on non-Windows platforms.

Verification:

  • corepack pnpm exec oxfmt --write --threads=1 src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts
  • PNPM_CONFIG_OFFLINE=true corepack pnpm test src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -- --reporter=dot passed 2 files / 4 tests.
  • git diff --check -- src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts

I still cannot honestly provide a real private production resume recording from this contributor environment without using local/private session state. The disposable filesystem-backed regression now covers the recovered transcript contents plus backup permissions, and the remaining live proof item is still best done by a maintainer or a safe disposable OpenClaw session setup.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 1, 2026
@yetval

yetval commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Re-checked on 9d373724 — the recovery rework lands the fix where it matters. Verified against the real resume entrypoint (SessionManager.open(), i.e. attempt.ts:1961):

  • [corrupt-header, user-msg, assistant-msg] now survives the resume — both the user prompt and the assistant reply are preserved, and the original bytes are written to a private *.corrupt-<ts>.jsonl backup with the source mode.
  • corrupt-header + user-only assistant reply stays clean (1 header, 1 user).

So #89037's data loss is resolved. One small residual worth tightening before merge:

Header-only corrupt file duplicate header

When the corrupt file has no recoverable messages (e.g. a crash left only a half-written header line), recoverCorruptSessionEntries synthesizes [header], rewriteFile()s it, and sets flushed = true. The first userassistant turn then hits persist()'s no-assistant guard (flushed false) and re-appends the whole buffer, duplicating the header:

writeFileSync(file, `{"type":"session","version":3,"id":"sess`); // only a truncated header, no messages
const sm = SessionManager.open(file);
sm.appendMessage(msg("user", "hello"));
sm.appendMessage(msg("assistant", "hi"));
// final types: ['session','session','message','message'] -> headerCount: 2 (expected 1)

No data is lost here (there were no messages), so it's low severity — just a malformed two-header transcript. It's the same flushed/no-assistant fragility that createBranchedSession already guards against. Mirroring that here closes it:

// after building recoveredEntries
const hasAssistant = recoveredEntries.some(
 (e) => e.type === "message" && e.message.role === "assistant",
);
this.fileEntries = recoveredEntries;
const header = this.fileEntries.find((e) => e.type === "session");
this.sessionId = header?.id?? createSessionId();
this.buildIndex();
if (hasAssistant) {
 this.rewriteFile();
 this.flushed = true;
} else {
 this.flushed = false; // defer to persist(); avoids the duplicate-header re-flush
}
return;

(That also keeps the recovery branch consistent with the deferred-write contract documented in createBranchedSession.) Everything else looks good — nice turnaround. cc @charles-openclaw

@charles-openclaw

Copy link
Copy Markdown
Contributor Author

I added a disposable real-path proof run for the current PR head after the latest review. This did not touch any real user/session state; it used a temp transcript under /tmp with synthetic markers only.

Summary:

  • Ran the production SessionManager.open(disposableCorruptSession, tempDir, "/tmp/disposable-openclaw-proof") path against a malformed first-line session transcript.
  • The active transcript retained both later persisted entries: proof-user-1 and proof-assistant-1.
  • The active transcript still contained both synthetic markers after recovery: DISPOSABLE_PROOF_USER_MARKER_89050 and DISPOSABLE_PROOF_ASSISTANT_MARKER_89050.
  • Exactly one same-directory .corrupt-* backup was created, and its bytes matched the original disposable corrupt transcript.
  • The source transcript, recovered active transcript, and backup all had mode 0600 on this Linux runner.

Command shape used locally:

node --import tsx <<'NODE'
# creates a temp corrupt JSONL transcript, chmods it 0600,
# calls SessionManager.open(...), checks recovered entries and backup mode,
# prints only synthetic/redacted proof data, then removes the temp dir
NODE

Observed proof output:

{
  "tempRoot": "/tmp/<redacted-temp-dir>",
  "openCall": "SessionManager.open(disposableCorruptSession, tempDir, /tmp/disposable-openclaw-proof)",
  "beforeMode": "0600",
  "afterMode": "0600",
  "backupMode": "0600",
  "recoveredEntryIds": ["proof-user-1", "proof-assistant-1"],
  "activeTranscriptHasUserMarker": true,
  "activeTranscriptHasAssistantMarker": true,
  "backupByteLength": 390,
  "backupMatchesOriginal": true,
  "backupFileNamePattern": "disposable-corrupt-session.jsonl.corrupt-<timestamp>.jsonl"
}

The remaining policy question is intentional: this PR preserves the original corrupt bytes in a same-directory, same-mode .corrupt-* sidecar so users have a recovery artifact instead of silent truncation. I think that is the right tradeoff for this data-loss case, but it does need maintainer acceptance because it creates a persistent transcript backup.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 1, 2026
@clawsweeper clawsweeper Bot added the status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. label Jun 1, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. label Jun 1, 2026
@openclaw-barnacle openclaw-barnacle Bot added triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. labels Jun 1, 2026
@charles-openclaw

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Thanks, the header-only recovery edge case was a good catch. I pushed follow-up commit 20ae0d331 to avoid duplicating recovered session headers.

Summary:

  • Corrupt-session recovery now only rewrites immediately when the recovered transcript already includes an assistant message.
  • Header-only or user-only recovered transcripts defer to the normal first-assistant persist path.
  • The deferred full-buffer persist now writes the session buffer instead of appending it, so an old corrupt prefix or already-written header cannot produce a malformed two-header transcript.
  • Added a regression for the exact header-only corrupt file shape: after appending a user and assistant message, the on-disk types are session, message, message with exactly one session header.

Verification:

  • corepack pnpm exec oxfmt --write --threads=1 src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts
  • PNPM_CONFIG_OFFLINE=true corepack pnpm test src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -- --reporter=dot passed 2 files / 5 tests.
  • git diff --check -- src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts

@clawsweeper

clawsweeper Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 1, 2026
@steipete steipete self-assigned this Jun 2, 2026
@steipete

steipete commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Maintainer fix pushed: 30307017a8 (fix(sessions): preserve recovered corrupt transcripts).

What changed:

  • SessionManager.open() now rewrites every recovered corrupt-header transcript to valid JSONL immediately after creating the .corrupt-* backup, so the production initializer no longer sees unreadable disk bytes.
  • Recovered corrupt-header sessions now expose an internal recovery marker; prepareSessionManagerForRun() uses it to rewrite the header id/cwd while preserving recovered user-only rows instead of clearing them.
  • Added production-path regression coverage for corrupt header + user-only through SessionManager.open() -> prepareSessionManagerForRun() -> first assistant append.
  • Fixed the stale unused import and typed the new test messages so the prior CI failures are addressed.

Local proof on this branch after the fix:

  • pnpm test src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -- --reporter=dot -> 2 files / 6 tests passed.
  • pnpm tsgo:prod -> passed.
  • pnpm check:test-types -> passed.
  • pnpm lint --threads=8 -> passed.
  • git diff --check -- src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -> passed.
  • .agents/skills/autoreview/scripts/autoreview --mode local -> clean, no accepted/actionable findings.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels Jun 2, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 2, 2026
@steipete

steipete commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Land-ready maintainer proof for head 30307017a8668bc7dbeec7c5ee5d1bd47743ad1c.

Work done:

  • Preserves corrupt-header transcripts by backing up the original bytes and immediately rewriting recovered active JSONL in SessionManager.open().
  • Preserves recovered user-only rows through prepareSessionManagerForRun() before the first assistant append.
  • Adds focused regression coverage for both recovered assistant transcripts and recovered user-only production init.

Local proof run:

  • pnpm test src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.test.ts -- --reporter=dot
  • pnpm tsgo:prod
  • pnpm check:test-types
  • pnpm lint --threads=8
  • git diff --check -- src/agents/sessions/session-manager.ts src/agents/sessions/session-manager.test.ts src/agents/embedded-agent-runner/session-manager-init.ts src/agents/embedded-agent-runner/session-manager-init.test.ts
  • .agents/skills/autoreview/scripts/autoreview --mode local reported no accepted/actionable findings.

Real behavior proof:

Known proof gap:

  • Scan changed paths (precise) failed on synthetic merge SHA 26a91536b9582b52072d01b952e1d81033f3822f; the failing findings are outside this PR's four changed files. The PR diff is limited to session manager and session-manager-init files/tests.

@steipete steipete merged commit 2c48dd2 into openclaw:main Jun 2, 2026
190 of 197 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P0 Emergency: data loss, security bypass, crash loop, or unusable core runtime. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: M status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: resuming a session with a corrupted header line silently wipes the entire transcript (data loss)

3 participants