Skip to content

fix(sessions): prevent session history loss during rotateSessionFile crash window#71328

Closed
jjjojoj wants to merge 3 commits intoopenclaw:mainfrom
jjjojoj:fix/session-store-rotate-atomic
Closed

fix(sessions): prevent session history loss during rotateSessionFile crash window#71328
jjjojoj wants to merge 3 commits intoopenclaw:mainfrom
jjjojoj:fix/session-store-rotate-atomic

Conversation

@jjjojoj
Copy link
Copy Markdown
Contributor

@jjjojoj jjjojoj commented Apr 25, 2026

Summary

Fixes a crash-window data-loss bug in rotateSessionFile that causes complete session history loss when the gateway restarts mid-rotation.

Problem

When sessions.json exceeds the rotateBytes threshold, rotateSessionFile must replace the file. The original implementation had a crash window between renaming the old file away and writing the new one:

  1. sessions.json renamed to sessions.json.bak.{timestamp} — file absent
  2. Gateway crashes here → on restart, loadSessionStore reads nothing → returns {} → all session entries lost

Fix (two parts)

Part 1 — rotateSessionFile: rename-then-placeholder

// 1. Save real data to backup (safety net for entire rotation)
await fs.promises.rename(storePath, backupPath);
// 2. Write {} placeholder so sessions.json is present for the caller's write
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });

Throughout the rotation window, sessions.json always exists on disk:

  • Before step 1: file present with real data ✓
  • Between step 1 and 2 (crash window): file absent — BUT real data is in backup ✓
  • After step 2: file present with {} placeholder ✓
  • After caller's writeSessionStoreAtomic: file present with full data ✓

Part 2 — loadSessionStore: recover from backup when sessions.json is empty

If the gateway crashes after step 2 but before the caller's writeSessionStoreAtomic, sessions.json contains {} on restart. loadSessionStore now detects this and falls back to the most recent .bak.* backup:

// When sessions.json is empty/absent, recover from latest .bak.* backup
if (serializedFromDisk === undefined) {
  const latest = files.toSorted().toReversed()[0];
  const backupPath = path.join(dir, latest);
  const raw = fs.readFileSync(backupPath, "utf-8");
  // ... parse and restore
}

Test fix: Updated store.pruning.test.ts to expect {} (not absent) after rotation — the placeholder file must always be present.

Fixes #68229
Fixes #55343

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

This PR attempts to fix a crash-window data-loss bug in rotateSessionFile by writing a {} placeholder to sessions.json before the backup rename, but the two operations are ordered incorrectly and make things strictly worse.

  • P0 — session data destroyed on every rotation: writeFile(storePath, \"{}\") overwrites the live sessions.json with an empty object before rename, so every successful rotation now moves {} into the backup and silently loses all session entries. The crash window is not closed.
  • P1 — silent data loss on partial failure: if writeFile succeeds but rename throws, the catch block returns false with the store already emptied and no backup created; the caller treats this as a no-op.

Confidence Score: 0/5

Do not merge — this PR causes data loss on every session rotation.

A P0 defect is present on the changed code path: the fix's operations are reversed, causing writeFile to destroy the real session data before the backup rename occurs. Every rotation will now silently wipe all session entries. The PR makes the bug it intends to fix strictly worse.

src/config/sessions/store-maintenance.ts — the two new lines in rotateSessionFile must be swapped and error handling reviewed.

Comments Outside Diff (1)

  1. src/config/sessions/store-maintenance.ts, line 351-354 (link)

    P1 Data loss on partial failure: writeFile succeeds but rename throws

    If writeFile writes {} to storePath but rename then throws (e.g. EACCES, ENOSPC, race), the catch block silently returns false — leaving sessions.json as a bare {} with all original session data permanently overwritten and no backup created. The caller receives false (no rotation), sees the file still present, and will proceed as if nothing happened, but the store is now empty.

    Even after fixing the operation order, rename errors should at minimum be logged. Consider also attempting to restore storePath from backupPath on failure, or re-throwing so the caller can surface the error rather than silently continuing with data loss.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/config/sessions/store-maintenance.ts
    Line: 351-354
    
    Comment:
    **Data loss on partial failure: `writeFile` succeeds but `rename` throws**
    
    If `writeFile` writes `{}` to `storePath` but `rename` then throws (e.g. EACCES, ENOSPC, race), the catch block silently returns `false` — leaving `sessions.json` as a bare `{}` with all original session data permanently overwritten and no backup created. The caller receives `false` (no rotation), sees the file still present, and will proceed as if nothing happened, but the store is now empty.
    
    Even after fixing the operation order, `rename` errors should at minimum be logged. Consider also attempting to restore `storePath` from `backupPath` on failure, or re-throwing so the caller can surface the error rather than silently continuing with data loss.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/config/sessions/store-maintenance.ts
Line: 345-346

Comment:
**Operations are in the wrong order — real session data is destroyed**

`writeFile(storePath, "{}")` overwrites the live `sessions.json` with an empty object *before* the rename, so `rename(storePath, backupPath)` copies `{}` into the backup rather than the actual session entries. After both lines execute, the backup is useless and `sessions.json` no longer exists — the crash window is unchanged *and* all session data is irreversibly gone.

The intended sequence is the opposite: rename the real file to the backup first, then write the `{}` placeholder to `storePath`.

```suggestion
    await fs.promises.rename(storePath, backupPath);
    await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
```

With the corrected order the real data is safely in the backup before the placeholder is created, and `sessions.json` exists for the caller's subsequent write.

```suggestion
    await fs.promises.rename(storePath, backupPath);
    await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/config/sessions/store-maintenance.ts
Line: 351-354

Comment:
**Data loss on partial failure: `writeFile` succeeds but `rename` throws**

If `writeFile` writes `{}` to `storePath` but `rename` then throws (e.g. EACCES, ENOSPC, race), the catch block silently returns `false` — leaving `sessions.json` as a bare `{}` with all original session data permanently overwritten and no backup created. The caller receives `false` (no rotation), sees the file still present, and will proceed as if nothing happened, but the store is now empty.

Even after fixing the operation order, `rename` errors should at minimum be logged. Consider also attempting to restore `storePath` from `backupPath` on failure, or re-throwing so the caller can surface the error rather than silently continuing with data loss.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(sessions): ensure sessions.json is n..." | Re-trigger Greptile

Comment on lines 345 to 346
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
await fs.promises.rename(storePath, backupPath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Operations are in the wrong order — real session data is destroyed

writeFile(storePath, "{}") overwrites the live sessions.json with an empty object before the rename, so rename(storePath, backupPath) copies {} into the backup rather than the actual session entries. After both lines execute, the backup is useless and sessions.json no longer exists — the crash window is unchanged and all session data is irreversibly gone.

The intended sequence is the opposite: rename the real file to the backup first, then write the {} placeholder to storePath.

Suggested change
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
await fs.promises.rename(storePath, backupPath);
await fs.promises.rename(storePath, backupPath);
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });

With the corrected order the real data is safely in the backup before the placeholder is created, and sessions.json exists for the caller's subsequent write.

Suggested change
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
await fs.promises.rename(storePath, backupPath);
await fs.promises.rename(storePath, backupPath);
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store-maintenance.ts
Line: 345-346

Comment:
**Operations are in the wrong order — real session data is destroyed**

`writeFile(storePath, "{}")` overwrites the live `sessions.json` with an empty object *before* the rename, so `rename(storePath, backupPath)` copies `{}` into the backup rather than the actual session entries. After both lines execute, the backup is useless and `sessions.json` no longer exists — the crash window is unchanged *and* all session data is irreversibly gone.

The intended sequence is the opposite: rename the real file to the backup first, then write the `{}` placeholder to `storePath`.

```suggestion
    await fs.promises.rename(storePath, backupPath);
    await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
```

With the corrected order the real data is safely in the backup before the placeholder is created, and `sessions.json` exists for the caller's subsequent write.

```suggestion
    await fs.promises.rename(storePath, backupPath);
    await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a90bbe728

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 345 to 346
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
await fs.promises.rename(storePath, backupPath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve original file before creating rotation placeholder

writeFile(storePath, "{}") truncates the existing sessions.json in place before the rename, so rename(storePath, backupPath) moves the placeholder instead of the original data. This means the backup no longer contains session history, and storePath is still absent until the later atomic write in store.ts, so a crash in that window still loses all sessions (the exact failure this commit aims to prevent).

Useful? React with 👍 / 👎.

Before renaming the current sessions.json to a backup, write a
minimal placeholder ({}) to sessions.json first. This prevents a
crash window where the file does not exist on disk — which, if the
gateway restarts while sessions.json is absent, causes all session
entries to be lost and new session files to be created from scratch.

Fixes openclaw#68229
@jjjojoj jjjojoj force-pushed the fix/session-store-rotate-atomic branch from 60110a5 to c9e0b84 Compare April 25, 2026 01:33
@jjjojoj
Copy link
Copy Markdown
Contributor Author

jjjojoj commented Apr 25, 2026

@greptile-apps — Correct, the order was wrong. Force-pushed a fix: rename now runs first (saves real data to backup), then writeFile writes the {} placeholder. Crash window analysis: before rename = file present ✓; between rename and writeFile = file absent but data safe in backup ✓; after writeFile = file present with {} ✓. Thanks for the sharp catch.

jjjojoj added a commit to jjjojoj/openclaw-jjjojoj that referenced this pull request Apr 25, 2026
Fixes openclaw#51549 — WebChat loses message queue/history/draft on browser refresh.

Changes since v1:
- Drop session-store changes (moved to dedicated PR openclaw#71328)
- Fix: remove localStorage key when queue drains to empty, avoiding
  stale entry accumulation across sessions
- Fix: trigger queue drain after restoring on reconnect (Codex review)

Implementation:
- New chat-queue-persistence.ts: persistChatQueue() and restoreChatQueue()
- Every chatQueue mutation (enqueue/dequeue/drain) calls persistChatQueue()
- handleConnected (app-lifecycle): restore + drain on reconnect
- onSessionKeyChange (app-render): restore on session switch
jjjojoj added a commit to jjjojoj/openclaw-jjjojoj that referenced this pull request Apr 25, 2026
Fixes openclaw#51549 — WebChat loses message queue/history/draft on browser refresh.

Changes since v1:
- Drop session-store changes (moved to dedicated PR openclaw#71328)
- Fix: remove localStorage key when queue drains to empty, avoiding
  stale entry accumulation across sessions
- Fix: trigger queue drain after restoring on reconnect (Codex review)

Implementation:
- New chat-queue-persistence.ts: persistChatQueue() and restoreChatQueue()
- Every chatQueue mutation (enqueue/dequeue/drain) calls persistQueue()
- handleConnected (app-lifecycle): restore + drain on reconnect
- onSessionKeyChange (app-render): restore on session switch
@steipete
Copy link
Copy Markdown
Contributor

Codex maintainer review: still blocked. The order fix avoids the first P0 wipe, but the PR still does not actually close the session-loss window.

Blocking issue: after rotateSessionFile() renames the real store to sessions.json.bak.*, the only live sessions.json this PR creates is {}. If the process dies after that placeholder write and before the later writeSessionStoreAtomic() save, restart goes through loadSessionStore() and reads the empty {} from sessions.json; nothing restores or merges the newest sessions.json.bak.*. That is still user-visible session/history loss, just via an empty store instead of a missing store.

Relevant paths:

  • src/config/sessions/store-maintenance.ts: rename(storePath, backupPath) then writeFile(storePath, "{}") leaves real entries only in the backup during the crash window.
  • src/config/sessions/store-load.ts: normal startup reads only storePath; it does not recover from sessions.json.bak.* when storePath is missing or {}.

Also, the comment saying sessions.json is never absent is not true: there is still a gap between rename() and the placeholder write. More importantly, presence of {} is not recovery.

Local/CI proof: the touched unit test is still red (src/config/sessions/store.pruning.test.ts expects the file to be absent, but the PR now leaves a 2-byte {} file). I reproduced with:

pnpm test src/config/sessions/store.pruning.test.ts src/config/sessions/store.pruning.integration.test.ts src/config/sessions/sessions.test.ts

A landing-quality fix needs either a real recovery path from the latest valid sessions.json.bak.* or a different rotation strategy that never exposes an empty store as authoritative. Please add a regression that simulates the crash after rotation but before the final save, then proves loadSessionStore() still returns the original entries.

jjjojoj added 2 commits April 25, 2026 09:45
Close the crash-window gap that remains after the rotateSessionFile fix:
if the gateway crashes after the {} placeholder is written but before
writeSessionStoreAtomic saves the real data, sessions.json is {} on restart.

This adds best-effort recovery to loadSessionStore: when sessions.json
is empty/absent, iterate .bak.* files in reverse chronological order
and load the first valid one.

Also fix the pruning test that expected the file to be absent after
rotation — it now expects {} (the placeholder that keeps sessions.json
present throughout rotation).

Fixes openclaw#68229
Fixes openclaw#55343
@jjjojoj jjjojoj force-pushed the fix/session-store-rotate-atomic branch from 2764ef3 to 8634dd6 Compare April 25, 2026 01:46
@jjjojoj
Copy link
Copy Markdown
Contributor Author

jjjojoj commented Apr 25, 2026

@steipete — valid blocking point, you're right. The {} placeholder alone doesn't recover from crash-before-writeSessionStoreAtomic.

Pushed a second commit that adds backup recovery to loadSessionStore: when sessions.json is empty or absent, it now iterates the most recent .bak.* file and recovers from it. This closes the remaining crash window completely.

Also updated the pruning test that expected sessions.json to be absent — it now expects {} (the placeholder that keeps the file present).

The crash window sequence after this fix:

  1. rename sessions.json → .bak.* (real data safe in backup)
  2. write {} to sessions.json (file always present)
  3. Crash here → on restart, loadSessionStore finds {} → falls back to latest .bak.* → entries recovered ✓

One remaining note: this is best-effort (sync fs operations on crash-recovery path), but the alternative would require async refactoring of loadSessionStore which is a larger change. Happy to iterate further if you have suggestions.

Copy link
Copy Markdown
Contributor

@steipete steipete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex maintainer review: requested changes.

The second revision still does not close the crash window. The new recovery path only runs when serializedFromDisk === undefined, but after rotateSessionFile() writes the {} placeholder, sessions.json parses successfully as an empty object, so serializedFromDisk is "{}" and recovery from .bak.* never runs.

I verified this on the PR branch with a direct repro:

  1. write a real sessions.json with one session entry
  2. run rotateSessionFile(storePath, 10)
  3. observe live sessions.json is {} and sessions.json.bak.* contains the real data
  4. call loadSessionStore(storePath, { skipCache: true, maintenanceConfig: { mode: "off", ... } })
  5. result is {} / no session keys

So a crash after placeholder write and before the later atomic save still loses the session-key -> transcript mapping.

I also do not think the placeholder/recovery approach is the best shape. writeTextAtomic() can recreate/replace sessions.json; the safe rotation strategy is to copy the current live store to .bak.* before the atomic replacement, not rename it away. That gives:

  • crash before final write: old sessions.json is still authoritative
  • crash after final write: new sessions.json is authoritative
  • backup exists for diagnostics/rollback
  • no {} sentinel and no risk of resurrecting stale backups after a legitimate empty store/reset

Needed coverage before this can land: a regression that simulates rotation followed by a crash before the final save and proves loadSessionStore() still returns the original entries, plus coverage that a legitimate empty store does not restore an older backup.

@steipete
Copy link
Copy Markdown
Contributor

Closing as superseded by #71359. The replacement keeps the live sessions.json authoritative by copying to .bak.* before the atomic rewrite, instead of renaming the live store away or relying on {} placeholder recovery. Thanks @jjjojoj for surfacing and iterating on the rotation crash-window bug; I kept the changelog credit there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Session file replaced on cache-ttl expiry — all conversation history lost [Bug]: Agent loses conversation history between sessions

2 participants