fix(sessions): prevent session history loss during rotateSessionFile crash window#71328
fix(sessions): prevent session history loss during rotateSessionFile crash window#71328jjjojoj wants to merge 3 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR attempts to fix a crash-window data-loss bug in
Confidence Score: 0/5Do not merge — this PR causes data loss on every session rotation. A P0 defect is present on the changed code path: the fix's operations are reversed, causing writeFile to destroy the real session data before the backup rename occurs. Every rotation will now silently wipe all session entries. The PR makes the bug it intends to fix strictly worse. src/config/sessions/store-maintenance.ts — the two new lines in rotateSessionFile must be swapped and error handling reviewed.
|
| await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 }); | ||
| await fs.promises.rename(storePath, backupPath); |
There was a problem hiding this comment.
Operations are in the wrong order — real session data is destroyed
writeFile(storePath, "{}") overwrites the live sessions.json with an empty object before the rename, so rename(storePath, backupPath) copies {} into the backup rather than the actual session entries. After both lines execute, the backup is useless and sessions.json no longer exists — the crash window is unchanged and all session data is irreversibly gone.
The intended sequence is the opposite: rename the real file to the backup first, then write the {} placeholder to storePath.
| await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 }); | |
| await fs.promises.rename(storePath, backupPath); | |
| await fs.promises.rename(storePath, backupPath); | |
| await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 }); |
With the corrected order the real data is safely in the backup before the placeholder is created, and sessions.json exists for the caller's subsequent write.
| await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 }); | |
| await fs.promises.rename(storePath, backupPath); | |
| await fs.promises.rename(storePath, backupPath); | |
| await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 }); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store-maintenance.ts
Line: 345-346
Comment:
**Operations are in the wrong order — real session data is destroyed**
`writeFile(storePath, "{}")` overwrites the live `sessions.json` with an empty object *before* the rename, so `rename(storePath, backupPath)` copies `{}` into the backup rather than the actual session entries. After both lines execute, the backup is useless and `sessions.json` no longer exists — the crash window is unchanged *and* all session data is irreversibly gone.
The intended sequence is the opposite: rename the real file to the backup first, then write the `{}` placeholder to `storePath`.
```suggestion
await fs.promises.rename(storePath, backupPath);
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
```
With the corrected order the real data is safely in the backup before the placeholder is created, and `sessions.json` exists for the caller's subsequent write.
```suggestion
await fs.promises.rename(storePath, backupPath);
await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 });
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a90bbe728
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| await fs.promises.writeFile(storePath, "{}", { encoding: "utf-8", mode: 0o600 }); | ||
| await fs.promises.rename(storePath, backupPath); |
There was a problem hiding this comment.
Preserve original file before creating rotation placeholder
writeFile(storePath, "{}") truncates the existing sessions.json in place before the rename, so rename(storePath, backupPath) moves the placeholder instead of the original data. This means the backup no longer contains session history, and storePath is still absent until the later atomic write in store.ts, so a crash in that window still loses all sessions (the exact failure this commit aims to prevent).
Useful? React with 👍 / 👎.
Before renaming the current sessions.json to a backup, write a
minimal placeholder ({}) to sessions.json first. This prevents a
crash window where the file does not exist on disk — which, if the
gateway restarts while sessions.json is absent, causes all session
entries to be lost and new session files to be created from scratch.
Fixes openclaw#68229
60110a5 to
c9e0b84
Compare
|
@greptile-apps — Correct, the order was wrong. Force-pushed a fix: |
Fixes openclaw#51549 — WebChat loses message queue/history/draft on browser refresh. Changes since v1: - Drop session-store changes (moved to dedicated PR openclaw#71328) - Fix: remove localStorage key when queue drains to empty, avoiding stale entry accumulation across sessions - Fix: trigger queue drain after restoring on reconnect (Codex review) Implementation: - New chat-queue-persistence.ts: persistChatQueue() and restoreChatQueue() - Every chatQueue mutation (enqueue/dequeue/drain) calls persistChatQueue() - handleConnected (app-lifecycle): restore + drain on reconnect - onSessionKeyChange (app-render): restore on session switch
Fixes openclaw#51549 — WebChat loses message queue/history/draft on browser refresh. Changes since v1: - Drop session-store changes (moved to dedicated PR openclaw#71328) - Fix: remove localStorage key when queue drains to empty, avoiding stale entry accumulation across sessions - Fix: trigger queue drain after restoring on reconnect (Codex review) Implementation: - New chat-queue-persistence.ts: persistChatQueue() and restoreChatQueue() - Every chatQueue mutation (enqueue/dequeue/drain) calls persistQueue() - handleConnected (app-lifecycle): restore + drain on reconnect - onSessionKeyChange (app-render): restore on session switch
|
Codex maintainer review: still blocked. The order fix avoids the first P0 wipe, but the PR still does not actually close the session-loss window. Blocking issue: after Relevant paths:
Also, the comment saying Local/CI proof: the touched unit test is still red (
A landing-quality fix needs either a real recovery path from the latest valid |
Close the crash-window gap that remains after the rotateSessionFile fix:
if the gateway crashes after the {} placeholder is written but before
writeSessionStoreAtomic saves the real data, sessions.json is {} on restart.
This adds best-effort recovery to loadSessionStore: when sessions.json
is empty/absent, iterate .bak.* files in reverse chronological order
and load the first valid one.
Also fix the pruning test that expected the file to be absent after
rotation — it now expects {} (the placeholder that keeps sessions.json
present throughout rotation).
Fixes openclaw#68229
Fixes openclaw#55343
2764ef3 to
8634dd6
Compare
|
@steipete — valid blocking point, you're right. The {} placeholder alone doesn't recover from crash-before-writeSessionStoreAtomic. Pushed a second commit that adds backup recovery to loadSessionStore: when sessions.json is empty or absent, it now iterates the most recent .bak.* file and recovers from it. This closes the remaining crash window completely. Also updated the pruning test that expected sessions.json to be absent — it now expects {} (the placeholder that keeps the file present). The crash window sequence after this fix:
One remaining note: this is best-effort (sync fs operations on crash-recovery path), but the alternative would require async refactoring of loadSessionStore which is a larger change. Happy to iterate further if you have suggestions. |
steipete
left a comment
There was a problem hiding this comment.
Codex maintainer review: requested changes.
The second revision still does not close the crash window. The new recovery path only runs when serializedFromDisk === undefined, but after rotateSessionFile() writes the {} placeholder, sessions.json parses successfully as an empty object, so serializedFromDisk is "{}" and recovery from .bak.* never runs.
I verified this on the PR branch with a direct repro:
- write a real
sessions.jsonwith one session entry - run
rotateSessionFile(storePath, 10) - observe live
sessions.jsonis{}andsessions.json.bak.*contains the real data - call
loadSessionStore(storePath, { skipCache: true, maintenanceConfig: { mode: "off", ... } }) - result is
{}/ no session keys
So a crash after placeholder write and before the later atomic save still loses the session-key -> transcript mapping.
I also do not think the placeholder/recovery approach is the best shape. writeTextAtomic() can recreate/replace sessions.json; the safe rotation strategy is to copy the current live store to .bak.* before the atomic replacement, not rename it away. That gives:
- crash before final write: old
sessions.jsonis still authoritative - crash after final write: new
sessions.jsonis authoritative - backup exists for diagnostics/rollback
- no
{}sentinel and no risk of resurrecting stale backups after a legitimate empty store/reset
Needed coverage before this can land: a regression that simulates rotation followed by a crash before the final save and proves loadSessionStore() still returns the original entries, plus coverage that a legitimate empty store does not restore an older backup.
|
Closing as superseded by #71359. The replacement keeps the live sessions.json authoritative by copying to .bak.* before the atomic rewrite, instead of renaming the live store away or relying on {} placeholder recovery. Thanks @jjjojoj for surfacing and iterating on the rotation crash-window bug; I kept the changelog credit there. |
Summary
Fixes a crash-window data-loss bug in
rotateSessionFilethat causes complete session history loss when the gateway restarts mid-rotation.Problem
When
sessions.jsonexceeds therotateBytesthreshold,rotateSessionFilemust replace the file. The original implementation had a crash window between renaming the old file away and writing the new one:sessions.jsonrenamed tosessions.json.bak.{timestamp}— file absentloadSessionStorereads nothing → returns{}→ all session entries lostFix (two parts)
Part 1 — rotateSessionFile: rename-then-placeholder
Throughout the rotation window,
sessions.jsonalways exists on disk:{}placeholder ✓writeSessionStoreAtomic: file present with full data ✓Part 2 — loadSessionStore: recover from backup when sessions.json is empty
If the gateway crashes after step 2 but before the caller's
writeSessionStoreAtomic,sessions.jsoncontains{}on restart.loadSessionStorenow detects this and falls back to the most recent.bak.*backup:Test fix: Updated
store.pruning.test.tsto expect{}(not absent) after rotation — the placeholder file must always be present.Fixes #68229
Fixes #55343