fix(sessions): opt out of fsync in session-store atomic write#76881
fix(sessions): opt out of fsync in session-store atomic write#76881mmartoccia wants to merge 1 commit intoopenclaw:mainfrom
Conversation
The session-store registry write goes through writeTextAtomic, which performs a tmp-file fsync and a parent-dir fsync inside the session-write-lock window. Under cron load, fsync queues behind the libuv threadpool and holds the lock for 30s+, starving the event loop and surfacing as the symptoms in openclaw#73655 (session-write-lock timeouts, polling stalls, cron-lane timeouts, sustained ~100% CPU across macOS / Pi5 Linux / 24-core Linux VM). Make fsync opt-out via a new `durable` option on writeTextAtomic and writeJsonAtomic (default `true`, preserving prior behavior for all existing callers). saveSessionStoreUnlocked now passes `durable: false` because the session-store registry is reconstructible from the per-session JSONL transcripts the gateway writes durably elsewhere; the atomic-rename guarantee against partial reads is preserved. Locally verified on macOS arm64 / Node 25.8.2 / 2026.4.29 with ~98 enabled cron jobs: session-write-lock cascade and gateway.err.log silent post-restart, listener pid stable. CPU pressure remains (other leaks per @bryangauvin's triad in openclaw#73655) but lock-driven cascade is gone. Refs: openclaw#73655 Refs: openclaw#74345
|
Codex review: needs changes before merge. Summary Reproducibility: yes. from source inspection: current main awaits temp-file and parent-dir fsync inside the session-store writer path, and the linked field reports show that path causing long lock holds under cron load. I did not run a live heavy-cron reproduction in this read-only review. Next step before merge Security Review findings
Review detailsBest possible solution: Land a narrow version that preserves default durable atomic writes, opts only the session-store registry out, accurately documents the crash-loss tradeoff, includes a changelog entry, and has direct tests for both durable modes. Do we have a high-confidence way to reproduce the issue? Yes, from source inspection: current main awaits temp-file and parent-dir fsync inside the session-store writer path, and the linked field reports show that path causing long lock holds under cron load. I did not run a live heavy-cron reproduction in this read-only review. Is this the best way to solve the issue? Mostly yes: a default-on Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 4e0e6f8ef324. |
|
Gentle ping on this — |
|
Adding a Pi5/ARM64 datapoint to support landing this — full reproduction is on the parent issue (#73655), short version below. Tested on Raspberry Pi 5 / ARM64 native systemd / Node 24.15. Built
Two things worth noting in addition to mmartoccia's macOS / heavy-cron-load reproduction:
Re: clawsweeper's [P2] feedback — agree the JSDoc should reflect that transcript-append is not currently fsync-backed; the accepted tradeoff is "session-store registry recovery from in-memory state on restart, with crash-window loss of recent metadata" rather than "transcript-backed durable recovery." Happy to retest once the rebase + changes-requested round lands; my build artefacts at |
|
Rebase note: current main moved atomic JSON/text writes into |
|
Thanks for putting this together. This is now superseded by the fix that landed on The fs-safe side landed through Current source proof on Closing this PR as superseded by the maintainer-landed version of the same session-store fsync strand. The broader #73655 umbrella should stay open for the remaining Manifest/plugin-lifecycle strands. |
|
Superseded by main; details in the maintainer comment above. |
Problem
Per #73655, the gateway's session-store atomic write holds the session-write-lock across
fsync. Under cron-driven write load, the libuv thread pool can't drainfsynccalls fast enough, lock holds run 30–66 seconds, and the event loop starves — surfacing as polling stalls, cron-lane timeouts, sustained ~100% CPU, and[session-write-lock] releasing lock held for ...mscascades.This has been reported on macOS arm64 + heavy cron (#73655 comment), Pi5 Linux (comment), and 24-core Linux VM (comment). The 24-core VM commenter (@slideshow-dingo) measured
eventLoopUtilization=1.0, identified thetmpHandle.sync()as the hot path, and recommended moving fsync outside the lock.What this PR does
Makes fsync opt-out on
writeTextAtomic/writeJsonAtomicvia a newdurable?: booleanoption (defaulttrue, preserving every existing caller's behavior).saveSessionStoreUnlockedopts out by passingdurable: false.Why this is safe for the session store specifically:
finally { fs.rm(tmp, ...) }) or a metadata-only loss of the latest registry entry. The gateway already writes per-session JSONL transcripts durably; the registry is reconstructible from those.writeTextAtomic/writeJsonAtomiccallers (config files, models, etc.) keepdurable: trueby default, so durability semantics are unchanged for everything outside the session store.Verified locally
Applied the equivalent comment-out patch to the installed dist on macOS arm64 / Node 25.8.2 / 2026.4.29 + ~98 enabled cron jobs:
[session-write-lock] releasing lock held for ...entries since gateway restart: 0 (was several per hour, with 15–66s holds)gateway.err.log: silent for 15+ minutes (was a continuous trickle)openclaw sessions --active 5 --all-agents: ~2s (was 4–12s per [Bug]: Bug Report: 80 Sessions Keep Recreating After Restart #75954)KeepAliveafter the process gave up)Residual ~100% CPU on the listener is not addressed by this change — that's a separate microtask/spread loop (likely
bryangauvin's leak #1 or #2 in the same issue), and I observed the V8 hot path shift away fromBuiltins_CopyDataPropertiesWithExcludedPropertiesafter the patch but the load remains. This PR is one of the three leak strands.Validation
pnpm checkclean (lint + typecheck + 8 other check stages all pass)json-filestests pass; 3 new tests added covering: durable=true (default) calls fsync, durable=false skips fsync but still atomically replaces,writeJsonAtomicthreads the option through.sessions.test.tstests pass (no regressions on the session-store path)fs.openand counting.sync()invocations on the returnedFileHandleDiff
3 files changed, 141 insertions, 11 deletions:
src/infra/json-files.ts:durable?option added with JSDoc; conditional fsync (default behavior unchanged)src/config/sessions/store.ts:writeSessionStoreAtomicpassesdurable: falsewith rationale commentsrc/infra/json-files.test.ts: 3 new testsRefs: #73655
Refs: #74345
Refs: #66646