Skip to content

fix(sessions): opt out of fsync in session-store atomic write#76881

Closed
mmartoccia wants to merge 1 commit intoopenclaw:mainfrom
mmartoccia:fix/sessions-store-async-durability
Closed

fix(sessions): opt out of fsync in session-store atomic write#76881
mmartoccia wants to merge 1 commit intoopenclaw:mainfrom
mmartoccia:fix/sessions-store-async-durability

Conversation

@mmartoccia
Copy link
Copy Markdown

Problem

Per #73655, the gateway's session-store atomic write holds the session-write-lock across fsync. Under cron-driven write load, the libuv thread pool can't drain fsync calls fast enough, lock holds run 30–66 seconds, and the event loop starves — surfacing as polling stalls, cron-lane timeouts, sustained ~100% CPU, and [session-write-lock] releasing lock held for ...ms cascades.

This has been reported on macOS arm64 + heavy cron (#73655 comment), Pi5 Linux (comment), and 24-core Linux VM (comment). The 24-core VM commenter (@slideshow-dingo) measured eventLoopUtilization=1.0, identified the tmpHandle.sync() as the hot path, and recommended moving fsync outside the lock.

What this PR does

Makes fsync opt-out on writeTextAtomic / writeJsonAtomic via a new durable?: boolean option (default true, preserving every existing caller's behavior). saveSessionStoreUnlocked opts out by passing durable: false.

Why this is safe for the session store specifically:

  • The atomic-rename guarantee is preserved — concurrent readers never see a torn write, because the write still goes through tmp-file → rename.
  • Worst case on an OS-level crash mid-write is a partial tmp file (cleaned up by the existing finally { fs.rm(tmp, ...) }) or a metadata-only loss of the latest registry entry. The gateway already writes per-session JSONL transcripts durably; the registry is reconstructible from those.
  • All other writeTextAtomic / writeJsonAtomic callers (config files, models, etc.) keep durable: true by default, so durability semantics are unchanged for everything outside the session store.

Verified locally

Applied the equivalent comment-out patch to the installed dist on macOS arm64 / Node 25.8.2 / 2026.4.29 + ~98 enabled cron jobs:

  • [session-write-lock] releasing lock held for ... entries since gateway restart: 0 (was several per hour, with 15–66s holds)
  • gateway.err.log: silent for 15+ minutes (was a continuous trickle)
  • openclaw sessions --active 5 --all-agents: ~2s (was 4–12s per [Bug]: Bug Report: 80 Sessions Keep Recreating After Restart #75954)
  • Listener pid stable since restart (was respawning every ~55 min from launchd KeepAlive after the process gave up)

Residual ~100% CPU on the listener is not addressed by this change — that's a separate microtask/spread loop (likely bryangauvin's leak #1 or #2 in the same issue), and I observed the V8 hot path shift away from Builtins_CopyDataPropertiesWithExcludedProperties after the patch but the load remains. This PR is one of the three leak strands.

Validation

  • pnpm check clean (lint + typecheck + 8 other check stages all pass)
  • 14 json-files tests pass; 3 new tests added covering: durable=true (default) calls fsync, durable=false skips fsync but still atomically replaces, writeJsonAtomic threads the option through.
  • 27 sessions.test.ts tests pass (no regressions on the session-store path)
  • Verified by spying on fs.open and counting .sync() invocations on the returned FileHandle

Diff

3 files changed, 141 insertions, 11 deletions:

  • src/infra/json-files.ts: durable? option added with JSDoc; conditional fsync (default behavior unchanged)
  • src/config/sessions/store.ts: writeSessionStoreAtomic passes durable: false with rationale comment
  • src/infra/json-files.test.ts: 3 new tests

Refs: #73655
Refs: #74345
Refs: #66646

The session-store registry write goes through writeTextAtomic, which
performs a tmp-file fsync and a parent-dir fsync inside the
session-write-lock window. Under cron load, fsync queues behind the
libuv threadpool and holds the lock for 30s+, starving the event loop
and surfacing as the symptoms in openclaw#73655 (session-write-lock
timeouts, polling stalls, cron-lane timeouts, sustained ~100% CPU
across macOS / Pi5 Linux / 24-core Linux VM).

Make fsync opt-out via a new `durable` option on writeTextAtomic and
writeJsonAtomic (default `true`, preserving prior behavior for all
existing callers). saveSessionStoreUnlocked now passes `durable: false`
because the session-store registry is reconstructible from the
per-session JSONL transcripts the gateway writes durably elsewhere; the
atomic-rename guarantee against partial reads is preserved.

Locally verified on macOS arm64 / Node 25.8.2 / 2026.4.29 with ~98
enabled cron jobs: session-write-lock cascade and gateway.err.log
silent post-restart, listener pid stable. CPU pressure remains (other
leaks per @bryangauvin's triad in openclaw#73655) but lock-driven cascade is
gone.

Refs: openclaw#73655
Refs: openclaw#74345
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 3, 2026

Codex review: needs changes before merge.

Summary
Adds a durable option to atomic JSON/text writes, makes the session-store registry call pass durable: false, and adds focused json-files tests.

Reproducibility: yes. from source inspection: current main awaits temp-file and parent-dir fsync inside the session-store writer path, and the linked field reports show that path causing long lock holds under cron load. I did not run a live heavy-cron reproduction in this read-only review.

Next step before merge
A narrow automated repair can correct the durability wording/test proof and add the missing changelog entry on the PR branch.

Security
Cleared: The diff introduces no dependency, workflow, secret, permission, or supply-chain changes; file mode remains 0o600.

Review findings

  • [P2] Stop overclaiming transcript durability — src/config/sessions/store.ts:506-508
  • [P3] Add the required changelog entry — src/config/sessions/store.ts:509-512
  • [P3] Assert the default fsync calls directly — src/infra/json-files.test.ts:105-118
Review details

Best possible solution:

Land a narrow version that preserves default durable atomic writes, opts only the session-store registry out, accurately documents the crash-loss tradeoff, includes a changelog entry, and has direct tests for both durable modes.

Do we have a high-confidence way to reproduce the issue?

Yes, from source inspection: current main awaits temp-file and parent-dir fsync inside the session-store writer path, and the linked field reports show that path causing long lock holds under cron load. I did not run a live heavy-cron reproduction in this read-only review.

Is this the best way to solve the issue?

Mostly yes: a default-on durable option with a session-registry-only opt-out is the narrowest code shape. The patch should not claim durable transcript recovery unless that is added, and it needs the required changelog entry before merge.

Full review comments:

  • [P2] Stop overclaiming transcript durability — src/config/sessions/store.ts:506-508
    The new rationale says the registry is reconstructible from per-session JSONL transcripts the gateway writes durably, but the current transcript append helper uses fs.writeFile/fs.appendFile without an fsync. Since this is the safety case for disabling fsync on sessions.json, please either add real transcript durability or reword the comment/JSDoc to state the weaker accepted crash-loss tradeoff.
    Confidence: 0.87
  • [P3] Add the required changelog entry — src/config/sessions/store.ts:509-512
    This is a user-visible gateway reliability/performance fix, but the PR does not update CHANGELOG.md. Repo policy requires user-facing fixes/perf changes to be recorded under the active Unreleased section, with contributor credit when known.
    Confidence: 0.9
  • [P3] Assert the default fsync calls directly — src/infra/json-files.test.ts:105-118
    The new default-durability test only proves the file was written and that at least two handles were opened; it does not count sync() calls. A regression that drops the temp-file fsync while still opening the parent directory could pass this test, so mirror the later syncCount spy pattern for the default path too.
    Confidence: 0.76

Overall correctness: patch is incorrect
Overall confidence: 0.82

Acceptance criteria:

  • pnpm test src/infra/json-files.test.ts src/config/sessions/sessions.test.ts
  • pnpm check:changed

What I checked:

Likely related people:

  • steipete: Recent history shows Peter Steinberger authored the session-store writer queue, session-store maintenance changes, and earlier atomic JSON write extraction/fallback work that this PR builds on. (role: recent maintainer and likely follow-up owner; confidence: high; commits: b4437047f477, fb40ed99a7f0, 003f52db98ab; files: src/config/sessions/store.ts, src/config/sessions/store-writer.ts, src/infra/json-files.ts)
  • vincentkoc: Recent API history for src/infra/json-files.ts includes Vincent Koc changes around persisted registry and ledger path handling, which share the atomic JSON helper surface. (role: adjacent infra maintainer; confidence: medium; commits: 9bd348fdec59, f56bf63b0602, 521e75dea003; files: src/infra/json-files.ts)

Remaining risk / open question:

  • The PR intentionally weakens crash durability for the session-store registry; the current comment overstates recovery because transcript appends are not fsync-backed in the inspected source.
  • The exact failing check job log was not available through the public check-run annotation, so CI still needs to be green before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4e0e6f8ef324.

@mmartoccia
Copy link
Copy Markdown
Author

Gentle ping on this — v2026.5.3 shipped today and the same tmpHandle.sync() / dirHandle.sync() are still on the hot path in src/infra/json-files.ts. On my host the local equivalent of this patch has been holding for ~18 hours straight (no respawn cycle, zero [session-write-lock] entries) where pre-patch it was dying every ~55 minutes. Happy to address review feedback on the durable: false API shape if there's a concern there.

@jsc2304
Copy link
Copy Markdown

jsc2304 commented May 6, 2026

Adding a Pi5/ARM64 datapoint to support landing this — full reproduction is on the parent issue (#73655), short version below.

Tested on Raspberry Pi 5 / ARM64 native systemd / Node 24.15. Built origin/main (103cdd9d, which already contains this PR's 098ace7e), ran the gateway in foreground, no other patches applied:

RPC 2026.5.2 stable (no patch) main with this patch
status 2854–3948 ms 789–1142 ms
config.get 3736–4665 ms 823–1177 ms
openclaw gateway probe Read step (15 s budget) 0/8 ok 8/8 ok
Steady-state CPU 0% 0%
[session-write-lock] releasing lock held for ... markers none none
[diagnostic] liveness warning: event_loop_delay none none

Two things worth noting in addition to mmartoccia's macOS / heavy-cron-load reproduction:

  1. The fix value extends beyond the CPU-peg failure mode. On this host the headline 4.26-4.29 CPU-peg was already addressed by the earlier 5.2 main-fix cluster (Wizard: bound hatch TUI timeout #76241, [Bug]: 4.29 dispatch prep stages take ~73s of synchronous CPU work, blocking event loop #75999, [Bug]: After enabling the gateway, it keeps timing out and reconnecting repeatedly #75944), so the gateway runs at 0% idle CPU and emits zero liveness warning lines. The session-store fsync hold was the only thing left blocking status/config.get aggregation reads on this setup. Without this patch, openclaw gateway probe cannot return Read probe: ok reliably even on an idle gateway.
  2. No durability regression observed in the round-trip. Memory-graph state preserved across stop → patched-main → kill → 4.23-restart (1486 entities / 2626 relations / 113 sources both before and after). The session-store registry rehydrated fine on the 4.23 restart.

Re: clawsweeper's [P2] feedback — agree the JSDoc should reflect that transcript-append is not currently fsync-backed; the accepted tradeoff is "session-store registry recovery from in-memory state on restart, with crash-window loss of recent metadata" rather than "transcript-backed durable recovery." Happy to retest once the rebase + changes-requested round lands; my build artefacts at 103cdd9d are still on disk.

@sallyom sallyom self-assigned this May 6, 2026
@sallyom
Copy link
Copy Markdown
Contributor

sallyom commented May 6, 2026

Rebase note: current main moved atomic JSON/text writes into @openclaw/fs-safe, so the durable: false option needs to land there first. I’m working on that now.

@steipete
Copy link
Copy Markdown
Contributor

steipete commented May 8, 2026

Thanks for putting this together. This is now superseded by the fix that landed on main.

The fs-safe side landed through @openclaw/fs-safe@0.2.0, which added writeTextAtomic(..., { durable: false }), and OpenClaw main now opts the session-store index path out of durable fsync while keeping the atomic rename and 0600 mode:

Current source proof on main: src/config/sessions/store.ts calls writeTextAtomic(params.storePath, params.serialized, { durable: false, mode: 0o600 }).

Closing this PR as superseded by the maintainer-landed version of the same session-store fsync strand. The broader #73655 umbrella should stay open for the remaining Manifest/plugin-lifecycle strands.

@steipete
Copy link
Copy Markdown
Contributor

steipete commented May 8, 2026

Superseded by main; details in the maintainer comment above.

@steipete steipete closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants