-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
Atomic write helper leaks orphan .tmp files; gateway never sweeps them (51 GB observed in the wild) #89520
Copy link
Copy link
Open
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:otherThis issue has meaningful maintainer-visible impact outside the owned taxonomy.This issue has meaningful maintainer-visible impact outside the owned taxonomy.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:otherThis issue has meaningful maintainer-visible impact outside the owned taxonomy.This issue has meaningful maintainer-visible impact outside the owned taxonomy.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Type
Fields
Give feedbackNo fields configured for issues without a type.
TL;DR
src/infra/json-files.tswrites JSON atomically via<path>.<UUID>.tmp→rename. If the process dies betweenwriteFileandrename(SIGKILL, OOM kill, hard shutdown, system sleep at the wrong moment), thefinallyblock never runs and the.tmpis orphaned. There is no startup sweep, so orphans accumulate forever.On one production user's machine (running an
openclaw-gateway.servicefor ~3 months), this leaked 8,662 orphan files = 51 GB in~/.openclaw/agents/main/sessions/. The currentsessions.jsonwas a healthy 54 MB. The 51 GB was 100% orphans matchingsessions.json.<UUID>.tmp.Reproduction (the leak)
sessions.json(default behavior on an active agent).fs.writeFile(tmp, ...)andfs.rename(tmp, filePath)orphans the tmp:kill -9(e.g. from a watchdog or restart script)sessions.jsonitself (50+ MB in heavy agents).Diagnose on any host:
```bash
find ~/.openclaw -name 'sessions.json..tmp' -type f | wc -l
du -sh ~/.openclaw/agents//sessions
```
Root cause
src/infra/json-files.tslines 27-57:```ts
export async function writeTextAtomic(filePath, content, options) {
// ...
const tmp = `${filePath}.${randomUUID()}.tmp`;
try {
await fs.writeFile(tmp, payload, ...);
// ...
await fs.rename(tmp, filePath); // <-- if SIGKILL hits between writeFile and rename...
// ...
} finally {
await fs.rm(tmp, { force: true }).catch(() => undefined); // <-- ...this never runs
}
}
```
The in-process `finally` cleanup is correct, but the JS runtime cannot run `finally` blocks when the process is hard-killed.
Note: `src/infra/fs-safe.ts:310` and `src/config/io.ts:1291` already use a better naming convention (`${name}.${process.pid}.${randomUUID()}.tmp`, sometimes with a leading dot) — but `json-files.ts` is on the older pattern.
Proposed fixes
Fix A — startup sweep (minimum viable)
When the gateway boots (or when the session store first loads), sweep the directory:
```ts
// Pseudocode for somewhere in src/gateway/boot.ts or session-utils.ts init
import { glob } from "node:fs/promises"; // or use fs.readdir + filter
import fs from "node:fs/promises";
async function sweepOrphanTmpFiles(sessionsDir: string, maxAgeMs = 60 * 60_000) {
const entries = await fs.readdir(sessionsDir).catch(() => []);
const now = Date.now();
for (const name of entries) {
if (!name.endsWith(".tmp") || !name.includes("sessions.json.")) continue;
const p = path.join(sessionsDir, name);
const stat = await fs.stat(p).catch(() => null);
if (!stat) continue;
if (now - stat.mtimeMs > maxAgeMs) {
await fs.rm(p, { force: true }).catch(() => undefined);
}
}
}
```
Run this once on gateway boot. `maxAgeMs = 1h` ensures we never delete a tmp belonging to a currently-in-flight write.
Fix B — periodic sweep (defense in depth)
The existing `session-reaper.ts` has a perfect throttled-sweep pattern. Adding a parallel tmp-file sweep that runs on the same cron timer tick would catch leaks from long-running gateways that never restart.
Fix C — include pid in the tmp name (consistency)
Other call sites in the codebase already do this (`fs-safe.ts:310`, `config/io.ts:1291`). Adopting the same convention in `json-files.ts` would:
Why this matters
51 GB on a 245 GB MacBook is a full-disk-wedge event. The user's only signal is "my Mac is full" — they have no reason to suspect openclaw because the user-facing data (sessions.json) looks fine. We diagnosed it only by walking down from `du -sh ~/.openclaw`.
Happy to draft a PR if a maintainer can point to the preferred call site for the sweep (boot.ts? session-utils.ts module init?).
Environment