Skip to content

Atomic write helper leaks orphan .tmp files; gateway never sweeps them (51 GB observed in the wild) #89520

@gideonblaauw-creator

Description

@gideonblaauw-creator

TL;DR

src/infra/json-files.ts writes JSON atomically via <path>.<UUID>.tmprename. If the process dies between writeFile and rename (SIGKILL, OOM kill, hard shutdown, system sleep at the wrong moment), the finally block never runs and the .tmp is orphaned. There is no startup sweep, so orphans accumulate forever.

On one production user's machine (running an openclaw-gateway.service for ~3 months), this leaked 8,662 orphan files = 51 GB in ~/.openclaw/agents/main/sessions/. The current sessions.json was a healthy 54 MB. The 51 GB was 100% orphans matching sessions.json.<UUID>.tmp.

Reproduction (the leak)

  1. Run openclaw-gateway as a long-lived service that frequently rewrites sessions.json (default behavior on an active agent).
  2. Anything that hard-kills the process between fs.writeFile(tmp, ...) and fs.rename(tmp, filePath) orphans the tmp:
    • Force-quit / Activity Monitor → Force Quit
    • System sleep that hits the gateway at exactly the wrong moment
    • OOM kill
    • kill -9 (e.g. from a watchdog or restart script)
  3. The tmp file is left behind. No code path ever removes it.
  4. Over months and thousands of writes, this compounds catastrophically — each tmp is roughly the same size as sessions.json itself (50+ MB in heavy agents).

Diagnose on any host:
```bash
find ~/.openclaw -name 'sessions.json..tmp' -type f | wc -l
du -sh ~/.openclaw/agents/
/sessions
```

Root cause

src/infra/json-files.ts lines 27-57:

```ts
export async function writeTextAtomic(filePath, content, options) {
// ...
const tmp = `${filePath}.${randomUUID()}.tmp`;
try {
await fs.writeFile(tmp, payload, ...);
// ...
await fs.rename(tmp, filePath); // <-- if SIGKILL hits between writeFile and rename...
// ...
} finally {
await fs.rm(tmp, { force: true }).catch(() => undefined); // <-- ...this never runs
}
}
```

The in-process `finally` cleanup is correct, but the JS runtime cannot run `finally` blocks when the process is hard-killed.

Note: `src/infra/fs-safe.ts:310` and `src/config/io.ts:1291` already use a better naming convention (`${name}.${process.pid}.${randomUUID()}.tmp`, sometimes with a leading dot) — but `json-files.ts` is on the older pattern.

Proposed fixes

Fix A — startup sweep (minimum viable)

When the gateway boots (or when the session store first loads), sweep the directory:

```ts
// Pseudocode for somewhere in src/gateway/boot.ts or session-utils.ts init
import { glob } from "node:fs/promises"; // or use fs.readdir + filter
import fs from "node:fs/promises";

async function sweepOrphanTmpFiles(sessionsDir: string, maxAgeMs = 60 * 60_000) {
const entries = await fs.readdir(sessionsDir).catch(() => []);
const now = Date.now();
for (const name of entries) {
if (!name.endsWith(".tmp") || !name.includes("sessions.json.")) continue;
const p = path.join(sessionsDir, name);
const stat = await fs.stat(p).catch(() => null);
if (!stat) continue;
if (now - stat.mtimeMs > maxAgeMs) {
await fs.rm(p, { force: true }).catch(() => undefined);
}
}
}
```

Run this once on gateway boot. `maxAgeMs = 1h` ensures we never delete a tmp belonging to a currently-in-flight write.

Fix B — periodic sweep (defense in depth)

The existing `session-reaper.ts` has a perfect throttled-sweep pattern. Adding a parallel tmp-file sweep that runs on the same cron timer tick would catch leaks from long-running gateways that never restart.

Fix C — include pid in the tmp name (consistency)

Other call sites in the codebase already do this (`fs-safe.ts:310`, `config/io.ts:1291`). Adopting the same convention in `json-files.ts` would:

  1. Make it possible to identify orphans from previous process generations vs. current
  2. Allow a safer sweep predicate (`pid not in active PIDs`) without time-based heuristics

Why this matters

51 GB on a 245 GB MacBook is a full-disk-wedge event. The user's only signal is "my Mac is full" — they have no reason to suspect openclaw because the user-facing data (sessions.json) looks fine. We diagnosed it only by walking down from `du -sh ~/.openclaw`.

Happy to draft a PR if a maintainer can point to the preferred call site for the sweep (boot.ts? session-utils.ts module init?).

Environment

  • openclaw release channel: `OPENCLAW_SERVICE_VERSION=2026.3.23-1`
  • Host: macOS, gateway running as user service (`openclaw-gateway.service`)
  • Detected: 2026-06-02

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:otherThis issue has meaningful maintainer-visible impact outside the owned taxonomy.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions