Summary
When the gateway restarts and a cron fires before full state is loaded, saveCronStore writes a partial in-memory job list over the full on-disk job list — silently wiping all jobs not present in memory at that moment. We lost 86 jobs twice in 24 hours from this.
Root Cause
File: src/cron/store.ts (dist: config-runtime-BYNizC50.js)
Function: saveCronStore(storePath, store, opts)
Current write path:
in-memory state → write to .tmp → rename to jobs.json
The function writes whatever is currently in memory as the complete canonical job list. If an isolated cron session fires 1–2 seconds after a gateway restart, only its own jobs exist in its memory scope. When it writes, it replaces all 50+ other jobs with its 1–2 job state.
OpenClaw already uses atomic writes (tmp → rename) which prevents file corruption — but atomic writes of the wrong data still cause silent data loss.
Reproduction
- Create 50+ cron jobs
- Restart the gateway
- Within ~5 seconds of restart, a cron fires in an isolated session
- That session has only 1–2 jobs in memory
saveCronStore writes those 1–2 jobs to disk
- All other jobs are gone — no error, no warning
This is amplified by any crash loop or rapid restart cycle (e.g., watchdog, config changes).
Proposed Fix: Read-Merge-Write Pattern
Instead of writing in-memory state directly, saveCronStore should:
- Read current
jobs.json from disk
- Merge — apply only the in-memory delta (add/modify/remove the specific job that changed)
- Backup — copy current
jobs.json → jobs.json.bak (already done, keep this)
- Write — write merged result to
.tmp, then rename to jobs.json
// Proposed change to saveCronStore
async function saveCronStore(storePath, store, opts) {
// Read current disk state
const diskJobs = await readJobsFromDisk(storePath) ?? [];
// Merge: apply delta from in-memory store onto disk state
const merged = mergeJobStates(diskJobs, store.jobs);
// Backup + atomic write (existing behavior, preserved)
await backupAndAtomicWrite(storePath, merged);
}
This ensures:
- A session with 1 job in memory cannot wipe 51 jobs from disk
- Add/modify/delete operations apply as deltas, not full replacements
- Behavior is identical to current for the normal (non-restart-race) case
Current Workaround
External watchdog (cron-guardian.sh via launchd) running every 5 minutes:
- Detects count regression (< 10 jobs)
- Auto-restores from rotating timestamped backups
- Sends Telegram alert
- Preserves forensics file
This mitigates impact but does not prevent the race. Restoration window is ~5 minutes worst-case.
Environment
- OpenClaw version: 2026.3.23-2
- OS: macOS 15.x (Darwin arm64)
- Gateway: launchd-managed, self-heal watchdog enabled
- Cron jobs at time of loss: ~86 (first incident), ~52 (second incident)
Related
Summary
When the gateway restarts and a cron fires before full state is loaded,
saveCronStorewrites a partial in-memory job list over the full on-disk job list — silently wiping all jobs not present in memory at that moment. We lost 86 jobs twice in 24 hours from this.Root Cause
File:
src/cron/store.ts(dist:config-runtime-BYNizC50.js)Function:
saveCronStore(storePath, store, opts)Current write path:
The function writes whatever is currently in memory as the complete canonical job list. If an isolated cron session fires 1–2 seconds after a gateway restart, only its own jobs exist in its memory scope. When it writes, it replaces all 50+ other jobs with its 1–2 job state.
OpenClaw already uses atomic writes (tmp → rename) which prevents file corruption — but atomic writes of the wrong data still cause silent data loss.
Reproduction
saveCronStorewrites those 1–2 jobs to diskThis is amplified by any crash loop or rapid restart cycle (e.g., watchdog, config changes).
Proposed Fix: Read-Merge-Write Pattern
Instead of writing in-memory state directly,
saveCronStoreshould:jobs.jsonfrom diskjobs.json→jobs.json.bak(already done, keep this).tmp, then rename tojobs.jsonThis ensures:
Current Workaround
External watchdog (
cron-guardian.shvia launchd) running every 5 minutes:This mitigates impact but does not prevent the race. Restoration window is ~5 minutes worst-case.
Environment
Related