Skip to content

Bug: saveCronStore overwrites jobs.json from partial in-memory state after restart, causing silent job loss #53746

@dustinprojectcoordinator-alt

Description

Summary

When the gateway restarts and a cron fires before full state is loaded, saveCronStore writes a partial in-memory job list over the full on-disk job list — silently wiping all jobs not present in memory at that moment. We lost 86 jobs twice in 24 hours from this.

Root Cause

File: src/cron/store.ts (dist: config-runtime-BYNizC50.js)
Function: saveCronStore(storePath, store, opts)

Current write path:

in-memory state → write to .tmp → rename to jobs.json

The function writes whatever is currently in memory as the complete canonical job list. If an isolated cron session fires 1–2 seconds after a gateway restart, only its own jobs exist in its memory scope. When it writes, it replaces all 50+ other jobs with its 1–2 job state.

OpenClaw already uses atomic writes (tmp → rename) which prevents file corruption — but atomic writes of the wrong data still cause silent data loss.

Reproduction

  1. Create 50+ cron jobs
  2. Restart the gateway
  3. Within ~5 seconds of restart, a cron fires in an isolated session
  4. That session has only 1–2 jobs in memory
  5. saveCronStore writes those 1–2 jobs to disk
  6. All other jobs are gone — no error, no warning

This is amplified by any crash loop or rapid restart cycle (e.g., watchdog, config changes).

Proposed Fix: Read-Merge-Write Pattern

Instead of writing in-memory state directly, saveCronStore should:

  1. Read current jobs.json from disk
  2. Merge — apply only the in-memory delta (add/modify/remove the specific job that changed)
  3. Backup — copy current jobs.jsonjobs.json.bak (already done, keep this)
  4. Write — write merged result to .tmp, then rename to jobs.json
// Proposed change to saveCronStore
async function saveCronStore(storePath, store, opts) {
  // Read current disk state
  const diskJobs = await readJobsFromDisk(storePath) ?? [];
  
  // Merge: apply delta from in-memory store onto disk state
  const merged = mergeJobStates(diskJobs, store.jobs);
  
  // Backup + atomic write (existing behavior, preserved)
  await backupAndAtomicWrite(storePath, merged);
}

This ensures:

  • A session with 1 job in memory cannot wipe 51 jobs from disk
  • Add/modify/delete operations apply as deltas, not full replacements
  • Behavior is identical to current for the normal (non-restart-race) case

Current Workaround

External watchdog (cron-guardian.sh via launchd) running every 5 minutes:

  • Detects count regression (< 10 jobs)
  • Auto-restores from rotating timestamped backups
  • Sends Telegram alert
  • Preserves forensics file

This mitigates impact but does not prevent the race. Restoration window is ~5 minutes worst-case.

Environment

  • OpenClaw version: 2026.3.23-2
  • OS: macOS 15.x (Darwin arm64)
  • Gateway: launchd-managed, self-heal watchdog enabled
  • Cron jobs at time of loss: ~86 (first incident), ~52 (second incident)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions