Bug: `saveCronStore` overwrites jobs.json from partial in-memory state after restart, causing silent job loss

## Summary

When the gateway restarts and a cron fires before full state is loaded, `saveCronStore` writes a partial in-memory job list over the full on-disk job list — silently wiping all jobs not present in memory at that moment. We lost 86 jobs twice in 24 hours from this.

## Root Cause

**File:** `src/cron/store.ts` (dist: `config-runtime-BYNizC50.js`)  
**Function:** `saveCronStore(storePath, store, opts)`

Current write path:
```
in-memory state → write to .tmp → rename to jobs.json
```

The function writes whatever is currently in memory as the complete canonical job list. If an isolated cron session fires 1–2 seconds after a gateway restart, only its own jobs exist in its memory scope. When it writes, it replaces all 50+ other jobs with its 1–2 job state.

OpenClaw already uses atomic writes (tmp → rename) which prevents file corruption — but atomic writes of the *wrong data* still cause silent data loss.

## Reproduction

1. Create 50+ cron jobs
2. Restart the gateway
3. Within ~5 seconds of restart, a cron fires in an isolated session
4. That session has only 1–2 jobs in memory
5. `saveCronStore` writes those 1–2 jobs to disk
6. All other jobs are gone — no error, no warning

This is amplified by any crash loop or rapid restart cycle (e.g., watchdog, config changes).

## Proposed Fix: Read-Merge-Write Pattern

Instead of writing in-memory state directly, `saveCronStore` should:

1. **Read** current `jobs.json` from disk
2. **Merge** — apply only the in-memory delta (add/modify/remove the specific job that changed)
3. **Backup** — copy current `jobs.json` → `jobs.json.bak` (already done, keep this)
4. **Write** — write merged result to `.tmp`, then rename to `jobs.json`

```ts
// Proposed change to saveCronStore
async function saveCronStore(storePath, store, opts) {
  // Read current disk state
  const diskJobs = await readJobsFromDisk(storePath) ?? [];
  
  // Merge: apply delta from in-memory store onto disk state
  const merged = mergeJobStates(diskJobs, store.jobs);
  
  // Backup + atomic write (existing behavior, preserved)
  await backupAndAtomicWrite(storePath, merged);
}
```

This ensures:
- A session with 1 job in memory cannot wipe 51 jobs from disk
- Add/modify/delete operations apply as deltas, not full replacements
- Behavior is identical to current for the normal (non-restart-race) case

## Current Workaround

External watchdog (`cron-guardian.sh` via launchd) running every 5 minutes:
- Detects count regression (< 10 jobs)
- Auto-restores from rotating timestamped backups
- Sends Telegram alert
- Preserves forensics file

This mitigates impact but does not prevent the race. Restoration window is ~5 minutes worst-case.

## Environment

- OpenClaw version: 2026.3.23-2
- OS: macOS 15.x (Darwin arm64)
- Gateway: launchd-managed, self-heal watchdog enabled
- Cron jobs at time of loss: ~86 (first incident), ~52 (second incident)

## Related

- Issue #53481 — cron.onChange webhook (filed separately — a registry hook would also help detect this faster)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: `saveCronStore` overwrites jobs.json from partial in-memory state after restart, causing silent job loss #53746

Summary

Root Cause

Reproduction

Proposed Fix: Read-Merge-Write Pattern

Current Workaround

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Bug: saveCronStore overwrites jobs.json from partial in-memory state after restart, causing silent job loss #53746

Description

Summary

Root Cause

Reproduction

Proposed Fix: Read-Merge-Write Pattern

Current Workaround

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug: `saveCronStore` overwrites jobs.json from partial in-memory state after restart, causing silent job loss #53746