Bug
When a cron job executes successfully but the gateway restarts before the outcome persist (Phase 3) completes, lastRunAtMs is never written to jobs.json. On restart, the stale lastRunAtMs (potentially days old) causes runMissedJobs to consider the job overdue and execute it again.
Result: Duplicate delivery (e.g., two morning briefings).
Reproduction
- Cron job fires at 08:00, Phase 1 persists
runningAtMs ✓
- Job executes and delivers successfully (Phase 2, no lock held)
- Gateway restarts (SIGUSR1, crash, deploy) before Phase 3 persist
- On restart:
runningAtMs is cleared (ops.ts:102-114), but lastRunAtMs is still stale
isRunnableJob (timer.ts:784-801) sees previousRunAtMs > lastRunAtMs → job re-runs
- User receives a second delivery
This happened in production on 2025-04-05: a daily-summary job delivered at 08:00, then again at 12:06 after a restart cascade.
Root cause
onTimer in timer.ts uses three phases:
| Phase |
What it does |
Persisted? |
| 1 (L611-616) |
Sets runningAtMs |
✓ await persist(state) |
| 2 (L624-669) |
Executes job (unbounded time, no lock) |
✗ |
| 3 (L675-689) |
Calls applyOutcomeToStoredJob → sets lastRunAtMs |
✓ but only if reached |
If gateway dies between Phase 1 and Phase 3, lastRunAtMs is never updated. The runningAtMs marker is cleaned up on restart (ops.ts:102-114), but nothing prevents the missed-job check from using the stale lastRunAtMs.
The codebase already acknowledges Phase-1 crash scenarios — see the comment at ops.ts:390-392 referencing #17554 — but only for the runningAtMs marker, not for lastRunAtMs.
Suggested fix
Persist lastRunAtMs immediately after job execution completes, before delivery attempt:
// After executeJobCoreWithTimeout returns, before delivery:
await locked(state, async () => {
await ensureLoaded(state, { forceReload: true, skipRecompute: true });
const stored = state.store.jobs.find(j => j.id === job.id);
if (stored) {
stored.state.lastRunAtMs = startedAt;
// Optionally: stored.state.lastRunStatus = "executed-pending-delivery"
}
await persist(state);
});
This way, even if the gateway crashes during or after delivery, the job won't be considered missed on restart. Delivery status can still be updated in Phase 3.
Alternative: on restart, if a runningAtMs marker is found and cleared, also set lastRunAtMs = runningAtMs before clearing — treating interrupted jobs as "ran but status unknown" rather than "never ran."
Affected files
src/cron/service/timer.ts — onTimer (L572-731), applyJobResult (L296-474), isRunnableJob (L784-801)
src/cron/service/ops.ts — startup cleanup (L102-114), missed job detection
Bug
When a cron job executes successfully but the gateway restarts before the outcome persist (Phase 3) completes,
lastRunAtMsis never written tojobs.json. On restart, the stalelastRunAtMs(potentially days old) causesrunMissedJobsto consider the job overdue and execute it again.Result: Duplicate delivery (e.g., two morning briefings).
Reproduction
runningAtMs✓runningAtMsis cleared (ops.ts:102-114), butlastRunAtMsis still staleisRunnableJob(timer.ts:784-801) seespreviousRunAtMs > lastRunAtMs→ job re-runsThis happened in production on 2025-04-05: a daily-summary job delivered at 08:00, then again at 12:06 after a restart cascade.
Root cause
onTimerintimer.tsuses three phases:runningAtMsawait persist(state)applyOutcomeToStoredJob→ setslastRunAtMsIf gateway dies between Phase 1 and Phase 3,
lastRunAtMsis never updated. TherunningAtMsmarker is cleaned up on restart (ops.ts:102-114), but nothing prevents the missed-job check from using the stalelastRunAtMs.The codebase already acknowledges Phase-1 crash scenarios — see the comment at ops.ts:390-392 referencing #17554 — but only for the
runningAtMsmarker, not forlastRunAtMs.Suggested fix
Persist
lastRunAtMsimmediately after job execution completes, before delivery attempt:This way, even if the gateway crashes during or after delivery, the job won't be considered missed on restart. Delivery status can still be updated in Phase 3.
Alternative: on restart, if a
runningAtMsmarker is found and cleared, also setlastRunAtMs = runningAtMsbefore clearing — treating interrupted jobs as "ran but status unknown" rather than "never ran."Affected files
src/cron/service/timer.ts—onTimer(L572-731),applyJobResult(L296-474),isRunnableJob(L784-801)src/cron/service/ops.ts— startup cleanup (L102-114), missed job detection