Skip to content

Cron: lastRunAtMs not persisted until Phase 3 — gateway restart causes duplicate job execution #61343

@daemic24

Description

@daemic24

Bug

When a cron job executes successfully but the gateway restarts before the outcome persist (Phase 3) completes, lastRunAtMs is never written to jobs.json. On restart, the stale lastRunAtMs (potentially days old) causes runMissedJobs to consider the job overdue and execute it again.

Result: Duplicate delivery (e.g., two morning briefings).

Reproduction

  1. Cron job fires at 08:00, Phase 1 persists runningAtMs
  2. Job executes and delivers successfully (Phase 2, no lock held)
  3. Gateway restarts (SIGUSR1, crash, deploy) before Phase 3 persist
  4. On restart: runningAtMs is cleared (ops.ts:102-114), but lastRunAtMs is still stale
  5. isRunnableJob (timer.ts:784-801) sees previousRunAtMs > lastRunAtMs → job re-runs
  6. User receives a second delivery

This happened in production on 2025-04-05: a daily-summary job delivered at 08:00, then again at 12:06 after a restart cascade.

Root cause

onTimer in timer.ts uses three phases:

Phase What it does Persisted?
1 (L611-616) Sets runningAtMs await persist(state)
2 (L624-669) Executes job (unbounded time, no lock)
3 (L675-689) Calls applyOutcomeToStoredJob → sets lastRunAtMs ✓ but only if reached

If gateway dies between Phase 1 and Phase 3, lastRunAtMs is never updated. The runningAtMs marker is cleaned up on restart (ops.ts:102-114), but nothing prevents the missed-job check from using the stale lastRunAtMs.

The codebase already acknowledges Phase-1 crash scenarios — see the comment at ops.ts:390-392 referencing #17554 — but only for the runningAtMs marker, not for lastRunAtMs.

Suggested fix

Persist lastRunAtMs immediately after job execution completes, before delivery attempt:

// After executeJobCoreWithTimeout returns, before delivery:
await locked(state, async () => {
  await ensureLoaded(state, { forceReload: true, skipRecompute: true });
  const stored = state.store.jobs.find(j => j.id === job.id);
  if (stored) {
    stored.state.lastRunAtMs = startedAt;
    // Optionally: stored.state.lastRunStatus = "executed-pending-delivery"
  }
  await persist(state);
});

This way, even if the gateway crashes during or after delivery, the job won't be considered missed on restart. Delivery status can still be updated in Phase 3.

Alternative: on restart, if a runningAtMs marker is found and cleared, also set lastRunAtMs = runningAtMs before clearing — treating interrupted jobs as "ran but status unknown" rather than "never ran."

Affected files

  • src/cron/service/timer.tsonTimer (L572-731), applyJobResult (L296-474), isRunnableJob (L784-801)
  • src/cron/service/ops.ts — startup cleanup (L102-114), missed job detection

Metadata

Metadata

Assignees

Labels

dedupe:parentPrimary canonical item in dedupe cluster

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions