Skip to content

Cron runningAtMs zombie state regression — cleared state returns after gateway restart #59056

@ponchoooPenguin

Description

@ponchoooPenguin

Problem

Regression of #18120 (closed as fixed). runningAtMs is still not reliably clearing on job completion.

Environment

  • OpenClaw 2026.3.22 (4dcc39c)
  • macOS (arm64), Node v22.22.0 (gateway service) / v25.5.0 (CLI)
  • 45 cron jobs configured, ~14 scraper crons running frequently (every 1-2 hours)

Observed Behavior

9 cron jobs simultaneously stuck with runningAtMs set to the exact same timestamp (1775045523045 = 2026-04-01 08:12:03 ET), even though their last runs completed successfully (status: ok or error with proper lastDurationMs values).

The stuck jobs included a mix of:

  • Jobs that last ran 3-12 hours ago and completed normally
  • Jobs with different schedules and timeouts (2700s, 3600s, null)
  • Both haiku and sonnet46 model jobs

The identical runningAtMs across all 9 suggests a batch state write (possibly on scheduler wake/catch-up) that sets runningAtMs without a corresponding session actually starting.

Impact

Downstream cron jobs (e.g. revenue-loop) stop firing because the scheduler appears to be at concurrency limit or skips scheduling when too many jobs are in running state.

Workaround Applied

  1. Edited ~/.openclaw/cron/jobs.json to set runningAtMs: null for all stuck jobs
  2. Restarted gateway to pick up clean state
  3. Jobs resumed firing normally

Hypothesis

The fix in #18120 handles the case where a session completes/errors through applyJobResult. But there appears to be a path where runningAtMs gets set without a real session starting — possibly during catch-up scheduling when overdue jobs are detected. The catch-up path may set runningAtMs but then skip actual execution (e.g. due to stale delivery), leaving the flag permanently set.

Evidence: gateway log shows skipping stale delivery warnings for some of these same jobs around the time the zombie state appeared.

Possibly Related

Two different Node.js versions observed in gateway logs (22.22.0 for the service, 25.5.0 for CLI). Unknown if this contributes to state inconsistency.

Metadata

Metadata

Assignees

Labels

dedupe:parentPrimary canonical item in dedupe cluster

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions