Problem
Regression of #18120 (closed as fixed). runningAtMs is still not reliably clearing on job completion.
Environment
- OpenClaw 2026.3.22 (4dcc39c)
- macOS (arm64), Node v22.22.0 (gateway service) / v25.5.0 (CLI)
- 45 cron jobs configured, ~14 scraper crons running frequently (every 1-2 hours)
Observed Behavior
9 cron jobs simultaneously stuck with runningAtMs set to the exact same timestamp (1775045523045 = 2026-04-01 08:12:03 ET), even though their last runs completed successfully (status: ok or error with proper lastDurationMs values).
The stuck jobs included a mix of:
- Jobs that last ran 3-12 hours ago and completed normally
- Jobs with different schedules and timeouts (2700s, 3600s, null)
- Both
haiku and sonnet46 model jobs
The identical runningAtMs across all 9 suggests a batch state write (possibly on scheduler wake/catch-up) that sets runningAtMs without a corresponding session actually starting.
Impact
Downstream cron jobs (e.g. revenue-loop) stop firing because the scheduler appears to be at concurrency limit or skips scheduling when too many jobs are in running state.
Workaround Applied
- Edited
~/.openclaw/cron/jobs.json to set runningAtMs: null for all stuck jobs
- Restarted gateway to pick up clean state
- Jobs resumed firing normally
Hypothesis
The fix in #18120 handles the case where a session completes/errors through applyJobResult. But there appears to be a path where runningAtMs gets set without a real session starting — possibly during catch-up scheduling when overdue jobs are detected. The catch-up path may set runningAtMs but then skip actual execution (e.g. due to stale delivery), leaving the flag permanently set.
Evidence: gateway log shows skipping stale delivery warnings for some of these same jobs around the time the zombie state appeared.
Possibly Related
Two different Node.js versions observed in gateway logs (22.22.0 for the service, 25.5.0 for CLI). Unknown if this contributes to state inconsistency.
Problem
Regression of #18120 (closed as fixed).
runningAtMsis still not reliably clearing on job completion.Environment
Observed Behavior
9 cron jobs simultaneously stuck with
runningAtMsset to the exact same timestamp (1775045523045= 2026-04-01 08:12:03 ET), even though their last runs completed successfully (status:okorerrorwith properlastDurationMsvalues).The stuck jobs included a mix of:
haikuandsonnet46model jobsThe identical
runningAtMsacross all 9 suggests a batch state write (possibly on scheduler wake/catch-up) that setsrunningAtMswithout a corresponding session actually starting.Impact
Downstream cron jobs (e.g.
revenue-loop) stop firing because the scheduler appears to be at concurrency limit or skips scheduling when too many jobs are inrunningstate.Workaround Applied
~/.openclaw/cron/jobs.jsonto setrunningAtMs: nullfor all stuck jobsHypothesis
The fix in #18120 handles the case where a session completes/errors through
applyJobResult. But there appears to be a path whererunningAtMsgets set without a real session starting — possibly during catch-up scheduling when overdue jobs are detected. The catch-up path may setrunningAtMsbut then skip actual execution (e.g. due to stale delivery), leaving the flag permanently set.Evidence: gateway log shows
skipping stale deliverywarnings for some of these same jobs around the time the zombie state appeared.Possibly Related
Two different Node.js versions observed in gateway logs (
22.22.0for the service,25.5.0for CLI). Unknown if this contributes to state inconsistency.