-
-
Notifications
You must be signed in to change notification settings - Fork 79.2k
cron: ghost runs recorded as ok when gateway is down (durationMs < 50ms) #63106
Copy link
Copy link
Open
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Problem
When the OpenClaw gateway goes down (e.g. due to a config error or model routing failure), cron jobs configured with
sessionTarget: "main"continue to fire on schedule and recordstatus: "ok"in the run log. The runs complete in 1–8 ms (durationMs: 1) becauseexecuteMainSessionCronJobreturns immediately after callingenqueueSystemEvent+requestHeartbeatNow— no actual agent turn is awaited.There is no alerting, no health check failure, and no indication in the log that the jobs are not actually executing. The cron scheduler marks runs as successful even though no agent turn occurred and no script was run.
Reproduction:
sessionTarget: "main",payload.kind: "systemEvent",wakeMode: "next-heartbeat"status: "ok",durationMs: 1Root Cause
In
src/cron/service/timer.ts,executeMainSessionCronJobenqueues a system event and returns{ status: "ok" }without waiting for the agent to actually process the heartbeat. If the gateway or agent session is unhealthy, the event is silently dropped and the cron run is still recorded asok.The
wakeMode: "now"path (runHeartbeatOnce) at least checks the heartbeat result, butwakeMode: "next-heartbeat"(the default for main-session jobs) has no health check at all.Proposed Fix
Add a ghost-run detector in the
onEventhandler insrc/gateway/server-cron.ts: when a finished event hasstatus: "ok"anddurationMs < GHOST_RUN_THRESHOLD_MS(e.g. 50 ms) for a job wheresessionTarget !== "none"andpayload.kind === "systemEvent", log a warning and/or record the run with an additionalwarnflag or summary note so operators can identify silent failures.At a minimum, the ghost-run check should:
sessionTarget: "main"+payload.kind: "systemEvent"+wakeMode: "next-heartbeat"jobs (the fast-return code path)openclaw doctororcron logsSee
src/cron/service/timer.ts→executeMainSessionCronJobandsrc/gateway/server-cron.ts→onEventhandler.Acceptance Criteria
cron: possible ghost run detected)jobId,durationMs,sessionTarget, andpayloadKindexecuteMainSessionCronJobare not broken