Skip to content

cron: ghost runs recorded as ok when gateway is down (durationMs < 50ms) #63106

@liaoandi

Description

@liaoandi

Problem

When the OpenClaw gateway goes down (e.g. due to a config error or model routing failure), cron jobs configured with sessionTarget: "main" continue to fire on schedule and record status: "ok" in the run log. The runs complete in 1–8 ms (durationMs: 1) because executeMainSessionCronJob returns immediately after calling enqueueSystemEvent + requestHeartbeatNow — no actual agent turn is awaited.

There is no alerting, no health check failure, and no indication in the log that the jobs are not actually executing. The cron scheduler marks runs as successful even though no agent turn occurred and no script was run.

Reproduction:

  1. Configure a cron job with sessionTarget: "main", payload.kind: "systemEvent", wakeMode: "next-heartbeat"
  2. Bring the gateway into a state where the main session cannot process requests (e.g. invalid model config, provider auth error)
  3. Observe run log: status: "ok", durationMs: 1

Root Cause

In src/cron/service/timer.ts, executeMainSessionCronJob enqueues a system event and returns { status: "ok" } without waiting for the agent to actually process the heartbeat. If the gateway or agent session is unhealthy, the event is silently dropped and the cron run is still recorded as ok.

The wakeMode: "now" path (runHeartbeatOnce) at least checks the heartbeat result, but wakeMode: "next-heartbeat" (the default for main-session jobs) has no health check at all.

Proposed Fix

Add a ghost-run detector in the onEvent handler in src/gateway/server-cron.ts: when a finished event has status: "ok" and durationMs < GHOST_RUN_THRESHOLD_MS (e.g. 50 ms) for a job where sessionTarget !== "none" and payload.kind === "systemEvent", log a warning and/or record the run with an additional warn flag or summary note so operators can identify silent failures.

At a minimum, the ghost-run check should:

  • Apply only to sessionTarget: "main" + payload.kind: "systemEvent" + wakeMode: "next-heartbeat" jobs (the fast-return code path)
  • Not treat legitimately fast jobs (e.g. no-op system events) as errors — use a configurable threshold (default 50 ms)
  • Emit a structured log warning that can be surfaced in openclaw doctor or cron logs

See src/cron/service/timer.tsexecuteMainSessionCronJob and src/gateway/server-cron.tsonEvent handler.

Acceptance Criteria

  • When a main-session cron job completes in < 50 ms, a warning is logged (cron: possible ghost run detected)
  • The warning includes jobId, durationMs, sessionTarget, and payloadKind
  • The behavior is gated behind a threshold that can be adjusted via cron config
  • Existing tests for executeMainSessionCronJob are not broken

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions