Skip to content

[Bug] Cron isolated agentTurn: "already-running" survives restart, run history always empty #68157

@interchainlive

Description

@interchainlive

Bug Description

Cron jobs with sessionTarget: isolated and payload.kind: agentTurn enter a stuck "already-running" state and never recover — even after gateway restart. The openclaw cron runs --id <jobId> history shows 0 entries, and manual cron run returns {ok: true, ran: false, reason: "already-running"} indefinitely.

This is a confirmed manifestation of issue #43452 but with additional detail: the stuck state survives gateway restart, suggesting the runningAtMs flag is persisted in the cron scheduler's internal state file, not cleared on restart.

Environment

  • Version: OpenClaw 2026.4.14 (323493f)
  • Gateway: systemd, local loopback
  • Node: node 24.14.1, Linux 6.17.0-20-generic
  • Affected job: heartbeat-dispatch (sessionTarget=isolated, payload.kind=agentTurn, agentId=qa)

Evidence

1. Stuck running state survives restart

Manual trigger while stuck:

$ openclaw cron run <jobId> --timeout 60000
{ok: true, enqueued: true, runId: "manual:<id>:..."}   <- initial manual trigger
$ openclaw cron run <jobId> --timeout 60000
{ok: true, ran: false, reason: "already-running"}         <- stuck forever

Gateway restart does NOT clear the stuck state. The job re-enters "already-running" within seconds of restart.

2. heartbeat-dispatch.sh executes but scheduler history is empty

The script runs every 30 min and writes to its own log:

[2026-04-17T14:30:09Z] heartbeat-dispatch: Subagent track check: STALE:0 REPORTED:0
[2026-04-17T14:30:09Z] heartbeat-dispatch: No PENDING tasks

Yet openclaw cron runs --id <jobId> returns total: 0. The script executes; the scheduler never records it.

3. STUCK_RUN_MS logic

From jobs-cnkUBFyc.js:

const STUCK_RUN_MS = 7200 * 1e3; // 2 hours
if (typeof runningAt === "number" && nowMs - runningAt > STUCK_RUN_MS) {
  state.deps.log.warn({jobId, runningAtMs: runningAt}, "cron: clearing stuck running marker");
  job.state.runningAtMs = void 0;
  changed = true;
}

The stuck marker is only cleared after 2 hours. But the job re-triggers before the threshold is hit (cron fires every 30 min for heartbeat-dispatch).

Root Cause Hypothesis

The runningAtMs flag is written BEFORE the isolated session executes. If the isolated session fails to start, the flag is never cleared. Subsequent runs see "already-running" immediately. The 2-hour STUCK_RUN_MS threshold exists but the job re-triggers before it expires.

Recommended Patch (Option C — Self-healing on restart)

On gateway startup, check all isolated agentTurn jobs with runningAtMs set. If the associated isolated session is not actually running, clear the flag immediately. This prevents the "survives restart" behavior and is the lowest-risk fix:

// On cron scheduler init / gateway startup:
for (const job of Object.values(state.jobs)) {
  if (job.config.sessionTarget === "isolated" && 
      job.config.payload?.kind === "agentTurn" &&
      typeof job.state.runningAtMs === "number") {
    // Check if isolated session is actually running for this job
    const isRunning = await state.deps.sessionManager.isJobSessionRunning(job.id);
    if (!isRunning) {
      state.deps.log.warn({jobId: job.id}, "cron: clearing orphaned runningAtMs on startup");
      job.state.runningAtMs = void 0;
    }
  }
}

Workaround (for users)

Add a pre-flight reset to heartbeat-dispatch.sh:

openclaw cron disable <jobId>
sleep 2
openclaw cron enable <jobId>

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    close:duplicateClosed as duplicatededupe:childDuplicate issue/PR child in dedupe clusterdedupe:parentPrimary canonical item in dedupe cluster

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions