Bug Description
Cron jobs with sessionTarget: isolated and payload.kind: agentTurn enter a stuck "already-running" state and never recover — even after gateway restart. The openclaw cron runs --id <jobId> history shows 0 entries, and manual cron run returns {ok: true, ran: false, reason: "already-running"} indefinitely.
This is a confirmed manifestation of issue #43452 but with additional detail: the stuck state survives gateway restart, suggesting the runningAtMs flag is persisted in the cron scheduler's internal state file, not cleared on restart.
Environment
- Version: OpenClaw 2026.4.14 (323493f)
- Gateway: systemd, local loopback
- Node: node 24.14.1, Linux 6.17.0-20-generic
- Affected job: heartbeat-dispatch (sessionTarget=isolated, payload.kind=agentTurn, agentId=qa)
Evidence
1. Stuck running state survives restart
Manual trigger while stuck:
$ openclaw cron run <jobId> --timeout 60000
{ok: true, enqueued: true, runId: "manual:<id>:..."} <- initial manual trigger
$ openclaw cron run <jobId> --timeout 60000
{ok: true, ran: false, reason: "already-running"} <- stuck forever
Gateway restart does NOT clear the stuck state. The job re-enters "already-running" within seconds of restart.
2. heartbeat-dispatch.sh executes but scheduler history is empty
The script runs every 30 min and writes to its own log:
[2026-04-17T14:30:09Z] heartbeat-dispatch: Subagent track check: STALE:0 REPORTED:0
[2026-04-17T14:30:09Z] heartbeat-dispatch: No PENDING tasks
Yet openclaw cron runs --id <jobId> returns total: 0. The script executes; the scheduler never records it.
3. STUCK_RUN_MS logic
From jobs-cnkUBFyc.js:
const STUCK_RUN_MS = 7200 * 1e3; // 2 hours
if (typeof runningAt === "number" && nowMs - runningAt > STUCK_RUN_MS) {
state.deps.log.warn({jobId, runningAtMs: runningAt}, "cron: clearing stuck running marker");
job.state.runningAtMs = void 0;
changed = true;
}
The stuck marker is only cleared after 2 hours. But the job re-triggers before the threshold is hit (cron fires every 30 min for heartbeat-dispatch).
Root Cause Hypothesis
The runningAtMs flag is written BEFORE the isolated session executes. If the isolated session fails to start, the flag is never cleared. Subsequent runs see "already-running" immediately. The 2-hour STUCK_RUN_MS threshold exists but the job re-triggers before it expires.
Recommended Patch (Option C — Self-healing on restart)
On gateway startup, check all isolated agentTurn jobs with runningAtMs set. If the associated isolated session is not actually running, clear the flag immediately. This prevents the "survives restart" behavior and is the lowest-risk fix:
// On cron scheduler init / gateway startup:
for (const job of Object.values(state.jobs)) {
if (job.config.sessionTarget === "isolated" &&
job.config.payload?.kind === "agentTurn" &&
typeof job.state.runningAtMs === "number") {
// Check if isolated session is actually running for this job
const isRunning = await state.deps.sessionManager.isJobSessionRunning(job.id);
if (!isRunning) {
state.deps.log.warn({jobId: job.id}, "cron: clearing orphaned runningAtMs on startup");
job.state.runningAtMs = void 0;
}
}
}
Workaround (for users)
Add a pre-flight reset to heartbeat-dispatch.sh:
openclaw cron disable <jobId>
sleep 2
openclaw cron enable <jobId>
References
Bug Description
Cron jobs with
sessionTarget: isolatedandpayload.kind: agentTurnenter a stuck "already-running" state and never recover — even after gateway restart. Theopenclaw cron runs --id <jobId>history shows 0 entries, and manualcron runreturns{ok: true, ran: false, reason: "already-running"}indefinitely.This is a confirmed manifestation of issue #43452 but with additional detail: the stuck state survives gateway restart, suggesting the runningAtMs flag is persisted in the cron scheduler's internal state file, not cleared on restart.
Environment
Evidence
1. Stuck running state survives restart
Manual trigger while stuck:
Gateway restart does NOT clear the stuck state. The job re-enters "already-running" within seconds of restart.
2. heartbeat-dispatch.sh executes but scheduler history is empty
The script runs every 30 min and writes to its own log:
Yet
openclaw cron runs --id <jobId>returnstotal: 0. The script executes; the scheduler never records it.3. STUCK_RUN_MS logic
From
jobs-cnkUBFyc.js:The stuck marker is only cleared after 2 hours. But the job re-triggers before the threshold is hit (cron fires every 30 min for heartbeat-dispatch).
Root Cause Hypothesis
The
runningAtMsflag is written BEFORE the isolated session executes. If the isolated session fails to start, the flag is never cleared. Subsequent runs see "already-running" immediately. The 2-hour STUCK_RUN_MS threshold exists but the job re-triggers before it expires.Recommended Patch (Option C — Self-healing on restart)
On gateway startup, check all isolated agentTurn jobs with
runningAtMsset. If the associated isolated session is not actually running, clear the flag immediately. This prevents the "survives restart" behavior and is the lowest-risk fix:Workaround (for users)
Add a pre-flight reset to heartbeat-dispatch.sh:
References