You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Isolated cron agentTurn jobs with an explicit payload.timeoutSeconds: 300 are still being hard-killed at approximately 300,000ms — even after the v2026.4.9 fix for the LLM idle timeout regression.
The job-level setTimeout in executeJobCoreWithTimeout fires at exactly the configured timeoutSeconds × 1000, but the embedded runner's internal work is not completing in time. This is different from the fixed LLM idle timeout bug — it is a separate, still-active issue affecting jobs that require more than ~265 seconds of wall-clock time.
Expected Behavior
A cron job with sessionTarget: "isolated", payload.kind: "agentTurn", and payload.timeoutSeconds: 300 should:
Respect the 300,000ms job-level timeout and kill the job at 300s if it genuinely exceeds that budget
Complete successfully if the embedded runner can finish within 300s
Actual Behavior
The job is killed at approximately 300,000ms (job-level timeout) even when the embedded runner is still actively processing — executing commands, running LLM calls that are succeeding, and making visible progress in the session transcript.
Session transcript (81KB, 49 entries) shows the session bootstraps, the LLM responds, exec commands run, but the job is hard-killed at ~300s with "cron: job execution timed out"
Timeline of Events
v2026.3.31 era: Cron works fine (scheduled, isolated agentTurn)
v2026.4.1 upgrade: Fails at ~60s (LLM idle timeout bug — now fixed)
v2026.4.9 upgrade: LLM idle timeout is fixed (no more 65s kills), but the job-level 300s hard timeout is now the terminal failure. The fix in d9dc75774b addressed the embedded runner's LLM idle watchdog, not the cron runner's executeJobCoreWithTimeout setTimeout.
The session was doing useful work until the moment it was killed. The exec commands were running, tool results were being returned. The job was not hung — it was actively processing.
Technical Findings
Two Independent Timeout Systems
The codebase has two independent timeout mechanisms that can kill a cron job:
Set via cfg.agents.defaults.timeoutSeconds (default 600s)
Independent setTimeout inside the embedded runner
Now fixed in v2026.4.9: inherits agents.defaults.timeoutSeconds and is disabled for unconfigured cron runs
The job-level timeout (300s) fires before the embedded runner's 600s idle watchdog. Both mechanisms are functioning as designed — but the job-level timeout is killing jobs that should be allowed to complete within 300s.
The Fix That Shipped Only Addressed One Layer
The v2026.4.9 release notes state:
"Agents/timeouts: make the LLM idle timeout inherit agents.defaults.timeoutSeconds when configured, disable the unconfigured idle watchdog for cron runs..."
This fixed the embedded runner's idle watchdog (the ~65s killer). It did not address executeJobCoreWithTimeout's job-level setTimeout, which has always been reading payload.timeoutSeconds correctly but fires regardless of whether the embedded runner is making progress.
Additional Observations
When the upstream LLM API returns 529 (overloaded), OpenClaw's retry/failover logic consumes time that counts against the job-level 300s budget. This is expected but compounds the timing pressure.
The same model works fine in manual (non-cron) sessions via the Control UI — direct API calls succeed in <1s.
Is executeJobCoreWithTimeout's setTimeout supposed to fire as a hard ceiling, or should it be extended by time spent in the embedded runner's internal retry loops?
Should payload.timeoutSeconds for isolated agentTurn jobs be increased significantly (600s+) to give the embedded runner enough time, given the job-level timeout is a hard ceiling?
Is there overhead in the isolated session bootstrap that's consuming part of the 300s budget before the LLM call even starts?
Reproduction Steps
Any isolated agentTurn cron job with:
sessionTarget: "isolated"
payload.timeoutSeconds: 300
A moderately complex task (LLM + multiple exec commands)
Will fail at 300s if the task takes longer than ~265s due to isolated session bootstrap overhead, retry time from API errors, or natural processing time.
Bug Description
Isolated cron
agentTurnjobs with an explicitpayload.timeoutSeconds: 300are still being hard-killed at approximately 300,000ms — even after the v2026.4.9 fix for the LLM idle timeout regression.The job-level
setTimeoutinexecuteJobCoreWithTimeoutfires at exactly the configuredtimeoutSeconds × 1000, but the embedded runner's internal work is not completing in time. This is different from the fixed LLM idle timeout bug — it is a separate, still-active issue affecting jobs that require more than ~265 seconds of wall-clock time.Expected Behavior
A cron job with
sessionTarget: "isolated",payload.kind: "agentTurn", andpayload.timeoutSeconds: 300should:Actual Behavior
The job is killed at approximately 300,000ms (job-level timeout) even when the embedded runner is still actively processing — executing commands, running LLM calls that are succeeding, and making visible progress in the session transcript.
Environment
sessionTarget: "isolated",payload.kind: "agentTurn",timeoutSeconds: 300"cron: job execution timed out"Timeline of Events
d9dc75774baddressed the embedded runner's LLM idle watchdog, not the cron runner'sexecuteJobCoreWithTimeoutsetTimeout.Session Transcript Evidence
The session was doing useful work until the moment it was killed. The exec commands were running, tool results were being returned. The job was not hung — it was actively processing.
Technical Findings
Two Independent Timeout Systems
The codebase has two independent timeout mechanisms that can kill a cron job:
Job-level timeout (
executeJobCoreWithTimeout,gateway-cli-*.js)payload.timeoutSeconds(converted to ms)setTimeout(..., jobTimeoutMs)that races againstexecuteJobCorerunAbortController.abort()+ rejection of the Promise race"cron: job execution timed out"Embedded runner's idle timeout (
runEmbeddedPiAgent,pi-embedded-*.js)cfg.agents.defaults.timeoutSeconds(default 600s)setTimeoutinside the embedded runneragents.defaults.timeoutSecondsand is disabled for unconfigured cron runsThe job-level timeout (300s) fires before the embedded runner's 600s idle watchdog. Both mechanisms are functioning as designed — but the job-level timeout is killing jobs that should be allowed to complete within 300s.
The Fix That Shipped Only Addressed One Layer
The v2026.4.9 release notes state:
This fixed the embedded runner's idle watchdog (the ~65s killer). It did not address
executeJobCoreWithTimeout's job-levelsetTimeout, which has always been readingpayload.timeoutSecondscorrectly but fires regardless of whether the embedded runner is making progress.Additional Observations
payload.modelfield IS being respected (session uses MiniMax-M2.7), but the model parameter for isolated cron sessions has a separate bug ([Bug] Cron isolated session model parameter ignored, falls back to default agent model #61294) — it may be ignored depending on configuration format.Relevant Open Issues
Profile *:default timed out. Trying next account...behavior, direct API worksQuestions / Things to Investigate
executeJobCoreWithTimeout'ssetTimeoutsupposed to fire as a hard ceiling, or should it be extended by time spent in the embedded runner's internal retry loops?payload.timeoutSecondsfor isolatedagentTurnjobs be increased significantly (600s+) to give the embedded runner enough time, given the job-level timeout is a hard ceiling?Reproduction Steps
Any isolated
agentTurncron job with:sessionTarget: "isolated"payload.timeoutSeconds: 300Will fail at 300s if the task takes longer than ~265s due to isolated session bootstrap overhead, retry time from API errors, or natural processing time.
Tags
bug regression cron isolated agentTurn timeout