Summary
In v2026.4.11, cron-triggered runs without an explicit agents.defaults.llm.idleTimeoutSeconds (and no agents.defaults.timeoutSeconds) disable the LLM idle watchdog entirely and rely only on the cron outer timeout. This is documented at docs.openclaw.ai/concepts/agent-loop#timeouts:
"Cron-triggered runs with no explicit LLM or agent timeout disable the idle watchdog and rely on the cron outer timeout."
The unintended consequence is that when the primary provider hangs without sending chunks, it consumes the entire cron timeout, and the failover chain cannot advance to the next model before the cron cancellation fires. Every cron job fails with cron: job execution timed out, regardless of how many fallbacks are configured.
Expected behavior
A cron with multiple fallback models should successfully fall back to a working provider if the primary hangs. The failover chain should have enough time budget to attempt each model in sequence.
Actual behavior
Only the primary model is really attempted. Its hang consumes all timeoutSeconds. When the cron cancellation fires, the failover code tries to advance to the next candidate in ~90ms, but every subsequent attempt immediately returns cron: job execution timed out without making a real API request. The whole run ends with FailoverError: LLM request timed out. on the primary only.
Reproduction
- Set up a cron job with:
- Configure
agents.defaults.model with 4 fallbacks:
- Do NOT set
agents.defaults.llm.idleTimeoutSeconds or agents.defaults.timeoutSeconds.
- Cause the primary to hang (e.g., exhaust GitHub Copilot Premium quota — or mock any hanging provider).
- Observe the cron run hits 180s, fails. Manually verify that
openai/gpt-5.4 is healthy via interactive call — it is. Failover chain is broken.
Gateway logs showing the pattern
21:11:28.997 [agent/embedded] pre-prompt provider=github-copilot/claude-opus-4.6
21:14:29.027 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
21:14:29.032 [diagnostic] lane task error durationMs=185973 error="FailoverError: LLM request timed out."
21:14:29.037 [model-fallback/decision] candidate=opus-4.6 next=sonnet detail=LLM request timed out.
21:14:29.124 [model-fallback/decision] candidate=sonnet next=gpt-5.4 detail=cron: job execution timed out
21:14:29.129 [model-fallback/decision] candidate=gpt-5.4 next=gpt-5.2 detail=cron: job execution timed out
21:14:29.135 [model-fallback/decision] candidate=gpt-5.2 next=none detail=cron: job execution timed out
The first candidate's attempt takes 185973ms (the full cron budget plus overhead). The remaining candidates are "attempted" within 100ms total because the cron-cancel has already fired — none of them make real API requests.
Fix / Workaround
Set an explicit LLM idle timeout in agent defaults:
After restart, the same cron job now completes successfully:
23:42:51 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
23:42:51 [diagnostic] durationMs=37294ms error="FailoverError: LLM request timed out."
23:42:51 [model-fallback/decision] candidate=opus-4.6 next=sonnet detail=LLM request timed out.
23:43:22 [agent/embedded] Profile github-copilot:github timed out. (sonnet, 30.7s)
23:43:22 [model-fallback/decision] candidate=sonnet next=openai/gpt-5.4 detail=LLM request timed out.
23:44:31 [model-fallback/decision] candidate=openai/gpt-5.4 decision=candidate_succeeded
23:44:37 cron: finished status=ok durationMs=142986
Total: 143s, within the 180s budget. The failover chain works.
Why this is a bug (not just docs)
The default behavior of "no explicit timeout => disable watchdog" means that a working OpenClaw 2026.4.10 deployment upgraded to 2026.4.11 will see cron jobs start failing silently, with no obvious config change required from the operator. The only way to detect this is reading the docs page carefully (buried in concepts/agent-loop#timeouts) AND correlating with a new explicit config field that wasn't needed before.
Proposed remediation options (not mutually exclusive):
- Change the default: cron runs should have a sane default
idleTimeoutSeconds (e.g., 60s) even when not explicitly set. This restores failover chain viability by default.
- Migration warning on startup: when the gateway detects cron jobs with no explicit LLM timeout configured, log a WARNING at startup with a link to the docs.
- Documentation: add a prominent callout on both docs.openclaw.ai/cli/cron and the cron configuration reference, explaining that cron runs require
agents.defaults.llm.idleTimeoutSeconds to make the failover chain work.
Environment
- OpenClaw: v2026.4.11
- Runtime: Kubernetes (RKE2), node:22-bookworm image
- Provider stack: GitHub Copilot (primary), OpenAI (fallbacks)
- Cron: 9 scheduled jobs, all affected
- First observed: 2026-04-12 ~01:00 CEST (hours after upgrade from 2026.4.10 to 2026.4.11)
- Trigger event: GitHub Copilot Premium quota exhausted —
opus-4.6 began hanging ~180s without returning 429
- Symptoms: all cron jobs failed silently with
cron: job execution timed out, Telegram/Discord interactive paths unaffected
- Detection time: ~22 hours (only noticed because the daily report was missing)
- Remediation time: ~30 minutes of config investigation + single configmap edit
Related
Summary
In v2026.4.11, cron-triggered runs without an explicit
agents.defaults.llm.idleTimeoutSeconds(and noagents.defaults.timeoutSeconds) disable the LLM idle watchdog entirely and rely only on the cron outer timeout. This is documented at docs.openclaw.ai/concepts/agent-loop#timeouts:The unintended consequence is that when the primary provider hangs without sending chunks, it consumes the entire cron timeout, and the failover chain cannot advance to the next model before the cron cancellation fires. Every cron job fails with
cron: job execution timed out, regardless of how many fallbacks are configured.Expected behavior
A cron with multiple fallback models should successfully fall back to a working provider if the primary hangs. The failover chain should have enough time budget to attempt each model in sequence.
Actual behavior
Only the primary model is really attempted. Its hang consumes all
timeoutSeconds. When the cron cancellation fires, the failover code tries to advance to the next candidate in ~90ms, but every subsequent attempt immediately returnscron: job execution timed outwithout making a real API request. The whole run ends withFailoverError: LLM request timed out.on the primary only.Reproduction
{ "schedule": { "kind": "cron", "cron": "0 * * * *" }, "sessionTarget": "isolated", "payload": { "kind": "agentTurn", "message": "Reply OK" }, "timeoutSeconds": 180 }agents.defaults.modelwith 4 fallbacks:{ "primary": "github-copilot/claude-opus-4.6", "fallbacks": [ "github-copilot/claude-sonnet-4.6", "openai/gpt-5.4", "openai/gpt-5.2" ] }agents.defaults.llm.idleTimeoutSecondsoragents.defaults.timeoutSeconds.openai/gpt-5.4is healthy via interactive call — it is. Failover chain is broken.Gateway logs showing the pattern
The first candidate's attempt takes 185973ms (the full cron budget plus overhead). The remaining candidates are "attempted" within 100ms total because the cron-cancel has already fired — none of them make real API requests.
Fix / Workaround
Set an explicit LLM idle timeout in agent defaults:
{ "agents": { "defaults": { "llm": { "idleTimeoutSeconds": 30 } } } }After restart, the same cron job now completes successfully:
Total: 143s, within the 180s budget. The failover chain works.
Why this is a bug (not just docs)
The default behavior of "no explicit timeout => disable watchdog" means that a working OpenClaw 2026.4.10 deployment upgraded to 2026.4.11 will see cron jobs start failing silently, with no obvious config change required from the operator. The only way to detect this is reading the docs page carefully (buried in concepts/agent-loop#timeouts) AND correlating with a new explicit config field that wasn't needed before.
Proposed remediation options (not mutually exclusive):
idleTimeoutSeconds(e.g., 60s) even when not explicitly set. This restores failover chain viability by default.agents.defaults.llm.idleTimeoutSecondsto make the failover chain work.Environment
opus-4.6began hanging ~180s without returning 429cron: job execution timed out, Telegram/Discord interactive paths unaffectedRelated