Skip to content

[Bug][2026.4.11] Cron runs silently disable LLM idle watchdog by default, hung providers consume full cron timeout and block failover chain #65576

@PabloHurtadoGonzalo86

Description

@PabloHurtadoGonzalo86

Summary

In v2026.4.11, cron-triggered runs without an explicit agents.defaults.llm.idleTimeoutSeconds (and no agents.defaults.timeoutSeconds) disable the LLM idle watchdog entirely and rely only on the cron outer timeout. This is documented at docs.openclaw.ai/concepts/agent-loop#timeouts:

"Cron-triggered runs with no explicit LLM or agent timeout disable the idle watchdog and rely on the cron outer timeout."

The unintended consequence is that when the primary provider hangs without sending chunks, it consumes the entire cron timeout, and the failover chain cannot advance to the next model before the cron cancellation fires. Every cron job fails with cron: job execution timed out, regardless of how many fallbacks are configured.

Expected behavior

A cron with multiple fallback models should successfully fall back to a working provider if the primary hangs. The failover chain should have enough time budget to attempt each model in sequence.

Actual behavior

Only the primary model is really attempted. Its hang consumes all timeoutSeconds. When the cron cancellation fires, the failover code tries to advance to the next candidate in ~90ms, but every subsequent attempt immediately returns cron: job execution timed out without making a real API request. The whole run ends with FailoverError: LLM request timed out. on the primary only.

Reproduction

  1. Set up a cron job with:
    {
      "schedule": { "kind": "cron", "cron": "0 * * * *" },
      "sessionTarget": "isolated",
      "payload": { "kind": "agentTurn", "message": "Reply OK" },
      "timeoutSeconds": 180
    }
  2. Configure agents.defaults.model with 4 fallbacks:
    {
      "primary": "github-copilot/claude-opus-4.6",
      "fallbacks": [
        "github-copilot/claude-sonnet-4.6",
        "openai/gpt-5.4",
        "openai/gpt-5.2"
      ]
    }
  3. Do NOT set agents.defaults.llm.idleTimeoutSeconds or agents.defaults.timeoutSeconds.
  4. Cause the primary to hang (e.g., exhaust GitHub Copilot Premium quota — or mock any hanging provider).
  5. Observe the cron run hits 180s, fails. Manually verify that openai/gpt-5.4 is healthy via interactive call — it is. Failover chain is broken.

Gateway logs showing the pattern

21:11:28.997 [agent/embedded] pre-prompt provider=github-copilot/claude-opus-4.6
21:14:29.027 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
21:14:29.032 [diagnostic]     lane task error durationMs=185973 error="FailoverError: LLM request timed out."
21:14:29.037 [model-fallback/decision] candidate=opus-4.6 next=sonnet   detail=LLM request timed out.
21:14:29.124 [model-fallback/decision] candidate=sonnet   next=gpt-5.4  detail=cron: job execution timed out
21:14:29.129 [model-fallback/decision] candidate=gpt-5.4  next=gpt-5.2  detail=cron: job execution timed out
21:14:29.135 [model-fallback/decision] candidate=gpt-5.2  next=none     detail=cron: job execution timed out

The first candidate's attempt takes 185973ms (the full cron budget plus overhead). The remaining candidates are "attempted" within 100ms total because the cron-cancel has already fired — none of them make real API requests.

Fix / Workaround

Set an explicit LLM idle timeout in agent defaults:

{
  "agents": {
    "defaults": {
      "llm": {
        "idleTimeoutSeconds": 30
      }
    }
  }
}

After restart, the same cron job now completes successfully:

23:42:51 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
23:42:51 [diagnostic]     durationMs=37294ms error="FailoverError: LLM request timed out."
23:42:51 [model-fallback/decision] candidate=opus-4.6 next=sonnet detail=LLM request timed out.
23:43:22 [agent/embedded] Profile github-copilot:github timed out. (sonnet, 30.7s)
23:43:22 [model-fallback/decision] candidate=sonnet next=openai/gpt-5.4 detail=LLM request timed out.
23:44:31 [model-fallback/decision] candidate=openai/gpt-5.4 decision=candidate_succeeded
23:44:37 cron: finished status=ok durationMs=142986

Total: 143s, within the 180s budget. The failover chain works.

Why this is a bug (not just docs)

The default behavior of "no explicit timeout => disable watchdog" means that a working OpenClaw 2026.4.10 deployment upgraded to 2026.4.11 will see cron jobs start failing silently, with no obvious config change required from the operator. The only way to detect this is reading the docs page carefully (buried in concepts/agent-loop#timeouts) AND correlating with a new explicit config field that wasn't needed before.

Proposed remediation options (not mutually exclusive):

  1. Change the default: cron runs should have a sane default idleTimeoutSeconds (e.g., 60s) even when not explicitly set. This restores failover chain viability by default.
  2. Migration warning on startup: when the gateway detects cron jobs with no explicit LLM timeout configured, log a WARNING at startup with a link to the docs.
  3. Documentation: add a prominent callout on both docs.openclaw.ai/cli/cron and the cron configuration reference, explaining that cron runs require agents.defaults.llm.idleTimeoutSeconds to make the failover chain work.

Environment

  • OpenClaw: v2026.4.11
  • Runtime: Kubernetes (RKE2), node:22-bookworm image
  • Provider stack: GitHub Copilot (primary), OpenAI (fallbacks)
  • Cron: 9 scheduled jobs, all affected
  • First observed: 2026-04-12 ~01:00 CEST (hours after upgrade from 2026.4.10 to 2026.4.11)
  • Trigger event: GitHub Copilot Premium quota exhausted — opus-4.6 began hanging ~180s without returning 429
  • Symptoms: all cron jobs failed silently with cron: job execution timed out, Telegram/Discord interactive paths unaffected
  • Detection time: ~22 hours (only noticed because the daily report was missing)
  • Remediation time: ~30 minutes of config investigation + single configmap edit

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions