[Bug][2026.4.11] Cron runs silently disable LLM idle watchdog by default, hung providers consume full cron timeout and block failover chain

### Summary

In v2026.4.11, cron-triggered runs without an explicit `agents.defaults.llm.idleTimeoutSeconds` (and no `agents.defaults.timeoutSeconds`) **disable the LLM idle watchdog entirely** and rely only on the cron outer timeout. This is documented at [docs.openclaw.ai/concepts/agent-loop#timeouts](https://docs.openclaw.ai/concepts/agent-loop#timeouts):

> "Cron-triggered runs with no explicit LLM or agent timeout disable the idle watchdog and rely on the cron outer timeout."

The unintended consequence is that **when the primary provider hangs without sending chunks, it consumes the entire cron timeout, and the failover chain cannot advance to the next model before the cron cancellation fires**. Every cron job fails with `cron: job execution timed out`, regardless of how many fallbacks are configured.

### Expected behavior

A cron with multiple fallback models should successfully fall back to a working provider if the primary hangs. The failover chain should have enough time budget to attempt each model in sequence.

### Actual behavior

Only the primary model is really attempted. Its hang consumes all `timeoutSeconds`. When the cron cancellation fires, the failover code tries to advance to the next candidate in ~90ms, but every subsequent attempt immediately returns `cron: job execution timed out` without making a real API request. The whole run ends with `FailoverError: LLM request timed out.` on the primary only.

### Reproduction

1. Set up a cron job with:
   ```jsonc
   {
     "schedule": { "kind": "cron", "cron": "0 * * * *" },
     "sessionTarget": "isolated",
     "payload": { "kind": "agentTurn", "message": "Reply OK" },
     "timeoutSeconds": 180
   }
   ```
2. Configure `agents.defaults.model` with 4 fallbacks:
   ```jsonc
   {
     "primary": "github-copilot/claude-opus-4.6",
     "fallbacks": [
       "github-copilot/claude-sonnet-4.6",
       "openai/gpt-5.4",
       "openai/gpt-5.2"
     ]
   }
   ```
3. **Do NOT set** `agents.defaults.llm.idleTimeoutSeconds` or `agents.defaults.timeoutSeconds`.
4. Cause the primary to hang (e.g., exhaust GitHub Copilot Premium quota — or mock any hanging provider).
5. Observe the cron run hits 180s, fails. Manually verify that `openai/gpt-5.4` is healthy via interactive call — it is. Failover chain is broken.

### Gateway logs showing the pattern

```
21:11:28.997 [agent/embedded] pre-prompt provider=github-copilot/claude-opus-4.6
21:14:29.027 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
21:14:29.032 [diagnostic]     lane task error durationMs=185973 error="FailoverError: LLM request timed out."
21:14:29.037 [model-fallback/decision] candidate=opus-4.6 next=sonnet   detail=LLM request timed out.
21:14:29.124 [model-fallback/decision] candidate=sonnet   next=gpt-5.4  detail=cron: job execution timed out
21:14:29.129 [model-fallback/decision] candidate=gpt-5.4  next=gpt-5.2  detail=cron: job execution timed out
21:14:29.135 [model-fallback/decision] candidate=gpt-5.2  next=none     detail=cron: job execution timed out
```

The first candidate's attempt takes **185973ms** (the full cron budget plus overhead). The remaining candidates are "attempted" within **100ms** total because the cron-cancel has already fired — none of them make real API requests.

### Fix / Workaround

Set an explicit LLM idle timeout in agent defaults:

```jsonc
{
  "agents": {
    "defaults": {
      "llm": {
        "idleTimeoutSeconds": 30
      }
    }
  }
}
```

After restart, the same cron job now completes successfully:

```
23:42:51 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
23:42:51 [diagnostic]     durationMs=37294ms error="FailoverError: LLM request timed out."
23:42:51 [model-fallback/decision] candidate=opus-4.6 next=sonnet detail=LLM request timed out.
23:43:22 [agent/embedded] Profile github-copilot:github timed out. (sonnet, 30.7s)
23:43:22 [model-fallback/decision] candidate=sonnet next=openai/gpt-5.4 detail=LLM request timed out.
23:44:31 [model-fallback/decision] candidate=openai/gpt-5.4 decision=candidate_succeeded
23:44:37 cron: finished status=ok durationMs=142986
```

Total: 143s, within the 180s budget. The failover chain works.

### Why this is a bug (not just docs)

The default behavior of "no explicit timeout => disable watchdog" means that a **working OpenClaw 2026.4.10 deployment upgraded to 2026.4.11 will see cron jobs start failing silently**, with no obvious config change required from the operator. The only way to detect this is reading the docs page carefully (buried in [concepts/agent-loop#timeouts](https://docs.openclaw.ai/concepts/agent-loop#timeouts)) AND correlating with a new explicit config field that wasn't needed before.

Proposed remediation options (not mutually exclusive):

1. **Change the default**: cron runs should have a sane default `idleTimeoutSeconds` (e.g., 60s) even when not explicitly set. This restores failover chain viability by default.
2. **Migration warning on startup**: when the gateway detects cron jobs with no explicit LLM timeout configured, log a WARNING at startup with a link to the docs.
3. **Documentation**: add a prominent callout on both [docs.openclaw.ai/cli/cron](https://docs.openclaw.ai/cli/cron) and the cron configuration reference, explaining that cron runs require `agents.defaults.llm.idleTimeoutSeconds` to make the failover chain work.

### Environment

- **OpenClaw**: v2026.4.11
- **Runtime**: Kubernetes (RKE2), node:22-bookworm image
- **Provider stack**: GitHub Copilot (primary), OpenAI (fallbacks)
- **Cron**: 9 scheduled jobs, all affected
- **First observed**: 2026-04-12 ~01:00 CEST (hours after upgrade from 2026.4.10 to 2026.4.11)
- **Trigger event**: GitHub Copilot Premium quota exhausted — `opus-4.6` began hanging ~180s without returning 429
- **Symptoms**: all cron jobs failed silently with `cron: job execution timed out`, Telegram/Discord interactive paths unaffected
- **Detection time**: ~22 hours (only noticed because the daily report was missing)
- **Remediation time**: ~30 minutes of config investigation + single configmap edit

### Related

- #63805 — isolated cron agentTurn jobs with explicit timeoutSeconds still hard-killed at 300s (different bug, related theme)
- #62752 — cron jobs timing out in isolated sessions — scheduled runs vs manual cron run differ (closed 2026-04-11)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug][2026.4.11] Cron runs silently disable LLM idle watchdog by default, hung providers consume full cron timeout and block failover chain #65576

Summary

Expected behavior

Actual behavior

Reproduction

Gateway logs showing the pattern

Fix / Workaround

Why this is a bug (not just docs)

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug][2026.4.11] Cron runs silently disable LLM idle watchdog by default, hung providers consume full cron timeout and block failover chain #65576

Description

Summary

Expected behavior

Actual behavior

Reproduction

Gateway logs showing the pattern

Fix / Workaround

Why this is a bug (not just docs)

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions