[Bug]: Cron agent jobs silently time out during sustained LLM API outages instead of fast-failing on definitive errors

### Bug type

Regression (worked before, now fails)

### Summary

When the primary LLM provider returns sustained HTTP 500 (Internal Server Error) responses, OpenClaw cron agent jobs exhaust their full `timeoutSeconds` window (e.g. 180s) on each attempt rather than fast-failing on a definitive API error. This results in multiple consecutive timeout failures and missed data collection windows.

 - Environment

- OpenClaw Gateway: 2026.3.11
- Primary model: `anthropic/claude-haiku-4-5`
- Fallback chain: sonnet → gpt-5-mini → kimi → gpt-5.4 → opus

### Steps to reproduce

1. Create a cron job with `timeoutSeconds: 180` running on a 15-minute schedule
2. During the cron window, the primary LLM provider (Anthropic) returns `api_error: Internal server error` repeatedly
3. Observe that each cron run spends the full 180 seconds waiting before timing out

### Expected behavior

When the LLM provider returns a **definitive, non-retryable error** (HTTP 500, repeated across the fallback chain), cron jobs should:

1. **Fast-fail** — detect all fallbacks returned the same definitive error class and abort early
2. **Retry with backoff** — reschedule after a short delay rather than silently dropping
3. **Surface to delivery** — if `delivery.mode` is configured, notify the user the run was skipped due to provider outage

### Actual behavior

Gateway logs show:
```
[agent/embedded] embedded run agent end: isError=true error=LLM error api_error: Internal server error
[diagnostic] lane task error: lane=cron durationMs=180004 error="FailoverError: LLM request timed out."
```

4 consecutive cron runs each waited 180 seconds before failing. During a 15-minute cron schedule this meant the entire data collection window was missed (4 × 180s = 12 minutes of blocked lane time).


### OpenClaw version

2026.3.12

### Operating system

Ubuntu 24.04

### Install method

_No response_

### Model

anthropic/claude-haiku-4-5

### Provider / routing chain

  openclaw (cron agent, isolated session) → anthropic/claude-haiku-4-5 (primary, direct API)     → anthropic/claude-sonnet-4-6 (fallback 1, direct API)     → openai/gpt-5-mini (fallback 2, direct API)     → groq/moonshotai/kimi-k2-instruct-0905 (fallback 3, direct API)     → openai/gpt-5.4 (fallback 4, direct API)     → anthropic/claude-opus-4-6 (fallback 5, direct API)    No intermediate proxy or AI gateway. Direct to provider APIs.

### Config file / key location

  ~/.openclaw/openclaw.json     agents.defaults.model.primary     agents.defaults.model.fallbacks[]    ~/.openclaw/agents/main/agent/auth-profiles.json     (env-sourced: ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY — no per-agent overrides)    Cron job state persisted in gateway PostgreSQL (zeroclaw DB), not a flat config file.

### Additional provider/model setup details

  - No Cloudflare AI Gateway, no vLLM, no OpenRouter — all providers are direct API connections
  - ANTHROPIC_API_KEY sourced from Docker environment variable (docker-compose passthrough from /data/openclaw/.env)
  - Failing cron job config: sessionTarget: isolated, wakeMode: now, timeoutSeconds: 180, schedule: cron 50,5,20,35 ... @ UTC
  - During the incident, Anthropic returned api_error: Internal server error on all Anthropic-provider fallbacks (haiku, sonnet, opus) within 1–2s per attempt. Non-Anthropic fallbacks (openai, groq) were not
  reached — the gateway timed out the entire cron lane at 180s rather than fast-failing on the repeated definitive 500s
  - Anthropic API tier: Tier 1 (50 RPM, 30K input TPM). No rate limiting involved — these were 500 errors, not 429s
  - OpenClaw version: 2026.3.12, Docker deployment on Ubuntu 24.04

### Logs, screenshots, and evidence

```shell

```

### Impact and severity

_No response_

### Additional information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Cron agent jobs silently time out during sustained LLM API outages instead of fast-failing on definitive errors #45494

Bug type

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Config file / key location

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Cron agent jobs silently time out during sustained LLM API outages instead of fast-failing on definitive errors #45494

Description

Bug type

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Config file / key location

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions