Bug type
Regression (worked before, now fails)
Summary
When the primary LLM provider returns sustained HTTP 500 (Internal Server Error) responses, OpenClaw cron agent jobs exhaust their full timeoutSeconds window (e.g. 180s) on each attempt rather than fast-failing on a definitive API error. This results in multiple consecutive timeout failures and missed data collection windows.
-
Environment
-
OpenClaw Gateway: 2026.3.11
-
Primary model: anthropic/claude-haiku-4-5
-
Fallback chain: sonnet → gpt-5-mini → kimi → gpt-5.4 → opus
Steps to reproduce
- Create a cron job with
timeoutSeconds: 180 running on a 15-minute schedule
- During the cron window, the primary LLM provider (Anthropic) returns
api_error: Internal server error repeatedly
- Observe that each cron run spends the full 180 seconds waiting before timing out
Expected behavior
When the LLM provider returns a definitive, non-retryable error (HTTP 500, repeated across the fallback chain), cron jobs should:
- Fast-fail — detect all fallbacks returned the same definitive error class and abort early
- Retry with backoff — reschedule after a short delay rather than silently dropping
- Surface to delivery — if
delivery.mode is configured, notify the user the run was skipped due to provider outage
Actual behavior
Gateway logs show:
[agent/embedded] embedded run agent end: isError=true error=LLM error api_error: Internal server error
[diagnostic] lane task error: lane=cron durationMs=180004 error="FailoverError: LLM request timed out."
4 consecutive cron runs each waited 180 seconds before failing. During a 15-minute cron schedule this meant the entire data collection window was missed (4 × 180s = 12 minutes of blocked lane time).
OpenClaw version
2026.3.12
Operating system
Ubuntu 24.04
Install method
No response
Model
anthropic/claude-haiku-4-5
Provider / routing chain
openclaw (cron agent, isolated session) → anthropic/claude-haiku-4-5 (primary, direct API) → anthropic/claude-sonnet-4-6 (fallback 1, direct API) → openai/gpt-5-mini (fallback 2, direct API) → groq/moonshotai/kimi-k2-instruct-0905 (fallback 3, direct API) → openai/gpt-5.4 (fallback 4, direct API) → anthropic/claude-opus-4-6 (fallback 5, direct API) No intermediate proxy or AI gateway. Direct to provider APIs.
Config file / key location
~/.openclaw/openclaw.json agents.defaults.model.primary agents.defaults.model.fallbacks[] ~/.openclaw/agents/main/agent/auth-profiles.json (env-sourced: ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY — no per-agent overrides) Cron job state persisted in gateway PostgreSQL (zeroclaw DB), not a flat config file.
Additional provider/model setup details
- No Cloudflare AI Gateway, no vLLM, no OpenRouter — all providers are direct API connections
- ANTHROPIC_API_KEY sourced from Docker environment variable (docker-compose passthrough from /data/openclaw/.env)
- Failing cron job config: sessionTarget: isolated, wakeMode: now, timeoutSeconds: 180, schedule: cron 50,5,20,35 ... @ UTC
- During the incident, Anthropic returned api_error: Internal server error on all Anthropic-provider fallbacks (haiku, sonnet, opus) within 1–2s per attempt. Non-Anthropic fallbacks (openai, groq) were not
reached — the gateway timed out the entire cron lane at 180s rather than fast-failing on the repeated definitive 500s
- Anthropic API tier: Tier 1 (50 RPM, 30K input TPM). No rate limiting involved — these were 500 errors, not 429s
- OpenClaw version: 2026.3.12, Docker deployment on Ubuntu 24.04
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
No response
Bug type
Regression (worked before, now fails)
Summary
When the primary LLM provider returns sustained HTTP 500 (Internal Server Error) responses, OpenClaw cron agent jobs exhaust their full
timeoutSecondswindow (e.g. 180s) on each attempt rather than fast-failing on a definitive API error. This results in multiple consecutive timeout failures and missed data collection windows.Environment
OpenClaw Gateway: 2026.3.11
Primary model:
anthropic/claude-haiku-4-5Fallback chain: sonnet → gpt-5-mini → kimi → gpt-5.4 → opus
Steps to reproduce
timeoutSeconds: 180running on a 15-minute scheduleapi_error: Internal server errorrepeatedlyExpected behavior
When the LLM provider returns a definitive, non-retryable error (HTTP 500, repeated across the fallback chain), cron jobs should:
delivery.modeis configured, notify the user the run was skipped due to provider outageActual behavior
Gateway logs show:
4 consecutive cron runs each waited 180 seconds before failing. During a 15-minute cron schedule this meant the entire data collection window was missed (4 × 180s = 12 minutes of blocked lane time).
OpenClaw version
2026.3.12
Operating system
Ubuntu 24.04
Install method
No response
Model
anthropic/claude-haiku-4-5
Provider / routing chain
openclaw (cron agent, isolated session) → anthropic/claude-haiku-4-5 (primary, direct API) → anthropic/claude-sonnet-4-6 (fallback 1, direct API) → openai/gpt-5-mini (fallback 2, direct API) → groq/moonshotai/kimi-k2-instruct-0905 (fallback 3, direct API) → openai/gpt-5.4 (fallback 4, direct API) → anthropic/claude-opus-4-6 (fallback 5, direct API) No intermediate proxy or AI gateway. Direct to provider APIs.
Config file / key location
~/.openclaw/openclaw.json agents.defaults.model.primary agents.defaults.model.fallbacks[] ~/.openclaw/agents/main/agent/auth-profiles.json (env-sourced: ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY — no per-agent overrides) Cron job state persisted in gateway PostgreSQL (zeroclaw DB), not a flat config file.
Additional provider/model setup details
reached — the gateway timed out the entire cron lane at 180s rather than fast-failing on the repeated definitive 500s
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
No response