Skip to content

[Bug]: Cron agent jobs silently time out during sustained LLM API outages instead of fast-failing on definitive errors #45494

@03marcbluechain

Description

@03marcbluechain

Bug type

Regression (worked before, now fails)

Summary

When the primary LLM provider returns sustained HTTP 500 (Internal Server Error) responses, OpenClaw cron agent jobs exhaust their full timeoutSeconds window (e.g. 180s) on each attempt rather than fast-failing on a definitive API error. This results in multiple consecutive timeout failures and missed data collection windows.

  • Environment

  • OpenClaw Gateway: 2026.3.11

  • Primary model: anthropic/claude-haiku-4-5

  • Fallback chain: sonnet → gpt-5-mini → kimi → gpt-5.4 → opus

Steps to reproduce

  1. Create a cron job with timeoutSeconds: 180 running on a 15-minute schedule
  2. During the cron window, the primary LLM provider (Anthropic) returns api_error: Internal server error repeatedly
  3. Observe that each cron run spends the full 180 seconds waiting before timing out

Expected behavior

When the LLM provider returns a definitive, non-retryable error (HTTP 500, repeated across the fallback chain), cron jobs should:

  1. Fast-fail — detect all fallbacks returned the same definitive error class and abort early
  2. Retry with backoff — reschedule after a short delay rather than silently dropping
  3. Surface to delivery — if delivery.mode is configured, notify the user the run was skipped due to provider outage

Actual behavior

Gateway logs show:

[agent/embedded] embedded run agent end: isError=true error=LLM error api_error: Internal server error
[diagnostic] lane task error: lane=cron durationMs=180004 error="FailoverError: LLM request timed out."

4 consecutive cron runs each waited 180 seconds before failing. During a 15-minute cron schedule this meant the entire data collection window was missed (4 × 180s = 12 minutes of blocked lane time).

OpenClaw version

2026.3.12

Operating system

Ubuntu 24.04

Install method

No response

Model

anthropic/claude-haiku-4-5

Provider / routing chain

openclaw (cron agent, isolated session) → anthropic/claude-haiku-4-5 (primary, direct API) → anthropic/claude-sonnet-4-6 (fallback 1, direct API) → openai/gpt-5-mini (fallback 2, direct API) → groq/moonshotai/kimi-k2-instruct-0905 (fallback 3, direct API) → openai/gpt-5.4 (fallback 4, direct API) → anthropic/claude-opus-4-6 (fallback 5, direct API) No intermediate proxy or AI gateway. Direct to provider APIs.

Config file / key location

~/.openclaw/openclaw.json agents.defaults.model.primary agents.defaults.model.fallbacks[] ~/.openclaw/agents/main/agent/auth-profiles.json (env-sourced: ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY — no per-agent overrides) Cron job state persisted in gateway PostgreSQL (zeroclaw DB), not a flat config file.

Additional provider/model setup details

  • No Cloudflare AI Gateway, no vLLM, no OpenRouter — all providers are direct API connections
  • ANTHROPIC_API_KEY sourced from Docker environment variable (docker-compose passthrough from /data/openclaw/.env)
  • Failing cron job config: sessionTarget: isolated, wakeMode: now, timeoutSeconds: 180, schedule: cron 50,5,20,35 ... @ UTC
  • During the incident, Anthropic returned api_error: Internal server error on all Anthropic-provider fallbacks (haiku, sonnet, opus) within 1–2s per attempt. Non-Anthropic fallbacks (openai, groq) were not
    reached — the gateway timed out the entire cron lane at 180s rather than fast-failing on the repeated definitive 500s
  • Anthropic API tier: Tier 1 (50 RPM, 30K input TPM). No rate limiting involved — these were 500 errors, not 429s
  • OpenClaw version: 2026.3.12, Docker deployment on Ubuntu 24.04

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions