Skip to content

Cron job timeout aborts entire model fallback chain via shared AbortController #37505

@n3moxyz

Description

@n3moxyz

Summary

When a cron job's timeout fires, the fallback model chain never executes. The abort signal from the cron timeout is shared with all fallback provider attempts, causing them to fail instantly (~100-200ms) without making a network request.

Version

OpenClaw 2026.3.2 (85377a2), stable channel

Steps to Reproduce

  1. Configure an agent with a model fallback chain (e.g., openai-codex → anthropic/sonnet → anthropic-b/sonnet → minimax)
  2. Create a cron job with timeoutSeconds shorter than the LLM provider's response time (e.g., 90s timeout when the provider is slow/unresponsive)
  3. Wait for the primary provider to be slow or rate-limited
  4. Observe that the job fails with FailoverError: LLM request timed out without trying any fallback providers

Expected Behavior

When the primary model times out, runWithModelFallback should try the next provider in the chain. Each fallback attempt should get its own abort controller and timeout budget.

Actual Behavior

The cron runner's executeJobCoreWithTimeout (gateway-cli, ~line 5732) creates a single AbortController and uses Promise.race between the job execution and a timeout timer. When the timeout fires:

  1. runAbortController.abort(timeoutErrorMessage()) is called
  2. reject(new Error(timeoutErrorMessage())) immediately resolves the race
  3. The abort signal propagates to runEmbeddedAttempt (~line 93531 in pi-embedded), setting timedOut = true
  4. runWithModelFallback catches the FailoverError and tries the next candidate
  5. The next candidate receives the same already-aborted signal via abortSignal: params.opts.abortSignal (~line 64123)
  6. onAbort() fires instantly on the new attempt (signal is already aborted), killing it in milliseconds

Gateway log evidence:

05:10:36.411Z Profile openai-codex:default timed out. Trying next account...
05:10:36.414Z lane task error: lane=cron durationMs=125 error="FailoverError: LLM request timed out."

The fallback "attempt" completed in 125ms total — no network request was made.

Non-cron sessions are not affected

Interactive sessions (Telegram, DMs) use a different timeout path that creates a fresh AbortController per attempt inside runWithModelFallback. Fallback works correctly for non-cron execution.

Suggested Fix

Option A: In executeJobCoreWithTimeout, don't abort the signal when the race fires — just reject the promise. Let runWithModelFallback handle its own per-attempt timeouts.

Option B: Create a per-attempt AbortController inside runWithModelFallback rather than passing through the caller's shared signal.

Option C: Budget the cron timeout across providers (e.g., jobTimeout / numberOfFallbacks per attempt).

Workaround

We're currently mitigating by:

  • Staggering concurrent cron jobs to reduce rate-limit pressure on the primary provider
  • Setting generous timeouts to give the primary model enough time
  • Splitting complex jobs into smaller units

But the fallback chain is effectively non-functional for all cron jobs.

Environment

  • macOS (Darwin 25.3.0)
  • OpenClaw 2026.3.2 (85377a2)
  • 4 agents, ~40 cron jobs
  • Model chain: Codex → Sonnet → Sonnet-B → MiniMax

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions