Summary
When a cron job's timeout fires, the fallback model chain never executes. The abort signal from the cron timeout is shared with all fallback provider attempts, causing them to fail instantly (~100-200ms) without making a network request.
Version
OpenClaw 2026.3.2 (85377a2), stable channel
Steps to Reproduce
- Configure an agent with a model fallback chain (e.g.,
openai-codex → anthropic/sonnet → anthropic-b/sonnet → minimax)
- Create a cron job with
timeoutSeconds shorter than the LLM provider's response time (e.g., 90s timeout when the provider is slow/unresponsive)
- Wait for the primary provider to be slow or rate-limited
- Observe that the job fails with
FailoverError: LLM request timed out without trying any fallback providers
Expected Behavior
When the primary model times out, runWithModelFallback should try the next provider in the chain. Each fallback attempt should get its own abort controller and timeout budget.
Actual Behavior
The cron runner's executeJobCoreWithTimeout (gateway-cli, ~line 5732) creates a single AbortController and uses Promise.race between the job execution and a timeout timer. When the timeout fires:
runAbortController.abort(timeoutErrorMessage()) is called
reject(new Error(timeoutErrorMessage())) immediately resolves the race
- The abort signal propagates to
runEmbeddedAttempt (~line 93531 in pi-embedded), setting timedOut = true
runWithModelFallback catches the FailoverError and tries the next candidate
- The next candidate receives the same already-aborted signal via
abortSignal: params.opts.abortSignal (~line 64123)
onAbort() fires instantly on the new attempt (signal is already aborted), killing it in milliseconds
Gateway log evidence:
05:10:36.411Z Profile openai-codex:default timed out. Trying next account...
05:10:36.414Z lane task error: lane=cron durationMs=125 error="FailoverError: LLM request timed out."
The fallback "attempt" completed in 125ms total — no network request was made.
Non-cron sessions are not affected
Interactive sessions (Telegram, DMs) use a different timeout path that creates a fresh AbortController per attempt inside runWithModelFallback. Fallback works correctly for non-cron execution.
Suggested Fix
Option A: In executeJobCoreWithTimeout, don't abort the signal when the race fires — just reject the promise. Let runWithModelFallback handle its own per-attempt timeouts.
Option B: Create a per-attempt AbortController inside runWithModelFallback rather than passing through the caller's shared signal.
Option C: Budget the cron timeout across providers (e.g., jobTimeout / numberOfFallbacks per attempt).
Workaround
We're currently mitigating by:
- Staggering concurrent cron jobs to reduce rate-limit pressure on the primary provider
- Setting generous timeouts to give the primary model enough time
- Splitting complex jobs into smaller units
But the fallback chain is effectively non-functional for all cron jobs.
Environment
- macOS (Darwin 25.3.0)
- OpenClaw 2026.3.2 (85377a2)
- 4 agents, ~40 cron jobs
- Model chain: Codex → Sonnet → Sonnet-B → MiniMax
Summary
When a cron job's timeout fires, the fallback model chain never executes. The abort signal from the cron timeout is shared with all fallback provider attempts, causing them to fail instantly (~100-200ms) without making a network request.
Version
OpenClaw 2026.3.2 (85377a2), stable channel
Steps to Reproduce
openai-codex → anthropic/sonnet → anthropic-b/sonnet → minimax)timeoutSecondsshorter than the LLM provider's response time (e.g., 90s timeout when the provider is slow/unresponsive)FailoverError: LLM request timed outwithout trying any fallback providersExpected Behavior
When the primary model times out,
runWithModelFallbackshould try the next provider in the chain. Each fallback attempt should get its own abort controller and timeout budget.Actual Behavior
The cron runner's
executeJobCoreWithTimeout(gateway-cli, ~line 5732) creates a singleAbortControllerand usesPromise.racebetween the job execution and a timeout timer. When the timeout fires:runAbortController.abort(timeoutErrorMessage())is calledreject(new Error(timeoutErrorMessage()))immediately resolves the racerunEmbeddedAttempt(~line 93531 in pi-embedded), settingtimedOut = truerunWithModelFallbackcatches theFailoverErrorand tries the next candidateabortSignal: params.opts.abortSignal(~line 64123)onAbort()fires instantly on the new attempt (signal is already aborted), killing it in millisecondsGateway log evidence:
The fallback "attempt" completed in 125ms total — no network request was made.
Non-cron sessions are not affected
Interactive sessions (Telegram, DMs) use a different timeout path that creates a fresh
AbortControllerper attempt insiderunWithModelFallback. Fallback works correctly for non-cron execution.Suggested Fix
Option A: In
executeJobCoreWithTimeout, don't abort the signal when the race fires — just reject the promise. LetrunWithModelFallbackhandle its own per-attempt timeouts.Option B: Create a per-attempt
AbortControllerinsiderunWithModelFallbackrather than passing through the caller's shared signal.Option C: Budget the cron timeout across providers (e.g.,
jobTimeout / numberOfFallbacksper attempt).Workaround
We're currently mitigating by:
But the fallback chain is effectively non-functional for all cron jobs.
Environment