Cron job timeout aborts entire model fallback chain via shared AbortController

## Summary

When a cron job's timeout fires, the fallback model chain never executes. The abort signal from the cron timeout is shared with all fallback provider attempts, causing them to fail instantly (~100-200ms) without making a network request.

## Version

OpenClaw 2026.3.2 (85377a2), stable channel

## Steps to Reproduce

1. Configure an agent with a model fallback chain (e.g., `openai-codex → anthropic/sonnet → anthropic-b/sonnet → minimax`)
2. Create a cron job with `timeoutSeconds` shorter than the LLM provider's response time (e.g., 90s timeout when the provider is slow/unresponsive)
3. Wait for the primary provider to be slow or rate-limited
4. Observe that the job fails with `FailoverError: LLM request timed out` without trying any fallback providers

## Expected Behavior

When the primary model times out, `runWithModelFallback` should try the next provider in the chain. Each fallback attempt should get its own abort controller and timeout budget.

## Actual Behavior

The cron runner's `executeJobCoreWithTimeout` (gateway-cli, ~line 5732) creates a single `AbortController` and uses `Promise.race` between the job execution and a timeout timer. When the timeout fires:

1. `runAbortController.abort(timeoutErrorMessage())` is called
2. `reject(new Error(timeoutErrorMessage()))` immediately resolves the race
3. The abort signal propagates to `runEmbeddedAttempt` (~line 93531 in pi-embedded), setting `timedOut = true`
4. `runWithModelFallback` catches the `FailoverError` and tries the next candidate
5. **The next candidate receives the same already-aborted signal** via `abortSignal: params.opts.abortSignal` (~line 64123)
6. `onAbort()` fires instantly on the new attempt (signal is already aborted), killing it in milliseconds

Gateway log evidence:
```
05:10:36.411Z Profile openai-codex:default timed out. Trying next account...
05:10:36.414Z lane task error: lane=cron durationMs=125 error="FailoverError: LLM request timed out."
```
The fallback "attempt" completed in 125ms total — no network request was made.

## Non-cron sessions are not affected

Interactive sessions (Telegram, DMs) use a different timeout path that creates a fresh `AbortController` per attempt inside `runWithModelFallback`. Fallback works correctly for non-cron execution.

## Suggested Fix

Option A: In `executeJobCoreWithTimeout`, don't abort the signal when the race fires — just reject the promise. Let `runWithModelFallback` handle its own per-attempt timeouts.

Option B: Create a per-attempt `AbortController` inside `runWithModelFallback` rather than passing through the caller's shared signal.

Option C: Budget the cron timeout across providers (e.g., `jobTimeout / numberOfFallbacks` per attempt).

## Workaround

We're currently mitigating by:
- Staggering concurrent cron jobs to reduce rate-limit pressure on the primary provider
- Setting generous timeouts to give the primary model enough time
- Splitting complex jobs into smaller units

But the fallback chain is effectively non-functional for all cron jobs.

## Environment

- macOS (Darwin 25.3.0)
- OpenClaw 2026.3.2 (85377a2)
- 4 agents, ~40 cron jobs
- Model chain: Codex → Sonnet → Sonnet-B → MiniMax

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cron job timeout aborts entire model fallback chain via shared AbortController #37505

Summary

Version

Steps to Reproduce

Expected Behavior

Actual Behavior

Non-cron sessions are not affected

Suggested Fix

Workaround

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Cron job timeout aborts entire model fallback chain via shared AbortController #37505

Description

Summary

Version

Steps to Reproduce

Expected Behavior

Actual Behavior

Non-cron sessions are not affected

Suggested Fix

Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions