Problem
When a cron job (or any embedded agent run) exceeds its configured timeoutSeconds, the embedded runner calls abortRun(true), which sets timedOut=true. This triggers the model fallback cascade:
scheduleAbortTimer(params.timeoutMs) fires
abortRun(true) → timedOut = true
- Profile marked as timed out →
"Profile X timed out. Trying next account..."
- No more auth profiles → escalates to
fallback_model
runWithModelFallback catches the FailoverError → tries next candidate model
- Next candidate gets whatever time remains (~0-30s) → also times out
- Complete failure with 2-3× the API calls
Why this is counterproductive
When a run's own timeout budget fires, the task is out of time regardless of which model handles it. Falling back to another model with near-zero remaining budget is guaranteed to fail, wastes API calls, increases lane occupancy, and delays the final error delivery.
This is different from legitimate fallback triggers (provider down, rate limited, auth error) where trying another model makes sense.
Observed impact
On a fleet of ~100 cron jobs:
- 30+ model-fallback events per day, almost all triggered by run timeouts
- Each timeout event generates 2-3 fallback attempts (exhausting the candidate chain)
- Lane queue congestion worsens because failed fallback attempts hold lanes longer
Suggested fix
In the embedded runner's abort handler, distinguish between:
- Run timeout (
scheduleAbortTimer fired) → skip model fallback, fail immediately with a clear timeout error
- Provider/model failure (API error, rate limit, idle stream timeout) → trigger model fallback as today
This could be as simple as passing a flag from abortRun through to the failover decision path indicating reason=run_timeout_exceeded (vs reason=provider_timeout), and having runWithModelFallback skip candidates when the reason is the run's own budget.
Environment
- OpenClaw 2026.4.2
- Source reference:
pi-embedded-BYdcxQ5A.js line ~38336 (scheduleAbortTimer) and line ~40158 (Profile timed out)
- Fallback chain construction:
resolveFallbackCandidates line ~7126
Workaround
Setting agents.defaults.model.fallbacks = [] reduces the cascade from 3-4 candidates to 2 (requested + global primary). Raising cron timeout budgets reduces the trigger frequency. Neither eliminates the issue.
Problem
When a cron job (or any embedded agent run) exceeds its configured
timeoutSeconds, the embedded runner callsabortRun(true), which setstimedOut=true. This triggers the model fallback cascade:scheduleAbortTimer(params.timeoutMs)firesabortRun(true)→timedOut = true"Profile X timed out. Trying next account..."fallback_modelrunWithModelFallbackcatches theFailoverError→ tries next candidate modelWhy this is counterproductive
When a run's own timeout budget fires, the task is out of time regardless of which model handles it. Falling back to another model with near-zero remaining budget is guaranteed to fail, wastes API calls, increases lane occupancy, and delays the final error delivery.
This is different from legitimate fallback triggers (provider down, rate limited, auth error) where trying another model makes sense.
Observed impact
On a fleet of ~100 cron jobs:
Suggested fix
In the embedded runner's abort handler, distinguish between:
scheduleAbortTimerfired) → skip model fallback, fail immediately with a clear timeout errorThis could be as simple as passing a flag from
abortRunthrough to the failover decision path indicatingreason=run_timeout_exceeded(vsreason=provider_timeout), and havingrunWithModelFallbackskip candidates when the reason is the run's own budget.Environment
pi-embedded-BYdcxQ5A.jsline ~38336 (scheduleAbortTimer) and line ~40158 (Profile timed out)resolveFallbackCandidatesline ~7126Workaround
Setting
agents.defaults.model.fallbacks = []reduces the cascade from 3-4 candidates to 2 (requested + global primary). Raising cron timeout budgets reduces the trigger frequency. Neither eliminates the issue.