Skip to content

Don't trigger model fallback when abort reason is the run's own timeout budget #60388

@cinapbot

Description

@cinapbot

Problem

When a cron job (or any embedded agent run) exceeds its configured timeoutSeconds, the embedded runner calls abortRun(true), which sets timedOut=true. This triggers the model fallback cascade:

  1. scheduleAbortTimer(params.timeoutMs) fires
  2. abortRun(true)timedOut = true
  3. Profile marked as timed out → "Profile X timed out. Trying next account..."
  4. No more auth profiles → escalates to fallback_model
  5. runWithModelFallback catches the FailoverError → tries next candidate model
  6. Next candidate gets whatever time remains (~0-30s) → also times out
  7. Complete failure with 2-3× the API calls

Why this is counterproductive

When a run's own timeout budget fires, the task is out of time regardless of which model handles it. Falling back to another model with near-zero remaining budget is guaranteed to fail, wastes API calls, increases lane occupancy, and delays the final error delivery.

This is different from legitimate fallback triggers (provider down, rate limited, auth error) where trying another model makes sense.

Observed impact

On a fleet of ~100 cron jobs:

  • 30+ model-fallback events per day, almost all triggered by run timeouts
  • Each timeout event generates 2-3 fallback attempts (exhausting the candidate chain)
  • Lane queue congestion worsens because failed fallback attempts hold lanes longer

Suggested fix

In the embedded runner's abort handler, distinguish between:

  • Run timeout (scheduleAbortTimer fired) → skip model fallback, fail immediately with a clear timeout error
  • Provider/model failure (API error, rate limit, idle stream timeout) → trigger model fallback as today

This could be as simple as passing a flag from abortRun through to the failover decision path indicating reason=run_timeout_exceeded (vs reason=provider_timeout), and having runWithModelFallback skip candidates when the reason is the run's own budget.

Environment

  • OpenClaw 2026.4.2
  • Source reference: pi-embedded-BYdcxQ5A.js line ~38336 (scheduleAbortTimer) and line ~40158 (Profile timed out)
  • Fallback chain construction: resolveFallbackCandidates line ~7126

Workaround

Setting agents.defaults.model.fallbacks = [] reduces the cascade from 3-4 candidates to 2 (requested + global primary). Raising cron timeout budgets reduces the trigger frequency. Neither eliminates the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions