Skip to content

Fallback models not triggered on provider-level errors (e.g., AWS Bedrock model ID changes) #44353

@lucca-alma

Description

@lucca-alma

Summary

Model fallbacks configured in openclaw.json are not triggered when the primary model fails with a provider-level error. The error is treated as terminal, leaving the session/subagent failed rather than retrying with a fallback.

Context

AWS Bedrock recently changed behavior: raw model IDs (anthropic.claude-opus-4-5-20251101-v1:0) no longer work for on-demand throughput. You must now use inference profile IDs (us.anthropic.claude-opus-4-5-20251101-v1:0).

This broke all Bedrock-primary configurations overnight with no recovery via fallback.

Config

{
  "agents": {
    "defaults": {
      "subagents": {
        "model": {
          "primary": "amazon-bedrock/anthropic.claude-opus-4-5-20251101-v1:0",
          "fallbacks": ["github-copilot/claude-sonnet-4"]
        }
      }
    }
  }
}

Error from Bedrock

Invocation of model ID anthropic.claude-opus-4-5-20251101-v1:0 with on-demand throughput 
is not supported. Retry your request with the ID or ARN of an inference profile that 
contains this model.

Logs (relevant excerpts)

embedded run start: runId=da512e2f-... provider=amazon-bedrock model=anthropic.claude-opus-4-5-20251101-v1:0

embedded run agent end: runId=da512e2f-... isError=true error=Invocation of model ID 
anthropic.claude-opus-4-5-20251101-v1:0 with on-demand throughput is not supported...

embedded run done: runId=da512e2f-... durationMs=853 aborted=false

No fallback attempt logged. Run terminates immediately after isError=true.

Expected Behavior

When primary model fails with a provider error:

  1. Log the error
  2. Check if fallbacks are configured
  3. Retry with next fallback model
  4. Only fail permanently if all fallbacks exhausted

Actual Behavior

  • isError=true → run terminates
  • No fallback attempt
  • Subagent reports failure to parent session
  • All work blocked until config manually fixed

Impact

  • Provider-side changes (like AWS tightening model ID requirements) cause total failure
  • Users must manually diagnose and fix config
  • No graceful degradation despite fallback config being present

Possible Causes

  1. Fallback logic only handles rate limits, not provider errors?
  2. This error type classified as non-retryable?
  3. Fallback only implemented for main sessions, not subagents?

Environment

  • OpenClaw: latest npm
  • Provider: amazon-bedrock
  • OS: macOS (arm64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions