Skip to content

Embedded agent runs do not use model fallback chain #54698

@noamrazbuilds

Description

@noamrazbuilds

Bug: Embedded agent runs do not use model fallback chain

Summary

When the primary model returns overloaded (503) errors, the main agent lane correctly falls back to the configured fallback model. However, embedded agent runs (subagents, followups, heartbeats) do not trigger the model fallback mechanism and instead retry the primary model indefinitely until failure.

Environment

  • OpenClaw 2026.3.24 (cff6dc9)
  • Ubuntu 22.04 on DigitalOcean (4 vCPU / 8 GB)
  • Node 22.22.1
  • Gateway: systemd service, loopback bind

Model configuration

Steps to reproduce

  1. Configure a primary model and a fallback model with valid API keys
  2. Wait for the primary model's API to return 503 overloaded errors
  3. Send a message via Telegram (or any channel)

Expected behavior

All agent runs (main lane AND embedded) should fall back to the configured fallback model when the primary model is overloaded.

Actual behavior

  • Main lane run: Correctly triggers model_fallback_decision, falls back to openai/gpt-4.1, and succeeds.
  • Embedded agent runs: Only emit embedded_run_agent_end with isError: true and failoverReason: "overloaded". No model_fallback_decision or embedded_run_failover_decision is logged for these runs. They retry the primary model multiple times and then fail without attempting the fallback.

Log evidence

Main lane (working fallback):

model_fallback_decision: candidate_failed (anthropic/claude-sonnet-4-6, overloaded)
model_fallback_decision: candidate_succeeded (openai/gpt-4.1)

Embedded runs (no fallback):

embedded_run_agent_end: isError=true, model=claude-sonnet-4-6, failoverReason=overloaded
embedded_run_agent_end: isError=true, model=claude-sonnet-4-6, failoverReason=overloaded
embedded_run_agent_end: isError=true, model=claude-sonnet-4-6, failoverReason=overloaded
(repeats ~10 times with no fallback attempt)

Impact

During an API outage affecting the primary model, the bot becomes partially non-functional even when a healthy fallback model is configured. The user-facing reply may succeed (via main lane fallback), but embedded runs (followups, heartbeats, tool execution) continue to fail, causing error messages and degraded behavior.

Workaround

Switch the default model itself to the fallback model:

openclaw models set openai/gpt-4.1
openclaw gateway restart

This forces all runs (main and embedded) to use the working model directly rather than relying on the fallback chain.

Date observed

2026-03-25, during Anthropic incident "Elevated errors on Claude Opus 4.6" (status.claude.com)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions