Bug: Embedded agent runs do not use model fallback chain
Summary
When the primary model returns overloaded (503) errors, the main agent lane correctly falls back to the configured fallback model. However, embedded agent runs (subagents, followups, heartbeats) do not trigger the model fallback mechanism and instead retry the primary model indefinitely until failure.
Environment
- OpenClaw 2026.3.24 (cff6dc9)
- Ubuntu 22.04 on DigitalOcean (4 vCPU / 8 GB)
- Node 22.22.1
- Gateway: systemd service, loopback bind
Model configuration
Steps to reproduce
- Configure a primary model and a fallback model with valid API keys
- Wait for the primary model's API to return 503 overloaded errors
- Send a message via Telegram (or any channel)
Expected behavior
All agent runs (main lane AND embedded) should fall back to the configured fallback model when the primary model is overloaded.
Actual behavior
- Main lane run: Correctly triggers
model_fallback_decision, falls back to openai/gpt-4.1, and succeeds.
- Embedded agent runs: Only emit
embedded_run_agent_end with isError: true and failoverReason: "overloaded". No model_fallback_decision or embedded_run_failover_decision is logged for these runs. They retry the primary model multiple times and then fail without attempting the fallback.
Log evidence
Main lane (working fallback):
model_fallback_decision: candidate_failed (anthropic/claude-sonnet-4-6, overloaded)
model_fallback_decision: candidate_succeeded (openai/gpt-4.1)
Embedded runs (no fallback):
embedded_run_agent_end: isError=true, model=claude-sonnet-4-6, failoverReason=overloaded
embedded_run_agent_end: isError=true, model=claude-sonnet-4-6, failoverReason=overloaded
embedded_run_agent_end: isError=true, model=claude-sonnet-4-6, failoverReason=overloaded
(repeats ~10 times with no fallback attempt)
Impact
During an API outage affecting the primary model, the bot becomes partially non-functional even when a healthy fallback model is configured. The user-facing reply may succeed (via main lane fallback), but embedded runs (followups, heartbeats, tool execution) continue to fail, causing error messages and degraded behavior.
Workaround
Switch the default model itself to the fallback model:
openclaw models set openai/gpt-4.1
openclaw gateway restart
This forces all runs (main and embedded) to use the working model directly rather than relying on the fallback chain.
Date observed
2026-03-25, during Anthropic incident "Elevated errors on Claude Opus 4.6" (status.claude.com)
Bug: Embedded agent runs do not use model fallback chain
Summary
When the primary model returns overloaded (503) errors, the main agent lane correctly falls back to the configured fallback model. However, embedded agent runs (subagents, followups, heartbeats) do not trigger the model fallback mechanism and instead retry the primary model indefinitely until failure.
Environment
Model configuration
anthropic/claude-sonnet-4-6openai/gpt-4.1auth-profiles.jsonSteps to reproduce
Expected behavior
All agent runs (main lane AND embedded) should fall back to the configured fallback model when the primary model is overloaded.
Actual behavior
model_fallback_decision, falls back toopenai/gpt-4.1, and succeeds.embedded_run_agent_endwithisError: trueandfailoverReason: "overloaded". Nomodel_fallback_decisionorembedded_run_failover_decisionis logged for these runs. They retry the primary model multiple times and then fail without attempting the fallback.Log evidence
Main lane (working fallback):
Embedded runs (no fallback):
Impact
During an API outage affecting the primary model, the bot becomes partially non-functional even when a healthy fallback model is configured. The user-facing reply may succeed (via main lane fallback), but embedded runs (followups, heartbeats, tool execution) continue to fail, causing error messages and degraded behavior.
Workaround
Switch the default model itself to the fallback model:
This forces all runs (main and embedded) to use the working model directly rather than relying on the fallback chain.
Date observed
2026-03-25, during Anthropic incident "Elevated errors on Claude Opus 4.6" (status.claude.com)