Summary
Model fallback in live sessions is completely non-functional due to LiveSessionModelSwitch logic intercepting every fallback attempt and reverting it back to the failed primary provider.
Environment
- OpenClaw: 2026.3.28 (f9b1079)
- OS: Ubuntu 22.04 / Linux 6.8.0-71-generic (x64)
- Node: v22.22.1
- Config:
agents.defaults.model.fallbacks includes a secondary provider
Reproduction
- Configure a primary model provider and a fallback provider in
agents.defaults.model.fallbacks
- Have an active live session (e.g., Telegram conversation)
- Primary provider starts returning errors (502/503/429)
- Observe gateway logs
Expected behavior
After primary provider fails the retry budget, traffic should seamlessly switch to the fallback provider. User should receive a response (possibly with slightly higher latency) instead of silence.
Actual behavior
The fallback never executes. Every ~50 seconds, the following cycle repeats indefinitely:
1. Primary provider retried 4 times (~10s each), all fail → 502
2. Failover decision: candidate_failed, next=fallback-provider/model
3. LiveSessionModelSwitch immediately intercepts:
"live session model switch detected: fallback-provider/model -> primary-provider/model"
4. Request forced back to failed primary provider
5. Goto 1
Relevant log pattern (provider names sanitized):
[model-fallback/decision] decision=candidate_failed requested=provider-a/model candidate=provider-a/model reason=timeout next=provider-b/model
[agent/embedded] live session model switch detected before attempt for <session-id>: provider-b/model -> provider-a/model
This cycle continued for 20+ minutes (11+ rounds × ~50s/round) on a single runId, with the same session locked in an infinite retry loop. The user received complete silence during this entire period.
Root cause analysis
The LiveSessionModelSwitch check (introduced for cron model conflict prevention in #57155/#57972) appears to fire on all model changes in live sessions, including legitimate fallback transitions. It compares the fallback candidate against the session's persisted model selection and force-reverts it, effectively making the fallback chain unreachable.
Impact
- Critical for reliability: fallback configuration gives a false sense of redundancy — it's configured but never activates in the most common scenario (live user conversation)
- A 2-3 minute upstream blip becomes 20+ minutes of total unavailability
- Users see "typing..." indicator but never receive a response
- The only recovery path is external intervention (provider restart or gateway restart)
Relationship to #57155
#57155 addressed LiveSessionModelSwitchError in cron isolated sessions and was closed as fixed by #57972. This issue is the live session variant — same underlying mechanism, but instead of throwing an error, it silently hijacks the fallback and creates an infinite retry loop. Arguably worse because it fails silently rather than loudly.
Suggested fix
The resolvePersistedLiveSelection() check should distinguish between:
- User/config-initiated model switches (should be validated)
- Failover-initiated model switches (should be allowed to proceed)
A possible approach: pass a source: "failover" flag through the model resolution chain and skip the live session model check when the switch originates from the failover system.
Summary
Model fallback in live sessions is completely non-functional due to
LiveSessionModelSwitchlogic intercepting every fallback attempt and reverting it back to the failed primary provider.Environment
agents.defaults.model.fallbacksincludes a secondary providerReproduction
agents.defaults.model.fallbacksExpected behavior
After primary provider fails the retry budget, traffic should seamlessly switch to the fallback provider. User should receive a response (possibly with slightly higher latency) instead of silence.
Actual behavior
The fallback never executes. Every ~50 seconds, the following cycle repeats indefinitely:
Relevant log pattern (provider names sanitized):
This cycle continued for 20+ minutes (11+ rounds × ~50s/round) on a single runId, with the same session locked in an infinite retry loop. The user received complete silence during this entire period.
Root cause analysis
The
LiveSessionModelSwitchcheck (introduced for cron model conflict prevention in #57155/#57972) appears to fire on all model changes in live sessions, including legitimate fallback transitions. It compares the fallback candidate against the session's persisted model selection and force-reverts it, effectively making the fallback chain unreachable.Impact
Relationship to #57155
#57155 addressed
LiveSessionModelSwitchErrorin cron isolated sessions and was closed as fixed by #57972. This issue is the live session variant — same underlying mechanism, but instead of throwing an error, it silently hijacks the fallback and creates an infinite retry loop. Arguably worse because it fails silently rather than loudly.Suggested fix
The
resolvePersistedLiveSelection()check should distinguish between:A possible approach: pass a
source: "failover"flag through the model resolution chain and skip the live session model check when the switch originates from the failover system.