Skip to content

fix(agents): preserve fallback candidates in live sessions#57227

Closed
goldmar wants to merge 6 commits into
openclaw:mainfrom
goldmar:fix/live-session-fallback-regression
Closed

fix(agents): preserve fallback candidates in live sessions#57227
goldmar wants to merge 6 commits into
openclaw:mainfrom
goldmar:fix/live-session-fallback-regression

Conversation

@goldmar

@goldmar goldmar commented Mar 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • stop live-session model selection from snapping back to the agent default when there is no explicit session override
  • preserve normal live /model behavior by still honoring stored session overrides from the session store
  • add regression coverage for agent-session fallback so cross-provider retries keep running after a primary rate limit

Root cause

resolveLiveSessionModelSelection() treated the agent default model as the persisted live-session selection whenever an agentId existed. During fallback, the embedded runner compared the current retry candidate against that agent default and threw LiveSessionModelSwitchError, which made internal fallback candidates look like user-requested live model switches.

Fix

Use the current attempt's provider/model as the default live-session selection baseline. Only switch away when the session store has an explicit override.

Validation

  • pnpm --dir /home/openclaw/openclaw exec vitest run src/agents/live-model-switch.test.ts
  • pnpm --dir /home/openclaw/openclaw exec vitest run --config vitest.e2e.config.ts src/agents/model-fallback.run-embedded.e2e.test.ts
  • rebuilt local dist/, restarted the gateway, reverted the live config back to anthropic/claude-sonnet-4-6, and verified a direct agent turn fell back to openai-codex/gpt-5.4
  • local command used for the live check:
    • openclaw agent --agent main --message 'Reply with FALLBACK_FIX_OK and nothing else.' --json
  • observed result for run a2ac9644-34bf-41db-9457-50261e59b067:
    • Anthropic primary returned 429 rate_limit_error
    • fallback then succeeded on openai-codex/gpt-5.4
    • no fresh live session model switch detected before attempt log entries were emitted for that run

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Mar 29, 2026
@greptile-apps

greptile-apps Bot commented Mar 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug in resolveLiveSessionModelSelection() where cross-provider fallback candidates were incorrectly treated as user-requested live model switches for agent sessions. The root cause was that when an agentId was present, the function used the agent's configured default model as the session baseline instead of the current attempt's provider/model — causing hasDifferentLiveSessionModelSelection to fire a LiveSessionModelSwitchError on every fallback attempt.

Changes:

  • live-model-switch.ts: Removes the resolveDefaultModelForAgent call and uses params.defaultProvider/params.defaultModel directly as the baseline. The call site in pi-embedded-runner/run.ts already passes the current attempt's provider and model for these fields, so no change was needed there.
  • live-model-switch.test.ts: Removes the now-unnecessary resolveDefaultModelForAgent mock and adds a dedicated test for the no-override fallback path.
  • model-fallback.run-embedded.e2e.test.ts: Adds an e2e regression test covering cross-provider fallback for agent sessions without explicit session store overrides (agentId: "main").

Confidence Score: 5/5

Safe to merge — targeted, well-understood fix with matching unit and e2e regression coverage.

The change is minimal and surgical: one function call removed, two field accesses simplified to use the already-correct caller-supplied values. The call site in run.ts already passes the current attempt's provider/model, so the fix is properly integrated. New unit test and e2e regression test both cover the exact scenario described in the root cause. No P0 or P1 issues found.

No files require special attention.

Important Files Changed

Filename Overview
src/agents/live-model-switch.ts Core fix: removes resolveDefaultModelForAgent call and uses params.defaultProvider/defaultModel directly as the session baseline, preventing false LiveSessionModelSwitchError during fallback.
src/agents/live-model-switch.test.ts Removes now-unnecessary resolveDefaultModelForAgent mock and adds a dedicated unit test for the no-override fallback path.
src/agents/model-fallback.run-embedded.e2e.test.ts Adds agentId param to runEmbeddedFallback and a new e2e regression test covering cross-provider fallback for agent sessions without explicit session store overrides.

Reviews (1): Last reviewed commit: "fix(agents): preserve fallback candidate..." | Re-trigger Greptile

@goldmar

goldmar commented Mar 29, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up validation after local deploy:

  • rebuilt the patched repo and restarted the local gateway on build 0a9bfa8
  • exercised the previously failing Telegram/live-session path directly against session 27c82782-4e2a-40a3-a626-f7b94b796154
  • command used:
    openclaw agent --agent main --session-id 27c82782-4e2a-40a3-a626-f7b94b796154 --message 'Reply with POST_DEPLOY_TELEGRAM_FALLBACK_OK and nothing else.' --json\n  ```\n- result completed on `openai-codex/gpt-5.4` with payload `POST_DEPLOY_TELEGRAM_FALLBACK_OK`\n- fresh logs for run `f2eded73-809c-4d58-a0ff-17d6316a343f` showed Anthropic `429` on the primary attempt, then `model_fallback_decision` `candidate_succeeded` for `openai-codex/gpt-5.4`\n- importantly, there were no new `live session model switch detected before attempt` lines for that run\n\nSo the follow-up commit fixes the remaining Telegram/live-session regression that still reproduced after the first patch.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85e00a7252

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/live-model-switch.ts Outdated
Comment on lines +43 to +47
if (authProfileIdSource !== "user") {
return {
authProfileId: undefined,
authProfileIdSource: undefined,
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve auto auth overrides in live-selection baseline

Dropping every non-user authProfileOverride here makes persisted session state disagree with the runner’s current state in normal flows where the override source is auto (for example, overrides persisted by resolveSessionAuthProfileOverride during /btw, then forwarded by runAgentAttempt into runEmbeddedPiAgent). In that case resolveLiveSessionModelSelection now returns authProfileId: undefined, hasDifferentLiveSessionModelSelection flags a change before any attempt, and the run throws LiveSessionModelSwitchError, which can skip the primary candidate or fail the turn when no fallback remains.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6cb6d8cf7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/live-model-switch.ts Outdated
Comment on lines +40 to +42
const authProfileIdSource =
entry.authProfileOverrideSource ??
(typeof entry.authProfileOverrideCompactionCount === "number" ? "auto" : "user");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep unknown auth override source from forcing live switch

When a session entry has authProfileOverride but no authProfileOverrideSource (legacy/manual store data), this fallback now coerces the source to "user". runEmbeddedPiAgent still builds current from the raw session entry source (often undefined), so hasDifferentLiveSessionModelSelection() treats the same profile ID as a model-switch request and throws LiveSessionModelSwitchError before any attempt runs. On CLI paths that do not catch and rewrite this error, that can cause the turn to fail after exhausting candidates without actually calling a model.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85037da0e2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/live-model-switch.ts Outdated
Comment on lines +45 to +49
entry.authProfileOverrideSource === "auto"
? "auto"
: normalizeSelectedAuthProfileSource(
authProfileId,
typeof entry.authProfileOverrideCompactionCount === "number" ? "auto" : "user",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor explicit user source before compaction fallback

This source resolution path can misclassify persisted manual profile locks: if a session entry contains authProfileOverrideSource: "user" plus legacy/stale authProfileOverrideCompactionCount, the explicit user source is ignored and recomputed as "auto", so resolveLiveSessionModelSelection drops the auth profile entirely. In that state, runs that currently have a user-pinned profile can be treated as changed and throw LiveSessionModelSwitchError before attempts. The compaction-count fallback should only run when authProfileOverrideSource is missing, not when it is explicitly "user".

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a82be707f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +45 to +46
entry.authProfileOverrideSource ??
(typeof entry.authProfileOverrideCompactionCount === "number" ? "auto" : "user");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle legacy auto auth overrides without source metadata

This fallback infers "auto" when authProfileOverrideSource is missing but authProfileOverrideCompactionCount exists, and resolvePersistedAuthProfileSelection() then drops the profile entirely. In CLI/agent runs, runEmbeddedPiAgent builds current from the raw session entry source (often still undefined for legacy stores), so hasDifferentLiveSessionModelSelection() treats current authProfileId vs next undefined as a real switch and can throw LiveSessionModelSwitchError before attempts. This is a regression for legacy session rows that have compaction metadata but no source field, causing unnecessary candidate skips/failures during fallback.

Useful? React with 👍 / 👎.

@ether-btc

Copy link
Copy Markdown

Staff Engineer Review: APPROVED

PR: fix(agents): preserve fallback candidates in live sessions
Author: goldmar


Root Cause Analysis: CORRECT

The fix properly addresses the root cause. resolveLiveSessionModelSelection() was using resolveDefaultModelForAgent() as the baseline whenever an agentId existed. During fallback retries, the runner compared the retry candidate (e.g., openai:gpt-5.4) against the agent-default baseline (e.g., anthropic:claude-opus-4-6) and threw LiveSessionModelSwitchError — treating internal retry candidates as user-initiated live switches.

The fix removes resolveDefaultModelForAgent() entirely and uses the passed-in defaultProvider/defaultModel (the current attempt's provider/model) as the baseline. Cross-provider fallback now works because the baseline moves with each retry attempt.


Code Quality: SOUND

normalizeSelectedAuthProfileSource — Clean utility. Correctly normalizes authProfileOverrideSource to auto | user | undefined. No edge cases missed.

resolvePersistedAuthProfileSelection — Correctly filters out auto-selected profiles during fallback. The compaction metadata heuristic is a sound backwards-compat approach for distinguishing legacy data.

hasDifferentLiveSessionModelSelection — The new comparison logic handles all edge cases:

  • Same profile + different source → detected as change
  • Missing → present → detected as change
  • Present → missing → detected as change
  • Both undefined → not a change

Return value consistency — The final return is cleaner: always returns authProfileIdSource directly. resolvePersistedAuthProfileSelection already handles the undefined case, so behavior is preserved.


Test Quality: STRONG

Unit tests cover all auth profile source edge cases including the tricky legacy/stale compaction metadata scenarios. The E2E tests directly validate the bug:

  • "falls back across providers for agent sessions without explicit model overrides" — the core fix
  • "ignores auto-selected persisted auth profiles when crossing providers during fallback" — regression coverage

Mock cleanup is correct: resolveDefaultModelForAgentMock was properly removed since that function is no longer called.


Minor Observation (non-blocking)

The compaction-count heuristic is a reasonable backwards-compat inference, but it's not obvious from the call site. A brief code comment would help future readers. Not worth blocking over.


VERDICT: APPROVED

The fix is correct, the tests are thorough, and the auth profile source handling is more rigorous than before. No blockers.

@goldmar

goldmar commented Mar 29, 2026

Copy link
Copy Markdown
Contributor Author

@joshavant @tyler6204 I addressed the follow-up review comments and the current checks are looking healthy again. This should be ready for another look when you have a minute.

@goldmar

goldmar commented Mar 30, 2026

Copy link
Copy Markdown
Contributor Author

Looks like this was independently addressed in bd89e07 on main — closing this one. Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants