fix(agents): skip model fallback for embedded session takeover and session write-lock errors [AI-assisted]#83550
Conversation
|
Codex review: needs maintainer review before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. from source: current main treats these local coordination errors as unclassified candidate failures and continues through configured fallbacks. I did not run a live embedded-session race in this read-only review. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the narrow fail-fast guard with the provider-metadata precedence tests intact, then close the linked bug after merge. Do we have a high-confidence way to reproduce the issue? Yes, from source: current main treats these local coordination errors as unclassified candidate failures and continues through configured fallbacks. I did not run a live embedded-session race in this read-only review. Is this the best way to solve the issue? Yes. The final branch is narrow: it fails fast only for unclassified local coordination errors and keeps explicit provider failover metadata authoritative, with regression coverage for both paths. Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against bf95f762b5df. |
martingarramon
left a comment
There was a problem hiding this comment.
The intended fail-fast direction is right: session write-lock timeouts and embedded session takeovers are local coordination failures, not provider/model failures, so they should not consume the fallback chain.
Two things before LGTM:
1. Provider-wrapper edge case
The new guard runs on the raw error before resolveFailoverClassificationFromError can apply its existing "explicit provider metadata wins" logic. Relevant guard in failover-error.ts:383-407:
if (classification) {
if (hasSessionLock && !hasExplicitFailoverMetadata) {
return null; // session lock wins only when no explicit status/code
}
return classification; // 429 wins over nested session lock
}If a 429 / RESOURCE_EXHAUSTED wrapper happens to contain a nested SessionWriteLockTimeoutError in its cause chain, the current classification logic intentionally returns rate_limit (retryable), but isNonProviderRuntimeCoordinationError would fire first and abort the fallback chain.
Suggested fix: exempt errors that are already FailoverError or carry explicit provider status (typeof inferSignalStatus(normalizeErrorSignal(err)) === "number") from the coordination abort — mirroring the existing guard. Alternatively, a regression test covering the 429-wrapper-with-nested-session-lock path would at least document the intended behavior, even if the fix is deferred.
2. Real behavior proof section
The CI gate (Real behavior proof) is a PR-body parser, not a test runner — .github/workflows/real-behavior-proof.yml runs scripts/github/real-behavior-proof-check.mjs against the PR body text. Adding a ## Real behavior proof section with the standard fields will unblock CI:
## Real behavior proof
Behavior addressed: ...
Real environment tested: ...
Exact steps or command run after this patch: ...
Evidence after fix: ...
Observed result after fix: ...
What was not tested: ...(onError is per-candidate telemetry only, so skipping it for coordination errors is fine as-is.)
…ite-lock errors EmbeddedAttemptSessionTakeoverError and SessionWriteLockTimeoutError are local runtime coordination errors, not provider/model failures. Treating them as candidate failures in runWithModelFallback walked the entire fallback chain — each model hit the same local condition (or, for the last, a 120s idle timeout) — and surfaced a misleading 'All models failed' to the user. Add isNonProviderRuntimeCoordinationError() in failover-error.ts that matches either error directly or via nested cause/reason/error chains. The embedded takeover check uses a name match to avoid an import cycle with pi-embedded-runner. In runWithModelFallback's per-attempt catch block, rethrow immediately when the predicate is true, before the candidate is recorded as failed or any fallback bookkeeping advances. Closes openclaw#83510 AI-assisted.
b6fd9a8 to
12cb2ba
Compare
|
Verification before merge:
Real behavior proof: source-level coordination/fallback behavior is covered by regression tests here. No live installed gateway race was reproduced; |
Summary
EmbeddedAttemptSessionTakeoverError("session file changed while embedded prompt lock was released") orSessionWriteLockTimeoutError,runWithModelFallbacktreats it as a candidate model failure and walks the entire fallback chain. Each model hits the same local condition (or the last one idles out after 120s), and the user sees a misleadingAll models failedsummary.isNonProviderRuntimeCoordinationError(err)insrc/agents/failover-error.tsthat matches either error directly or via nestedcause/reason/errorchains. InrunWithModelFallback's per-attempt catch block, rethrow immediately when the predicate is true — before the candidate is recorded as failed or any fallback bookkeeping advances.attempt.session-lock.ts), and the existinghasSessionWriteLockTimeouttreatment in failover classification all stay as-is.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause
runWithModelFallbackonly differentiates between candidate failures (advance to next model) andFailoverError(rethrow). Local runtime coordination errors thrown from inside the candidate run —EmbeddedAttemptSessionTakeoverErrorfrom the embedded prompt lock andSessionWriteLockTimeoutErrorfrom the session writer — fall through the generic "candidate failed" path, so the loop tries every fallback model against the same local condition.failover-error.tsalready special-casesSessionWriteLockTimeoutErrorto avoid misclassifying it as a timeout, but the fallback driver never consulted that signal.EmbeddedAttemptSessionTakeoverErroris a newer (post-SessionWriteLockTimeoutError) error class; the embedded runner catches it locally in some paths (google-prompt-cache.ts,attempt.session-lock.ts) but it can still escape upward into the model fallback driver.Regression Test Plan
src/agents/failover-error.test.ts,src/agents/model-fallback.test.tsisNonProviderRuntimeCoordinationErrorreturns true for directSessionWriteLockTimeoutErrorand directEmbeddedAttemptSessionTakeoverError-named errors, and for both errors wrapped viacause. It returns false for plainTimeoutError, HTTP 429 provider errors, and nullish input.runWithModelFallbackinvoked with arunthat rejects with anEmbeddedAttemptSessionTakeoverError-named error rethrows that exact error on the first attempt and does not invokerunfor any fallback candidate (runcalled exactly once).hasSessionWriteLockTimeouttests only assert it suppresses timeout misclassification, not that the fallback chain aborts.User-visible / Behavior Changes
Embedded agent failed before reply: All models failed (N): .... Provider/model fallback semantics are unchanged for actual provider failures.Security Impact (required)
Verification
node scripts/run-vitest.mjs src/agents/failover-error.test.ts src/agents/model-fallback.test.ts- 3 files passed, 202 tests passed.git diff --check- clean../node_modules/.bin/oxfmt --check --threads=1 src/agents/failover-error.ts src/agents/failover-error.test.ts src/agents/model-fallback.test.ts CHANGELOG.md- clean.node scripts/check-changelog-attributions.mjs- clean.codex review --uncommitted- no blocking/actionable findings after maintainer fixup.Real behavior proof
Behavior addressed: embedded session takeover and session write-lock failures now abort the model fallback chain as local coordination errors, while provider-classified failures with nested session-lock causes still use normal model fallback.
Real environment tested: local macOS source checkout for the model fallback and failover classifier runtime paths.
Exact steps or command run after this patch:
node scripts/run-vitest.mjs src/agents/failover-error.test.ts src/agents/model-fallback.test.ts.Evidence after fix: targeted Vitest run passed with 3 files and 202 tests, including direct embedded takeover abort, direct session write-lock abort, and a provider 429/RESOURCE_EXHAUSTED wrapper with a nested session lock cause falling through to the configured fallback model.
Observed result after fix: pure local session coordination errors are rethrown on the first candidate instead of surfacing as
All models failed, and explicit provider failover metadata remains authoritative so provider rate-limit fallback still advances to the next candidate.What was not tested: a live embedded-session takeover race against a real installed gateway; the proof is source-level regression coverage for the observed fallback routing bug.