Skip to content

fix(agents): skip model fallback for embedded session takeover and session write-lock errors [AI-assisted]#83550

Merged
steipete merged 2 commits into
openclaw:mainfrom
luyao618:fix/embedded-session-takeover-skip-fallback
May 18, 2026
Merged

fix(agents): skip model fallback for embedded session takeover and session write-lock errors [AI-assisted]#83550
steipete merged 2 commits into
openclaw:mainfrom
luyao618:fix/embedded-session-takeover-skip-fallback

Conversation

@luyao618

@luyao618 luyao618 commented May 18, 2026

Copy link
Copy Markdown
Contributor

🤖 AI-assisted (built with Codex via Hermes orchestration). Test level: lightly tested. Prompt summary available on request.

Summary

  • Problem: When an embedded agent turn fails with EmbeddedAttemptSessionTakeoverError ("session file changed while embedded prompt lock was released") or SessionWriteLockTimeoutError, runWithModelFallback treats it as a candidate model failure and walks the entire fallback chain. Each model hits the same local condition (or the last one idles out after 120s), and the user sees a misleading All models failed summary.
  • Why it matters: Interactive embedded turns fail before any reply is produced, fallback model slots are wasted on a non-provider local error, and the final error message points operators at provider/model diagnosis instead of the real cause (session lock coordination). On long-lived or maintenance-heavy sessions this repeats; the reporter saw 61 occurrences in one day.
  • What changed: Add isNonProviderRuntimeCoordinationError(err) in src/agents/failover-error.ts that matches either error directly or via nested cause / reason / error chains. In runWithModelFallback's per-attempt catch block, rethrow immediately when the predicate is true — before the candidate is recorded as failed or any fallback bookkeeping advances.
  • What did NOT change (scope boundary): The error classes themselves, the session-lock acquisition/fence logic, the embedded runner's own internal retry on takeover (attempt.session-lock.ts), and the existing hasSessionWriteLockTimeout treatment in failover classification all stay as-is.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agents

Linked Issue/PR

Root Cause

  • Root cause: runWithModelFallback only differentiates between candidate failures (advance to next model) and FailoverError (rethrow). Local runtime coordination errors thrown from inside the candidate run — EmbeddedAttemptSessionTakeoverError from the embedded prompt lock and SessionWriteLockTimeoutError from the session writer — fall through the generic "candidate failed" path, so the loop tries every fallback model against the same local condition.
  • Missing detection / guardrail: The fallback loop has no notion of "non-provider local error, abort the chain". failover-error.ts already special-cases SessionWriteLockTimeoutError to avoid misclassifying it as a timeout, but the fallback driver never consulted that signal.
  • Contributing context (if known): EmbeddedAttemptSessionTakeoverError is a newer (post-SessionWriteLockTimeoutError) error class; the embedded runner catches it locally in some paths (google-prompt-cache.ts, attempt.session-lock.ts) but it can still escape upward into the model fallback driver.

Regression Test Plan

  • Coverage level that should have caught this:
    • Unit test
  • Target test or file: src/agents/failover-error.test.ts, src/agents/model-fallback.test.ts
  • Scenario the test should lock in:
    • isNonProviderRuntimeCoordinationError returns true for direct SessionWriteLockTimeoutError and direct EmbeddedAttemptSessionTakeoverError-named errors, and for both errors wrapped via cause. It returns false for plain TimeoutError, HTTP 429 provider errors, and nullish input.
    • runWithModelFallback invoked with a run that rejects with an EmbeddedAttemptSessionTakeoverError-named error rethrows that exact error on the first attempt and does not invoke run for any fallback candidate (run called exactly once).
  • Why this is the smallest reliable guardrail: The two assertions pin down both the classifier (which feeds the policy) and the loop behavior (which is the user-visible regression vector). Any future refactor that re-introduces the "treat coordination error as candidate failure" bug fails immediately.
  • Existing test that already covers this (if any): N/A — existing hasSessionWriteLockTimeout tests only assert it suppresses timeout misclassification, not that the fallback chain aborts.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • Embedded agent turns that fail with session takeover or session write-lock timeout now surface the original coordination error instead of Embedded agent failed before reply: All models failed (N): .... Provider/model fallback semantics are unchanged for actual provider failures.

Security Impact (required)

  • New permissions/capabilities? No
  • This change only narrows the conditions under which the model fallback loop continues. It does not introduce new I/O, network paths, persistence, or capability grants; the error types involved are produced by the existing local session-lock subsystem.

Verification

  • node scripts/run-vitest.mjs src/agents/failover-error.test.ts src/agents/model-fallback.test.ts - 3 files passed, 202 tests passed.
  • git diff --check - clean.
  • ./node_modules/.bin/oxfmt --check --threads=1 src/agents/failover-error.ts src/agents/failover-error.test.ts src/agents/model-fallback.test.ts CHANGELOG.md - clean.
  • node scripts/check-changelog-attributions.mjs - clean.
  • codex review --uncommitted - no blocking/actionable findings after maintainer fixup.

Real behavior proof

Behavior addressed: embedded session takeover and session write-lock failures now abort the model fallback chain as local coordination errors, while provider-classified failures with nested session-lock causes still use normal model fallback.

Real environment tested: local macOS source checkout for the model fallback and failover classifier runtime paths.

Exact steps or command run after this patch: node scripts/run-vitest.mjs src/agents/failover-error.test.ts src/agents/model-fallback.test.ts.

Evidence after fix: targeted Vitest run passed with 3 files and 202 tests, including direct embedded takeover abort, direct session write-lock abort, and a provider 429/RESOURCE_EXHAUSTED wrapper with a nested session lock cause falling through to the configured fallback model.

Observed result after fix: pure local session coordination errors are rethrown on the first candidate instead of surfacing as All models failed, and explicit provider failover metadata remains authoritative so provider rate-limit fallback still advances to the next candidate.

What was not tested: a live embedded-session takeover race against a real installed gateway; the proof is source-level regression coverage for the observed fallback routing bug.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 18, 2026
@clawsweeper

clawsweeper Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds a non-provider runtime coordination error classifier, rethrows embedded session takeover and session write-lock errors before model fallback, and adds regression tests plus a changelog entry.

Reproducibility: yes. from source: current main treats these local coordination errors as unclassified candidate failures and continues through configured fallbacks. I did not run a live embedded-session race in this read-only review.

PR rating
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🦞 diamond lobster
Summary: The patch is narrow and well-covered after the fixup, with proof accepted by override rather than live runtime evidence.

Rank-up moves:

  • none
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Override: A maintainer applied proof: override for this PR.

Risk before merge
Why this matters: - The PR body still does not show a live embedded-session race; the proof override makes that a maintainer validation tradeoff rather than a contributor blocker.

Maintainer options:

  1. Decide the mitigation before merge
    Land the narrow fail-fast guard with the provider-metadata precedence tests intact, then close the linked bug after merge.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge
No ClawSweeper repair lane is needed; the branch already contains the narrow code/test fix and only normal maintainer merge handling remains.

Security
Cleared: The diff only changes local error classification, tests, and changelog text; I found no new permissions, dependency, network, persistence, or secret-handling surface.

Review details

Best possible solution:

Land the narrow fail-fast guard with the provider-metadata precedence tests intact, then close the linked bug after merge.

Do we have a high-confidence way to reproduce the issue?

Yes, from source: current main treats these local coordination errors as unclassified candidate failures and continues through configured fallbacks. I did not run a live embedded-session race in this read-only review.

Is this the best way to solve the issue?

Yes. The final branch is narrow: it fails fast only for unclassified local coordination errors and keeps explicit provider failover metadata authoritative, with regression coverage for both paths.

Label justifications:

  • P1: The linked regression can break interactive embedded agent turns before any reply and misdiagnose a local session-state conflict as model fallback exhaustion.

What I checked:

  • Current-main source reproduction: On current main, unrecognized attempt errors are recorded as failed candidates and the loop continues until the final candidate, after which the fallback summary can report all models failed. (src/agents/model-fallback.ts:1269, 40a59420912e)
  • Coordination error source: The embedded session lock raises EmbeddedAttemptSessionTakeoverError when the session file fingerprint changes, and the session write-lock code raises SessionWriteLockTimeoutError for lock timeouts. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:167, 40a59420912e)
  • PR fail-fast implementation: The PR head checks isNonProviderRuntimeCoordinationError before fallback bookkeeping and rethrows matching local coordination errors immediately. (src/agents/model-fallback.ts:1206, 12cb2bab5ecb)
  • Provider metadata preservation: The final predicate returns false for existing FailoverError objects and for errors with a non-null failover classification, preserving explicit provider status/code routing. (src/agents/failover-error.ts:270, 12cb2bab5ecb)
  • Regression coverage: The PR adds tests for direct embedded takeover, direct session write-lock timeout, and a provider 429/RESOURCE_EXHAUSTED wrapper with a nested session lock cause falling through to the fallback model. (src/agents/model-fallback.test.ts:772, 12cb2bab5ecb)
  • Maintainer fixup provenance: The second PR commit is the targeted provider metadata fixup after the earlier bot review flagged that edge case. (src/agents/failover-error.ts:270, 12cb2bab5ecb)

Likely related people:

  • steipete: Authored the provider metadata preservation commit on the PR head and is assigned in the PR timeline, making them the clearest route for final merge judgment. (role: recent reviewer and fixup owner; confidence: high; commits: 12cb2bab5ecb; files: src/agents/failover-error.ts, src/agents/model-fallback.test.ts, CHANGELOG.md)
  • Ayaan Zaidi: Local current-main blame for the central fallback and coordination-error files points to this recent checkout commit, though the signal is broad and not specific to this bug. (role: current-main blame signal; confidence: low; commits: 6940a01e74b4; files: src/agents/model-fallback.ts, src/agents/failover-error.ts, src/agents/pi-embedded-runner/run/attempt.session-lock.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against bf95f762b5df.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. labels May 18, 2026

@martingarramon martingarramon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended fail-fast direction is right: session write-lock timeouts and embedded session takeovers are local coordination failures, not provider/model failures, so they should not consume the fallback chain.

Two things before LGTM:

1. Provider-wrapper edge case

The new guard runs on the raw error before resolveFailoverClassificationFromError can apply its existing "explicit provider metadata wins" logic. Relevant guard in failover-error.ts:383-407:

if (classification) {
  if (hasSessionLock && !hasExplicitFailoverMetadata) {
    return null;  // session lock wins only when no explicit status/code
  }
  return classification;  // 429 wins over nested session lock
}

If a 429 / RESOURCE_EXHAUSTED wrapper happens to contain a nested SessionWriteLockTimeoutError in its cause chain, the current classification logic intentionally returns rate_limit (retryable), but isNonProviderRuntimeCoordinationError would fire first and abort the fallback chain.

Suggested fix: exempt errors that are already FailoverError or carry explicit provider status (typeof inferSignalStatus(normalizeErrorSignal(err)) === "number") from the coordination abort — mirroring the existing guard. Alternatively, a regression test covering the 429-wrapper-with-nested-session-lock path would at least document the intended behavior, even if the fix is deferred.

2. Real behavior proof section

The CI gate (Real behavior proof) is a PR-body parser, not a test runner — .github/workflows/real-behavior-proof.yml runs scripts/github/real-behavior-proof-check.mjs against the PR body text. Adding a ## Real behavior proof section with the standard fields will unblock CI:

## Real behavior proof

Behavior addressed: ...
Real environment tested: ...
Exact steps or command run after this patch: ...
Evidence after fix: ...
Observed result after fix: ...
What was not tested: ...

(onError is per-candidate telemetry only, so skipping it for coordination errors is fine as-is.)

@steipete steipete self-assigned this May 18, 2026
@openclaw-barnacle openclaw-barnacle Bot added size: M triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. and removed size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 18, 2026
luyao618 and others added 2 commits May 18, 2026 14:14
…ite-lock errors

EmbeddedAttemptSessionTakeoverError and SessionWriteLockTimeoutError are
local runtime coordination errors, not provider/model failures. Treating
them as candidate failures in runWithModelFallback walked the entire
fallback chain — each model hit the same local condition (or, for the
last, a 120s idle timeout) — and surfaced a misleading 'All models failed'
to the user.

Add isNonProviderRuntimeCoordinationError() in failover-error.ts that
matches either error directly or via nested cause/reason/error chains.
The embedded takeover check uses a name match to avoid an import cycle
with pi-embedded-runner. In runWithModelFallback's per-attempt catch
block, rethrow immediately when the predicate is true, before the
candidate is recorded as failed or any fallback bookkeeping advances.

Closes openclaw#83510

AI-assisted.
@steipete steipete force-pushed the fix/embedded-session-takeover-skip-fallback branch from b6fd9a8 to 12cb2ba Compare May 18, 2026 13:18
@steipete steipete added the proof: override Maintainer override for the external PR real behavior proof gate. label May 18, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. label May 18, 2026
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. labels May 18, 2026
@steipete

Copy link
Copy Markdown
Contributor

Verification before merge:

  • node scripts/run-vitest.mjs src/agents/failover-error.test.ts src/agents/model-fallback.test.ts - passed, 3 files / 202 tests.
  • git diff --check - clean.
  • ./node_modules/.bin/oxfmt --check --threads=1 src/agents/failover-error.ts src/agents/failover-error.test.ts src/agents/model-fallback.test.ts CHANGELOG.md - clean.
  • node scripts/check-changelog-attributions.mjs - clean.
  • codex review --uncommitted - no blocking/actionable findings.
  • CI run 26035988392 rerun for checks-node-core-fast - completed success on head 12cb2bab5ecb07e62fa97838646547e6de76787b.

Real behavior proof: source-level coordination/fallback behavior is covered by regression tests here. No live installed gateway race was reproduced; proof: override applied because this is local session coordination/error-classification behavior with deterministic unit coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling P1 High-priority user-facing bug, regression, or broken workflow. proof: override Maintainer override for the external PR real behavior proof gate. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Embedded agent failover treats session-file mutation as model failure and exhausts all fallbacks

3 participants