Skip to content

fix(agents): fail fast on session lock fallback#78633

Merged
sallyom merged 1 commit into
openclaw:mainfrom
sallyom:fix/issue-66646
May 7, 2026
Merged

fix(agents): fail fast on session lock fallback#78633
sallyom merged 1 commit into
openclaw:mainfrom
sallyom:fix/issue-66646

Conversation

@sallyom

@sallyom sallyom commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fail fast when model fallback receives a session write-lock timeout instead of trying later model candidates for local file contention.
  • Reuse the existing recursive session-lock detector and add a regression test proving configured fallbacks are not attempted.
  • Add a changelog entry for the user-visible fallback behavior fix.

Fixes #66646

Verification

  • pnpm test src/agents/model-fallback.test.ts src/agents/failover-error.test.ts

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS maintainer Maintainer-authored PR labels May 6, 2026
@clawsweeper

clawsweeper Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

Summary
The PR exports the session write-lock timeout detector, uses it in runWithModelFallback to fail fast on local session lock contention, adds a regression test, and updates CHANGELOG.md.

Reproducibility: yes. A high-confidence source-level path exists: configure a primary plus fallback and have the first candidate throw SessionWriteLockTimeoutError; current main leaves it unclassified and continues because non-final unknown errors advance through fallback.

Real behavior proof
Override: The PR has the proof: override label, so the external-contributor real behavior proof gate is intentionally bypassed for this review.

Next step before merge
Protected maintainer labeling keeps this PR in human review; I found no narrow repair finding for automation to fix.

Security
Cleared: The diff only touches agents fallback code, a colocated regression test, and changelog text; it adds no dependency, CI, secret, install, or publishing surface.

Review details

Best possible solution:

Merge the narrow fallback-loop fail-fast guard after maintainer review and green required checks; handle any broader queue, retry, or backoff policy separately.

Do we have a high-confidence way to reproduce the issue?

Yes. A high-confidence source-level path exists: configure a primary plus fallback and have the first candidate throw SessionWriteLockTimeoutError; current main leaves it unclassified and continues because non-final unknown errors advance through fallback.

Is this the best way to solve the issue?

Yes for this PR's scope. The explicit session-lock escape at the fallback-loop boundary is the narrowest maintainable fix for stopping model substitution while preserving generic unknown-error fallback behavior.

Acceptance criteria:

  • pnpm test src/agents/model-fallback.test.ts src/agents/failover-error.test.ts
  • pnpm check:changed

What I checked:

  • current-main-fallback-gap: Current main normalizes candidate errors, then only rethrows an unrecognized error on the final candidate; unclassified non-final errors are recorded and fallback continues. (src/agents/model-fallback.ts:1096, 372e270871a2)
  • typed-lock-error-source: Session write-lock acquisition timeouts are converted into SessionWriteLockTimeoutError, giving the fallback layer a typed local contention signal. (src/agents/session-write-lock.ts:540, 372e270871a2)
  • classifier-suppresses-lock-timeout: Current main recursively detects session write-lock timeout errors and returns no failover classification when no explicit provider failover metadata is present. (src/agents/failover-error.ts:408, 372e270871a2)
  • pr-fix-boundary: The PR head checks hasSessionWriteLockTimeout(normalized) immediately after normalization and throws the original error before the generic fallback path can advance. (src/agents/model-fallback.ts:1052, e3143af07868)
  • regression-coverage: The PR adds a runWithModelFallback test that throws SessionWriteLockTimeoutError, expects that same error, and asserts the fallback runner was called once. (src/agents/model-fallback.test.ts:559, e3143af07868)
  • changelog-entry: The PR adds a user-facing changelog bullet for failing fast on session write-lock timeouts and links the related report. (CHANGELOG.md:134, e3143af07868)

Likely related people:

  • MonkeyLeeT: Authored the merged prior fix fix(agents): stop treating session lock waits as timeout #68700, which added the session-lock timeout suppression and adjacent failover coverage this PR builds on. (role: introduced partial classifier fix; confidence: high; commits: 8cc38c1b86f4; files: src/agents/failover-error.ts, src/agents/failover-error.test.ts, src/agents/session-write-lock-error.ts)
  • obviyus: Co-authored the merged prior fix that introduced typed session write-lock timeout behavior at the lock boundary and failover classifier suppression. (role: adjacent session-lock maintainer; confidence: medium; commits: 8cc38c1b86f4; files: src/agents/session-write-lock.ts, src/agents/session-write-lock-error.ts, src/agents/failover-error.ts)
  • steipete: Current checkout blame and commit metadata show recent maintenance touching the central fallback, failover, session-lock, and test files immediately before this PR. (role: recent fallback/session maintainer; confidence: medium; commits: 0f9f956bbd04; files: src/agents/model-fallback.ts, src/agents/model-fallback.test.ts, src/agents/failover-error.ts)
  • vincentkoc: Prior ClawSweeper review context identifies recent work on model fallback helper imports and live-session fallback redirects in the same code path, making this a reasonable routing candidate for fallback behavior review. (role: adjacent fallback owner; confidence: medium; commits: 43a003b8a062, 480a3f66c9f7, da3977e6818b; files: src/agents/model-fallback.ts, src/agents/model-fallback.test.ts)

Remaining risk / open question:

  • Some latest-head check-runs were still queued at inspection time, so maintainers should wait for required checks before merge.
  • No live concurrent-lock scenario was run in this read-only review; the reproduction confidence comes from source tracing and the PR's focused regression path.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 372e270871a2.

@sallyom sallyom added the proof: override Maintainer override for the external PR real behavior proof gate. label May 6, 2026
@sallyom sallyom force-pushed the fix/issue-66646 branch 3 times, most recently from 75d26c2 to 2d18c0b Compare May 6, 2026 21:41
Signed-off-by: sallyom <somalley@redhat.com>
@sallyom sallyom force-pushed the fix/issue-66646 branch from 2d18c0b to e3143af Compare May 7, 2026 00:06
@sallyom sallyom merged commit a74894a into openclaw:main May 7, 2026
93 of 94 checks passed
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Signed-off-by: sallyom <somalley@redhat.com>
rogerdigital pushed a commit to rogerdigital/openclaw that referenced this pull request May 9, 2026
Signed-off-by: sallyom <somalley@redhat.com>
lykeion-dev pushed a commit to lykeion-dev/openclaw--rev that referenced this pull request May 14, 2026
Signed-off-by: sallyom <somalley@redhat.com>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
Signed-off-by: sallyom <somalley@redhat.com>
@sallyom sallyom deleted the fix/issue-66646 branch May 29, 2026 15:15
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
Signed-off-by: sallyom <somalley@redhat.com>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Signed-off-by: sallyom <somalley@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR proof: override Maintainer override for the external PR real behavior proof gate. size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Session file lock errors cascade through model fallback chain

1 participant