Skip to content

fix(agents): stop treating session lock waits as timeout#68700

Merged
obviyus merged 8 commits into
openclaw:mainfrom
MonkeyLeeT:codex/stop-session-lock-failover
Apr 25, 2026
Merged

fix(agents): stop treating session lock waits as timeout#68700
obviyus merged 8 commits into
openclaw:mainfrom
MonkeyLeeT:codex/stop-session-lock-failover

Conversation

@MonkeyLeeT

@MonkeyLeeT MonkeyLeeT commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • stop treating session lock wait errors as provider timeout failovers
  • keep wrapped AbortError session-lock cases out of timeout classification, including request was aborted
  • add regressions covering direct and wrapped session lock errors

Root Cause

session-write-lock reports local lock contention using the text session file locked (timeout ...).
The failover classifier treated the bare timeout substring as a model/provider timeout, and isTimeoutError() still treated abort-wrapped lock waits as timeouts when the outer abort message matched request was aborted.
That let model fallback continue across candidates for a local file-lock contention issue.

Impact

Session lock contention now fails fast as infrastructure pressure instead of drifting across fallback models, including abort-wrapped lock waits that previously still looked like timeout. This avoids wasting fallback attempts on the same locked session and keeps model fallback focused on actual provider-side failures.

Testing

  • pnpm test src/agents/failover-error.test.ts -t "does not classify session lock wait errors as model timeout failover"
  • pnpm test src/agents/failover-error.test.ts src/agents/session-write-lock.test.ts
  • pnpm check

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels Apr 18, 2026
@MonkeyLeeT MonkeyLeeT changed the title [codex] fix(agents): stop treating session lock waits as timeout fix(agents): stop treating session lock waits as timeout Apr 19, 2026
@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch 2 times, most recently from ab4a06b to b76352c Compare April 19, 2026 16:05
@MonkeyLeeT MonkeyLeeT marked this pull request as ready for review April 19, 2026 17:36
@greptile-apps

greptile-apps Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a misclassification bug where session-write-lock errors containing the substring timeout (e.g., "session file locked (timeout 10000ms): ...") were incorrectly treated as provider/model timeout failures and triggered model failover cascades. The fix introduces isSessionLockError() — a cycle-safe recursive traversal over .error, .cause, and .reason chains — and applies it as an early-exit guard in both hasTimeoutHint() and resolveFailoverClassificationFromError(), with a regression test covering both the bare and AbortError-wrapped error shapes.

Confidence Score: 5/5

Safe to merge — the fix is well-scoped, correctly placed, and covered by a targeted regression test.

Both changed files are clean: the guard logic correctly short-circuits before any classifier path, cycle detection via seen prevents infinite loops in cyclic error chains, and the .reason traversal in isSessionLockError is intentionally aligned with how isTimeoutError already reads .reason from AbortErrors. The test covers both the direct and AbortError-wrapped cases. No P0 or P1 findings.

No files require special attention.

Reviews (1): Last reviewed commit: "fix(agents): stop treating session lock ..." | Re-trigger Greptile

@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch 2 times, most recently from 956419a to b9e135e Compare April 20, 2026 01:36

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9e135ec07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts Outdated
@MonkeyLeeT

Copy link
Copy Markdown
Contributor Author

@hxy91819 PTAL? Thanks!

@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch from 87c1e3e to bab0cfe Compare April 20, 2026 19:18

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83dc131f86

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts Outdated
@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch from 83dc131 to 897c24c Compare April 21, 2026 16:23

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 897c24c2ea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts Outdated
@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch from 897c24c to c10b881 Compare April 21, 2026 21:26

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c10b881ea0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts Outdated
@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch from 295c8d8 to 28de0c5 Compare April 22, 2026 16:32

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28de0c56ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts
@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch 2 times, most recently from d01d650 to 59cae90 Compare April 22, 2026 21:05

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59cae9058c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70b8bbd183

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts Outdated
@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch from 22898de to 8ddbba8 Compare April 23, 2026 03:13
@aisle-research-bot

aisle-research-bot Bot commented Apr 23, 2026

Copy link
Copy Markdown

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Potential denial-of-service via unbounded recursion in hasSessionWriteLockTimeout() error graph traversal
2 🟡 Medium Failover/timeout classification bypass via spoofable SessionWriteLockTimeoutError code
3 🟡 Medium SessionWriteLockTimeoutError message leaks internal lock path and allows log forging via unsanitized filename
1. 🟡 Potential denial-of-service via unbounded recursion in hasSessionWriteLockTimeout() error graph traversal
Property Value
Severity Medium
CWE CWE-674
Location src/agents/failover-error.ts:203-219

Description

The new hasSessionWriteLockTimeout() helper recursively traverses error, cause, and reason properties of arbitrary error-like objects without a maximum depth limit.

  • Input err is unknown and can be any object (including provider/SDK error objects that may embed nested objects derived from remote responses)
  • The function performs recursive descent on three properties per node (error, cause, reason)
  • There is cycle detection (seen), but no depth/step limit, so an attacker (or untrusted upstream) that can influence the shape of an error object could create a very deep chain and trigger:
    • stack exhaustion (RangeError: Maximum call stack size exceeded)
    • excessive CPU time from large traversals

This helper is called from hasTimeoutHint(), isTimeoutError(), and resolveFailoverClassificationFromErrorInternal(), which are likely executed on error paths that can be triggered repeatedly, amplifying the DoS impact.

Vulnerable code:

return (
  hasSessionWriteLockTimeout(candidate.error, seen) ||
  hasSessionWriteLockTimeout(candidate.cause, seen) ||
  hasSessionWriteLockTimeout(candidate.reason, seen)
);

Recommendation

Add a maximum traversal depth / node budget similar to MAX_FAILOVER_CAUSE_DEPTH, and/or implement an iterative traversal to avoid stack exhaustion.

Example fix (depth-limited + iterative):

const MAX_SESSION_LOCK_CAUSE_DEPTH = 25;

function hasSessionWriteLockTimeout(err: unknown): boolean {
  const seen = new Set<object>();
  const stack: Array<{ value: unknown; depth: number }> = [{ value: err, depth: 0 }];

  while (stack.length) {
    const { value, depth } = stack.pop()!;
    if (isSessionWriteLockTimeoutError(value)) return true;
    if (!value || typeof value !== "object") continue;
    if (seen.has(value)) continue;
    if (depth >= MAX_SESSION_LOCK_CAUSE_DEPTH) continue;
    seen.add(value);

    const c = value as { error?: unknown; cause?: unknown; reason?: unknown };
    if (c.error !== undefined) stack.push({ value: c.error, depth: depth + 1 });
    if (c.cause !== undefined) stack.push({ value: c.cause, depth: depth + 1 });
    if (c.reason !== undefined) stack.push({ value: c.reason, depth: depth + 1 });
  }
  return false;
}

This prevents stack overflow and bounds worst-case work even with attacker-controlled nesting.

2. 🟡 Failover/timeout classification bypass via spoofable SessionWriteLockTimeoutError code
Property Value
Severity Medium
CWE CWE-840
Location src/agents/session-write-lock-error.ts:20-28

Description

isSessionWriteLockTimeoutError() trusts a magic string error code and returns true for any object with a matching code, not just real SessionWriteLockTimeoutError instances.

This is security-relevant because failover-error.ts uses hasSessionWriteLockTimeout() to suppress timeout detection and failover classification:

  • hasTimeoutHint() / isTimeoutError() return false if a session-write-lock timeout is detected
  • resolveFailoverClassificationFromErrorInternal() returns null (no failover) when hasSessionLock is true and there is no “explicit failover metadata”

Since failover classification consumes generic error objects and extracts .code recursively (getErrorCode() uses findErrorProperty() over error/cause), an error originating from outside this module (e.g., provider/network error payloads that include a code field) can be made to look like a session-write-lock timeout simply by setting code to OPENCLAW_SESSION_WRITE_LOCK_TIMEOUT. This can prevent retries/failover and degrade availability (routing manipulation / DoS on fallback).

Vulnerable code:

export function isSessionWriteLockTimeoutError(err: unknown): boolean {
  return (
    err instanceof SessionWriteLockTimeoutError ||
    Boolean(
      err &&
      typeof err === "object" &&
      (err as { code?: unknown }).code === SESSION_WRITE_LOCK_TIMEOUT_CODE,
    )
  );
}

Recommendation

Do not treat untrusted/plain objects as SessionWriteLockTimeoutError based solely on a magic string.

Options:

  1. Strictest (preferred): only accept real instances (or ones that can be validated strongly):
export function isSessionWriteLockTimeoutError(err: unknown): err is SessionWriteLockTimeoutError {
  return err instanceof SessionWriteLockTimeoutError;
}
  1. If cross-boundary serialization is required, add strong branding and validate multiple fields:
  • check name === "SessionWriteLockTimeoutError"
  • require numeric timeoutMs, string owner, string lockPath
  • optionally include a non-guessable brand (e.g., a Symbol.for(...) property set only locally) or wrap/translate external errors into a dedicated internal error type rather than trusting .code.

Also consider scoping the suppression logic in failover-error.ts so only known-local lock-timeout errors (created by your locking code) suppress failover/timeout classification, rather than any nested error object containing a matching code.

3. 🟡 SessionWriteLockTimeoutError message leaks internal lock path and allows log forging via unsanitized filename
Property Value
Severity Medium
CWE CWE-209
Location src/agents/session-write-lock-error.ts:9-12

Description

SessionWriteLockTimeoutError embeds the absolute lockPath and owner (PID) directly in the exception message. This message is later propagated through generic error formatting (e.g., formatErrorMessage()), which is used for diagnostics/telemetry and potentially other user-visible surfaces.

Impact:

  • Information disclosure: reveals internal filesystem layout (absolute session/lock path) and process IDs.
  • Log forging / injection: POSIX filenames can contain newlines/control characters; because lockPath is derived from sessionFile (via path.basename(sessionFile)), an attacker who can influence sessionFile can inject newlines into logs/telemetry entries that include err.message.

Vulnerable code:

super(
  `session file locked (timeout ${params.timeoutMs}ms): ${params.owner} ${params.lockPath}`,
);

Recommendation

Avoid putting sensitive/internal identifiers (absolute paths, PIDs) into the human-readable error message. Keep them as structured fields for debugging, and sanitize/escape before logging.

Suggested change:

export class SessionWriteLockTimeoutError extends Error {
  readonly code = SESSION_WRITE_LOCK_TIMEOUT_CODE;
  readonly timeoutMs: number;
  readonly owner: string;
  readonly lockPath: string;

  constructor(params: { timeoutMs: number; owner: string; lockPath: string }) {
    super(`session file locked (timeout ${params.timeoutMs}ms)`);
    this.name = "SessionWriteLockTimeoutError";
    this.timeoutMs = params.timeoutMs;
    this.owner = params.owner;
    this.lockPath = params.lockPath;
  }
}

When logging, log structured metadata and escape control characters:

const safePath = lockPath.replace(/[\r\n\t]/g, " ");
log.warn({ code, owner, lockPath: safePath }, "session file locked");

If any of these errors can reach HTTP/UI responses, map them to a generic user-safe message and omit internal paths/PIDs.


Analyzed PR: #68700 at commit 6e5a8c7

Last updated on: 2026-04-25T04:07:14Z

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/openclaw/openclaw/blob/8ddbba84931cc9f77a4d2b9e896bc72b947a7d2a/src/agents/failover-error.ts#L287-L289
P2 Badge Apply session-lock suppression before cause-based return

resolveFailoverClassificationFromError returns causeClassification immediately in the !classification || classification.kind === "context_overflow" branch, before the new isSessionLockError(err) suppression runs. That means wrapped errors where session-lock text lives on the wrapper (for example in reason/error) but the cause looks timeout-like are still classified as provider failover (timeout/rate_limit) and continue model fallback instead of failing fast on lock contention. Fresh evidence in this commit is that the new guard was added only after this early return path.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@MonkeyLeeT MonkeyLeeT force-pushed the codex/stop-session-lock-failover branch from d862ffc to 51239ad Compare April 23, 2026 06:58

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51239ad20e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/failover-error.ts
@obviyus obviyus force-pushed the codex/stop-session-lock-failover branch from 2ef2eb3 to 18d0f66 Compare April 25, 2026 03:57
@obviyus obviyus force-pushed the codex/stop-session-lock-failover branch from 18d0f66 to d6c5004 Compare April 25, 2026 03:58
@obviyus obviyus force-pushed the codex/stop-session-lock-failover branch from d6c5004 to 6e5a8c7 Compare April 25, 2026 04:01

@obviyus obviyus left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the session write-lock timeout path and confirmed it no longer enters model failover when local transcript lock contention occurs.

Maintainer follow-up: replaced message-regex suppression with a typed session lock timeout error, kept wrapper traversal in the failover classifier, and added the active changelog entry.

Local gate: pnpm check:changed; after final changelog-only rebase, pnpm check:no-conflict-markers.

@obviyus obviyus merged commit 8cc38c1 into openclaw:main Apr 25, 2026
55 of 57 checks passed
@obviyus

obviyus commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Landed on main.

Thanks @MonkeyLeeT.

@MonkeyLeeT MonkeyLeeT deleted the codex/stop-session-lock-failover branch April 25, 2026 06:00
Angfr95 pushed a commit to Angfr95/openclaw that referenced this pull request Apr 25, 2026
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
globalcaos pushed a commit to globalcaos/tinkerclaw that referenced this pull request May 13, 2026
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
* fix(agents): stop treating session lock waits as timeout

* fix(agents): ignore abort-wrapped session lock waits

* fix(agents): keep explicit failover metadata authoritative

* fix(agents): respect inferred failover metadata

* fix(agents): ignore generic abort codes for lock waits

* fix(agents): suppress cause-based lock wait fallback

* fix(agents): type session lock timeout errors

* fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants