fix(agents): stop treating session lock waits as timeout#68700
Conversation
ab4a06b to
b76352c
Compare
Greptile SummaryThis PR fixes a misclassification bug where session-write-lock errors containing the substring Confidence Score: 5/5Safe to merge — the fix is well-scoped, correctly placed, and covered by a targeted regression test. Both changed files are clean: the guard logic correctly short-circuits before any classifier path, cycle detection via No files require special attention. Reviews (1): Last reviewed commit: "fix(agents): stop treating session lock ..." | Re-trigger Greptile |
956419a to
b9e135e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9e135ec07
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@hxy91819 PTAL? Thanks! |
87c1e3e to
bab0cfe
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83dc131f86
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
83dc131 to
897c24c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 897c24c2ea
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
897c24c to
c10b881
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c10b881ea0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
295c8d8 to
28de0c5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 28de0c56ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
d01d650 to
59cae90
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 59cae9058c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70b8bbd183
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
22898de to
8ddbba8
Compare
🔒 Aisle Security AnalysisWe found 3 potential security issue(s) in this PR:
1. 🟡 Potential denial-of-service via unbounded recursion in hasSessionWriteLockTimeout() error graph traversal
DescriptionThe new
This helper is called from Vulnerable code: return (
hasSessionWriteLockTimeout(candidate.error, seen) ||
hasSessionWriteLockTimeout(candidate.cause, seen) ||
hasSessionWriteLockTimeout(candidate.reason, seen)
);RecommendationAdd a maximum traversal depth / node budget similar to Example fix (depth-limited + iterative): const MAX_SESSION_LOCK_CAUSE_DEPTH = 25;
function hasSessionWriteLockTimeout(err: unknown): boolean {
const seen = new Set<object>();
const stack: Array<{ value: unknown; depth: number }> = [{ value: err, depth: 0 }];
while (stack.length) {
const { value, depth } = stack.pop()!;
if (isSessionWriteLockTimeoutError(value)) return true;
if (!value || typeof value !== "object") continue;
if (seen.has(value)) continue;
if (depth >= MAX_SESSION_LOCK_CAUSE_DEPTH) continue;
seen.add(value);
const c = value as { error?: unknown; cause?: unknown; reason?: unknown };
if (c.error !== undefined) stack.push({ value: c.error, depth: depth + 1 });
if (c.cause !== undefined) stack.push({ value: c.cause, depth: depth + 1 });
if (c.reason !== undefined) stack.push({ value: c.reason, depth: depth + 1 });
}
return false;
}This prevents stack overflow and bounds worst-case work even with attacker-controlled nesting. 2. 🟡 Failover/timeout classification bypass via spoofable SessionWriteLockTimeoutError code
Description
This is security-relevant because
Since failover classification consumes generic error objects and extracts Vulnerable code: export function isSessionWriteLockTimeoutError(err: unknown): boolean {
return (
err instanceof SessionWriteLockTimeoutError ||
Boolean(
err &&
typeof err === "object" &&
(err as { code?: unknown }).code === SESSION_WRITE_LOCK_TIMEOUT_CODE,
)
);
}RecommendationDo not treat untrusted/plain objects as Options:
export function isSessionWriteLockTimeoutError(err: unknown): err is SessionWriteLockTimeoutError {
return err instanceof SessionWriteLockTimeoutError;
}
Also consider scoping the suppression logic in 3. 🟡 SessionWriteLockTimeoutError message leaks internal lock path and allows log forging via unsanitized filename
Description
Impact:
Vulnerable code: super(
`session file locked (timeout ${params.timeoutMs}ms): ${params.owner} ${params.lockPath}`,
);RecommendationAvoid putting sensitive/internal identifiers (absolute paths, PIDs) into the human-readable error message. Keep them as structured fields for debugging, and sanitize/escape before logging. Suggested change: export class SessionWriteLockTimeoutError extends Error {
readonly code = SESSION_WRITE_LOCK_TIMEOUT_CODE;
readonly timeoutMs: number;
readonly owner: string;
readonly lockPath: string;
constructor(params: { timeoutMs: number; owner: string; lockPath: string }) {
super(`session file locked (timeout ${params.timeoutMs}ms)`);
this.name = "SessionWriteLockTimeoutError";
this.timeoutMs = params.timeoutMs;
this.owner = params.owner;
this.lockPath = params.lockPath;
}
}When logging, log structured metadata and escape control characters: const safePath = lockPath.replace(/[\r\n\t]/g, " ");
log.warn({ code, owner, lockPath: safePath }, "session file locked");If any of these errors can reach HTTP/UI responses, map them to a generic user-safe message and omit internal paths/PIDs. Analyzed PR: #68700 at commit Last updated on: 2026-04-25T04:07:14Z |
There was a problem hiding this comment.
💡 Codex Review
https://github.com/openclaw/openclaw/blob/8ddbba84931cc9f77a4d2b9e896bc72b947a7d2a/src/agents/failover-error.ts#L287-L289
Apply session-lock suppression before cause-based return
resolveFailoverClassificationFromError returns causeClassification immediately in the !classification || classification.kind === "context_overflow" branch, before the new isSessionLockError(err) suppression runs. That means wrapped errors where session-lock text lives on the wrapper (for example in reason/error) but the cause looks timeout-like are still classified as provider failover (timeout/rate_limit) and continue model fallback instead of failing fast on lock contention. Fresh evidence in this commit is that the new guard was added only after this early return path.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
d862ffc to
51239ad
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51239ad20e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2ef2eb3 to
18d0f66
Compare
18d0f66 to
d6c5004
Compare
d6c5004 to
6e5a8c7
Compare
obviyus
left a comment
There was a problem hiding this comment.
Verified the session write-lock timeout path and confirmed it no longer enters model failover when local transcript lock contention occurs.
Maintainer follow-up: replaced message-regex suppression with a typed session lock timeout error, kept wrapper traversal in the failover classifier, and added the active changelog entry.
Local gate: pnpm check:changed; after final changelog-only rebase, pnpm check:no-conflict-markers.
|
Landed on main. Thanks @MonkeyLeeT. |
* fix(agents): stop treating session lock waits as timeout * fix(agents): ignore abort-wrapped session lock waits * fix(agents): keep explicit failover metadata authoritative * fix(agents): respect inferred failover metadata * fix(agents): ignore generic abort codes for lock waits * fix(agents): suppress cause-based lock wait fallback * fix(agents): type session lock timeout errors * fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
* fix(agents): stop treating session lock waits as timeout * fix(agents): ignore abort-wrapped session lock waits * fix(agents): keep explicit failover metadata authoritative * fix(agents): respect inferred failover metadata * fix(agents): ignore generic abort codes for lock waits * fix(agents): suppress cause-based lock wait fallback * fix(agents): type session lock timeout errors * fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
* fix(agents): stop treating session lock waits as timeout * fix(agents): ignore abort-wrapped session lock waits * fix(agents): keep explicit failover metadata authoritative * fix(agents): respect inferred failover metadata * fix(agents): ignore generic abort codes for lock waits * fix(agents): suppress cause-based lock wait fallback * fix(agents): type session lock timeout errors * fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
* fix(agents): stop treating session lock waits as timeout * fix(agents): ignore abort-wrapped session lock waits * fix(agents): keep explicit failover metadata authoritative * fix(agents): respect inferred failover metadata * fix(agents): ignore generic abort codes for lock waits * fix(agents): suppress cause-based lock wait fallback * fix(agents): type session lock timeout errors * fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
* fix(agents): stop treating session lock waits as timeout * fix(agents): ignore abort-wrapped session lock waits * fix(agents): keep explicit failover metadata authoritative * fix(agents): respect inferred failover metadata * fix(agents): ignore generic abort codes for lock waits * fix(agents): suppress cause-based lock wait fallback * fix(agents): type session lock timeout errors * fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
* fix(agents): stop treating session lock waits as timeout * fix(agents): ignore abort-wrapped session lock waits * fix(agents): keep explicit failover metadata authoritative * fix(agents): respect inferred failover metadata * fix(agents): ignore generic abort codes for lock waits * fix(agents): suppress cause-based lock wait fallback * fix(agents): type session lock timeout errors * fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
* fix(agents): stop treating session lock waits as timeout * fix(agents): ignore abort-wrapped session lock waits * fix(agents): keep explicit failover metadata authoritative * fix(agents): respect inferred failover metadata * fix(agents): ignore generic abort codes for lock waits * fix(agents): suppress cause-based lock wait fallback * fix(agents): type session lock timeout errors * fix: stop session lock failover (openclaw#68700) (thanks @MonkeyLeeT) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Summary
AbortErrorsession-lock cases out of timeout classification, includingrequest was abortedRoot Cause
session-write-lockreports local lock contention using the textsession file locked (timeout ...).The failover classifier treated the bare
timeoutsubstring as a model/provider timeout, andisTimeoutError()still treated abort-wrapped lock waits as timeouts when the outer abort message matchedrequest was aborted.That let model fallback continue across candidates for a local file-lock contention issue.
Impact
Session lock contention now fails fast as infrastructure pressure instead of drifting across fallback models, including abort-wrapped lock waits that previously still looked like timeout. This avoids wasting fallback attempts on the same locked session and keeps model fallback focused on actual provider-side failures.
Testing
pnpm test src/agents/failover-error.test.ts -t "does not classify session lock wait errors as model timeout failover"pnpm test src/agents/failover-error.test.ts src/agents/session-write-lock.test.tspnpm check