Skip to content

fix(sessions): reclaim orphan same-pid legacy lock files#32081

Merged
steipete merged 2 commits intoopenclaw:mainfrom
bmendonca3:bm/session-lock-timeout-recovery
Mar 2, 2026
Merged

fix(sessions): reclaim orphan same-pid legacy lock files#32081
steipete merged 2 commits intoopenclaw:mainfrom
bmendonca3:bm/session-lock-timeout-recovery

Conversation

@bmendonca3
Copy link

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: orphan legacy *.jsonl.lock files (no starttime) can block session writes for timeoutMs when container PID is reused (same PID appears alive).
  • Why it matters: agent runs fail before reply with session file locked (timeout 10000ms) and all models on that session are effectively blocked until manual cleanup/restart.
  • What changed: when lock contention is detected, lock reclaim now treats pid===process.pid + missing/invalid starttime + no in-process held lock entry as stale (orphan-self-pid) and reclaims it.
  • What did NOT change (scope boundary): no changes to lock file format, no timeout default changes, no queueing/refactor changes, no broader lock policy changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Session write lock acquisition now self-recovers from orphan same-PID legacy lock files (missing starttime) instead of timing out.
  • Active in-process locks are still preserved (regression tests added for non-reentrant contention).

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS 15.6.1
  • Runtime/container: Node v22.22.0
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Create a lock file with current PID and createdAt, but without starttime (legacy shape).
  2. Attempt acquireSessionWriteLock for the same session file with timeoutMs=200 and default staleMs.
  3. Observe behavior before and after patch.

Expected

  • Orphan legacy same-PID lock is reclaimed and lock acquisition succeeds.

Actual

  • Before patch: timed out with session file locked (timeout 200ms): pid=<current> ...session.jsonl.lock.
  • After patch: lock acquisition succeeds under the same setup.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Before patch repro output:

REPRO_OK
session file locked (timeout 200ms): pid=36714 .../session.jsonl.lock

After patch repro output:

UNEXPECTED: lock acquired

Passing tests/commands:

pnpm -s vitest run src/agents/session-write-lock.test.ts
pnpm check

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: reproduced timeout with orphan same-PID legacy lock, then verified acquisition succeeds after fix; verified lock suite passes.
  • Edge cases checked: active in-process lock with missing starttime is not reclaimed when allowReentrant: false; active in-process lock with malformed starttime is not reclaimed.
  • What you did not verify: full Docker gateway end-to-end under live traffic.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit sessions: reclaim orphan self-pid lock files.
  • Files/config to restore: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts.
  • Known bad symptoms reviewers should watch for: unexpected lock reclaim while a same-process lock is actively held (covered by new tests).

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: false-positive reclaim in uncommon same-process lock ownership edge cases.
    • Mitigation: reclaim only triggers when PID matches current process, starttime is missing/invalid, and no in-process held lock entry exists; regression tests cover active lock scenarios.

@aisle-research-bot
Copy link

aisle-research-bot bot commented Mar 2, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Race condition: orphan-self-pid reclaim can delete a freshly created lock and allow double acquisition

1. 🟠 Race condition: orphan-self-pid reclaim can delete a freshly created lock and allow double acquisition

Property Value
Severity High
CWE CWE-362
Location src/agents/session-write-lock.ts:372-542

Description

The new shouldTreatAsOrphanSelfLock logic can force a contended lock to be treated as stale: true purely because the lock payload has pid === process.pid and missing/invalid starttime, even if the lock is fresh and the owning process is alive.

Because acquireSessionWriteLock() passes these forced reclaimDetails into shouldReclaimContendedLockFile(), this bypasses the mtime-based safety fallback in shouldReclaimContendedLockFile() and causes immediate deletion of the lock file during contention.

Impact scenarios:

  • Concurrent acquisitions in the same process (race window): if getProcessStartTime() returns null (non-Linux platforms or environments where /proc is unavailable), newly created locks will have no starttime. A second concurrent call that hits EEXIST before HELD_LOCKS is populated will:

    • read a valid JSON payload with pid + createdAt but no starttime
    • classify it as orphanSelfLock
    • delete the lock file even though the first call still holds an open handle
    • then create a new lock file, resulting in two callers simultaneously believing they hold the lock, risking session store corruption.
  • PID-namespace collisions (containers + shared volume): a lock file created by another live container/process using the same PID (common PID=1) but missing starttime (legacy lock format) can be reclaimed by a different container instance, breaking mutual exclusion.

Vulnerable behavior is introduced by forcing stale: true + adding a non-fallback staleReason ("orphan-self-pid"), which makes shouldReclaimContendedLockFile() return true immediately when contended.

Recommendation

Do not immediately reclaim/delete a contended lock solely because pid === process.pid and starttime is missing.

Safer options:

  1. Re-check for in-process ownership during the contention loop (fixes the race where HELD_LOCKS becomes populated after another concurrent call acquires the lock):
// inside the while loop, before attempting fs.open or right after EEXIST
if (allowReentrant) {
  const heldNow = HELD_LOCKS.get(normalizedSessionFile);
  if (heldNow) {
    heldNow.count += 1;
    return { release: async () => await releaseHeldLock(normalizedSessionFile, heldNow) };
  }
}
  1. If you still need legacy reclamation, gate it on age rather than forcing immediate stale:
  • require lock mtime/createdAt to be older than some minimum (or older than the current process start wall-clock time derived from process.uptime()), and/or
  • keep the mtime-fallback path by treating this condition as requiring additional validation, not unconditional staleness.

This prevents deleting a freshly created lock that is merely missing starttime due to platform/proc limitations or transient acquisition timing.


Analyzed PR: #32081 at commit 960ee03

Last updated on: 2026-03-02T20:18:01Z

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1e919bf5d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +380 to +384
const hasValidStarttime = isValidLockNumber(params.payload?.starttime);
if (hasValidStarttime) {
return false;
}
return !HELD_LOCKS.has(params.normalizedSessionFile);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard orphan-self reclaim against namespace PID collisions

This reclaim path can delete an actively-held lock when the session directory is shared across processes in different PID namespaces (for example, two containers where both writers run as PID 1) and the lock file is in legacy shape without a valid starttime. In that case pid === process.pid is true in both processes, HELD_LOCKS is local to the current process, and the second process will remove the first process’s live lock, allowing concurrent writers and possible session corruption; before this change it would block on the existing lock instead of force-reclaiming it.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

Added logic to reclaim orphan lock files when the lock file's PID matches the current process but lacks a valid starttime field and is not actively held in-process.

Key changes:

  • Introduced shouldTreatAsOrphanSelfLock() to detect legacy lock files from previous process instances with recycled PIDs
  • Modified lock acquisition to mark orphan same-PID locks as stale and reclaim them automatically
  • Preserved safety by checking HELD_LOCKS to prevent reclaiming actively-held locks even when starttime is missing or malformed

Testing:

  • Updated existing test to verify orphan lock reclamation succeeds
  • Added regression tests ensuring active in-process locks are never reclaimed, even with missing/malformed starttime
  • All scenarios (orphan detection, active lock protection, PID recycling) covered by comprehensive test cases

Confidence Score: 5/5

  • This PR is safe to merge with high confidence
  • The fix is well-targeted, addressing a specific edge case (orphan same-PID legacy locks) with clear conditions. The logic correctly distinguishes between orphan locks and actively-held locks using the HELD_LOCKS registry. Comprehensive test coverage includes both the fix scenario and regression tests for active lock protection. The implementation preserves existing PID-recycling detection and adds an additional safety check without changing lock file format or timeout behavior.
  • No files require special attention

Last reviewed commit: a1e919b

@steipete steipete force-pushed the bm/session-lock-timeout-recovery branch from a1e919b to 960ee03 Compare March 2, 2026 19:53
@steipete steipete merged commit 0cf533a into openclaw:main Mar 2, 2026
@steipete
Copy link
Contributor

steipete commented Mar 2, 2026

Landed via temp rebase onto main.

  • Gate: pnpm -s vitest run src/agents/session-write-lock.test.ts
  • Land commit: 960ee03
  • Merge commit: 0cf533a

Thanks @bmendonca3!

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 960ee03a80

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +380 to +384
const hasValidStarttime = isValidLockNumber(params.payload?.starttime);
if (hasValidStarttime) {
return false;
}
return !HELD_LOCKS.has(params.normalizedSessionFile);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid reclaiming same-PID locks without owner disambiguation

In the new orphan-self path, a contended lock is reclaimable whenever pid === process.pid, starttime is missing/invalid, and the current process has no in-memory held entry. Because HELD_LOCKS is process-local, this misidentifies live locks when multiple processes share the same lock directory but run in different PID namespaces with overlapping PIDs (for example, two containers both using PID 1): process B will delete process A's active legacy lock and proceed, allowing concurrent writers and possible session corruption. Before this change, that scenario blocked on the existing lock instead of force-reclaiming it.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

session file locked (timeout 10000ms) causes agent to fail before reply

2 participants