Skip to content

fix(sessions): harden recycled PID lock recovery follow-up#31320

Merged
vincentkoc merged 5 commits intoopenclaw:mainfrom
vincentkoc:vincentkoc-code/pr-26443-review
Mar 2, 2026
Merged

fix(sessions): harden recycled PID lock recovery follow-up#31320
vincentkoc merged 5 commits intoopenclaw:mainfrom
vincentkoc:vincentkoc-code/pr-26443-review

Conversation

@vincentkoc
Copy link
Member

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: session-write-lock could treat stale lock files as live when Linux PID values were recycled, and lock payload parsing accepted loosely-typed numeric values.
  • Why it matters: under container/concurrency workloads this can cause persistent session file locked (timeout 10000ms) failures until stale timeout.
  • What changed: added PID start-time identity checks, stricter lock payload number validation, safer /proc/<pid>/stat parsing, and regression tests.
  • What did NOT change (scope boundary): no auth, policy, sandbox, or network-surface changes.
  • Note: I could not rebase/push directly to metelix:fix/pid-recycling-lock-detection because GitHub denied fork push permissions (403), so this PR carries the same fixes from my branch.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Stale .jsonl.lock files caused by PID recycling are reclaimed immediately (Linux), instead of waiting for age-based staleness.
  • Lock parsing now ignores malformed pid/starttime values instead of treating them as valid identities.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS (tests simulate Linux /proc parsing)
  • Runtime/container: Node 22 / pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): default test config

Steps

  1. Create lock file with current live PID and mismatched starttime.
  2. Attempt lock acquisition.
  3. Verify stale lock is reclaimed and new lock is acquired.

Expected

  • Recycled PID lock is reclaimed immediately.

Actual

  • Verified with tests: reclaimed as expected.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: focused tests for pid-alive + session-write-lock, including recycled PID and backward compatibility.
  • Edge cases checked: malformed /proc stat, fractional/invalid pid/starttime values, old lock files without starttime.
  • What you did not verify: full end-to-end container workload on this machine.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR commits.
  • Files/config to restore: src/shared/pid-alive.ts, src/agents/session-write-lock.ts and related tests.
  • Known bad symptoms reviewers should watch for: unexpected stale-lock reclaim under Linux if /proc parsing assumptions are wrong.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: /proc/<pid>/stat parser could mis-handle uncommon process names.
    • Mitigation: lastIndexOf(")") split approach + tests for nested parentheses/spaces.
  • Risk: malformed lock payloads might be interpreted as valid identities.
    • Mitigation: integer-only validation for pid/starttime and regression tests.

HirokiKobayashi-R and others added 5 commits March 1, 2026 21:08
The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).
@openclaw-barnacle openclaw-barnacle bot added the agents Agent runtime and tooling label Mar 2, 2026
@openclaw-barnacle openclaw-barnacle bot added size: M maintainer Maintainer-authored PR labels Mar 2, 2026
@vincentkoc vincentkoc marked this pull request as ready for review March 2, 2026 05:34
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR addresses a critical session lock issue where Linux PID recycling caused stale lock files to be treated as live, leading to persistent lock timeout failures in containerized environments.

The implementation adds PID start-time tracking by reading field 22 from /proc/<pid>/stat and comparing it with the stored value in lock files. When a mismatch is detected, the lock is immediately reclaimed instead of waiting for age-based timeout.

Key improvements:

  • Added getProcessStartTime() function that safely parses /proc/<pid>/stat with robust handling of process names containing spaces and parentheses
  • Enhanced lock payload validation with isValidLockNumber() to reject fractional/invalid numeric values
  • Implemented PID recycling detection in inspectLockPayload() with new "recycled-pid" stale reason
  • Maintains backward compatibility by treating old locks without starttime as valid (falls back to existing age-based checks)
  • Comprehensive test coverage including edge cases for malformed data and cross-platform mocking

Security hardening:

  • Stricter integer-only validation for pid and starttime fields
  • Safer /proc parsing using lastIndexOf(")") approach to handle complex process names
  • Graceful degradation on non-Linux platforms and error conditions

The changes are well-scoped to the lock recovery mechanism with no impact on auth, policy, or network surfaces.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it's a well-tested bug fix with comprehensive edge case handling and backward compatibility
  • Score of 5 reflects: (1) thorough test coverage including edge cases for PID recycling, backward compatibility, and malformed data; (2) robust implementation with proper validation and error handling; (3) well-scoped changes limited to lock recovery with no impact on critical systems; (4) careful attention to cross-platform compatibility and graceful degradation; (5) addresses a real production issue (persistent lock timeouts in containers) with a sound technical approach
  • No files require special attention - all changes are well-implemented with appropriate test coverage

Last reviewed commit: ff4cac2

@vincentkoc vincentkoc merged commit 5a2200b into openclaw:main Mar 2, 2026
30 checks passed
safzanpirani pushed a commit to safzanpirani/clawdbot that referenced this pull request Mar 2, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
@vincentkoc vincentkoc deleted the vincentkoc-code/pr-26443-review branch March 2, 2026 06:18
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
robertchang-ga pushed a commit to robertchang-ga/openclaw that referenced this pull request Mar 2, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
hanqizheng pushed a commit to hanqizheng/openclaw that referenced this pull request Mar 2, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
execute008 pushed a commit to execute008/openclaw that referenced this pull request Mar 2, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
dawi369 pushed a commit to dawi369/davis that referenced this pull request Mar 3, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
sachinkundu pushed a commit to sachinkundu/openclaw that referenced this pull request Mar 6, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
atlastacticalbot pushed a commit to tensakulabs/atlasbot that referenced this pull request Mar 6, 2026
…31320)

* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
(cherry picked from commit 5a2200b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stale session lock files cause startup failure in containerized deployments

2 participants