Skip to content

fix: detect PID recycling in session write lock staleness check#26443

Closed
HirokiKobayashi-R wants to merge 3 commits intoopenclaw:mainfrom
metelix:fix/pid-recycling-lock-detection
Closed

fix: detect PID recycling in session write lock staleness check#26443
HirokiKobayashi-R wants to merge 3 commits intoopenclaw:mainfrom
metelix:fix/pid-recycling-lock-detection

Conversation

@HirokiKobayashi-R
Copy link
Contributor

@HirokiKobayashi-R HirokiKobayashi-R commented Feb 25, 2026

Summary

The session write lock uses isPidAlive() to determine if a lock holder is still running. In containers (Docker/Kubernetes), PID recycling can cause a different process to inherit the same PID number, making a stale lock appear valid. This leads to session file locked (timeout 10000ms) errors that persist until the 30-minute age-based stale threshold is reached.

This PR adds process start time (/proc/pid/stat field 22) comparison to detect PID recycling:

  • Record starttime (clock ticks since boot) in the lock file alongside pid and createdAt
  • During staleness checks, compare the stored starttime with the current starttime of that PID
  • If the PID is alive but its starttime differs from the recorded value, the lock is treated as stale (recycled-pid) and reclaimed immediately

Why this happens in containers

  • Container PID namespaces are small, so PIDs are recycled quickly
  • Process respawning (spawn(detached: true)) frees the parent PID for reuse
  • Background shell jobs ((…) & in entrypoint scripts) can inherit recycled PIDs
  • kill(pid, 0) returns true for the new (unrelated) process, and the lock appears valid

Design

  • Industry standard approach: (pid, starttime) comparison is the technique used by psutil, and recommended by kernel developers for /proc/pid/stat-based process identity verification
  • lastIndexOf(")") parsing: The only correct way to parse /proc/pid/stat — the comm field can contain spaces and nested parentheses (CVE-2017-1000367, CVE-2021-25683)
  • Backward compatible: Lock files without starttime (from older versions) are handled with the existing PID-alive + age-based logic — no false reclamation
  • Non-Linux safe: getProcessStartTime() returns null on non-Linux platforms; the field is omitted from lock files and the recycling check is skipped entirely

Changes

File Change
src/shared/pid-alive.ts Add getProcessStartTime() — reads /proc/pid/stat field 22
src/agents/session-write-lock.ts Record starttime in lock file; detect recycled-pid in inspectLockPayload()
src/shared/pid-alive.test.ts Tests for getProcessStartTime (Linux mock, non-Linux, invalid PIDs, nested-paren comm fields)
src/agents/session-write-lock.test.ts Tests for recycled PID reclamation + backward compatibility (old lock files without starttime)

Test plan

  • All 24 existing + new tests pass (vitest run)
  • Recycled PID: lock with alive PID but mismatched starttime is reclaimed immediately
  • Backward compat: lock without starttime field is NOT treated as recycled
  • Non-Linux: getProcessStartTime returns null, no behavioral change
  • Comm field with spaces and nested parentheses: correctly parsed

Greptile Summary

This PR adds PID recycling detection to the session write lock mechanism by recording and comparing process start times. The implementation follows the industry-standard (pid, starttime) approach used by psutil and recommended by kernel developers.

Key changes:

  • Added getProcessStartTime() function that reads /proc/pid/stat field 22 (starttime in clock ticks since boot)
  • Lock files now store starttime alongside pid and createdAt
  • During staleness checks, compares stored starttime with current process starttime to detect PID recycling
  • Recycled PIDs are reclaimed immediately (no 30-minute wait) with recycled-pid stale reason
  • Backward compatible: old lock files without starttime use existing PID-alive + age-based logic
  • Non-Linux safe: getProcessStartTime() returns null on non-Linux platforms, skipping recycling check

Security considerations:

  • Uses lastIndexOf(")") parsing to handle comm fields with spaces and nested parentheses, avoiding CVE-2017-1000367 and CVE-2021-25683 vulnerabilities
  • Field indexing is correct: field 22 = index 19 after comm-split (field 3 is index 0)

Test coverage:

  • Tests verify recycled PID detection and backward compatibility
  • Comm field parsing tests cover spaces and nested parentheses
  • Comprehensive coverage for Linux, non-Linux, and edge cases

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is thorough, well-tested, and follows industry best practices. The backward compatibility handling ensures no regressions for existing deployments. The parsing logic correctly handles edge cases documented in security advisories. Tests cover all code paths including PID recycling detection, backward compatibility, non-Linux platforms, and comm field edge cases.
  • No files require special attention

Last reviewed commit: 49a6bf1

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: M labels Feb 25, 2026
@vincentkoc
Copy link
Member

Opened follow-up draft PR from my branch with the same fixes and changelog credit update:

Reason: I could not rebase/push directly to metelix:fix/pid-recycling-lock-detection from my account (fork push permission denied / 403).

@vincentkoc vincentkoc closed this Mar 2, 2026
@vincentkoc
Copy link
Member

Cleanup note: this work landed via #31320 (merged), including the changelog credit update (Thanks @HirokiKobayashi-R and @vincentkoc).

@HirokiKobayashi-R
Copy link
Contributor Author

Sorry for the extra work — I'll make sure maintainer push access is enabled on future PRs. Thanks for landing it and the credit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants