fix(locks): reclaim stale session locks across reboot by liuxiaopai-ai · Pull Request #30525 · openclaw/openclaw

liuxiaopai-ai · 2026-03-01T09:51:23Z

Summary

Describe the problem and fix in 2–5 bullets:

Problem: session lock files only validated pid liveness + age, so after hard reboot a reused PID could make a stale lock look active.
Why it matters: affected sessions could fail for ~timeout windows with session file locked even though no real OpenClaw process held the lock.
What changed: lock payloads now include Linux bootId; lock inspection marks locks stale on boot-id-mismatch; stale cleanup and lock acquisition reclaim those locks.
What did NOT change (scope boundary): no scheduler/session routing logic changed, and non-Linux boot-id behavior remains unchanged.
AI-assisted: yes (implemented with AI assistance, human-verified with targeted tests below).

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

On Linux, stale session lock files from a previous boot are now reclaimed immediately when the stored boot ID differs from the current boot.

Security Impact (required)

New permissions/capabilities? (Yes/No) No
Secrets/tokens handling changed? (Yes/No) No
New/changed network calls? (Yes/No) No
Command/tool execution surface changed? (Yes/No) No
Data access scope changed? (Yes/No) No
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: macOS (local dev) + Linux boot-id behavior covered by deterministic tests
Runtime/container: Node.js + Vitest
Model/provider: N/A
Integration/channel (if any): N/A
Relevant config (redacted): N/A

Steps

Create a fresh lock file with live pid, fresh createdAt, and mismatched bootId.
Attempt acquireSessionWriteLock for the same session path.
Observe lock reclaimed and replaced with current payload.

Expected

Mismatched-boot lock is treated as stale and reclaimed.

Actual

Verified via new tests: lock is reclaimed on mismatch, and not reclaimed when boot ID matches.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

pnpm vitest src/agents/session-write-lock.test.ts src/commands/doctor-session-locks.test.ts

✓ src/agents/session-write-lock.test.ts (17 tests)
✓ src/commands/doctor-session-locks.test.ts (2 tests)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: boot mismatch reclaim path in both acquisition and cleanup flows; matching-boot lock still blocks as expected.
Edge cases checked: malformed/fresh lock behavior remained unchanged; stale cleanup still removes dead/old locks.
What you did not verify: full gateway E2E reboot workflow on a real Linux host.

Compatibility / Migration

Backward compatible? (Yes/No) Yes
Config/env changes? (Yes/No) No
Migration needed? (Yes/No) No
If yes, exact upgrade steps:

Failure Recovery (if this breaks)

How to disable/revert this change quickly: revert commit d01660fb13.
Files/config to restore: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts, CHANGELOG.md.
Known bad symptoms reviewers should watch for: unexpected lock reclamation on active sessions (not expected; guarded by boot mismatch + existing stale criteria).

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

Risk: Linux-only boot marker unavailable in constrained environments.
- Mitigation: behavior gracefully falls back to existing pid/age logic when boot ID cannot be resolved.

claw-bft · 2026-03-01T09:54:49Z

Nice approach using boot ID to solve the PID reuse problem! This is a classic issue with file-based locking that I've run into before.

A few thoughts on the implementation:

1. Cross-platform boot ID sourcing
The PR mentions reading from /proc/sys/kernel/random/boot_id on Linux. Have you considered what happens in containerized environments where /proc might be masked or the boot ID is inherited from the host? In Docker, the boot_id is often the same across container restarts, which could lead to false negatives (locks not reclaimed when they should be).

One mitigation might be to combine boot_id with a container-specific identifier when available:

const containerId = process.env.HOSTNAME || process.env.CONTAINER_ID;
const effectiveBootId = containerId ? `${bootId}:${containerId}` : bootId;

2. macOS behavior
The PR notes macOS uses deterministic tests since boot_id isn't available. On macOS, you could potentially use kern.boottime from sysctl or the system startup timestamp as a fallback. Not critical, but would make the feature work more consistently across platforms.

3. Lock file format versioning
Since you're changing the lock payload structure, consider adding a version field to the JSON. This makes future migrations easier if the format needs to change again:

interface LockPayload {
  version: 2;  // bumped from implicit v1
  pid: number;
  createdAt: number;
  bootId?: string;
}

4. Edge case: boot_id changes mid-session
Extremely unlikely on bare metal, but in VMs/containers with live migration, the boot_id could theoretically change while OpenClaw is running. The current logic would treat the existing lock as stale on the next acquisition attempt. This is probably the correct behavior (reclaim the lock), but worth documenting.

5. Testing suggestion
Consider adding a test case for the container scenario where boot_id is identical but the lock should still be reclaimed due to PID being dead. This ensures the fallback logic (pid + age) still works when boot_id doesn't differentiate.

Overall LGTM - the boot_id approach is much more robust than PID+age alone. The edge cases I mentioned are mostly theoretical and the fallback behavior handles them gracefully.

greptile-apps · 2026-03-01T09:56:28Z

Greptile Summary

Adds Linux boot ID tracking to session lock files to solve the PID reuse problem after hard reboots. Lock payloads now include bootId on Linux systems, and locks are marked stale when the stored boot ID differs from the current system's boot ID. The implementation gracefully handles:

Non-Linux platforms (boot ID check disabled, falls back to existing pid/age validation)
Backward compatibility (old locks without bootId are not falsely marked as stale)
Edge cases (empty/null boot IDs, malformed payloads)

The change is well-scoped and doesn't modify scheduler/session routing logic. Test coverage includes both acquisition and cleanup flows, verifying that mismatched boot IDs trigger reclamation while matching boot IDs preserve the lock.

Confidence Score: 5/5

Safe to merge with minimal risk - solves a real production issue with proper fallbacks
Implementation is thoroughly tested with comprehensive edge case coverage, maintains backward compatibility, gracefully handles platform differences, and the logic correctly handles all boot ID scenarios (null, empty, mismatch, match) without introducing new failure modes
No files require special attention

_{Last reviewed commit: d01660f}

steipete · 2026-03-02T23:28:43Z

Thanks for the PR!

Multiple PRs address stale lock recovery. Keeping #29118 as the earliest and most scoped submission. Closing this one to reduce noise.

This is an AI-assisted triage review. If we got this wrong, feel free to reopen or start a new PR — happy to revisit.

fix(locks): reclaim stale session locks across reboot

d01660f

openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Mar 1, 2026

steipete closed this Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(locks): reclaim stale session locks across reboot#30525

fix(locks): reclaim stale session locks across reboot#30525
liuxiaopai-ai wants to merge 1 commit intoopenclaw:mainfrom
liuxiaopai-ai:codex/fix-session-lock-boot-id-29927

liuxiaopai-ai commented Mar 1, 2026

Uh oh!

claw-bft commented Mar 1, 2026

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

steipete commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

liuxiaopai-ai commented Mar 1, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

claw-bft commented Mar 1, 2026

Uh oh!

greptile-apps bot commented Mar 1, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

steipete commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants