Skip to content

fix(locks): reclaim stale session locks across reboot#30525

Closed
liuxiaopai-ai wants to merge 1 commit intoopenclaw:mainfrom
liuxiaopai-ai:codex/fix-session-lock-boot-id-29927
Closed

fix(locks): reclaim stale session locks across reboot#30525
liuxiaopai-ai wants to merge 1 commit intoopenclaw:mainfrom
liuxiaopai-ai:codex/fix-session-lock-boot-id-29927

Conversation

@liuxiaopai-ai
Copy link

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: session lock files only validated pid liveness + age, so after hard reboot a reused PID could make a stale lock look active.
  • Why it matters: affected sessions could fail for ~timeout windows with session file locked even though no real OpenClaw process held the lock.
  • What changed: lock payloads now include Linux bootId; lock inspection marks locks stale on boot-id-mismatch; stale cleanup and lock acquisition reclaim those locks.
  • What did NOT change (scope boundary): no scheduler/session routing logic changed, and non-Linux boot-id behavior remains unchanged.
  • AI-assisted: yes (implemented with AI assistance, human-verified with targeted tests below).

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • On Linux, stale session lock files from a previous boot are now reclaimed immediately when the stored boot ID differs from the current boot.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS (local dev) + Linux boot-id behavior covered by deterministic tests
  • Runtime/container: Node.js + Vitest
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Create a fresh lock file with live pid, fresh createdAt, and mismatched bootId.
  2. Attempt acquireSessionWriteLock for the same session path.
  3. Observe lock reclaimed and replaced with current payload.

Expected

  • Mismatched-boot lock is treated as stale and reclaimed.

Actual

  • Verified via new tests: lock is reclaimed on mismatch, and not reclaimed when boot ID matches.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

pnpm vitest src/agents/session-write-lock.test.ts src/commands/doctor-session-locks.test.ts

  • ✓ src/agents/session-write-lock.test.ts (17 tests)
  • ✓ src/commands/doctor-session-locks.test.ts (2 tests)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: boot mismatch reclaim path in both acquisition and cleanup flows; matching-boot lock still blocks as expected.
  • Edge cases checked: malformed/fresh lock behavior remained unchanged; stale cleanup still removes dead/old locks.
  • What you did not verify: full gateway E2E reboot workflow on a real Linux host.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit d01660fb13.
  • Files/config to restore: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts, CHANGELOG.md.
  • Known bad symptoms reviewers should watch for: unexpected lock reclamation on active sessions (not expected; guarded by boot mismatch + existing stale criteria).

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: Linux-only boot marker unavailable in constrained environments.
    • Mitigation: behavior gracefully falls back to existing pid/age logic when boot ID cannot be resolved.

@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Mar 1, 2026
@claw-bft
Copy link

claw-bft commented Mar 1, 2026

Nice approach using boot ID to solve the PID reuse problem! This is a classic issue with file-based locking that I've run into before.

A few thoughts on the implementation:

1. Cross-platform boot ID sourcing
The PR mentions reading from /proc/sys/kernel/random/boot_id on Linux. Have you considered what happens in containerized environments where /proc might be masked or the boot ID is inherited from the host? In Docker, the boot_id is often the same across container restarts, which could lead to false negatives (locks not reclaimed when they should be).

One mitigation might be to combine boot_id with a container-specific identifier when available:

const containerId = process.env.HOSTNAME || process.env.CONTAINER_ID;
const effectiveBootId = containerId ? `${bootId}:${containerId}` : bootId;

2. macOS behavior
The PR notes macOS uses deterministic tests since boot_id isn't available. On macOS, you could potentially use kern.boottime from sysctl or the system startup timestamp as a fallback. Not critical, but would make the feature work more consistently across platforms.

3. Lock file format versioning
Since you're changing the lock payload structure, consider adding a version field to the JSON. This makes future migrations easier if the format needs to change again:

interface LockPayload {
  version: 2;  // bumped from implicit v1
  pid: number;
  createdAt: number;
  bootId?: string;
}

4. Edge case: boot_id changes mid-session
Extremely unlikely on bare metal, but in VMs/containers with live migration, the boot_id could theoretically change while OpenClaw is running. The current logic would treat the existing lock as stale on the next acquisition attempt. This is probably the correct behavior (reclaim the lock), but worth documenting.

5. Testing suggestion
Consider adding a test case for the container scenario where boot_id is identical but the lock should still be reclaimed due to PID being dead. This ensures the fallback logic (pid + age) still works when boot_id doesn't differentiate.

Overall LGTM - the boot_id approach is much more robust than PID+age alone. The edge cases I mentioned are mostly theoretical and the fallback behavior handles them gracefully.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 1, 2026

Greptile Summary

Adds Linux boot ID tracking to session lock files to solve the PID reuse problem after hard reboots. Lock payloads now include bootId on Linux systems, and locks are marked stale when the stored boot ID differs from the current system's boot ID. The implementation gracefully handles:

  • Non-Linux platforms (boot ID check disabled, falls back to existing pid/age validation)
  • Backward compatibility (old locks without bootId are not falsely marked as stale)
  • Edge cases (empty/null boot IDs, malformed payloads)

The change is well-scoped and doesn't modify scheduler/session routing logic. Test coverage includes both acquisition and cleanup flows, verifying that mismatched boot IDs trigger reclamation while matching boot IDs preserve the lock.

Confidence Score: 5/5

  • Safe to merge with minimal risk - solves a real production issue with proper fallbacks
  • Implementation is thoroughly tested with comprehensive edge case coverage, maintains backward compatibility, gracefully handles platform differences, and the logic correctly handles all boot ID scenarios (null, empty, mismatch, match) without introducing new failure modes
  • No files require special attention

Last reviewed commit: d01660f

@steipete
Copy link
Contributor

steipete commented Mar 2, 2026

Thanks for the PR!

Multiple PRs address stale lock recovery. Keeping #29118 as the earliest and most scoped submission. Closing this one to reduce noise.

This is an AI-assisted triage review. If we got this wrong, feel free to reopen or start a new PR — happy to revisit.

@steipete steipete closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session lock files not cleaned up on hard reboot (pid reuse causes stale lock to appear live)

3 participants