Skip to content

fix(registry): break a stale sandboxes.json lock held by a recycled PID#4922

Merged
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/registry-lock-pid-recycle
Jun 8, 2026
Merged

fix(registry): break a stale sandboxes.json lock held by a recycled PID#4922
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/registry-lock-pid-recycle

Conversation

@latenighthackathon

@latenighthackathon latenighthackathon commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

NemoClaw serializes writes to its sandbox registry (sandboxes.json) with a lock directory next to the file. If the process holding that lock is killed outright (a crash, a power loss, kill -9) it never removes the lock, so the lock is left behind with the dead process's PID recorded in it. NemoClaw is meant to notice the holder is gone and clear the stale lock, but it only checked whether something is running under that PID. Operating systems reuse PIDs, so once an unrelated new process lands on that number the check thinks the original holder is still alive and refuses to clear the lock. From then on every command that touches the registry retries for about 12 seconds and then fails with Failed to acquire lock on .../sandboxes.json, and the only way out is deleting the lock directory by hand.

This teaches the staleness check to tell a recycled PID apart from the real holder: a lock counts as held only if the process under that PID actually started before the lock was taken. A process that started afterward is a reused PID, so the lock is cleared and the command proceeds. When that start time cannot be read, the lock is cleared once it is older than the existing staleness window. This mirrors the recycle check the onboarding lock already uses.

Related Issue

Closes #4921

Changes

  • Add the PID-recycle guard to the registry lock in src/lib/state/registry.ts, extracted into the exported, unit-tested pure classifyExistingLock() (alive owner whose /proc start time post-dates the lock mtime is a reused PID and therefore stale; mtime-window fallback when the PID or start time is unavailable).
  • Preserve the existing re-read-owner-before-breaking race guard so a lock another process already replaced is not clobbered.
  • Add src/lib/state/registry-lock.test.ts covering dead holder, recycled PID, live original (fresh and stale), unreadable owner, and the non-/proc fallback.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Verification

  • New and existing unit tests pass locally.

Ran: npx vitest run src/lib/state/registry-lock.test.ts (6 passed) plus the registry-touching registry-recovery-action and sandbox-registry-metadata suites (8 passed), npx tsc -p tsconfig.src.json, npm run typecheck:cli, and Biome lint on the changed files, all clean.

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved detection and handling of stale registry locks (including PID-recycling cases) to reduce deadlocks and make lock reclamation safer and more reliable.
  • Tests

    • Added thorough tests covering a range of lock-staleness and process-state scenarios to validate correct break-vs-wait decisions.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1572faa9-1286-4e9d-8752-75506960e244

📥 Commits

Reviewing files that changed from the base of the PR and between 989a4e1 and 15de7ce.

📒 Files selected for processing (1)
  • src/lib/state/registry.ts
💤 Files with no reviewable changes (1)
  • src/lib/state/registry.ts

📝 Walkthrough

Walkthrough

Adds process liveness and /proc-derived process start-time checks and an exported classifyExistingLock to decide whether an existing registry lock should be "break" or "wait". Integrates this decision into acquireLock() with re-stat/re-read safety and adds Vitest cases covering PID-recycle and fallback scenarios.

Changes

Registry lock PID recycling detection

Layer / File(s) Summary
Lock staleness classification utilities
src/lib/state/registry.ts
Exports RegistryLockDecision and classifyExistingLock. Adds PID liveness probe and /proc-based process start-time reader; classification uses owner PID presence/aliveness, process start time vs lock mtime, and lock age fallback.
acquireLock integration
src/lib/state/registry.ts
Refactors acquireLock() EEXIST path: stat lock mtime, parse LOCK_OWNER, compute ownerAlive/processStartMs, call classifyExistingLock, and when breaking a lock re-stat and re-read owner before removing the lock directory.
Test suite for staleness classification
src/lib/state/registry-lock.test.ts
New Vitest tests with timing constants covering dead owner, PID-recycle wedge detection, fresh-lock wait, stale-lock break, and age-based fallback when owner PID or process start time are unavailable.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

v0.0.57

Suggested reviewers

  • prekshivyas

Poem

🐰
I sniffed a stale lock by the sand,
A reused PID held the land.
I checked start times, mtime, and more,
Broke the stale door and hopped to the shore.
Now the registry hops freely once more.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: fixing a registry lock issue caused by recycled PIDs, which is directly reflected in the code changes introducing PID-recycle detection logic.
Linked Issues check ✅ Passed The PR fully addresses all primary coding requirements from issue #4921: implements PID-recycle detection via process start-time comparison, maintains race-guard re-read logic, includes fallback to mtime-based staleness, and provides comprehensive unit tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: new test file exercises stale-lock scenarios, and registry.ts modifications implement only the PID-recycle staleness check and exported classification function with no unrelated alterations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/state/registry.ts`:
- Around line 243-257: The stale-lock revalidation currently treats failures to
read LOCK_OWNER (and ownerPid === null) as stillSameOwner=true and may rmSync a
freshly replaced lock; change the logic in the decision === "break" branch so
stillSameOwner defaults to false and only becomes true after you (1)
successfully read and parse LOCK_OWNER and verify it equals the previously
observed ownerPid and (2) re-stat LOCK_DIR (e.g., fs.statSync or fs.lstatSync)
to ensure the directory still exists and has the same identity
(mtime/ctime/inode) you observed earlier; any failure to read LOCK_OWNER or stat
LOCK_DIR must be treated as not-same (do not call fs.rmSync). Use the existing
symbols ownerPid, LOCK_OWNER, LOCK_DIR, stillSameOwner and the recheck step to
implement this conservative verification before deleting the lock.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 593b0b8f-edca-4dce-9ac7-abc11779098a

📥 Commits

Reviewing files that changed from the base of the PR and between 0a5327b and cf6acd6.

📒 Files selected for processing (2)
  • src/lib/state/registry-lock.test.ts
  • src/lib/state/registry.ts

Comment thread src/lib/state/registry.ts
The registry advisory lock only probed the owner with kill(pid, 0). If a holder
crashed without releasing (SIGKILL, power loss) and its PID was later reused by
an unrelated live process, the liveness probe succeeded, so the lock was treated
as held forever: every registry write retried LOCK_MAX_RETRIES times and then
threw Failed to acquire lock, leaving the CLI wedged until the lock dir was
removed by hand.

Confirm the owner started before it took the lock: a live PID whose /proc start
time is after the lock mtime is a recycled PID, so the lock is stale and is
broken. When the owner pid or start time cannot be read, fall back to breaking
the lock once it is older than LOCK_STALE_MS. The staleness decision is extracted
into the exported pure classifyExistingLock() and unit-tested. Mirrors the
onboard-session lock recycle check.

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@latenighthackathon latenighthackathon force-pushed the fix/registry-lock-pid-recycle branch from cf6acd6 to 989a4e1 Compare June 8, 2026 02:56
@cv cv added the v0.0.61 Release target label Jun 8, 2026
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 8, 2026
@wscurran

wscurran commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR about breaking a stale sandboxes.json lock held by a recycled PID. This proposes a way to fix the regression in the sandbox registry.


Related open issues:

@cv cv self-assigned this Jun 8, 2026
@cv cv merged commit 9a738e0 into NVIDIA:main Jun 8, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.61 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Registry advisory lock wedges when a crashed holder's PID is recycled (sandboxes.json)

3 participants