fix(registry): break a stale sandboxes.json lock held by a recycled PID by latenighthackathon · Pull Request #4922 · NVIDIA/NemoClaw

latenighthackathon · 2026-06-08T01:11:33Z

Summary

NemoClaw serializes writes to its sandbox registry (sandboxes.json) with a lock directory next to the file. If the process holding that lock is killed outright (a crash, a power loss, kill -9) it never removes the lock, so the lock is left behind with the dead process's PID recorded in it. NemoClaw is meant to notice the holder is gone and clear the stale lock, but it only checked whether something is running under that PID. Operating systems reuse PIDs, so once an unrelated new process lands on that number the check thinks the original holder is still alive and refuses to clear the lock. From then on every command that touches the registry retries for about 12 seconds and then fails with Failed to acquire lock on .../sandboxes.json, and the only way out is deleting the lock directory by hand.

This teaches the staleness check to tell a recycled PID apart from the real holder: a lock counts as held only if the process under that PID actually started before the lock was taken. A process that started afterward is a reused PID, so the lock is cleared and the command proceeds. When that start time cannot be read, the lock is cleared once it is older than the existing staleness window. This mirrors the recycle check the onboarding lock already uses.

Related Issue

Closes #4921

Changes

Add the PID-recycle guard to the registry lock in src/lib/state/registry.ts, extracted into the exported, unit-tested pure classifyExistingLock() (alive owner whose /proc start time post-dates the lock mtime is a reused PID and therefore stale; mtime-window fallback when the PID or start time is unavailable).
Preserve the existing re-read-owner-before-breaking race guard so a lock another process already replaced is not clobbered.
Add src/lib/state/registry-lock.test.ts covering dead holder, recycled PID, live original (fresh and stale), unreadable owner, and the non-/proc fallback.

Type of Change

Bug fix (non-breaking change which fixes an issue)

Verification

New and existing unit tests pass locally.

Ran: npx vitest run src/lib/state/registry-lock.test.ts (6 passed) plus the registry-touching registry-recovery-action and sandbox-registry-metadata suites (8 passed), npx tsc -p tsconfig.src.json, npm run typecheck:cli, and Biome lint on the changed files, all clean.

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

Bug Fixes
- Improved detection and handling of stale registry locks (including PID-recycling cases) to reduce deadlocks and make lock reclamation safer and more reliable.
Tests
- Added thorough tests covering a range of lock-staleness and process-state scenarios to validate correct break-vs-wait decisions.

coderabbitai · 2026-06-08T01:11:47Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1572faa9-1286-4e9d-8752-75506960e244

📥 Commits

Reviewing files that changed from the base of the PR and between 989a4e1 and 15de7ce.

📒 Files selected for processing (1)

src/lib/state/registry.ts

💤 Files with no reviewable changes (1)

src/lib/state/registry.ts

📝 Walkthrough

Walkthrough

Adds process liveness and /proc-derived process start-time checks and an exported classifyExistingLock to decide whether an existing registry lock should be "break" or "wait". Integrates this decision into acquireLock() with re-stat/re-read safety and adds Vitest cases covering PID-recycle and fallback scenarios.

Changes

Registry lock PID recycling detection

Layer / File(s)	Summary
Lock staleness classification utilities `src/lib/state/registry.ts`	Exports `RegistryLockDecision` and `classifyExistingLock`. Adds PID liveness probe and `/proc`-based process start-time reader; classification uses owner PID presence/aliveness, process start time vs lock mtime, and lock age fallback.
acquireLock integration `src/lib/state/registry.ts`	Refactors acquireLock() EEXIST path: stat lock mtime, parse `LOCK_OWNER`, compute ownerAlive/processStartMs, call `classifyExistingLock`, and when breaking a lock re-stat and re-read owner before removing the lock directory.
Test suite for staleness classification `src/lib/state/registry-lock.test.ts`	New Vitest tests with timing constants covering dead owner, PID-recycle wedge detection, fresh-lock wait, stale-lock break, and age-based fallback when owner PID or process start time are unavailable.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

v0.0.57

Suggested reviewers

prekshivyas

Poem

🐰
I sniffed a stale lock by the sand,
A reused PID held the land.
I checked start times, mtime, and more,
Broke the stale door and hopped to the shore.
Now the registry hops freely once more.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: fixing a registry lock issue caused by recycled PIDs, which is directly reflected in the code changes introducing PID-recycle detection logic.
Linked Issues check	✅ Passed	The PR fully addresses all primary coding requirements from issue `#4921`: implements PID-recycle detection via process start-time comparison, maintains race-guard re-read logic, includes fallback to mtime-based staleness, and provides comprehensive unit tests.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the linked issue: new test file exercises stale-lock scenarios, and registry.ts modifications implement only the PID-recycle staleness check and exported classification function with no unrelated alterations.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/state/registry.ts`:
- Around line 243-257: The stale-lock revalidation currently treats failures to
read LOCK_OWNER (and ownerPid === null) as stillSameOwner=true and may rmSync a
freshly replaced lock; change the logic in the decision === "break" branch so
stillSameOwner defaults to false and only becomes true after you (1)
successfully read and parse LOCK_OWNER and verify it equals the previously
observed ownerPid and (2) re-stat LOCK_DIR (e.g., fs.statSync or fs.lstatSync)
to ensure the directory still exists and has the same identity
(mtime/ctime/inode) you observed earlier; any failure to read LOCK_OWNER or stat
LOCK_DIR must be treated as not-same (do not call fs.rmSync). Use the existing
symbols ownerPid, LOCK_OWNER, LOCK_DIR, stillSameOwner and the recheck step to
implement this conservative verification before deleting the lock.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 593b0b8f-edca-4dce-9ac7-abc11779098a

📥 Commits

Reviewing files that changed from the base of the PR and between 0a5327b and cf6acd6.

📒 Files selected for processing (2)

src/lib/state/registry-lock.test.ts
src/lib/state/registry.ts

The registry advisory lock only probed the owner with kill(pid, 0). If a holder crashed without releasing (SIGKILL, power loss) and its PID was later reused by an unrelated live process, the liveness probe succeeded, so the lock was treated as held forever: every registry write retried LOCK_MAX_RETRIES times and then threw Failed to acquire lock, leaving the CLI wedged until the lock dir was removed by hand. Confirm the owner started before it took the lock: a live PID whose /proc start time is after the lock mtime is a recycled PID, so the lock is stale and is broken. When the owner pid or start time cannot be read, fall back to breaking the lock once it is older than LOCK_STALE_MS. The staleness decision is extracted into the exported pure classifyExistingLock() and unit-tested. Mirrors the onboard-session lock recycle check. Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>

wscurran · 2026-06-08T14:13:16Z

✨ Thanks for submitting this detailed PR about breaking a stale sandboxes.json lock held by a recycled PID. This proposes a way to fix the regression in the sandbox registry.

Related open issues:

#4921 Registry advisory lock wedges when a crashed holder's PID is recycled (sandboxes.json)

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread src/lib/state/registry.ts

latenighthackathon force-pushed the fix/registry-lock-pid-recycle branch from cf6acd6 to 989a4e1 Compare June 8, 2026 02:56

cv added the v0.0.61 Release target label Jun 8, 2026

wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 8, 2026

cv self-assigned this Jun 8, 2026

Merge branch 'main' into fix/registry-lock-pid-recycle

15de7ce

cv approved these changes Jun 8, 2026

View reviewed changes

cv merged commit 9a738e0 into NVIDIA:main Jun 8, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(registry): break a stale sandboxes.json lock held by a recycled PID#4922

fix(registry): break a stale sandboxes.json lock held by a recycled PID#4922
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/registry-lock-pid-recycle

latenighthackathon commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

wscurran commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

latenighthackathon commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wscurran commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

latenighthackathon commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading