fix(registry): break a stale sandboxes.json lock held by a recycled PID#4922
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughAdds process liveness and /proc-derived process start-time checks and an exported ChangesRegistry lock PID recycling detection
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/state/registry.ts`:
- Around line 243-257: The stale-lock revalidation currently treats failures to
read LOCK_OWNER (and ownerPid === null) as stillSameOwner=true and may rmSync a
freshly replaced lock; change the logic in the decision === "break" branch so
stillSameOwner defaults to false and only becomes true after you (1)
successfully read and parse LOCK_OWNER and verify it equals the previously
observed ownerPid and (2) re-stat LOCK_DIR (e.g., fs.statSync or fs.lstatSync)
to ensure the directory still exists and has the same identity
(mtime/ctime/inode) you observed earlier; any failure to read LOCK_OWNER or stat
LOCK_DIR must be treated as not-same (do not call fs.rmSync). Use the existing
symbols ownerPid, LOCK_OWNER, LOCK_DIR, stillSameOwner and the recheck step to
implement this conservative verification before deleting the lock.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 593b0b8f-edca-4dce-9ac7-abc11779098a
📒 Files selected for processing (2)
src/lib/state/registry-lock.test.tssrc/lib/state/registry.ts
The registry advisory lock only probed the owner with kill(pid, 0). If a holder crashed without releasing (SIGKILL, power loss) and its PID was later reused by an unrelated live process, the liveness probe succeeded, so the lock was treated as held forever: every registry write retried LOCK_MAX_RETRIES times and then threw Failed to acquire lock, leaving the CLI wedged until the lock dir was removed by hand. Confirm the owner started before it took the lock: a live PID whose /proc start time is after the lock mtime is a recycled PID, so the lock is stale and is broken. When the owner pid or start time cannot be read, fall back to breaking the lock once it is older than LOCK_STALE_MS. The staleness decision is extracted into the exported pure classifyExistingLock() and unit-tested. Mirrors the onboard-session lock recycle check. Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
cf6acd6 to
989a4e1
Compare
|
✨ Thanks for submitting this detailed PR about breaking a stale sandboxes.json lock held by a recycled PID. This proposes a way to fix the regression in the sandbox registry. Related open issues: |
Summary
NemoClaw serializes writes to its sandbox registry (
sandboxes.json) with a lock directory next to the file. If the process holding that lock is killed outright (a crash, a power loss,kill -9) it never removes the lock, so the lock is left behind with the dead process's PID recorded in it. NemoClaw is meant to notice the holder is gone and clear the stale lock, but it only checked whether something is running under that PID. Operating systems reuse PIDs, so once an unrelated new process lands on that number the check thinks the original holder is still alive and refuses to clear the lock. From then on every command that touches the registry retries for about 12 seconds and then fails withFailed to acquire lock on .../sandboxes.json, and the only way out is deleting the lock directory by hand.This teaches the staleness check to tell a recycled PID apart from the real holder: a lock counts as held only if the process under that PID actually started before the lock was taken. A process that started afterward is a reused PID, so the lock is cleared and the command proceeds. When that start time cannot be read, the lock is cleared once it is older than the existing staleness window. This mirrors the recycle check the onboarding lock already uses.
Related Issue
Closes #4921
Changes
src/lib/state/registry.ts, extracted into the exported, unit-tested pureclassifyExistingLock()(alive owner whose/procstart time post-dates the lock mtime is a reused PID and therefore stale; mtime-window fallback when the PID or start time is unavailable).src/lib/state/registry-lock.test.tscovering dead holder, recycled PID, live original (fresh and stale), unreadable owner, and the non-/procfallback.Type of Change
Verification
Ran:
npx vitest run src/lib/state/registry-lock.test.ts(6 passed) plus the registry-touchingregistry-recovery-actionandsandbox-registry-metadatasuites (8 passed),npx tsc -p tsconfig.src.json,npm run typecheck:cli, and Biome lint on the changed files, all clean.Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com
Summary by CodeRabbit
Bug Fixes
Tests