fix(shields): self-heal expired auto-restore timers#3204
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR centralizes Nemoclaw state path resolution, hardens shields state persistence and timer-marker validation, adds snapshot activation and inline auto-restore, tightens lock/unlock stat verification, adds destroy-time shields cleanup, and expands/tightens timer and cleanup tests. ChangesShields Orchestration Refactor
Sequence Diagram(s)sequenceDiagram
participant CLI
participant shieldsDown
participant TimerProcess
participant SnapshotRestore
participant lockAgentConfig
CLI->>shieldsDown: request temporary unlock
shieldsDown->>TimerProcess: start detached timer with marker (processToken)
TimerProcess-->>shieldsDown: startup failure -> rollback (restore snapshot)
TimerProcess->>SnapshotRestore: on expiry -> attempt restore
SnapshotRestore->>lockAgentConfig: lock config after restore
lockAgentConfig-->>CLI: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
Optional follow-up: add a lightweight host-side watchdog that scans ~/.nemoclaw/state/shields-timer-*.json and triggers the same restore path when restoreAt is expired and PID is dead. This would enforce fail-secure recovery even if shields status is never run. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lib/shields/index.ts (2)
122-133:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't collapse "corrupt state file" into "no state file".
If this file exists but fails to parse or validate, callers get
_hasStateFile: false.shieldsStatus()then reportsNOT CONFIGUREDand skipsrecoverExpiredAutoRestoreInline(), which turns a recoverable stale-unlock case into a silent fail-open. Preserve the fact that the state file exists and make callers surface corruption instead of treating it as fresh state.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/index.ts` around lines 122 - 133, loadShieldsState currently treats a present-but-unparseable state file the same as missing by returning _hasStateFile: false; change it so that when stateFilePath(sandboxName) exists but JSON.parse or isShieldsState validation throws/fails, loadShieldsState still returns _hasStateFile: true and also returns a corruption indicator (for example _isCorrupt: true or _corruptError: string) so callers can surface the corruption; update callers like shieldsStatus and recoverExpiredAutoRestoreInline to check this flag and avoid treating a corrupt file as "not configured" (i.e., surface or fail fast instead of skipping recovery).
218-225:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftValidate
pidbefore callingprocess.kill()inkillTimer().
isTimerMarker()accepts any numericpidwithout range validation, andkillTimer()callsprocess.kill(marker.pid, "SIGTERM")directly. A malformed marker withpid <= 0will signal a process group rather than the detached timer process. AlthoughisProcessAlive()correctly validatespid <= 0,killTimer()does not use it. Add pid validation toisTimerMarker()to require positive integers, or callisProcessAlive()before signaling inkillTimer()to guard against corrupted marker files.Also applies to: 263-272
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/index.ts` around lines 218 - 225, isTimerMarker currently accepts any number for marker.pid allowing non-positive values; update validation so pid must be a positive integer (e.g., require Number.isInteger(value.pid) && value.pid > 0) or alternatively add a guard in killTimer to call isProcessAlive(marker.pid) before invoking process.kill(marker.pid, "SIGTERM"); modify either isTimerMarker (the type guard for TimerMarker) or killTimer (the function that uses marker.pid and calls process.kill) to ensure marker.pid is validated as a positive integer to prevent signaling process groups from corrupted marker files.
🧹 Nitpick comments (1)
src/lib/shields/index.test.ts (1)
310-345: 🏗️ Heavy liftPrefer an execution-level recovery test for NC-3112.
These assertions only prove that the current source text exists; they will still pass if the inline-recovery path stops updating state, clearing the timer marker, or writing the audit entry correctly. For a fail-secure path like this, add at least one test that drives
shieldsStatus()against an expired marker and asserts the side effects.As per coding guidelines, "
src/lib/shields/**: These files control shields down/up, config mutability, audit trail, and auto-restore timer. E2E test recommendation: -shields-config-e2e— shields lifecycle + config get/set/rotate`"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/index.test.ts` around lines 310 - 345, Add an execution-level test that invokes shieldsStatus(...) against an expired auto-restore marker and asserts the actual side-effects rather than just source text: create a fake marker with restoreAtMs < Date.now() and a non-live pid, call shieldsStatus(sandboxName, false), then assert that recoverExpiredAutoRestoreInline (or the inline restore flow) was executed resulting in the marker being cleared, the shields state updated (e.g., shields reported UP/DOWN as expected), and an audit entry written; use mocks/spies for isProcessAlive, activateLockdownFromSnapshot and the audit logger to verify they were called and verify the timer/marker storage no longer contains the expired marker.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/shields/index.ts`:
- Around line 122-133: loadShieldsState currently treats a
present-but-unparseable state file the same as missing by returning
_hasStateFile: false; change it so that when stateFilePath(sandboxName) exists
but JSON.parse or isShieldsState validation throws/fails, loadShieldsState still
returns _hasStateFile: true and also returns a corruption indicator (for example
_isCorrupt: true or _corruptError: string) so callers can surface the
corruption; update callers like shieldsStatus and
recoverExpiredAutoRestoreInline to check this flag and avoid treating a corrupt
file as "not configured" (i.e., surface or fail fast instead of skipping
recovery).
- Around line 218-225: isTimerMarker currently accepts any number for marker.pid
allowing non-positive values; update validation so pid must be a positive
integer (e.g., require Number.isInteger(value.pid) && value.pid > 0) or
alternatively add a guard in killTimer to call isProcessAlive(marker.pid) before
invoking process.kill(marker.pid, "SIGTERM"); modify either isTimerMarker (the
type guard for TimerMarker) or killTimer (the function that uses marker.pid and
calls process.kill) to ensure marker.pid is validated as a positive integer to
prevent signaling process groups from corrupted marker files.
---
Nitpick comments:
In `@src/lib/shields/index.test.ts`:
- Around line 310-345: Add an execution-level test that invokes
shieldsStatus(...) against an expired auto-restore marker and asserts the actual
side-effects rather than just source text: create a fake marker with restoreAtMs
< Date.now() and a non-live pid, call shieldsStatus(sandboxName, false), then
assert that recoverExpiredAutoRestoreInline (or the inline restore flow) was
executed resulting in the marker being cleared, the shields state updated (e.g.,
shields reported UP/DOWN as expected), and an audit entry written; use
mocks/spies for isProcessAlive, activateLockdownFromSnapshot and the audit
logger to verify they were called and verify the timer/marker storage no longer
contains the expired marker.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 13fc9ad0-b7dc-4977-b7a6-e14ca0f016c1
📒 Files selected for processing (2)
src/lib/shields/index.test.tssrc/lib/shields/index.ts
|
✨ Thanks for submitting this PR that fixes the issue with shields remaining DOWN indefinitely due to a detached timer process dying before deadline. This change involves adding fail-secure stale-timer self-healing in Related open issues: |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/shields/index.ts`:
- Around line 151-171: saveShieldsState currently only strips the runtime marker
_hasStateFile before persisting, but loadShieldsState can also return _isCorrupt
and _corruptError which must not be written to disk; update the destructuring in
saveShieldsState (and any related merging logic) to remove _hasStateFile,
_isCorrupt, and _corruptError from current (the result of loadShieldsState) so
the resulting updated ShieldsState contains only persisted fields and then
continue writing that cleaned object to disk.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 69979908-ce69-4308-8e95-0a8503eea82f
📒 Files selected for processing (7)
src/lib/actions/sandbox/destroy.tssrc/lib/shields/audit.tssrc/lib/shields/index.test.tssrc/lib/shields/index.tssrc/lib/shields/timer.tssrc/lib/state/paths.tstest/image-cleanup.test.ts
✅ Files skipped from review due to trivial changes (1)
- src/lib/shields/timer.ts
|
Update: this PR now also includes the #3219 follow-up for destroy-time shields timer neutralization. What was added:
Why this does not conflict with the original #3204 stale-marker work:
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/shields/index.ts (1)
1273-1276:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail closed when the shields state file is corrupt.
loadShieldsState()can now mark unreadable/invalid JSON as_isCorrupt, andshieldsStatus()already treats that as an error.isShieldsDown()ignores the same marker and returnstruebecausederiveShieldsMode()falls back to"mutable_default"whenshieldsDownis missing. That means any caller gating writes onisShieldsDown()will treat a corrupt state file as mutable, which breaks the fail-secure behavior this PR is adding.💡 Minimal safe fix
function isShieldsDown(sandboxName: string): boolean { const state = loadShieldsState(sandboxName); + if (state._isCorrupt) { + return false; + } const mode = deriveShieldsMode(state, state._hasStateFile); return mode !== "locked"; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/index.ts` around lines 1273 - 1276, isShieldsDown currently ignores loadShieldsState's _isCorrupt flag and can return true (allow writes) when the state file is corrupt; update isShieldsDown to check state._isCorrupt immediately after calling loadShieldsState (before calling deriveShieldsMode) and, if _isCorrupt is truthy, return false to fail closed (consistent with shieldsStatus); keep the existing deriveShieldsMode flow for non-corrupt states so only corrupt state handling changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/shields/index.ts`:
- Around line 307-340: killTimer currently trusts only PID from readTimerMarker
and may kill a reused PID; update the TimerMarker shape to include a startTime
(or unique processToken) when the timer is created, then in killTimer (using
readTimerMarker, isProcessAlive, and clearTimerMarker) verify the stored
startTime/token matches the live process (e.g., compare process start time or
token read from /proc/<pid>/stat or from the timer process on creation) before
calling process.kill(marker.pid); if the verification fails treat the marker as
expired, clear it with clearTimerMarker and return without signaling, and
include any verification failure as a warning in the returned warnings array.
---
Outside diff comments:
In `@src/lib/shields/index.ts`:
- Around line 1273-1276: isShieldsDown currently ignores loadShieldsState's
_isCorrupt flag and can return true (allow writes) when the state file is
corrupt; update isShieldsDown to check state._isCorrupt immediately after
calling loadShieldsState (before calling deriveShieldsMode) and, if _isCorrupt
is truthy, return false to fail closed (consistent with shieldsStatus); keep the
existing deriveShieldsMode flow for non-corrupt states so only corrupt state
handling changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 45bf084a-53f0-4be8-a3da-4dea68701301
📒 Files selected for processing (2)
src/lib/actions/sandbox/destroy.tssrc/lib/shields/index.ts
jyaunches
left a comment
There was a problem hiding this comment.
Architectural / correctness review for #3204
Thanks for updating this PR to cover the destroy-time timer artifact path from #3219. Directionally this is the right fix: destroy should not duplicate timer marker parsing, and shields should own timer lifecycle/neutralization semantics.
I think this PR is worth taking, but I would not merge it as-is yet. One item below is a correctness blocker; the rest are warnings/suggestions that can be handled here or split to follow-ups.
🔴 Blocker: timer process must validate marker identity before restore/write
src/lib/shields/timer.ts now accepts a processToken, and src/lib/shields/index.ts writes that token into shields-timer-<sandbox>.json. killTimer() also verifies process identity before signaling. That is good.
However, the detached timer process itself still does not check that its marker still exists and still matches the current timer invocation before it restores policy or writes shields-<sandbox>.json.
Current flow in timer.ts:
parseTimerArgs()acceptsprocessToken.runRestoreTimer()immediately checks snapshot existence, restores policy, locks config, and updates state.- It never verifies that
args.markerPathstill exists or matches:sandboxNamesnapshotPathrestoreAtIsoprocessToken- ideally
pid === process.pid
This leaves the core #3219 race open:
nemoclaw <sandbox> shields down --timeout ...starts a detached timer.nemoclaw destroy <sandbox>clears the marker and attempts to kill the timer.- If signaling fails or races with timer wake-up, the timer process can still continue.
- Because
timer.tsdoes not treat missing/mismatched marker as invalidation, it can still recreateshields-<sandbox>.jsonafter destroy.
Please add timer-side marker validation before any restore or state write. Missing/mismatched marker should mean: this timer is no longer authorized; exit without restoring or writing state.
Suggested shape:
function readTimerMarker(markerPath: string): UnknownRecord | null {
try {
if (!fs.existsSync(markerPath)) return null;
const parsed = JSON.parse(fs.readFileSync(markerPath, "utf-8"));
return isRecord(parsed) ? parsed : null;
} catch {
return null;
}
}
function markerMatchesCurrentTimer(args: TimerArgs): boolean {
const marker = readTimerMarker(args.markerPath);
return (
marker !== null &&
marker.pid === process.pid &&
marker.sandboxName === args.sandboxName &&
marker.snapshotPath === args.snapshotPath &&
marker.restoreAt === args.restoreAtIso &&
marker.processToken === args.processToken
);
}Then near the top of runRestoreTimer():
if (!markerMatchesCurrentTimer(args)) {
// Optional best-effort audit, but most importantly: do not restore/write state.
return;
}Please also add a behavioral test for at least one of:
- marker missing => timer does not call restore/update state
- process token mismatch => timer does not call restore/update state
- marker PID mismatch => timer does not call restore/update state
This is the key requirement needed for #3204 to truly close the core #3219 stale-destroy-timer bug.
🟡 Warning: tests are too source-pattern-heavy for process lifecycle behavior
Several new tests in src/lib/shields/index.test.ts assert that source code contains particular strings. That proves intent, but not behavior.
For this PR, the important properties are process/timer lifecycle properties. Please prefer behavioral tests with dependency injection/mocks/temp files for the critical paths, especially:
killTimer()reads marker, verifies identity, sendsSIGTERM, clears marker.- identity mismatch clears marker but does not signal the wrong PID.
- missing/mismatched timer marker prevents the detached timer from writing state.
- expired marker + dead PID triggers inline restore from
shields status.
The new test/image-cleanup.test.ts coverage is closer to what we need: real temp files + injected dependencies. The timer tests should follow that style where possible.
🟡 Warning: src/lib/shields/index.ts is becoming a timer/process-control module
This PR adds a lot of responsibility to src/lib/shields/index.ts:
- timer marker parsing
- marker clearing
- PID liveness checks
- process command-line identity verification
- timer neutralization
- inline recovery
- lockdown restoration orchestration
- corrupt-state handling
I would not block this bug fix on a full extraction, but architecturally these helpers should eventually move behind a focused boundary, e.g.:
src/lib/shields/timer-control.ts
owning:
readTimerMarkerclearTimerMarkerkillTimerisProcessAliveverifyTimerMarkerIdentity- marker result types
If you do not want to extract in this PR, please keep the public export surface minimal and consider filing a follow-up refactor issue.
🟡 Warning: PR #3169 overlaps the shields down policy path
PR #3169 changes shields down to use an agent-aware permissive policy path instead of PERMISSIVE_POLICY_PATH. This PR still uses PERMISSIVE_POLICY_PATH in the same area.
Not a blocker for this PR, but merge order matters:
- If #3204 lands first, #3169 needs to rebase.
- If #3169 lands first, #3204 should preserve the agent-aware policy resolver.
Please coordinate so we do not regress Hermes shields-down behavior.
🔵 Suggestion: audit event typing is drifting
src/lib/shields/timer.ts writes audit actions like shields_auto_restore_lock_warning through its local untyped appendAudit(), while src/lib/shields/audit.ts only types:
"shields_down" | "shields_up" | "shields_auto_restore" | "shields_up_failed"This is not a compile blocker because timer.ts bypasses appendAuditEntry(), but it means the audit schema is split. Please consider either widening ShieldsAuditEntry or filing a follow-up to route timer audit writes through the shared audit helper.
🔵 Suggestion: acknowledge state-dir fallback behavior change
The new resolveNemoclawStateDir() uses:
process.env.HOME ?? os.homedir()where existing call sites often used:
process.env.HOME ?? "/tmp"This is probably better, but it is a behavior change when HOME is unset. Please mention it in the PR or add a short comment/test if intentional.
✅ What is good here
- Destroy now calls a shields-owned cleanup helper instead of knowing all timer internals.
killTimer()returns structured warnings instead of silently swallowing everything.- Timer marker shape is stricter, including positive integer PID validation and optional process token.
shields statusself-healing for expired/dead timer markers is the right recovery surface for #3112.- Corrupt shields state now fails closed rather than being treated as “not configured.”
- Centralizing
.nemoclaw/stateresolution is aligned with the #3219 cleanup plan.
Recommendation
Merge after fixing the timer-side marker validation blocker.
Once that blocker is fixed, I think this PR can cover the core #3219 stale destroy/timer correctness bug. The destroy audit-entry and policy snapshot lifecycle items from #3219 can be split into follow-ups rather than blocking this PR.
|
Update (latest): Implemented the architectural follow-up you suggested:
Commit:
Previously addressed blocker for #3204:
Coordination notes:
Validation:
Push note:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/lib/shields/timer-control.ts (1)
4-7: 💤 Low valueInconsistent module syntax: CommonJS
requirein a TypeScript file.This file uses CommonJS
require()syntax while other.tsfiles in this PR (e.g.,timer.ts,audit.ts) use ES moduleimportstatements. For consistency and to align with the project's TypeScript configuration, consider using ES module syntax.♻️ Suggested change to ES module imports
-const fs = require("fs"); -const path = require("path"); -const { execFileSync } = require("child_process"); -const { resolveNemoclawStateDir } = require("../state/paths"); +import fs from "node:fs"; +import path from "node:path"; +import { execFileSync } from "node:child_process"; +import { resolveNemoclawStateDir } from "../state/paths";🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/timer-control.ts` around lines 4 - 7, The file uses CommonJS require() for modules (fs, path, child_process.execFileSync, and resolveNemoclawStateDir) but should use ES module imports to match the project's TypeScript style; update the top of src/lib/shields/timer-control.ts to import fs, path, { execFileSync } from "child_process" and import { resolveNemoclawStateDir } from "../state/paths" (keep the same symbol names so usages of execFileSync and resolveNemoclawStateDir in this file remain unchanged).src/lib/shields/timer.test.ts (1)
100-140: ⚡ Quick winTests should verify marker is NOT deleted when identity mismatches.
These tests verify that
runis not called and state remains unchanged when the marker identity mismatches. However, they don't assert that the marker file still exists after the call. Given the timer authorization model where a process should not clean up markers it doesn't own, consider adding:expect(fs.existsSync(markerPath)).toBe(true);This would help catch implementation bugs where the marker is incorrectly deleted.
Also applies to: 142-175
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/shields/timer.test.ts` around lines 100 - 140, Add an assertion to verify the marker file is not deleted when the marker identity mismatches: after invoking timer.runRestoreTimer with args produced by timer.parseTimerArgs, assert fs.existsSync(markerPath) is true (use the existing markerPath variable). Do the same for the similar test block referenced around lines 142-175 so both tests verify the marker file remains present when processToken/sandbox identity do not match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/shields/timer.ts`:
- Around line 147-153: The finally block currently always calls
cleanupMarker(args.markerPath) even when markerMatchesCurrentTimer(args)
returned false; move the identity check out of the try or gate the cleanup so we
only remove the marker when we actually owned it: either perform the
markerMatchesCurrentTimer(args) check before entering the try in the surrounding
function (so an early return occurs before try/finally), or introduce a local
boolean (e.g., ownedMarker or shouldCleanup) set when
markerMatchesCurrentTimer(args) is true and then in the finally only call
cleanupMarker(args.markerPath) if that flag is true; reference
markerMatchesCurrentTimer, cleanupMarker, and args.markerPath when making the
change.
---
Nitpick comments:
In `@src/lib/shields/timer-control.ts`:
- Around line 4-7: The file uses CommonJS require() for modules (fs, path,
child_process.execFileSync, and resolveNemoclawStateDir) but should use ES
module imports to match the project's TypeScript style; update the top of
src/lib/shields/timer-control.ts to import fs, path, { execFileSync } from
"child_process" and import { resolveNemoclawStateDir } from "../state/paths"
(keep the same symbol names so usages of execFileSync and
resolveNemoclawStateDir in this file remain unchanged).
In `@src/lib/shields/timer.test.ts`:
- Around line 100-140: Add an assertion to verify the marker file is not deleted
when the marker identity mismatches: after invoking timer.runRestoreTimer with
args produced by timer.parseTimerArgs, assert fs.existsSync(markerPath) is true
(use the existing markerPath variable). Do the same for the similar test block
referenced around lines 142-175 so both tests verify the marker file remains
present when processToken/sandbox identity do not match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 481312ee-4cfe-466e-81f1-68a7c2e10cc6
📒 Files selected for processing (7)
src/lib/actions/sandbox/destroy.tssrc/lib/shields/audit.tssrc/lib/shields/index.test.tssrc/lib/shields/index.tssrc/lib/shields/timer-control.tssrc/lib/shields/timer.test.tssrc/lib/shields/timer.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lib/actions/sandbox/destroy.ts
- src/lib/shields/index.ts
prekshivyas
left a comment
There was a problem hiding this comment.
Fix shape looks right for #3112: shieldsStatus() self-heals via recoverExpiredAutoRestoreInline(), markerMatchesCurrentTimer() covers the destroy-race, and the NC-3112 tests at src/lib/shields/index.test.ts:331-470 hit both the happy and failure paths.
PR is CONFLICTING — please rebase before this can land.
🟡 isShieldsDown() skips the recovery gate
src/lib/shields/index.ts:1168-1176 just reads state. Callers src/lib/actions/sandbox/doctor.ts:626-628 and src/lib/actions/sandbox/status.ts:139 will keep reporting stale "down" after timer death — same fail-open shape #3112 is fixing, on a different API. Right after a timer-death incident, shields status will report UP (recovered) while doctor / status still say DOWN.
Factor the gate out of recoverExpiredAutoRestoreInline() and route both through it.
🟡 Audit field naming: restored_at vs restore_at
Both live in ShieldsAuditEntry. recoverExpiredAutoRestoreInline writes restored_at: nowIso (when the restore happened); src/lib/shields/timer.ts:113-120 writes restore_at: args.restoreAtIso (the originally-scheduled deadline). One-character difference will bite. Rename restore_at → scheduled_restore_at.
🔵 timer-control.ts uses CommonJS require()
src/lib/shields/timer-control.ts:4-7 uses require("fs") etc. — inconsistent with the rest of src/lib/ and likely Biome-flagged. Switch to import.
731a912 to
cf55cb7
Compare
|
@prekshivyas
Validation run:
Branch updated: |
The vi.mock targets `../policies` and `../sandbox-config` didn't match
timer.ts's actual imports of `../policy` and `../sandbox/config`, so the
mocks never intercepted: vitest fell through to the real `../policy`
which loaded `policy/index.ts` and crashed on `require("../runner")` in
that test environment. All four authorization tests failed without
exercising any assertion.
Correct the mock paths so the tests run against their intended fakes.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Revant Patel <111788366+ChunkyMonkey11@users.noreply.github.com>
recoverExpiredAutoRestoreInline() previously skipped inline recovery whenever marker.pid was alive. After a host reboot or OOM, that PID can be reassigned to an unrelated live process, which reproduces the NVIDIA#3112 fail-open in a different shape: timer is dead but recycled PID looks alive, recovery is blocked, shields stay DOWN. Treat a live PID as the recorded timer only when its /proc/<pid>/cmdline also matches the shields timer script, sandbox name, and processToken (reusing verifyTimerMarkerIdentity from timer-control). Otherwise proceed with inline restore. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Revant Patel <111788366+ChunkyMonkey11@users.noreply.github.com>
The destroy-time cleanup used a lazy `require("../../shields")` indirection
to fetch killTimer, presumably to avoid a circular import with shields/index.
Now that killTimer lives in shields/timer-control.ts — which has no
dependency on shields/index — the indirection is no longer load-bearing.
Import killTimer statically from timer-control. The runtime behaviour is
identical; the dep is now visible to TypeScript and the next reader.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Revant Patel <111788366+ChunkyMonkey11@users.noreply.github.com>
|
Pushed three commits to the branch via maintainer push (
Why:
All three carry Not pushed — leaving for your call:
Both pre-existing failures I worked around ( |
|
@ChunkyMonkey11 — Could you append a line like the following to the PR description? (or your real-name + Nvidia-email equivalent, whichever you normally use for DCO). That should turn the check green without needing a re-push. |
|
@prekshivyas Added the Signed-off-by: Revant Patel 111788366+ChunkyMonkey11@users.noreply.github.com |
|
@prekshivyas Thanks — I picked up both optional cleanups you flagged.
Validation:
|
…s-timer # Conflicts: # src/lib/actions/sandbox/destroy.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
ericksoa
left a comment
There was a problem hiding this comment.
Approved. Reviewed current head 00631e3 after resolving the main conflict and fixing the replacement timer-marker ownership bug. Focused shields/config E2E passed: https://github.com/NVIDIA/NemoClaw/actions/runs/25776147771
Summary
shields statusfor temporary unlocks.shields-timer-<sandbox>.jsonmarkers where the timer PID is no longer alive.shields up.Bug Behavior
nemoclaw <name> shields down --timeout Nmdepends on a detached timer process. If that process dies before the deadline, shields can remain DOWN indefinitely whileshields statusshowsAuto-lockdown in: 0m 0s.Fix
src/lib/shields/index.ts,shieldsStatus()now checks stale/expired timer markers duringtemporarily_unlocked.restoreAtis in the past and the marker PID is dead, status triggers inline lockdown restore and then re-renders status.shields_auto_restoreaudit entry and remove the stale marker.shields_up_failed, keep/report DOWN, and print a clear manual recovery warning.shieldsUp()now clears timer marker files on successful completion (including already-locked fast path).Validation
npm run build:clinpx vitest run src/lib/shields/index.test.ts src/lib/shields/audit.test.tsnpx prettier --check src/lib/shields/index.ts src/lib/shields/index.test.tsgit diff --checknpm test(fails due to pre-existing unrelated failures in this environment:test/generate-openclaw-config.test.ts(2 cases),test/nemoclaw-start.test.ts(1 case),test/legacy-path-guard.test.tsenvironment/git-mv failure)Scope Note
This PR intentionally does not implement a persistent host watchdog/supervisor (
systemd/launchd/etc). That follow-up is proposed separately.Addresses #3112
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
Signed-off-by: Revant revant.h.patel@gmail.com