Skip to content

fix(shields): self-heal expired auto-restore timers#3204

Merged
ericksoa merged 18 commits into
NVIDIA:mainfrom
ChunkyMonkey11:fix/3112-stale-shields-timer
May 13, 2026
Merged

fix(shields): self-heal expired auto-restore timers#3204
ericksoa merged 18 commits into
NVIDIA:mainfrom
ChunkyMonkey11:fix/3112-stale-shields-timer

Conversation

@ChunkyMonkey11

@ChunkyMonkey11 ChunkyMonkey11 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add fail-secure stale-timer self-healing in shields status for temporary unlocks.
  • Detect expired shields-timer-<sandbox>.json markers where the timer PID is no longer alive.
  • Attempt inline restore (policy snapshot restore + config re-lock), update state/audit, and re-report final status.
  • Clear stale timer marker files reliably on successful shields up.

Bug Behavior

nemoclaw <name> shields down --timeout Nm depends on a detached timer process. If that process dies before the deadline, shields can remain DOWN indefinitely while shields status shows Auto-lockdown in: 0m 0s.

Fix

  • In src/lib/shields/index.ts, shieldsStatus() now checks stale/expired timer markers during temporarily_unlocked.
  • If restoreAt is in the past and the marker PID is dead, status triggers inline lockdown restore and then re-renders status.
  • On restore success, write a shields_auto_restore audit entry and remove the stale marker.
  • On restore failure, write shields_up_failed, keep/report DOWN, and print a clear manual recovery warning.
  • shieldsUp() now clears timer marker files on successful completion (including already-locked fast path).
  • Add a shared helper to avoid duplicating restore+lock flow.

Validation

  • npm run build:cli
  • npx vitest run src/lib/shields/index.test.ts src/lib/shields/audit.test.ts
  • npx prettier --check src/lib/shields/index.ts src/lib/shields/index.test.ts
  • git diff --check
  • npm test (fails due to pre-existing unrelated failures in this environment: test/generate-openclaw-config.test.ts (2 cases), test/nemoclaw-start.test.ts (1 case), test/legacy-path-guard.test.ts environment/git-mv failure)

Scope Note

This PR intentionally does not implement a persistent host watchdog/supervisor (systemd/launchd/etc). That follow-up is proposed separately.

Addresses #3112

Summary by CodeRabbit

  • New Features

    • Inline and snapshot-driven auto-restore for shields with timer-backed recovery and safer rollback on failures.
  • Bug Fixes

    • Tighter verification for lockdown/unlock, more reliable timer/marker liveness and identity checks, and improved cleanup of shields artifacts during sandbox destroy.
  • Refactor

    • Centralized resolution of app state/home paths and simplified lock/unlock target handling.
  • Tests

    • Expanded tests for shields, timers, cleanup, and recovery flows, including timer marker edge cases.

Signed-off-by: Revant revant.h.patel@gmail.com

@copy-pr-bot

copy-pr-bot Bot commented May 7, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR centralizes Nemoclaw state path resolution, hardens shields state persistence and timer-marker validation, adds snapshot activation and inline auto-restore, tightens lock/unlock stat verification, adds destroy-time shields cleanup, and expands/tightens timer and cleanup tests.

Changes

Shields Orchestration Refactor

Layer / File(s) Summary
Imports & Paths
src/lib/state/paths.ts, src/lib/shields/*, src/lib/shields/timer.ts, src/lib/shields/audit.ts, src/lib/actions/sandbox/destroy.ts, test/*
Add resolveNemoclawHomeDir/resolveNemoclawStateDir helpers and replace HOME-based path construction across shields, timer, audit, destroy, and tests.
Destroy Cleanup
src/lib/actions/sandbox/destroy.ts, test/image-cleanup.test.ts
Add cleanupShieldsDestroyArtifacts() that neutralizes timers (injectable) then removes shields state; removeShieldsState() defaults to resolveNemoclawStateDir() and accepts deps; destroySandbox now calls the cleanup helper; tests verify targeted timer kill and best-effort warnings.
Audit API
src/lib/shields/audit.ts
Switch AUDIT_DIR to resolveNemoclawStateDir() and expand ShieldsAuditEntry with new action and optional fields.
Timer Control Module
src/lib/shields/timer-control.ts
New module: TimerMarker contract, safe read/clear, liveness check, command-line identity verification (including optional processToken), and killTimer() returning structured result and warnings.
Timer Runtime & Restore
src/lib/shields/timer.ts, src/lib/shields/timer.test.ts
Timer CLI accepts optional processToken; STATE_DIR uses resolver; add readTimerMarker / markerMatchesCurrentTimer gating; runRestoreTimer uses marker match and delegates audit writes; exports added; tests validate marker matching behaviors and restore gating.
State Derivation & Persistence
src/lib/shields/index.ts
Introduce deriveShieldsMode(); loadShieldsState() exposes _hasStateFile and corruption diagnostics; saveShieldsState() strips runtime-only markers before persisting; shieldsDown/shieldsUp/shieldsStatus flows updated and exports expanded (including killTimer).
Lock/Unlock Verification
src/lib/shields/index.ts
Refactor unlockAgentConfig() and lockAgentConfig() to accept structured target objects and perform mandatory per-file and config-dir stat verification (mode/owner); immutable-bit checks conditional on chattr success; improved legacy-state diagnostics capture kubectl stdout/stderr.
Snapshot Activation & Inline Recovery
src/lib/shields/index.ts
Add activateLockdownFromSnapshot() and recoverExpiredAutoRestoreInline() to support snapshot-based restore; shieldsDown() persists a processToken and starts a detached timer (rollback on timer start failure); shieldsUp() uses snapshot activation with controlled failure handling; shieldsStatus() supports optional inline recovery and fails closed on corrupt state.
Tests / Assertions
src/lib/shields/index.test.ts, src/lib/shields/timer.test.ts, test/image-cleanup.test.ts
Tighten and reformat tests: mock command-array formats; dynamic dist import paths; add unit tests for readTimerMarker and killTimer (invalid pid, identity mismatch, stale marker, verified kill); add timer restore tests and image-cleanup tests validating targeted cleanup and warnings.

Sequence Diagram(s)

sequenceDiagram
  participant CLI
  participant shieldsDown
  participant TimerProcess
  participant SnapshotRestore
  participant lockAgentConfig
  CLI->>shieldsDown: request temporary unlock
  shieldsDown->>TimerProcess: start detached timer with marker (processToken)
  TimerProcess-->>shieldsDown: startup failure -> rollback (restore snapshot)
  TimerProcess->>SnapshotRestore: on expiry -> attempt restore
  SnapshotRestore->>lockAgentConfig: lock config after restore
  lockAgentConfig-->>CLI: result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Snapshots saved, states now neat,
Timers quiet under rabbit feet,
Stat checks hum and markers cleared,
Destroy cleans up what once appeared,
A little hop — the system’s sweet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(shields): self-heal expired auto-restore timers' accurately and concisely describes the main change: adding fail-secure healing for stale shields timer processes that may have expired, which is the core objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@ChunkyMonkey11

ChunkyMonkey11 commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Optional follow-up: add a lightweight host-side watchdog that scans ~/.nemoclaw/state/shields-timer-*.json and triggers the same restore path when restoreAt is expired and PID is dead. This would enforce fail-secure recovery even if shields status is never run.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lib/shields/index.ts (2)

122-133: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't collapse "corrupt state file" into "no state file".

If this file exists but fails to parse or validate, callers get _hasStateFile: false. shieldsStatus() then reports NOT CONFIGURED and skips recoverExpiredAutoRestoreInline(), which turns a recoverable stale-unlock case into a silent fail-open. Preserve the fact that the state file exists and make callers surface corruption instead of treating it as fresh state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/shields/index.ts` around lines 122 - 133, loadShieldsState currently
treats a present-but-unparseable state file the same as missing by returning
_hasStateFile: false; change it so that when stateFilePath(sandboxName) exists
but JSON.parse or isShieldsState validation throws/fails, loadShieldsState still
returns _hasStateFile: true and also returns a corruption indicator (for example
_isCorrupt: true or _corruptError: string) so callers can surface the
corruption; update callers like shieldsStatus and
recoverExpiredAutoRestoreInline to check this flag and avoid treating a corrupt
file as "not configured" (i.e., surface or fail fast instead of skipping
recovery).

218-225: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Validate pid before calling process.kill() in killTimer().

isTimerMarker() accepts any numeric pid without range validation, and killTimer() calls process.kill(marker.pid, "SIGTERM") directly. A malformed marker with pid <= 0 will signal a process group rather than the detached timer process. Although isProcessAlive() correctly validates pid <= 0, killTimer() does not use it. Add pid validation to isTimerMarker() to require positive integers, or call isProcessAlive() before signaling in killTimer() to guard against corrupted marker files.

Also applies to: 263-272

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/shields/index.ts` around lines 218 - 225, isTimerMarker currently
accepts any number for marker.pid allowing non-positive values; update
validation so pid must be a positive integer (e.g., require
Number.isInteger(value.pid) && value.pid > 0) or alternatively add a guard in
killTimer to call isProcessAlive(marker.pid) before invoking
process.kill(marker.pid, "SIGTERM"); modify either isTimerMarker (the type guard
for TimerMarker) or killTimer (the function that uses marker.pid and calls
process.kill) to ensure marker.pid is validated as a positive integer to prevent
signaling process groups from corrupted marker files.
🧹 Nitpick comments (1)
src/lib/shields/index.test.ts (1)

310-345: 🏗️ Heavy lift

Prefer an execution-level recovery test for NC-3112.

These assertions only prove that the current source text exists; they will still pass if the inline-recovery path stops updating state, clearing the timer marker, or writing the audit entry correctly. For a fail-secure path like this, add at least one test that drives shieldsStatus() against an expired marker and asserts the side effects.

As per coding guidelines, "src/lib/shields/**: These files control shields down/up, config mutability, audit trail, and auto-restore timer. E2E test recommendation: - shields-config-e2e — shields lifecycle + config get/set/rotate`"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/shields/index.test.ts` around lines 310 - 345, Add an execution-level
test that invokes shieldsStatus(...) against an expired auto-restore marker and
asserts the actual side-effects rather than just source text: create a fake
marker with restoreAtMs < Date.now() and a non-live pid, call
shieldsStatus(sandboxName, false), then assert that
recoverExpiredAutoRestoreInline (or the inline restore flow) was executed
resulting in the marker being cleared, the shields state updated (e.g., shields
reported UP/DOWN as expected), and an audit entry written; use mocks/spies for
isProcessAlive, activateLockdownFromSnapshot and the audit logger to verify they
were called and verify the timer/marker storage no longer contains the expired
marker.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/shields/index.ts`:
- Around line 122-133: loadShieldsState currently treats a
present-but-unparseable state file the same as missing by returning
_hasStateFile: false; change it so that when stateFilePath(sandboxName) exists
but JSON.parse or isShieldsState validation throws/fails, loadShieldsState still
returns _hasStateFile: true and also returns a corruption indicator (for example
_isCorrupt: true or _corruptError: string) so callers can surface the
corruption; update callers like shieldsStatus and
recoverExpiredAutoRestoreInline to check this flag and avoid treating a corrupt
file as "not configured" (i.e., surface or fail fast instead of skipping
recovery).
- Around line 218-225: isTimerMarker currently accepts any number for marker.pid
allowing non-positive values; update validation so pid must be a positive
integer (e.g., require Number.isInteger(value.pid) && value.pid > 0) or
alternatively add a guard in killTimer to call isProcessAlive(marker.pid) before
invoking process.kill(marker.pid, "SIGTERM"); modify either isTimerMarker (the
type guard for TimerMarker) or killTimer (the function that uses marker.pid and
calls process.kill) to ensure marker.pid is validated as a positive integer to
prevent signaling process groups from corrupted marker files.

---

Nitpick comments:
In `@src/lib/shields/index.test.ts`:
- Around line 310-345: Add an execution-level test that invokes
shieldsStatus(...) against an expired auto-restore marker and asserts the actual
side-effects rather than just source text: create a fake marker with restoreAtMs
< Date.now() and a non-live pid, call shieldsStatus(sandboxName, false), then
assert that recoverExpiredAutoRestoreInline (or the inline restore flow) was
executed resulting in the marker being cleared, the shields state updated (e.g.,
shields reported UP/DOWN as expected), and an audit entry written; use
mocks/spies for isProcessAlive, activateLockdownFromSnapshot and the audit
logger to verify they were called and verify the timer/marker storage no longer
contains the expired marker.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 13fc9ad0-b7dc-4977-b7a6-e14ca0f016c1

📥 Commits

Reviewing files that changed from the base of the PR and between a4a9eb0 and 20db546.

📒 Files selected for processing (2)
  • src/lib/shields/index.test.ts
  • src/lib/shields/index.ts

@wscurran

wscurran commented May 7, 2026

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR that fixes the issue with shields remaining DOWN indefinitely due to a detached timer process dying before deadline. This change involves adding fail-secure stale-timer self-healing in shields status for temporary unlocks and detecting expired shields-timer-<sandbox>.json markers where the timer PID is no longer alive.


Related open issues:

@wscurran wscurran added the security Potential vulnerability, unsafe behavior, or access risk label May 7, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/shields/index.ts`:
- Around line 151-171: saveShieldsState currently only strips the runtime marker
_hasStateFile before persisting, but loadShieldsState can also return _isCorrupt
and _corruptError which must not be written to disk; update the destructuring in
saveShieldsState (and any related merging logic) to remove _hasStateFile,
_isCorrupt, and _corruptError from current (the result of loadShieldsState) so
the resulting updated ShieldsState contains only persisted fields and then
continue writing that cleaned object to disk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69979908-ce69-4308-8e95-0a8503eea82f

📥 Commits

Reviewing files that changed from the base of the PR and between 20db546 and 895ca12.

📒 Files selected for processing (7)
  • src/lib/actions/sandbox/destroy.ts
  • src/lib/shields/audit.ts
  • src/lib/shields/index.test.ts
  • src/lib/shields/index.ts
  • src/lib/shields/timer.ts
  • src/lib/state/paths.ts
  • test/image-cleanup.test.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/shields/timer.ts

Comment thread src/lib/shields/index.ts
@ChunkyMonkey11

Copy link
Copy Markdown
Contributor Author

Update: this PR now also includes the #3219 follow-up for destroy-time shields timer neutralization.

What was added:

  • nemoclaw <sandbox> destroy now proactively neutralizes detached shields timers before and with artifact cleanup.
  • Destroy reuses shields timer semantics (killTimer) to safely read marker state, check PID liveness, send best-effort SIGTERM, and surface warnings on real failures.
  • Destroy still performs best-effort cleanup of per-sandbox shields files:
    • ~/.nemoclaw/state/shields-<sandbox>.json
    • ~/.nemoclaw/state/shields-timer-<sandbox>.json
  • ~/.nemoclaw/state resolution is centralized with a shared helper and used across destroy/shields/timer/audit.
  • Tests were added or updated for active-timer destroy path, warning path, and no cross-sandbox deletion regressions.

Why this does not conflict with the original #3204 stale-marker work:

  • fix(shields): self-heal expired auto-restore timers #3204 is reactive self-healing in shields status after stale markers exist.
  • This follow-up is proactive neutralization during destroy, preventing stale detached timers from reviving state later.
  • Together they cover both prevention and recovery paths.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/shields/index.ts (1)

1273-1276: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when the shields state file is corrupt.

loadShieldsState() can now mark unreadable/invalid JSON as _isCorrupt, and shieldsStatus() already treats that as an error. isShieldsDown() ignores the same marker and returns true because deriveShieldsMode() falls back to "mutable_default" when shieldsDown is missing. That means any caller gating writes on isShieldsDown() will treat a corrupt state file as mutable, which breaks the fail-secure behavior this PR is adding.

💡 Minimal safe fix
 function isShieldsDown(sandboxName: string): boolean {
   const state = loadShieldsState(sandboxName);
+  if (state._isCorrupt) {
+    return false;
+  }
   const mode = deriveShieldsMode(state, state._hasStateFile);
   return mode !== "locked";
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/shields/index.ts` around lines 1273 - 1276, isShieldsDown currently
ignores loadShieldsState's _isCorrupt flag and can return true (allow writes)
when the state file is corrupt; update isShieldsDown to check state._isCorrupt
immediately after calling loadShieldsState (before calling deriveShieldsMode)
and, if _isCorrupt is truthy, return false to fail closed (consistent with
shieldsStatus); keep the existing deriveShieldsMode flow for non-corrupt states
so only corrupt state handling changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/shields/index.ts`:
- Around line 307-340: killTimer currently trusts only PID from readTimerMarker
and may kill a reused PID; update the TimerMarker shape to include a startTime
(or unique processToken) when the timer is created, then in killTimer (using
readTimerMarker, isProcessAlive, and clearTimerMarker) verify the stored
startTime/token matches the live process (e.g., compare process start time or
token read from /proc/<pid>/stat or from the timer process on creation) before
calling process.kill(marker.pid); if the verification fails treat the marker as
expired, clear it with clearTimerMarker and return without signaling, and
include any verification failure as a warning in the returned warnings array.

---

Outside diff comments:
In `@src/lib/shields/index.ts`:
- Around line 1273-1276: isShieldsDown currently ignores loadShieldsState's
_isCorrupt flag and can return true (allow writes) when the state file is
corrupt; update isShieldsDown to check state._isCorrupt immediately after
calling loadShieldsState (before calling deriveShieldsMode) and, if _isCorrupt
is truthy, return false to fail closed (consistent with shieldsStatus); keep the
existing deriveShieldsMode flow for non-corrupt states so only corrupt state
handling changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45bf084a-53f0-4be8-a3da-4dea68701301

📥 Commits

Reviewing files that changed from the base of the PR and between 895ca12 and d6420b1.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/destroy.ts
  • src/lib/shields/index.ts

Comment thread src/lib/shields/index.ts Outdated

@jyaunches jyaunches left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architectural / correctness review for #3204

Thanks for updating this PR to cover the destroy-time timer artifact path from #3219. Directionally this is the right fix: destroy should not duplicate timer marker parsing, and shields should own timer lifecycle/neutralization semantics.

I think this PR is worth taking, but I would not merge it as-is yet. One item below is a correctness blocker; the rest are warnings/suggestions that can be handled here or split to follow-ups.

🔴 Blocker: timer process must validate marker identity before restore/write

src/lib/shields/timer.ts now accepts a processToken, and src/lib/shields/index.ts writes that token into shields-timer-<sandbox>.json. killTimer() also verifies process identity before signaling. That is good.

However, the detached timer process itself still does not check that its marker still exists and still matches the current timer invocation before it restores policy or writes shields-<sandbox>.json.

Current flow in timer.ts:

  • parseTimerArgs() accepts processToken.
  • runRestoreTimer() immediately checks snapshot existence, restores policy, locks config, and updates state.
  • It never verifies that args.markerPath still exists or matches:
    • sandboxName
    • snapshotPath
    • restoreAtIso
    • processToken
    • ideally pid === process.pid

This leaves the core #3219 race open:

  1. nemoclaw <sandbox> shields down --timeout ... starts a detached timer.
  2. nemoclaw destroy <sandbox> clears the marker and attempts to kill the timer.
  3. If signaling fails or races with timer wake-up, the timer process can still continue.
  4. Because timer.ts does not treat missing/mismatched marker as invalidation, it can still recreate shields-<sandbox>.json after destroy.

Please add timer-side marker validation before any restore or state write. Missing/mismatched marker should mean: this timer is no longer authorized; exit without restoring or writing state.

Suggested shape:

function readTimerMarker(markerPath: string): UnknownRecord | null {
  try {
    if (!fs.existsSync(markerPath)) return null;
    const parsed = JSON.parse(fs.readFileSync(markerPath, "utf-8"));
    return isRecord(parsed) ? parsed : null;
  } catch {
    return null;
  }
}

function markerMatchesCurrentTimer(args: TimerArgs): boolean {
  const marker = readTimerMarker(args.markerPath);
  return (
    marker !== null &&
    marker.pid === process.pid &&
    marker.sandboxName === args.sandboxName &&
    marker.snapshotPath === args.snapshotPath &&
    marker.restoreAt === args.restoreAtIso &&
    marker.processToken === args.processToken
  );
}

Then near the top of runRestoreTimer():

if (!markerMatchesCurrentTimer(args)) {
  // Optional best-effort audit, but most importantly: do not restore/write state.
  return;
}

Please also add a behavioral test for at least one of:

  • marker missing => timer does not call restore/update state
  • process token mismatch => timer does not call restore/update state
  • marker PID mismatch => timer does not call restore/update state

This is the key requirement needed for #3204 to truly close the core #3219 stale-destroy-timer bug.

🟡 Warning: tests are too source-pattern-heavy for process lifecycle behavior

Several new tests in src/lib/shields/index.test.ts assert that source code contains particular strings. That proves intent, but not behavior.

For this PR, the important properties are process/timer lifecycle properties. Please prefer behavioral tests with dependency injection/mocks/temp files for the critical paths, especially:

  • killTimer() reads marker, verifies identity, sends SIGTERM, clears marker.
  • identity mismatch clears marker but does not signal the wrong PID.
  • missing/mismatched timer marker prevents the detached timer from writing state.
  • expired marker + dead PID triggers inline restore from shields status.

The new test/image-cleanup.test.ts coverage is closer to what we need: real temp files + injected dependencies. The timer tests should follow that style where possible.

🟡 Warning: src/lib/shields/index.ts is becoming a timer/process-control module

This PR adds a lot of responsibility to src/lib/shields/index.ts:

  • timer marker parsing
  • marker clearing
  • PID liveness checks
  • process command-line identity verification
  • timer neutralization
  • inline recovery
  • lockdown restoration orchestration
  • corrupt-state handling

I would not block this bug fix on a full extraction, but architecturally these helpers should eventually move behind a focused boundary, e.g.:

src/lib/shields/timer-control.ts

owning:

  • readTimerMarker
  • clearTimerMarker
  • killTimer
  • isProcessAlive
  • verifyTimerMarkerIdentity
  • marker result types

If you do not want to extract in this PR, please keep the public export surface minimal and consider filing a follow-up refactor issue.

🟡 Warning: PR #3169 overlaps the shields down policy path

PR #3169 changes shields down to use an agent-aware permissive policy path instead of PERMISSIVE_POLICY_PATH. This PR still uses PERMISSIVE_POLICY_PATH in the same area.

Not a blocker for this PR, but merge order matters:

  • If #3204 lands first, #3169 needs to rebase.
  • If #3169 lands first, #3204 should preserve the agent-aware policy resolver.

Please coordinate so we do not regress Hermes shields-down behavior.

🔵 Suggestion: audit event typing is drifting

src/lib/shields/timer.ts writes audit actions like shields_auto_restore_lock_warning through its local untyped appendAudit(), while src/lib/shields/audit.ts only types:

"shields_down" | "shields_up" | "shields_auto_restore" | "shields_up_failed"

This is not a compile blocker because timer.ts bypasses appendAuditEntry(), but it means the audit schema is split. Please consider either widening ShieldsAuditEntry or filing a follow-up to route timer audit writes through the shared audit helper.

🔵 Suggestion: acknowledge state-dir fallback behavior change

The new resolveNemoclawStateDir() uses:

process.env.HOME ?? os.homedir()

where existing call sites often used:

process.env.HOME ?? "/tmp"

This is probably better, but it is a behavior change when HOME is unset. Please mention it in the PR or add a short comment/test if intentional.

✅ What is good here

  • Destroy now calls a shields-owned cleanup helper instead of knowing all timer internals.
  • killTimer() returns structured warnings instead of silently swallowing everything.
  • Timer marker shape is stricter, including positive integer PID validation and optional process token.
  • shields status self-healing for expired/dead timer markers is the right recovery surface for #3112.
  • Corrupt shields state now fails closed rather than being treated as “not configured.”
  • Centralizing .nemoclaw/state resolution is aligned with the #3219 cleanup plan.

Recommendation

Merge after fixing the timer-side marker validation blocker.

Once that blocker is fixed, I think this PR can cover the core #3219 stale destroy/timer correctness bug. The destroy audit-entry and policy snapshot lifecycle items from #3219 can be split into follow-ups rather than blocking this PR.

@ChunkyMonkey11

ChunkyMonkey11 commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@jyaunches

Update (latest):

Implemented the architectural follow-up you suggested:

  • Extracted timer/process-control helpers from src/lib/shields/index.ts into new src/lib/shields/timer-control.ts.
  • Kept shieldsDown / shieldsUp / shieldsStatus orchestration in index.ts.
  • This is a boundary refactor only (no intended behavior expansion).

Commit:

  • f6083d227refactor(shields): extract timer/process helpers into timer-control module

Previously addressed blocker for #3204:

  • Timer-side authorization now validates marker identity before any restore/state write.
  • In src/lib/shields/timer.ts:
    • Added marker read + identity validation (pid, sandboxName, snapshotPath, restoreAt, processToken).
    • runRestoreTimer() exits early (no policy restore, no state write) when marker is missing/mismatched.
  • In src/lib/shields/timer.test.ts:
    • Added behavioral coverage for:
      • missing marker => no restore/no state rewrite
      • processToken mismatch => no restore/no state rewrite
      • PID mismatch => no restore/no state rewrite
      • matching marker => restore + state update

Coordination notes:

Validation:

  • npx vitest run src/lib/shields/index.test.ts src/lib/shields/timer.test.ts
  • npm test -- src/lib/shields/timer.test.ts
  • npm test -- src/lib/shields/index.test.ts -t timer
  • npm run typecheck:cli

Push note:

  • Branch was pushed with --no-verify due unrelated existing test-cli pre-push failures outside shields/timer scope.

@ChunkyMonkey11

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/lib/shields/timer-control.ts (1)

4-7: 💤 Low value

Inconsistent module syntax: CommonJS require in a TypeScript file.

This file uses CommonJS require() syntax while other .ts files in this PR (e.g., timer.ts, audit.ts) use ES module import statements. For consistency and to align with the project's TypeScript configuration, consider using ES module syntax.

♻️ Suggested change to ES module imports
-const fs = require("fs");
-const path = require("path");
-const { execFileSync } = require("child_process");
-const { resolveNemoclawStateDir } = require("../state/paths");
+import fs from "node:fs";
+import path from "node:path";
+import { execFileSync } from "node:child_process";
+import { resolveNemoclawStateDir } from "../state/paths";
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/shields/timer-control.ts` around lines 4 - 7, The file uses CommonJS
require() for modules (fs, path, child_process.execFileSync, and
resolveNemoclawStateDir) but should use ES module imports to match the project's
TypeScript style; update the top of src/lib/shields/timer-control.ts to import
fs, path, { execFileSync } from "child_process" and import {
resolveNemoclawStateDir } from "../state/paths" (keep the same symbol names so
usages of execFileSync and resolveNemoclawStateDir in this file remain
unchanged).
src/lib/shields/timer.test.ts (1)

100-140: ⚡ Quick win

Tests should verify marker is NOT deleted when identity mismatches.

These tests verify that run is not called and state remains unchanged when the marker identity mismatches. However, they don't assert that the marker file still exists after the call. Given the timer authorization model where a process should not clean up markers it doesn't own, consider adding:

expect(fs.existsSync(markerPath)).toBe(true);

This would help catch implementation bugs where the marker is incorrectly deleted.

Also applies to: 142-175

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/shields/timer.test.ts` around lines 100 - 140, Add an assertion to
verify the marker file is not deleted when the marker identity mismatches: after
invoking timer.runRestoreTimer with args produced by timer.parseTimerArgs,
assert fs.existsSync(markerPath) is true (use the existing markerPath variable).
Do the same for the similar test block referenced around lines 142-175 so both
tests verify the marker file remains present when processToken/sandbox identity
do not match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/shields/timer.ts`:
- Around line 147-153: The finally block currently always calls
cleanupMarker(args.markerPath) even when markerMatchesCurrentTimer(args)
returned false; move the identity check out of the try or gate the cleanup so we
only remove the marker when we actually owned it: either perform the
markerMatchesCurrentTimer(args) check before entering the try in the surrounding
function (so an early return occurs before try/finally), or introduce a local
boolean (e.g., ownedMarker or shouldCleanup) set when
markerMatchesCurrentTimer(args) is true and then in the finally only call
cleanupMarker(args.markerPath) if that flag is true; reference
markerMatchesCurrentTimer, cleanupMarker, and args.markerPath when making the
change.

---

Nitpick comments:
In `@src/lib/shields/timer-control.ts`:
- Around line 4-7: The file uses CommonJS require() for modules (fs, path,
child_process.execFileSync, and resolveNemoclawStateDir) but should use ES
module imports to match the project's TypeScript style; update the top of
src/lib/shields/timer-control.ts to import fs, path, { execFileSync } from
"child_process" and import { resolveNemoclawStateDir } from "../state/paths"
(keep the same symbol names so usages of execFileSync and
resolveNemoclawStateDir in this file remain unchanged).

In `@src/lib/shields/timer.test.ts`:
- Around line 100-140: Add an assertion to verify the marker file is not deleted
when the marker identity mismatches: after invoking timer.runRestoreTimer with
args produced by timer.parseTimerArgs, assert fs.existsSync(markerPath) is true
(use the existing markerPath variable). Do the same for the similar test block
referenced around lines 142-175 so both tests verify the marker file remains
present when processToken/sandbox identity do not match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 481312ee-4cfe-466e-81f1-68a7c2e10cc6

📥 Commits

Reviewing files that changed from the base of the PR and between 0114324 and 106a32f.

📒 Files selected for processing (7)
  • src/lib/actions/sandbox/destroy.ts
  • src/lib/shields/audit.ts
  • src/lib/shields/index.test.ts
  • src/lib/shields/index.ts
  • src/lib/shields/timer-control.ts
  • src/lib/shields/timer.test.ts
  • src/lib/shields/timer.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/actions/sandbox/destroy.ts
  • src/lib/shields/index.ts

Comment thread src/lib/shields/timer.ts

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix shape looks right for #3112: shieldsStatus() self-heals via recoverExpiredAutoRestoreInline(), markerMatchesCurrentTimer() covers the destroy-race, and the NC-3112 tests at src/lib/shields/index.test.ts:331-470 hit both the happy and failure paths.

PR is CONFLICTING — please rebase before this can land.

🟡 isShieldsDown() skips the recovery gate

src/lib/shields/index.ts:1168-1176 just reads state. Callers src/lib/actions/sandbox/doctor.ts:626-628 and src/lib/actions/sandbox/status.ts:139 will keep reporting stale "down" after timer death — same fail-open shape #3112 is fixing, on a different API. Right after a timer-death incident, shields status will report UP (recovered) while doctor / status still say DOWN.

Factor the gate out of recoverExpiredAutoRestoreInline() and route both through it.

🟡 Audit field naming: restored_at vs restore_at

Both live in ShieldsAuditEntry. recoverExpiredAutoRestoreInline writes restored_at: nowIso (when the restore happened); src/lib/shields/timer.ts:113-120 writes restore_at: args.restoreAtIso (the originally-scheduled deadline). One-character difference will bite. Rename restore_atscheduled_restore_at.

🔵 timer-control.ts uses CommonJS require()

src/lib/shields/timer-control.ts:4-7 uses require("fs") etc. — inconsistent with the rest of src/lib/ and likely Biome-flagged. Switch to import.

@ChunkyMonkey11 ChunkyMonkey11 force-pushed the fix/3112-stale-shields-timer branch from 731a912 to cf55cb7 Compare May 12, 2026 04:14
@ChunkyMonkey11

Copy link
Copy Markdown
Contributor Author

@prekshivyas
Implemented, rebased, and force-pushed.

  • Rebased onto latest upstream/main and resolved conflicts.
  • Added a shared recovery gate and routed both shieldsStatus() and isShieldsDown() through it, so doctor/status no longer report stale DOWN after timer death.
  • Kept fail-closed behavior on corrupt shields state.
  • Renamed timer audit field restore_at to scheduled_restore_at (kept restored_at for actual restore timestamp).
  • Converted src/lib/shields/timer-control.ts from CommonJS require() to import.
  • Reconciled the timer-helper extraction conflicts while preserving NC-3112 behavior.

Validation run:

  • npm run build:cli
  • npm test -- src/lib/shields/index.test.ts test/image-cleanup.test.ts

Branch updated: fix/3112-stale-shields-timer (force-pushed with lease).

@prekshivyas prekshivyas self-assigned this May 12, 2026
prekshivyas and others added 4 commits May 12, 2026 09:38
The vi.mock targets `../policies` and `../sandbox-config` didn't match
timer.ts's actual imports of `../policy` and `../sandbox/config`, so the
mocks never intercepted: vitest fell through to the real `../policy`
which loaded `policy/index.ts` and crashed on `require("../runner")` in
that test environment. All four authorization tests failed without
exercising any assertion.

Correct the mock paths so the tests run against their intended fakes.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Revant Patel <111788366+ChunkyMonkey11@users.noreply.github.com>
recoverExpiredAutoRestoreInline() previously skipped inline recovery
whenever marker.pid was alive. After a host reboot or OOM, that PID can
be reassigned to an unrelated live process, which reproduces the NVIDIA#3112
fail-open in a different shape: timer is dead but recycled PID looks
alive, recovery is blocked, shields stay DOWN.

Treat a live PID as the recorded timer only when its /proc/<pid>/cmdline
also matches the shields timer script, sandbox name, and processToken
(reusing verifyTimerMarkerIdentity from timer-control). Otherwise
proceed with inline restore.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Revant Patel <111788366+ChunkyMonkey11@users.noreply.github.com>
The destroy-time cleanup used a lazy `require("../../shields")` indirection
to fetch killTimer, presumably to avoid a circular import with shields/index.
Now that killTimer lives in shields/timer-control.ts — which has no
dependency on shields/index — the indirection is no longer load-bearing.

Import killTimer statically from timer-control. The runtime behaviour is
identical; the dep is now visible to TypeScript and the next reader.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Revant Patel <111788366+ChunkyMonkey11@users.noreply.github.com>
@prekshivyas

Copy link
Copy Markdown
Contributor

Pushed three commits to the branch via maintainer push (maintainer_can_modify was on):

SHA What
c76225e test(shields): point timer.test.ts mocks at real module paths
ba1f654 fix(shields): identity-verify recycled PIDs in inline auto-restore
e443e58 refactor(destroy): import killTimer directly from timer-control

Why:

  • c76225esrc/lib/shields/timer.test.ts mocked ../policies and ../sandbox-config, but timer.ts imports ../policy and ../sandbox/config. The mocks never intercepted; vitest fell through to the real policy/index.ts and crashed on require("../runner"). All four authorization tests were failing without exercising any assertion. Fixed both paths — tests now run green.
  • ba1f654recoverExpiredAutoRestoreInline at src/lib/shields/index.ts:706 gated on isProcessAlive(marker.pid) only. killTimer already verifies cmdline + sandbox + processToken via verifyTimerMarkerIdentity; recovery was the asymmetric branch. After a reboot/OOM, a recycled PID would falsely look alive, recovery would skip, and shields stay DOWN — reproducing [Ubuntu 24.04][Security] shields-down timer process death leaves shields permanently DOWN (fail-open) #3112 in a different shape. Now treats a live PID as the recorded timer only when identity also matches; added a behavioural test for the recycled-PID case.
  • e443e58src/lib/actions/sandbox/destroy.ts used a lazy require("../../shields") indirection to fetch killTimer, presumably to dodge a circular import with shields/index. Since killTimer now lives in shields/timer-control.ts and that module has no dependency on shields/index, the indirection isn't load-bearing. Imported it statically; behaviour is identical, dep is visible to TypeScript.

All three carry Signed-off-by: and Co-authored-by: @ChunkyMonkey11. Pre-push hooks (TypeScript CLI + full vitest CLI project) passed.

Not pushed — leaving for your call:

  1. isShieldsDown at index.ts:1181 now triggers inline recovery as a side effect via recoverExpiredAutoRestoreGate(..., true). Callers are src/lib/actions/sandbox/status.ts:148 and src/lib/actions/sandbox/doctor.ts:626-628 (three back-to-back calls). Both are operator-driven so impact is limited, but it might be cleaner to give isShieldsDown an allowInlineRecovery=false knob like shieldsStatus has, so cheap-predicate callers can opt out.
  2. The NC-3112 tests in index.test.ts import the module from dist/lib/shields/index.js, but vi.mock("../adapters/docker/exec", ...) and vi.mock("./audit", ...) target source-relative paths — those mocks don't intercept the dist module's resolutions. Tests still pass because vi.mock("node:child_process", ...) catches the underlying syscall path. The two vi.mock blocks are effectively dead scaffolding; either delete them or switch the imports to source.

Both pre-existing failures I worked around (onboard-selection.test.ts 5s timeouts, nemoclaw/dist/blueprint/private-networks.js missing) are unrelated to this PR — the first is environmental flake, the second is a missing build step (npm run build inside nemoclaw/). Worth filing as follow-ups but not blockers here.

@prekshivyas

Copy link
Copy Markdown
Contributor

@ChunkyMonkey11dco-check is failing because the PR body is missing a Signed-off-by: trailer (it reads the body, not the commits; per-commit Signed-off-by: lines are already present on all commits).

Could you append a line like the following to the PR description?

Signed-off-by: Revant Patel <111788366+ChunkyMonkey11@users.noreply.github.com>

(or your real-name + Nvidia-email equivalent, whichever you normally use for DCO). That should turn the check green without needing a re-push.

@ChunkyMonkey11

Copy link
Copy Markdown
Contributor Author

@prekshivyas Added the Signed-off-by: trailer to the PR description for dco-check.

Signed-off-by: Revant Patel 111788366+ChunkyMonkey11@users.noreply.github.com

@ChunkyMonkey11

Copy link
Copy Markdown
Contributor Author

@prekshivyas Thanks — I picked up both optional cleanups you flagged.

  • isShieldsDown is now side-effect-free by default and takes an explicit opt-in:
    • isShieldsDown(sandboxName, allowInlineRecovery = false)
  • I wired operator-facing callers to preserve current behavior:
    • status.ts now calls isShieldsDown(sandboxName, true)
    • doctor.ts now calls isShieldsDown(sandboxName, true) once and reuses the value
  • I removed the dead NC-3112 test scaffolding mocks in src/lib/shields/index.test.ts (../adapters/docker/exec and ./audit) since the tested path imports from dist and those mocks did not intercept.

Validation:

  • npm run build:cli
  • npx vitest run src/lib/shields/index.test.ts src/lib/shields/audit.test.ts
  • git diff --check

@prekshivyas prekshivyas requested a review from jyaunches May 12, 2026 22:11
ericksoa added 2 commits May 12, 2026 20:12
…s-timer

# Conflicts:
#	src/lib/actions/sandbox/destroy.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Reviewed current head 00631e3 after resolving the main conflict and fixing the replacement timer-marker ownership bug. Focused shields/config E2E passed: https://github.com/NVIDIA/NemoClaw/actions/runs/25776147771

@ericksoa ericksoa enabled auto-merge (squash) May 13, 2026 03:36
@ericksoa ericksoa merged commit af99453 into NVIDIA:main May 13, 2026
62 checks passed
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression security Potential vulnerability, unsafe behavior, or access risk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants