fix(recovery): re-emit guard chain when /tmp is wiped after pod recreate (#2701) by jyaunches · Pull Request #2723 · NVIDIA/NemoClaw

jyaunches · 2026-04-29T23:54:26Z

Summary

Fixes #2701 — After a pod recreate, the sandbox's /tmp is fresh and all NODE_OPTIONS preload guard files are gone. The recovery path previously detected this and warned but launched the gateway anyway — causing a crash-loop from @homebridge/ciao's os.networkInterfaces() call.

This PR adds automatic guard re-emission: when recovery detects missing guards, it calls /usr/local/lib/nemoclaw/emit-guards.sh inside the sandbox via kubectl exec (as root, bypassing Landlock) before launching the gateway.

Changes

scripts/lib/guards/ — 6 JS guard files extracted from nemoclaw-start.sh heredocs into standalone files, baked into the Docker image at /usr/local/lib/nemoclaw/guards/
scripts/lib/emit-guards.sh — Standalone script that installs guards to /tmp with correct permissions (root:root 444) and generates proxy-env.sh dynamically. Callable by both the entrypoint and the recovery path.
Dockerfile — COPY guards/ and emit-guards.sh into the image
src/lib/guard-recovery.ts — New module: checkGuardsPresent() + reEmitGuards() using docker exec → kubectl exec
src/nemoclaw.ts — Wired guard re-emission into recoverSandboxProcesses() before gateway launch
src/lib/sandbox-build-context.ts — Stage new files in optimized build context

Dependencies

Depends on PR refactor(cli): wire recoverDashboardChain into checkAndRecoverSandboxProcesses #2710 (rebased on its branch) — needs to merge first or be combined
test/e2e/test-issue-2478-crash-loop-recovery.sh Phase 4 assertion will need updating: the negative case (proxy-env.sh removed) will now see successful re-emission instead of the WARNING, since our fix restores the file before launching

Verification

npm run typecheck:cli
npx vitest run src/lib/guard-recovery.test.ts src/lib/agent-runtime.test.ts test/sandbox-build-context.test.ts test/nemoclaw-start.test.ts

E2E Validation

Target the issue-2478-crash-loop-recovery-e2e nightly job. The test's Phase 4 negative case will validate that:

Guards are removed → recovery detects missing guards → re-emits → gateway starts clean
Crash-recovery loops (Phase 3) continue to work with guards intact

Once #2710 merges, rebase onto main and update the #2478 E2E test assertions to expect re-emission instead of WARNING.

Follow-up

Phase 2: Refactor nemoclaw-start.sh to source guards from /usr/local/lib/nemoclaw/guards/ (DRY the heredocs) — separate PR
Phase 4: Dedicated test-issue-2701-guard-reemission.sh E2E test

…Processes Replace the manual gateway-restart + port-forward logic in checkAndRecoverSandboxProcesses with delegation to the Dashboard Delivery Contract's recoverDashboardChain(). This verifies all chain links (gateway, forward, CORS) and only repairs what's broken. Implements bounded DashboardRecoverDeps in nemoclaw.ts: - captureForwardList: bounded with OPENSHELL_PROBE_TIMEOUT_MS - downloadSandboxConfig: bounded with OPENSHELL_DOWNLOAD_TIMEOUT_MS - stopForward/startForward: bounded with OPENSHELL_OPERATION_TIMEOUT_MS - executeSandboxCommand: already bounded (15s SSH timeout) - restartGateway: delegates to existing recoverSandboxProcesses This is Phase 5 of #2562 and completes the nemoclaw.ts integration that was reverted after the #2398 E2E hang. All openshell calls in the recovery path are now explicitly bounded. Closes: #2390 Refs: #2562

Address CodeRabbit feedback: 1. Remove early return when gateway is running — always run recoverDashboardChain() so broken forward/CORS links get repaired even when the in-sandbox gateway is healthy (#2042, #1178). 2. Forward port param through restartGateway → recoverSandboxProcesses so the fallback recovery script uses the chain's port instead of hardcoded DASHBOARD_PORT. The wasRunning field now reflects the actual gateway probe result regardless of whether recovery was attempted. Refs: #2390

…ailures The E2E tests (sandbox-survival-e2e, skip-permissions-e2e) failed with exit code 124 (timeout hang) because recoverDashboardChain immediately re-verified after restarting the gateway — the gateway HTTP listener hadn't bound yet, so verifyDashboardChain reported it unhealthy. Add sleepSeconds(3) after a successful restartGateway call to give the gateway time to bind its HTTP port before the chain re-verification. This matches the original recovery code's behavior (sleepSeconds(3) between recoverSandboxProcesses and isSandboxGatewayRunning). Also removes a redundant type annotation in downloadSandboxConfig. Refs: #2390

…port Address Aaron's review feedback: 1. Port derivation: use registry.getSandbox(name)?.dashboardPort instead of agent?.forwardPort to correctly recover multi-sandbox setups with auto-allocated or user-overridden ports (e.g. 18790). 2. Agent gating: only run full dashboard chain recovery (CORS check via /sandbox/.openclaw/openclaw.json) for OpenClaw sandboxes. Non-OpenClaw agents (Hermes) use different config paths and don't expose the OpenClaw control UI — fall back to gateway-only recovery via new checkAndRecoverGatewayOnly() helper. The gateway-only path preserves the original pre-#2398 behavior for Hermes: check gateway → restart if dead → re-forward → done. Refs: #2390

…ate (#2701) After a pod recreate, the sandbox's /tmp is fresh and the NODE_OPTIONS preload guard chain (safety-net, ciao-network-guard, http-proxy-fix, nemotron-fix, slack-channel-guard, slack-token-rewriter) and the proxy-env.sh aggregator are all gone. The recovery path previously detected this and warned but launched the gateway anyway — causing a crash-loop from @homebridge/ciao's os.networkInterfaces() call. Fix: - Extract guard JS files into scripts/lib/guards/ as standalone files - Add scripts/lib/emit-guards.sh that installs guards to /tmp with correct permissions (root:root 444) and generates proxy-env.sh - Install both into the Docker image at /usr/local/lib/nemoclaw/ - Add src/lib/guard-recovery.ts with checkGuardsPresent() and reEmitGuards() — uses kubectl exec (via docker exec) to run emit-guards.sh as root inside the sandbox, bypassing Landlock - Wire guard re-emission into recoverSandboxProcesses() before the gateway launch script runs The recovery path now: detect missing guards → re-emit via kubectl exec → source the freshly-written proxy-env.sh → launch gateway with guards. Refs: #2701, #2478

copy-pr-bot · 2026-04-29T23:54:30Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-29T23:54:32Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b409f1b3-2a26-4f45-9d00-b498ff9f5405

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-2701-recover-guard-chain-on-pod-recreate

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jyaunches and others added 6 commits April 29, 2026 18:16

Merge branch 'main' into issue-2390-dashboard-recovery-integration

fa774eb

wscurran added bug Something fails against expected or documented behavior refactor PR restructures code without intended behavior change dependencies Pull requests that update a dependency file labels Apr 30, 2026

wscurran mentioned this pull request Apr 30, 2026

[DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701

Open

jyaunches closed this May 1, 2026

jyaunches deleted the issue-2701-recover-guard-chain-on-pod-recreate branch May 1, 2026 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recovery): re-emit guard chain when /tmp is wiped after pod recreate (#2701)#2723

fix(recovery): re-emit guard chain when /tmp is wiped after pod recreate (#2701)#2723
jyaunches wants to merge 6 commits into
mainfrom
issue-2701-recover-guard-chain-on-pod-recreate

jyaunches commented Apr 29, 2026

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jyaunches commented Apr 29, 2026

Summary

Changes

Dependencies

Verification

E2E Validation

Follow-up

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants