Skip to content

fix(recovery): re-emit guard chain when /tmp is wiped after pod recreate (#2701)#2723

Closed
jyaunches wants to merge 6 commits into
mainfrom
issue-2701-recover-guard-chain-on-pod-recreate
Closed

fix(recovery): re-emit guard chain when /tmp is wiped after pod recreate (#2701)#2723
jyaunches wants to merge 6 commits into
mainfrom
issue-2701-recover-guard-chain-on-pod-recreate

Conversation

@jyaunches

Copy link
Copy Markdown
Contributor

Summary

Fixes #2701 — After a pod recreate, the sandbox's /tmp is fresh and all NODE_OPTIONS preload guard files are gone. The recovery path previously detected this and warned but launched the gateway anyway — causing a crash-loop from @homebridge/ciao's os.networkInterfaces() call.

This PR adds automatic guard re-emission: when recovery detects missing guards, it calls /usr/local/lib/nemoclaw/emit-guards.sh inside the sandbox via kubectl exec (as root, bypassing Landlock) before launching the gateway.

Changes

  • scripts/lib/guards/ — 6 JS guard files extracted from nemoclaw-start.sh heredocs into standalone files, baked into the Docker image at /usr/local/lib/nemoclaw/guards/
  • scripts/lib/emit-guards.sh — Standalone script that installs guards to /tmp with correct permissions (root:root 444) and generates proxy-env.sh dynamically. Callable by both the entrypoint and the recovery path.
  • Dockerfile — COPY guards/ and emit-guards.sh into the image
  • src/lib/guard-recovery.ts — New module: checkGuardsPresent() + reEmitGuards() using docker exec → kubectl exec
  • src/nemoclaw.ts — Wired guard re-emission into recoverSandboxProcesses() before gateway launch
  • src/lib/sandbox-build-context.ts — Stage new files in optimized build context

Dependencies

Verification

npm run typecheck:cli
npx vitest run src/lib/guard-recovery.test.ts src/lib/agent-runtime.test.ts test/sandbox-build-context.test.ts test/nemoclaw-start.test.ts

E2E Validation

Target the issue-2478-crash-loop-recovery-e2e nightly job. The test's Phase 4 negative case will validate that:

  1. Guards are removed → recovery detects missing guards → re-emits → gateway starts clean
  2. Crash-recovery loops (Phase 3) continue to work with guards intact

Once #2710 merges, rebase onto main and update the #2478 E2E test assertions to expect re-emission instead of WARNING.

Follow-up

  • Phase 2: Refactor nemoclaw-start.sh to source guards from /usr/local/lib/nemoclaw/guards/ (DRY the heredocs) — separate PR
  • Phase 4: Dedicated test-issue-2701-guard-reemission.sh E2E test

jyaunches and others added 6 commits April 29, 2026 18:16
…Processes

Replace the manual gateway-restart + port-forward logic in
checkAndRecoverSandboxProcesses with delegation to the Dashboard
Delivery Contract's recoverDashboardChain(). This verifies all
chain links (gateway, forward, CORS) and only repairs what's broken.

Implements bounded DashboardRecoverDeps in nemoclaw.ts:
- captureForwardList: bounded with OPENSHELL_PROBE_TIMEOUT_MS
- downloadSandboxConfig: bounded with OPENSHELL_DOWNLOAD_TIMEOUT_MS
- stopForward/startForward: bounded with OPENSHELL_OPERATION_TIMEOUT_MS
- executeSandboxCommand: already bounded (15s SSH timeout)
- restartGateway: delegates to existing recoverSandboxProcesses

This is Phase 5 of #2562 and completes the nemoclaw.ts integration
that was reverted after the #2398 E2E hang. All openshell calls in
the recovery path are now explicitly bounded.

Closes: #2390
Refs: #2562
Address CodeRabbit feedback:

1. Remove early return when gateway is running — always run
   recoverDashboardChain() so broken forward/CORS links get repaired
   even when the in-sandbox gateway is healthy (#2042, #1178).

2. Forward port param through restartGateway → recoverSandboxProcesses
   so the fallback recovery script uses the chain's port instead of
   hardcoded DASHBOARD_PORT.

The wasRunning field now reflects the actual gateway probe result
regardless of whether recovery was attempted.

Refs: #2390
…ailures

The E2E tests (sandbox-survival-e2e, skip-permissions-e2e) failed with
exit code 124 (timeout hang) because recoverDashboardChain immediately
re-verified after restarting the gateway — the gateway HTTP listener
hadn't bound yet, so verifyDashboardChain reported it unhealthy.

Add sleepSeconds(3) after a successful restartGateway call to give the
gateway time to bind its HTTP port before the chain re-verification.
This matches the original recovery code's behavior (sleepSeconds(3)
between recoverSandboxProcesses and isSandboxGatewayRunning).

Also removes a redundant type annotation in downloadSandboxConfig.

Refs: #2390
…port

Address Aaron's review feedback:

1. Port derivation: use registry.getSandbox(name)?.dashboardPort instead
   of agent?.forwardPort to correctly recover multi-sandbox setups with
   auto-allocated or user-overridden ports (e.g. 18790).

2. Agent gating: only run full dashboard chain recovery (CORS check via
   /sandbox/.openclaw/openclaw.json) for OpenClaw sandboxes. Non-OpenClaw
   agents (Hermes) use different config paths and don't expose the
   OpenClaw control UI — fall back to gateway-only recovery via new
   checkAndRecoverGatewayOnly() helper.

The gateway-only path preserves the original pre-#2398 behavior for
Hermes: check gateway → restart if dead → re-forward → done.

Refs: #2390
…ate (#2701)

After a pod recreate, the sandbox's /tmp is fresh and the NODE_OPTIONS
preload guard chain (safety-net, ciao-network-guard, http-proxy-fix,
nemotron-fix, slack-channel-guard, slack-token-rewriter) and the
proxy-env.sh aggregator are all gone. The recovery path previously
detected this and warned but launched the gateway anyway — causing a
crash-loop from @homebridge/ciao's os.networkInterfaces() call.

Fix:
- Extract guard JS files into scripts/lib/guards/ as standalone files
- Add scripts/lib/emit-guards.sh that installs guards to /tmp with
  correct permissions (root:root 444) and generates proxy-env.sh
- Install both into the Docker image at /usr/local/lib/nemoclaw/
- Add src/lib/guard-recovery.ts with checkGuardsPresent() and
  reEmitGuards() — uses kubectl exec (via docker exec) to run
  emit-guards.sh as root inside the sandbox, bypassing Landlock
- Wire guard re-emission into recoverSandboxProcesses() before
  the gateway launch script runs

The recovery path now: detect missing guards → re-emit via kubectl exec
→ source the freshly-written proxy-env.sh → launch gateway with guards.

Refs: #2701, #2478
@copy-pr-bot

copy-pr-bot Bot commented Apr 29, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b409f1b3-2a26-4f45-9d00-b498ff9f5405

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-2701-recover-guard-chain-on-pod-recreate

Comment @coderabbitai help to get the list of available commands and usage tips.

@wscurran wscurran added bug Something fails against expected or documented behavior refactor PR restructures code without intended behavior change dependencies Pull requests that update a dependency file labels Apr 30, 2026
@jyaunches jyaunches closed this May 1, 2026
@jyaunches jyaunches deleted the issue-2701-recover-guard-chain-on-pod-recreate branch May 1, 2026 12:19
@wscurran wscurran added bug-fix PR fixes a bug or regression chore Build, CI, dependency, or tooling maintenance and removed bug Something fails against expected or documented behavior refactor PR restructures code without intended behavior change chore Build, CI, dependency, or tooling maintenance labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression dependencies Pull requests that update a dependency file

Projects

None yet

3 participants