fix(recovery): re-emit guard chain when /tmp is wiped after pod recreate (#2701)#2723
Closed
jyaunches wants to merge 6 commits into
Closed
fix(recovery): re-emit guard chain when /tmp is wiped after pod recreate (#2701)#2723jyaunches wants to merge 6 commits into
jyaunches wants to merge 6 commits into
Conversation
…Processes Replace the manual gateway-restart + port-forward logic in checkAndRecoverSandboxProcesses with delegation to the Dashboard Delivery Contract's recoverDashboardChain(). This verifies all chain links (gateway, forward, CORS) and only repairs what's broken. Implements bounded DashboardRecoverDeps in nemoclaw.ts: - captureForwardList: bounded with OPENSHELL_PROBE_TIMEOUT_MS - downloadSandboxConfig: bounded with OPENSHELL_DOWNLOAD_TIMEOUT_MS - stopForward/startForward: bounded with OPENSHELL_OPERATION_TIMEOUT_MS - executeSandboxCommand: already bounded (15s SSH timeout) - restartGateway: delegates to existing recoverSandboxProcesses This is Phase 5 of #2562 and completes the nemoclaw.ts integration that was reverted after the #2398 E2E hang. All openshell calls in the recovery path are now explicitly bounded. Closes: #2390 Refs: #2562
Address CodeRabbit feedback: 1. Remove early return when gateway is running — always run recoverDashboardChain() so broken forward/CORS links get repaired even when the in-sandbox gateway is healthy (#2042, #1178). 2. Forward port param through restartGateway → recoverSandboxProcesses so the fallback recovery script uses the chain's port instead of hardcoded DASHBOARD_PORT. The wasRunning field now reflects the actual gateway probe result regardless of whether recovery was attempted. Refs: #2390
…ailures The E2E tests (sandbox-survival-e2e, skip-permissions-e2e) failed with exit code 124 (timeout hang) because recoverDashboardChain immediately re-verified after restarting the gateway — the gateway HTTP listener hadn't bound yet, so verifyDashboardChain reported it unhealthy. Add sleepSeconds(3) after a successful restartGateway call to give the gateway time to bind its HTTP port before the chain re-verification. This matches the original recovery code's behavior (sleepSeconds(3) between recoverSandboxProcesses and isSandboxGatewayRunning). Also removes a redundant type annotation in downloadSandboxConfig. Refs: #2390
…port Address Aaron's review feedback: 1. Port derivation: use registry.getSandbox(name)?.dashboardPort instead of agent?.forwardPort to correctly recover multi-sandbox setups with auto-allocated or user-overridden ports (e.g. 18790). 2. Agent gating: only run full dashboard chain recovery (CORS check via /sandbox/.openclaw/openclaw.json) for OpenClaw sandboxes. Non-OpenClaw agents (Hermes) use different config paths and don't expose the OpenClaw control UI — fall back to gateway-only recovery via new checkAndRecoverGatewayOnly() helper. The gateway-only path preserves the original pre-#2398 behavior for Hermes: check gateway → restart if dead → re-forward → done. Refs: #2390
…ate (#2701) After a pod recreate, the sandbox's /tmp is fresh and the NODE_OPTIONS preload guard chain (safety-net, ciao-network-guard, http-proxy-fix, nemotron-fix, slack-channel-guard, slack-token-rewriter) and the proxy-env.sh aggregator are all gone. The recovery path previously detected this and warned but launched the gateway anyway — causing a crash-loop from @homebridge/ciao's os.networkInterfaces() call. Fix: - Extract guard JS files into scripts/lib/guards/ as standalone files - Add scripts/lib/emit-guards.sh that installs guards to /tmp with correct permissions (root:root 444) and generates proxy-env.sh - Install both into the Docker image at /usr/local/lib/nemoclaw/ - Add src/lib/guard-recovery.ts with checkGuardsPresent() and reEmitGuards() — uses kubectl exec (via docker exec) to run emit-guards.sh as root inside the sandbox, bypassing Landlock - Wire guard re-emission into recoverSandboxProcesses() before the gateway launch script runs The recovery path now: detect missing guards → re-emit via kubectl exec → source the freshly-written proxy-env.sh → launch gateway with guards. Refs: #2701, #2478
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2701 — After a pod recreate, the sandbox's
/tmpis fresh and all NODE_OPTIONS preload guard files are gone. The recovery path previously detected this and warned but launched the gateway anyway — causing a crash-loop from@homebridge/ciao'sos.networkInterfaces()call.This PR adds automatic guard re-emission: when recovery detects missing guards, it calls
/usr/local/lib/nemoclaw/emit-guards.shinside the sandbox viakubectl exec(as root, bypassing Landlock) before launching the gateway.Changes
scripts/lib/guards/— 6 JS guard files extracted fromnemoclaw-start.shheredocs into standalone files, baked into the Docker image at/usr/local/lib/nemoclaw/guards/scripts/lib/emit-guards.sh— Standalone script that installs guards to/tmpwith correct permissions (root:root 444) and generatesproxy-env.shdynamically. Callable by both the entrypoint and the recovery path.Dockerfile— COPY guards/ and emit-guards.sh into the imagesrc/lib/guard-recovery.ts— New module:checkGuardsPresent()+reEmitGuards()using docker exec → kubectl execsrc/nemoclaw.ts— Wired guard re-emission intorecoverSandboxProcesses()before gateway launchsrc/lib/sandbox-build-context.ts— Stage new files in optimized build contextDependencies
test/e2e/test-issue-2478-crash-loop-recovery.shPhase 4 assertion will need updating: the negative case (proxy-env.sh removed) will now see successful re-emission instead of the WARNING, since our fix restores the file before launchingVerification
E2E Validation
Target the
issue-2478-crash-loop-recovery-e2enightly job. The test's Phase 4 negative case will validate that:Once #2710 merges, rebase onto main and update the #2478 E2E test assertions to expect re-emission instead of WARNING.
Follow-up
nemoclaw-start.shto source guards from/usr/local/lib/nemoclaw/guards/(DRY the heredocs) — separate PRtest-issue-2701-guard-reemission.shE2E test