[DGX Spark] Host reboot bricks sandbox until 5-min `rebuild --yes`: `connect` recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop

## TL;DR

On DGX Spark / GB10 / aarch64, after **any pod recreate that doesn't go through `nemoclaw <sandbox> rebuild`** (host reboot, OOM, supervisor crash, manual `kubectl delete pod`), running `nemoclaw <sandbox> connect` puts the gateway into an infinite crash loop. The TUI shows `gateway disconnected: closed | idle`. **The only recovery is `nemoclaw <sandbox> rebuild --yes` — a ~5 minute Docker image rebuild.**

The recovery code path *detects* the condition (it emits `[gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing`) but **launches the gateway anyway, naked, with no NODE_OPTIONS preloads**. On aarch64 the `@homebridge/ciao` mDNS package then throws `uv_interface_addresses returned Unknown system error 1` because the OpenShell sandbox netns blocks the syscall, and the gateway crash-loops forever.

## Steps to Reproduce

1. **Onboard a sandbox** with any provider:
   ```bash
   NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=hermes3:8b \
   NEMOCLAW_NON_INTERACTIVE=1 nemoclaw onboard
   ```

2. **Verify guards are present** (the `--require` preloads + the env file that chains them via `NODE_OPTIONS`):
   ```bash
   docker exec openshell-cluster-<gateway> kubectl -n openshell exec <sandbox> -- \
     ls -la /tmp/nemoclaw-proxy-env.sh /tmp/nemoclaw-ciao-network-guard.js
   ```
   Expected: both exist, owned `sandbox:sandbox`, mode `0444`.

3. **Force a pod recreate** without going through rebuild. Any of these triggers the bug:
   - **Easy repro:** `docker exec openshell-cluster-<gateway> kubectl -n openshell delete pod <sandbox>`
   - **Realistic on DGX Spark:** reboot the host
   - Agent container OOM
   - Unhandled rejection in the gateway crashes the supervisor

   The openshell-sandbox-controller recreates the pod with a fresh container in ~5–10 s.

4. **Re-verify guard files are gone:**
   ```bash
   docker exec openshell-cluster-<gateway> kubectl -n openshell exec <sandbox> -- \
     ls -la /tmp/nemoclaw-proxy-env.sh /tmp/nemoclaw-ciao-network-guard.js
   ```
   Observed: `No such file or directory` on both.

5. **Run** `nemoclaw <sandbox> connect`. The gateway respawns and immediately enters a crash loop with the ciao stack trace below.

6. TUI displays:
   ```
   gateway disconnected: closed | idle
   ```

7. **Only `nemoclaw <sandbox> rebuild --yes` recovers** — a full image build (~5 min on ARM64 / GB10).

## What the user sees

Gateway log (repeats every few seconds, forever):
```
[openclaw] Unhandled promise rejection: SystemError: A system error occurred:
  uv_interface_addresses returned Unknown system error 1 (Unknown system error 1)
    at Object.networkInterfaces (node:os:218:16)
    at Function.assumeNetworkInterfaceNames (.../@homebridge/ciao/src/NetworkManager.ts:527:23)
```

The recovery script *also* writes a `[gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing — gateway launching without library guards (#2478)` to the log just before launching — but only to the log, and the user has no actionable next step short of `rebuild --yes`.

## Root Cause (current code on `main`)

`src/lib/agent/runtime.ts → buildOpenClawRecoveryScript()` constructs the shell that runs when `connect` decides to relaunch the gateway. When `/tmp/nemoclaw-proxy-env.sh` is missing it takes the **warn-and-proceed** branch by design:

```sh
if [ -r /tmp/nemoclaw-proxy-env.sh ]; then . /tmp/nemoclaw-proxy-env.sh; _PE_MISSING=0; else _PE_MISSING=1; fi;
[ "$_PE_MISSING" = "1" ] && { ...echo WARNING...; };
[ "$_PE_MISSING" = "0" ] && [ "$_GUARDS_MISSING" = "1" ] && { ...exit 1... };  # only partial-failure hard-fails
launchCommand   # runs even when _PE_MISSING=1
```

The comment in the source spells out the trade-off:
> "A missing env file remains warning-only; a present env file that does not install required guards is a hard failure because launching would create an unguarded gateway."

That trade-off is fine on x86 cloud (where `os.networkInterfaces()` succeeds) and a **guaranteed crash loop on aarch64 / DGX Spark**.

## What landed adjacent on `main` (does NOT solve this)

- **#2777** (merged 2026-05-01) — keeps gateway guard preloads active **after respawn**. Solves a sibling case where the re-exec'd gateway child silently dropped its preloads. Does not restore wiped `/tmp` files.
- **#3109** (merged 2026-05-06) — extracts the five preload bodies from `scripts/nemoclaw-start.sh` heredocs into standalone modules at `nemoclaw-blueprint/scripts/`, baked into the image at `/usr/local/lib/nemoclaw/preloads/`. **This is the prerequisite for the proper fix here** but does not itself wire the recovery path to use them.

## Earlier fix attempts

- **PR #2723** (closed 2026-05-01) — first draft of the proper fix; got tangled and was replaced same-day.
- **PR #2843** (closed 2026-05-07) — re-roll of #2723 (identical 1324+/17 diff). Closed unmerged because #3109 landed mid-flight and changed the preload-module layout that #2843 was extracting; the diff would have needed a meaningful rebase.

## Remaining scope (proposed fix)

1. **Shared `install-preloads.sh`** — factor the preload install logic out of `scripts/nemoclaw-start.sh` so it can be invoked by both the entrypoint and the recovery path. Installs from `/usr/local/lib/nemoclaw/preloads/` (established by #3109) into `/tmp` with correct ownership/mode, and emits `/tmp/nemoclaw-proxy-env.sh` dynamically.
2. **New recovery module** — `src/lib/sandbox/guard-recovery.ts` (or colocate with `src/lib/dashboard/recover.ts` — pick during rebase):
   - `checkGuardsPresent(sandbox)` — kubectl-exec stat.
   - `reEmitGuards(sandbox)` — invokes `install-preloads.sh` inside the sandbox as root (bypasses Landlock) before gateway launch.
3. **Wire into `buildOpenClawRecoveryScript`** — when missing-guards is detected, re-emit *before* launching the gateway instead of warning and proceeding. The "`_PE_MISSING=1` → WARNING" branch becomes "`_PE_MISSING=1` → re-emit → re-source → continue."
4. **Tests**
   - Unit: `guard-recovery.test.ts` — presence check, re-emit invocation, error handling.
   - E2E: update `test/e2e/test-issue-2478-crash-loop-recovery.sh` Phase 4 — the negative case (proxy-env.sh removed) should now assert successful re-emission instead of the WARNING.

## Out of scope (intentional)

- `--repair` flag on `sandbox connect` — recovery should be automatic.
- Further refactor of `nemoclaw-start.sh` beyond extracting `install-preloads.sh`.
- Landlock / permission-model changes.

## Environment

- OS: Ubuntu 24.04 (`Linux <host> 6.17.0-1014-nvidia aarch64`)
- Hardware: NVIDIA GB10 (DGX Spark)
- Docker: Engine 27.x
- Node.js: v22.22.2
- NemoClaw: v0.0.29 (originally reported); **confirmed still present on `main` (≥ v0.0.61) as of 2026-06-09**
- OpenClaw: 2026.4.9
- OpenShell: 0.0.36

## Original logs / artifacts

Stale lock left behind by the crash-looping gateway:
```
$ kubectl -n openshell exec <sandbox> -- cat /tmp/openclaw-998/gateway.<id>.lock
{"pid":257,"createdAt":"...","configPath":"/sandbox/.openclaw/openclaw.json","startTime":9140991}
```
This lock prevents subsequent gateway-start attempts from succeeding cleanly even after `/tmp` is repopulated, until it's removed.

`nemoclaw debug --quick --sandbox <sandbox>` capture archived at [debug-output-2026-04-29-1847.txt](https://github.com/user-attachments/files/27216116/debug-output-2026-04-29-1847.txt).

---

Filed by @camerono 2026-04-29 · re-scoped 2026-05-07 · rewritten for clarity 2026-06-09.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DGX Spark] Host reboot bricks sandbox until 5-min `rebuild --yes`: `connect` recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701

TL;DR

Steps to Reproduce

What the user sees

Root Cause (current code on `main`)

What landed adjacent on `main` (does NOT solve this)

Earlier fix attempts

Remaining scope (proposed fix)

Out of scope (intentional)

Environment

Original logs / artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701

Description

TL;DR

Steps to Reproduce

What the user sees

Root Cause (current code on main)

What landed adjacent on main (does NOT solve this)

Earlier fix attempts

Remaining scope (proposed fix)

Out of scope (intentional)

Environment

Original logs / artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[DGX Spark] Host reboot bricks sandbox until 5-min `rebuild --yes`: `connect` recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701

Root Cause (current code on `main`)

What landed adjacent on `main` (does NOT solve this)