Skip to content

[DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701

@camerono

Description

@camerono

TL;DR

On DGX Spark / GB10 / aarch64, after any pod recreate that doesn't go through nemoclaw <sandbox> rebuild (host reboot, OOM, supervisor crash, manual kubectl delete pod), running nemoclaw <sandbox> connect puts the gateway into an infinite crash loop. The TUI shows gateway disconnected: closed | idle. The only recovery is nemoclaw <sandbox> rebuild --yes — a ~5 minute Docker image rebuild.

The recovery code path detects the condition (it emits [gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing) but launches the gateway anyway, naked, with no NODE_OPTIONS preloads. On aarch64 the @homebridge/ciao mDNS package then throws uv_interface_addresses returned Unknown system error 1 because the OpenShell sandbox netns blocks the syscall, and the gateway crash-loops forever.

Steps to Reproduce

  1. Onboard a sandbox with any provider:

    NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=hermes3:8b \
    NEMOCLAW_NON_INTERACTIVE=1 nemoclaw onboard
  2. Verify guards are present (the --require preloads + the env file that chains them via NODE_OPTIONS):

    docker exec openshell-cluster-<gateway> kubectl -n openshell exec <sandbox> -- \
      ls -la /tmp/nemoclaw-proxy-env.sh /tmp/nemoclaw-ciao-network-guard.js

    Expected: both exist, owned sandbox:sandbox, mode 0444.

  3. Force a pod recreate without going through rebuild. Any of these triggers the bug:

    • Easy repro: docker exec openshell-cluster-<gateway> kubectl -n openshell delete pod <sandbox>
    • Realistic on DGX Spark: reboot the host
    • Agent container OOM
    • Unhandled rejection in the gateway crashes the supervisor

    The openshell-sandbox-controller recreates the pod with a fresh container in ~5–10 s.

  4. Re-verify guard files are gone:

    docker exec openshell-cluster-<gateway> kubectl -n openshell exec <sandbox> -- \
      ls -la /tmp/nemoclaw-proxy-env.sh /tmp/nemoclaw-ciao-network-guard.js

    Observed: No such file or directory on both.

  5. Run nemoclaw <sandbox> connect. The gateway respawns and immediately enters a crash loop with the ciao stack trace below.

  6. TUI displays:

    gateway disconnected: closed | idle
    
  7. Only nemoclaw <sandbox> rebuild --yes recovers — a full image build (~5 min on ARM64 / GB10).

What the user sees

Gateway log (repeats every few seconds, forever):

[openclaw] Unhandled promise rejection: SystemError: A system error occurred:
  uv_interface_addresses returned Unknown system error 1 (Unknown system error 1)
    at Object.networkInterfaces (node:os:218:16)
    at Function.assumeNetworkInterfaceNames (.../@homebridge/ciao/src/NetworkManager.ts:527:23)

The recovery script also writes a [gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing — gateway launching without library guards (#2478) to the log just before launching — but only to the log, and the user has no actionable next step short of rebuild --yes.

Root Cause (current code on main)

src/lib/agent/runtime.ts → buildOpenClawRecoveryScript() constructs the shell that runs when connect decides to relaunch the gateway. When /tmp/nemoclaw-proxy-env.sh is missing it takes the warn-and-proceed branch by design:

if [ -r /tmp/nemoclaw-proxy-env.sh ]; then . /tmp/nemoclaw-proxy-env.sh; _PE_MISSING=0; else _PE_MISSING=1; fi;
[ "$_PE_MISSING" = "1" ] && { ...echo WARNING...; };
[ "$_PE_MISSING" = "0" ] && [ "$_GUARDS_MISSING" = "1" ] && { ...exit 1... };  # only partial-failure hard-fails
launchCommand   # runs even when _PE_MISSING=1

The comment in the source spells out the trade-off:

"A missing env file remains warning-only; a present env file that does not install required guards is a hard failure because launching would create an unguarded gateway."

That trade-off is fine on x86 cloud (where os.networkInterfaces() succeeds) and a guaranteed crash loop on aarch64 / DGX Spark.

What landed adjacent on main (does NOT solve this)

Earlier fix attempts

Remaining scope (proposed fix)

  1. Shared install-preloads.sh — factor the preload install logic out of scripts/nemoclaw-start.sh so it can be invoked by both the entrypoint and the recovery path. Installs from /usr/local/lib/nemoclaw/preloads/ (established by refactor(runtime): extract entrypoint preload modules #3109) into /tmp with correct ownership/mode, and emits /tmp/nemoclaw-proxy-env.sh dynamically.
  2. New recovery modulesrc/lib/sandbox/guard-recovery.ts (or colocate with src/lib/dashboard/recover.ts — pick during rebase):
    • checkGuardsPresent(sandbox) — kubectl-exec stat.
    • reEmitGuards(sandbox) — invokes install-preloads.sh inside the sandbox as root (bypasses Landlock) before gateway launch.
  3. Wire into buildOpenClawRecoveryScript — when missing-guards is detected, re-emit before launching the gateway instead of warning and proceeding. The "_PE_MISSING=1 → WARNING" branch becomes "_PE_MISSING=1 → re-emit → re-source → continue."
  4. Tests
    • Unit: guard-recovery.test.ts — presence check, re-emit invocation, error handling.
    • E2E: update test/e2e/test-issue-2478-crash-loop-recovery.sh Phase 4 — the negative case (proxy-env.sh removed) should now assert successful re-emission instead of the WARNING.

Out of scope (intentional)

  • --repair flag on sandbox connect — recovery should be automatic.
  • Further refactor of nemoclaw-start.sh beyond extracting install-preloads.sh.
  • Landlock / permission-model changes.

Environment

  • OS: Ubuntu 24.04 (Linux <host> 6.17.0-1014-nvidia aarch64)
  • Hardware: NVIDIA GB10 (DGX Spark)
  • Docker: Engine 27.x
  • Node.js: v22.22.2
  • NemoClaw: v0.0.29 (originally reported); confirmed still present on main (≥ v0.0.61) as of 2026-06-09
  • OpenClaw: 2026.4.9
  • OpenShell: 0.0.36

Original logs / artifacts

Stale lock left behind by the crash-looping gateway:

$ kubectl -n openshell exec <sandbox> -- cat /tmp/openclaw-998/gateway.<id>.lock
{"pid":257,"createdAt":"...","configPath":"/sandbox/.openclaw/openclaw.json","startTime":9140991}

This lock prevents subsequent gateway-start attempts from succeeding cleanly even after /tmp is repopulated, until it's removed.

nemoclaw debug --quick --sandbox <sandbox> capture archived at debug-output-2026-04-29-1847.txt.


Filed by @camerono 2026-04-29 · re-scoped 2026-05-07 · rewritten for clarity 2026-06-09.

Metadata

Metadata

Assignees

Labels

area: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: dgx-sparkAffects DGX Spark hardware or workflows

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions