You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On DGX Spark / GB10 / aarch64, after any pod recreate that doesn't go through nemoclaw <sandbox> rebuild (host reboot, OOM, supervisor crash, manual kubectl delete pod), running nemoclaw <sandbox> connect puts the gateway into an infinite crash loop. The TUI shows gateway disconnected: closed | idle. The only recovery is nemoclaw <sandbox> rebuild --yes — a ~5 minute Docker image rebuild.
The recovery code path detects the condition (it emits [gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing) but launches the gateway anyway, naked, with no NODE_OPTIONS preloads. On aarch64 the @homebridge/ciao mDNS package then throws uv_interface_addresses returned Unknown system error 1 because the OpenShell sandbox netns blocks the syscall, and the gateway crash-loops forever.
Runnemoclaw <sandbox> connect. The gateway respawns and immediately enters a crash loop with the ciao stack trace below.
TUI displays:
gateway disconnected: closed | idle
Only nemoclaw <sandbox> rebuild --yes recovers — a full image build (~5 min on ARM64 / GB10).
What the user sees
Gateway log (repeats every few seconds, forever):
[openclaw] Unhandled promise rejection: SystemError: A system error occurred:
uv_interface_addresses returned Unknown system error 1 (Unknown system error 1)
at Object.networkInterfaces (node:os:218:16)
at Function.assumeNetworkInterfaceNames (.../@homebridge/ciao/src/NetworkManager.ts:527:23)
The recovery script also writes a [gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing — gateway launching without library guards (#2478) to the log just before launching — but only to the log, and the user has no actionable next step short of rebuild --yes.
Root Cause (current code on main)
src/lib/agent/runtime.ts → buildOpenClawRecoveryScript() constructs the shell that runs when connect decides to relaunch the gateway. When /tmp/nemoclaw-proxy-env.sh is missing it takes the warn-and-proceed branch by design:
if [ -r /tmp/nemoclaw-proxy-env.sh ];then. /tmp/nemoclaw-proxy-env.sh; _PE_MISSING=0;else _PE_MISSING=1;fi;
[ "$_PE_MISSING"="1" ] && { ...echo WARNING...; };
[ "$_PE_MISSING"="0" ] && [ "$_GUARDS_MISSING"="1" ] && { ...exit 1... };# only partial-failure hard-fails
launchCommand # runs even when _PE_MISSING=1
The comment in the source spells out the trade-off:
"A missing env file remains warning-only; a present env file that does not install required guards is a hard failure because launching would create an unguarded gateway."
That trade-off is fine on x86 cloud (where os.networkInterfaces() succeeds) and a guaranteed crash loop on aarch64 / DGX Spark.
What landed adjacent on main (does NOT solve this)
refactor(runtime): extract entrypoint preload modules #3109 (merged 2026-05-06) — extracts the five preload bodies from scripts/nemoclaw-start.sh heredocs into standalone modules at nemoclaw-blueprint/scripts/, baked into the image at /usr/local/lib/nemoclaw/preloads/. This is the prerequisite for the proper fix here but does not itself wire the recovery path to use them.
Shared install-preloads.sh — factor the preload install logic out of scripts/nemoclaw-start.sh so it can be invoked by both the entrypoint and the recovery path. Installs from /usr/local/lib/nemoclaw/preloads/ (established by refactor(runtime): extract entrypoint preload modules #3109) into /tmp with correct ownership/mode, and emits /tmp/nemoclaw-proxy-env.sh dynamically.
New recovery module — src/lib/sandbox/guard-recovery.ts (or colocate with src/lib/dashboard/recover.ts — pick during rebase):
checkGuardsPresent(sandbox) — kubectl-exec stat.
reEmitGuards(sandbox) — invokes install-preloads.sh inside the sandbox as root (bypasses Landlock) before gateway launch.
Wire into buildOpenClawRecoveryScript — when missing-guards is detected, re-emit before launching the gateway instead of warning and proceeding. The "_PE_MISSING=1 → WARNING" branch becomes "_PE_MISSING=1 → re-emit → re-source → continue."
Tests
Unit: guard-recovery.test.ts — presence check, re-emit invocation, error handling.
E2E: update test/e2e/test-issue-2478-crash-loop-recovery.sh Phase 4 — the negative case (proxy-env.sh removed) should now assert successful re-emission instead of the WARNING.
Out of scope (intentional)
--repair flag on sandbox connect — recovery should be automatic.
Further refactor of nemoclaw-start.sh beyond extracting install-preloads.sh.
TL;DR
On DGX Spark / GB10 / aarch64, after any pod recreate that doesn't go through
nemoclaw <sandbox> rebuild(host reboot, OOM, supervisor crash, manualkubectl delete pod), runningnemoclaw <sandbox> connectputs the gateway into an infinite crash loop. The TUI showsgateway disconnected: closed | idle. The only recovery isnemoclaw <sandbox> rebuild --yes— a ~5 minute Docker image rebuild.The recovery code path detects the condition (it emits
[gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing) but launches the gateway anyway, naked, with no NODE_OPTIONS preloads. On aarch64 the@homebridge/ciaomDNS package then throwsuv_interface_addresses returned Unknown system error 1because the OpenShell sandbox netns blocks the syscall, and the gateway crash-loops forever.Steps to Reproduce
Onboard a sandbox with any provider:
Verify guards are present (the
--requirepreloads + the env file that chains them viaNODE_OPTIONS):Expected: both exist, owned
sandbox:sandbox, mode0444.Force a pod recreate without going through rebuild. Any of these triggers the bug:
docker exec openshell-cluster-<gateway> kubectl -n openshell delete pod <sandbox>The openshell-sandbox-controller recreates the pod with a fresh container in ~5–10 s.
Re-verify guard files are gone:
Observed:
No such file or directoryon both.Run
nemoclaw <sandbox> connect. The gateway respawns and immediately enters a crash loop with the ciao stack trace below.TUI displays:
Only
nemoclaw <sandbox> rebuild --yesrecovers — a full image build (~5 min on ARM64 / GB10).What the user sees
Gateway log (repeats every few seconds, forever):
The recovery script also writes a
[gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing — gateway launching without library guards (#2478)to the log just before launching — but only to the log, and the user has no actionable next step short ofrebuild --yes.Root Cause (current code on
main)src/lib/agent/runtime.ts → buildOpenClawRecoveryScript()constructs the shell that runs whenconnectdecides to relaunch the gateway. When/tmp/nemoclaw-proxy-env.shis missing it takes the warn-and-proceed branch by design:The comment in the source spells out the trade-off:
That trade-off is fine on x86 cloud (where
os.networkInterfaces()succeeds) and a guaranteed crash loop on aarch64 / DGX Spark.What landed adjacent on
main(does NOT solve this)/tmpfiles.scripts/nemoclaw-start.shheredocs into standalone modules atnemoclaw-blueprint/scripts/, baked into the image at/usr/local/lib/nemoclaw/preloads/. This is the prerequisite for the proper fix here but does not itself wire the recovery path to use them.Earlier fix attempts
Remaining scope (proposed fix)
install-preloads.sh— factor the preload install logic out ofscripts/nemoclaw-start.shso it can be invoked by both the entrypoint and the recovery path. Installs from/usr/local/lib/nemoclaw/preloads/(established by refactor(runtime): extract entrypoint preload modules #3109) into/tmpwith correct ownership/mode, and emits/tmp/nemoclaw-proxy-env.shdynamically.src/lib/sandbox/guard-recovery.ts(or colocate withsrc/lib/dashboard/recover.ts— pick during rebase):checkGuardsPresent(sandbox)— kubectl-exec stat.reEmitGuards(sandbox)— invokesinstall-preloads.shinside the sandbox as root (bypasses Landlock) before gateway launch.buildOpenClawRecoveryScript— when missing-guards is detected, re-emit before launching the gateway instead of warning and proceeding. The "_PE_MISSING=1→ WARNING" branch becomes "_PE_MISSING=1→ re-emit → re-source → continue."guard-recovery.test.ts— presence check, re-emit invocation, error handling.test/e2e/test-issue-2478-crash-loop-recovery.shPhase 4 — the negative case (proxy-env.sh removed) should now assert successful re-emission instead of the WARNING.Out of scope (intentional)
--repairflag onsandbox connect— recovery should be automatic.nemoclaw-start.shbeyond extractinginstall-preloads.sh.Environment
Linux <host> 6.17.0-1014-nvidia aarch64)main(≥ v0.0.61) as of 2026-06-09Original logs / artifacts
Stale lock left behind by the crash-looping gateway:
This lock prevents subsequent gateway-start attempts from succeeding cleanly even after
/tmpis repopulated, until it's removed.nemoclaw debug --quick --sandbox <sandbox>capture archived at debug-output-2026-04-29-1847.txt.Filed by @camerono 2026-04-29 · re-scoped 2026-05-07 · rewritten for clarity 2026-06-09.