Description
On NemoClaw v0.0.39 (Docker-driver default), nemoclaw onboard succeeds through [1/8] Preflight (including ✓ Docker-driver gateway is healthy) and through the image build/upload, then silently hangs at "sandbox is ready" for 180s before the orphan-cleanup path runs and bails with:
Sandbox 'my-assistant' was created but did not become ready within 180s.
The orphaned sandbox has been removed — you can safely retry.
Root cause: the sandbox container can't TCP-connect from its bridge IP (e.g. 172.19.0.2) to the host bridge gateway (host.openshell.internal = 172.19.0.1:8080) because the host's UFW INPUT chain drops it. Inside the sandbox container, repeated WARN openshell_sandbox: Policy fetch failed, retrying followed by Error: Policy fetch failed after 5 attempts: failed to connect to OpenShell server — the sandbox crash-loops 5x in 180s and never reaches Ready.
This is the same architectural class as #3340 (Brev UFW blocks managed Ollama auth proxy on :11435) — but applied to the gateway port itself, so the blast radius is 100% of sandbox starts on UFW-active hosts, not just the Ollama path.
On v0.0.38 (K3s-driver) the same hosts work because the gateway lived inside the cluster container's network namespace and the sandbox→gateway path never touched the host INPUT chain. v0.0.39's docker-driver moved the gateway to a host process listening on the bridge gateway IP, introducing a new networking requirement that the existing readiness probe in src/lib/onboard/gateway-tcp-readiness.ts doesn't cover (that probe correctly verifies something is listening on 127.0.0.1:8080 — but the failure mode here is the path from sandbox bridge → host, not host loopback).
The symptom is opaque: the wizard prints ✓ Docker-driver gateway is healthy two seconds before sandbox creation starts, then 12 minutes of silence before "did not become ready within 180s". No log line points at the firewall.
Environment
Device: Brev shadeform Ubuntu (NVIDIA flavor)
OS: Ubuntu (kernel 6.11.0-1016-nvidia)
Architecture: x86_64
Node.js: v22.22.2
Docker: 29.4.3 (build 055a478)
OpenShell CLI: openshell 0.0.37
NemoClaw: v0.0.39 (08db33d5b)
Sandbox base: ghcr.io/nvidia/nemoclaw/sandbox-base:latest (revision 47a54a53)
Provider: NVIDIA Endpoints, Nemotron 3 Super 120B
UFW: active; default-deny incoming; only 22/tcp ALLOW (Brev default)
Steps to Reproduce
- Provision a Brev VM with default UFW posture (active,
INPUT policy DROP, only 22/tcp ALLOW).
- Install NemoClaw v0.0.39:
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash (or build from source at 08db33d5b).
- Configure
NVIDIA_API_KEY and run NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 nemoclaw onboard.
- Wait through preflight (passes), gateway startup (passes), sandbox image build (~3 min), upload (~30s), then sandbox-create.
- After 180s the wizard reports "Sandbox … did not become ready within 180s. The orphaned sandbox has been removed — you can safely retry."
- Retry: same result, deterministically.
Expected Result
Either:
- Preflight catches the unreachable path before sandbox creation and prints a clear error with the firewall remediation, OR
- Sandbox reaches Ready within the normal window.
Actual Result
Preflight passes (✓ Docker-driver gateway is healthy), sandbox crash-loops inside on Policy fetch failed, orphan-cleanup deletes it after 180s. Wizard suggests retry which deterministically fails the same way.
Diagnosis
From inside the sandbox container during the crash-loop:
$ docker exec <sandbox> getent hosts host.openshell.internal
172.19.0.1 host.openshell.internal
$ docker exec <sandbox> curl -sv --max-time 5 http://host.openshell.internal:8080/
* Trying 172.19.0.1:8080...
* Connection timed out after 5002 milliseconds
From the host:
$ sudo iptables -L INPUT -n -v | head -3
Chain INPUT (policy DROP 6830 packets, 383K bytes)
$ sudo ufw status
Status: active
22/tcp ALLOW Anywhere
$ curl -sI -o /dev/null -w 'HTTP %{http_code}\n' http://172.19.0.1:8080/
HTTP 404
Gateway is listening on the bridge IP correctly. The host can reach itself. The sandbox cannot — UFW's INPUT chain drops the SYN.
Workaround
SUBNET=$(docker network inspect openshell-docker --format '{{(index .IPAM.Config 0).Subnet}}')
sudo ufw allow from "$SUBNET" to any port 8080 proto tcp
After this rule is added, nemoclaw onboard completes cleanly and the sandbox reaches Ready in the normal window. Confirmed end-to-end including Tavily web-search and Telegram bridge.
Proposed Fix
Add a sibling probe to src/lib/onboard/gateway-tcp-readiness.ts — call it gateway-sandbox-reachability.ts — that runs after the host-side TCP readiness probe and before the wizard proceeds to sandbox image build. The probe:
- Inspects
docker network openshell-docker for the bridge subnet.
- Spawns a short-lived container on that network (e.g.
busybox:latest or whichever base image is already cached / pinned).
- Attempts TCP connect to
host.openshell.internal:8080 (with a sensible timeout — 3-5s is plenty).
- On failure, fails preflight with an actionable error message including the workaround command.
Keeps the probe diagnostic-only — no firewall mutation by the installer (correctly out of scope for a privileged installer).
Related
Description
On NemoClaw
v0.0.39(Docker-driver default),nemoclaw onboardsucceeds through[1/8] Preflight(including✓ Docker-driver gateway is healthy) and through the image build/upload, then silently hangs at "sandbox is ready" for 180s before the orphan-cleanup path runs and bails with:Root cause: the sandbox container can't TCP-connect from its bridge IP (e.g.
172.19.0.2) to the host bridge gateway (host.openshell.internal=172.19.0.1:8080) because the host's UFW INPUT chain drops it. Inside the sandbox container, repeatedWARN openshell_sandbox: Policy fetch failed, retryingfollowed byError: Policy fetch failed after 5 attempts: failed to connect to OpenShell server— the sandbox crash-loops 5x in 180s and never reaches Ready.This is the same architectural class as #3340 (Brev UFW blocks managed Ollama auth proxy on
:11435) — but applied to the gateway port itself, so the blast radius is 100% of sandbox starts on UFW-active hosts, not just the Ollama path.On
v0.0.38(K3s-driver) the same hosts work because the gateway lived inside the cluster container's network namespace and the sandbox→gateway path never touched the host INPUT chain.v0.0.39's docker-driver moved the gateway to a host process listening on the bridge gateway IP, introducing a new networking requirement that the existing readiness probe insrc/lib/onboard/gateway-tcp-readiness.tsdoesn't cover (that probe correctly verifies something is listening on127.0.0.1:8080— but the failure mode here is the path from sandbox bridge → host, not host loopback).The symptom is opaque: the wizard prints
✓ Docker-driver gateway is healthytwo seconds before sandbox creation starts, then 12 minutes of silence before "did not become ready within 180s". No log line points at the firewall.Environment
Steps to Reproduce
INPUT policy DROP, only22/tcp ALLOW).curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash(or build from source at08db33d5b).NVIDIA_API_KEYand runNEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 nemoclaw onboard.Expected Result
Either:
Actual Result
Preflight passes (
✓ Docker-driver gateway is healthy), sandbox crash-loops inside onPolicy fetch failed, orphan-cleanup deletes it after 180s. Wizard suggests retry which deterministically fails the same way.Diagnosis
From inside the sandbox container during the crash-loop:
From the host:
Gateway is listening on the bridge IP correctly. The host can reach itself. The sandbox cannot — UFW's INPUT chain drops the SYN.
Workaround
After this rule is added,
nemoclaw onboardcompletes cleanly and the sandbox reaches Ready in the normal window. Confirmed end-to-end including Tavily web-search and Telegram bridge.Proposed Fix
Add a sibling probe to
src/lib/onboard/gateway-tcp-readiness.ts— call itgateway-sandbox-reachability.ts— that runs after the host-side TCP readiness probe and before the wizard proceeds to sandbox image build. The probe:docker network openshell-dockerfor the bridge subnet.busybox:latestor whichever base image is already cached / pinned).host.openshell.internal:8080(with a sensible timeout — 3-5s is plenty).Keeps the probe diagnostic-only — no firewall mutation by the installer (correctly out of scope for a privileged installer).
Related
:11435. Different port, same class of bug. Worth noting that:8080(gateway) is on the critical path for every sandbox start, where:11435only mattered for the Ollama provider path.gateway-tcp-readiness.tsdocstring already anticipates a "unified port-liveness probe API" tracked in Epic: unify warnings, advisories, and fatal exits under a single registry #3213 — the new sandbox-bridge probe is a natural addition to that family.