Skip to content

[Brev][Onboard][v0.0.39] Docker-driver gateway: sandbox→host:8080 fails on UFW-active hosts; sandbox crash-loops with "Policy fetch failed" until orphan cleanup #3439

@stevenrick

Description

@stevenrick

Description

On NemoClaw v0.0.39 (Docker-driver default), nemoclaw onboard succeeds through [1/8] Preflight (including ✓ Docker-driver gateway is healthy) and through the image build/upload, then silently hangs at "sandbox is ready" for 180s before the orphan-cleanup path runs and bails with:

Sandbox 'my-assistant' was created but did not become ready within 180s.
The orphaned sandbox has been removed — you can safely retry.

Root cause: the sandbox container can't TCP-connect from its bridge IP (e.g. 172.19.0.2) to the host bridge gateway (host.openshell.internal = 172.19.0.1:8080) because the host's UFW INPUT chain drops it. Inside the sandbox container, repeated WARN openshell_sandbox: Policy fetch failed, retrying followed by Error: Policy fetch failed after 5 attempts: failed to connect to OpenShell server — the sandbox crash-loops 5x in 180s and never reaches Ready.

This is the same architectural class as #3340 (Brev UFW blocks managed Ollama auth proxy on :11435) — but applied to the gateway port itself, so the blast radius is 100% of sandbox starts on UFW-active hosts, not just the Ollama path.

On v0.0.38 (K3s-driver) the same hosts work because the gateway lived inside the cluster container's network namespace and the sandbox→gateway path never touched the host INPUT chain. v0.0.39's docker-driver moved the gateway to a host process listening on the bridge gateway IP, introducing a new networking requirement that the existing readiness probe in src/lib/onboard/gateway-tcp-readiness.ts doesn't cover (that probe correctly verifies something is listening on 127.0.0.1:8080 — but the failure mode here is the path from sandbox bridge → host, not host loopback).

The symptom is opaque: the wizard prints ✓ Docker-driver gateway is healthy two seconds before sandbox creation starts, then 12 minutes of silence before "did not become ready within 180s". No log line points at the firewall.

Environment

Device:        Brev shadeform Ubuntu (NVIDIA flavor)
OS:            Ubuntu (kernel 6.11.0-1016-nvidia)
Architecture:  x86_64
Node.js:       v22.22.2
Docker:        29.4.3 (build 055a478)
OpenShell CLI: openshell 0.0.37
NemoClaw:      v0.0.39 (08db33d5b)
Sandbox base:  ghcr.io/nvidia/nemoclaw/sandbox-base:latest (revision 47a54a53)
Provider:      NVIDIA Endpoints, Nemotron 3 Super 120B
UFW:           active; default-deny incoming; only 22/tcp ALLOW (Brev default)

Steps to Reproduce

  1. Provision a Brev VM with default UFW posture (active, INPUT policy DROP, only 22/tcp ALLOW).
  2. Install NemoClaw v0.0.39: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash (or build from source at 08db33d5b).
  3. Configure NVIDIA_API_KEY and run NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 nemoclaw onboard.
  4. Wait through preflight (passes), gateway startup (passes), sandbox image build (~3 min), upload (~30s), then sandbox-create.
  5. After 180s the wizard reports "Sandbox … did not become ready within 180s. The orphaned sandbox has been removed — you can safely retry."
  6. Retry: same result, deterministically.

Expected Result

Either:

  • Preflight catches the unreachable path before sandbox creation and prints a clear error with the firewall remediation, OR
  • Sandbox reaches Ready within the normal window.

Actual Result

Preflight passes (✓ Docker-driver gateway is healthy), sandbox crash-loops inside on Policy fetch failed, orphan-cleanup deletes it after 180s. Wizard suggests retry which deterministically fails the same way.

Diagnosis

From inside the sandbox container during the crash-loop:

$ docker exec <sandbox> getent hosts host.openshell.internal
172.19.0.1      host.openshell.internal

$ docker exec <sandbox> curl -sv --max-time 5 http://host.openshell.internal:8080/
*   Trying 172.19.0.1:8080...
* Connection timed out after 5002 milliseconds

From the host:

$ sudo iptables -L INPUT -n -v | head -3
Chain INPUT (policy DROP 6830 packets, 383K bytes)

$ sudo ufw status
Status: active
22/tcp                     ALLOW       Anywhere

$ curl -sI -o /dev/null -w 'HTTP %{http_code}\n' http://172.19.0.1:8080/
HTTP 404

Gateway is listening on the bridge IP correctly. The host can reach itself. The sandbox cannot — UFW's INPUT chain drops the SYN.

Workaround

SUBNET=$(docker network inspect openshell-docker --format '{{(index .IPAM.Config 0).Subnet}}')
sudo ufw allow from "$SUBNET" to any port 8080 proto tcp

After this rule is added, nemoclaw onboard completes cleanly and the sandbox reaches Ready in the normal window. Confirmed end-to-end including Tavily web-search and Telegram bridge.

Proposed Fix

Add a sibling probe to src/lib/onboard/gateway-tcp-readiness.ts — call it gateway-sandbox-reachability.ts — that runs after the host-side TCP readiness probe and before the wizard proceeds to sandbox image build. The probe:

  1. Inspects docker network openshell-docker for the bridge subnet.
  2. Spawns a short-lived container on that network (e.g. busybox:latest or whichever base image is already cached / pinned).
  3. Attempts TCP connect to host.openshell.internal:8080 (with a sensible timeout — 3-5s is plenty).
  4. On failure, fails preflight with an actionable error message including the workaround command.

Keeps the probe diagnostic-only — no firewall mutation by the installer (correctly out of scope for a privileged installer).

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions