[Brev][Onboard][v0.0.39] Docker-driver gateway: sandbox→host:8080 fails on UFW-active hosts; sandbox crash-loops with "Policy fetch failed" until orphan cleanup

## Description

On NemoClaw `v0.0.39` (Docker-driver default), `nemoclaw onboard` succeeds through `[1/8] Preflight` (including `✓ Docker-driver gateway is healthy`) and through the image build/upload, then silently hangs at "sandbox is ready" for 180s before the orphan-cleanup path runs and bails with:

```
Sandbox 'my-assistant' was created but did not become ready within 180s.
The orphaned sandbox has been removed — you can safely retry.
```

Root cause: the sandbox container can't TCP-connect from its bridge IP (e.g. `172.19.0.2`) to the host bridge gateway (`host.openshell.internal` = `172.19.0.1:8080`) because the host's UFW INPUT chain drops it. Inside the sandbox container, repeated `WARN openshell_sandbox: Policy fetch failed, retrying` followed by `Error: Policy fetch failed after 5 attempts: failed to connect to OpenShell server` — the sandbox crash-loops 5x in 180s and never reaches Ready.

This is the same architectural class as #3340 (Brev UFW blocks managed Ollama auth proxy on `:11435`) — but applied to **the gateway port itself**, so the blast radius is 100% of sandbox starts on UFW-active hosts, not just the Ollama path.

On `v0.0.38` (K3s-driver) the same hosts work because the gateway lived inside the cluster container's network namespace and the sandbox→gateway path never touched the host INPUT chain. `v0.0.39`'s docker-driver moved the gateway to a host process listening on the bridge gateway IP, introducing a new networking requirement that the existing readiness probe in `src/lib/onboard/gateway-tcp-readiness.ts` doesn't cover (that probe correctly verifies *something is listening on `127.0.0.1:8080`* — but the failure mode here is *the path from sandbox bridge → host*, not host loopback).

The symptom is opaque: the wizard prints `✓ Docker-driver gateway is healthy` two seconds before sandbox creation starts, then 12 minutes of silence before "did not become ready within 180s". No log line points at the firewall.

## Environment

```
Device:        Brev shadeform Ubuntu (NVIDIA flavor)
OS:            Ubuntu (kernel 6.11.0-1016-nvidia)
Architecture:  x86_64
Node.js:       v22.22.2
Docker:        29.4.3 (build 055a478)
OpenShell CLI: openshell 0.0.37
NemoClaw:      v0.0.39 (08db33d5b)
Sandbox base:  ghcr.io/nvidia/nemoclaw/sandbox-base:latest (revision 47a54a53)
Provider:      NVIDIA Endpoints, Nemotron 3 Super 120B
UFW:           active; default-deny incoming; only 22/tcp ALLOW (Brev default)
```

## Steps to Reproduce

1. Provision a Brev VM with default UFW posture (active, `INPUT policy DROP`, only `22/tcp ALLOW`).
2. Install NemoClaw v0.0.39: `curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash` (or build from source at `08db33d5b`).
3. Configure `NVIDIA_API_KEY` and run `NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 nemoclaw onboard`.
4. Wait through preflight (passes), gateway startup (passes), sandbox image build (~3 min), upload (~30s), then sandbox-create.
5. After 180s the wizard reports "Sandbox … did not become ready within 180s. The orphaned sandbox has been removed — you can safely retry."
6. Retry: same result, deterministically.

## Expected Result

Either:
- Preflight catches the unreachable path before sandbox creation and prints a clear error with the firewall remediation, OR
- Sandbox reaches Ready within the normal window.

## Actual Result

Preflight passes (`✓ Docker-driver gateway is healthy`), sandbox crash-loops inside on `Policy fetch failed`, orphan-cleanup deletes it after 180s. Wizard suggests retry which deterministically fails the same way.

## Diagnosis

From inside the sandbox container during the crash-loop:

```
$ docker exec <sandbox> getent hosts host.openshell.internal
172.19.0.1      host.openshell.internal

$ docker exec <sandbox> curl -sv --max-time 5 http://host.openshell.internal:8080/
*   Trying 172.19.0.1:8080...
* Connection timed out after 5002 milliseconds
```

From the host:

```
$ sudo iptables -L INPUT -n -v | head -3
Chain INPUT (policy DROP 6830 packets, 383K bytes)

$ sudo ufw status
Status: active
22/tcp                     ALLOW       Anywhere

$ curl -sI -o /dev/null -w 'HTTP %{http_code}\n' http://172.19.0.1:8080/
HTTP 404
```

Gateway is listening on the bridge IP correctly. The host can reach itself. The sandbox cannot — UFW's INPUT chain drops the SYN.

## Workaround

```bash
SUBNET=$(docker network inspect openshell-docker --format '{{(index .IPAM.Config 0).Subnet}}')
sudo ufw allow from "$SUBNET" to any port 8080 proto tcp
```

After this rule is added, `nemoclaw onboard` completes cleanly and the sandbox reaches Ready in the normal window. Confirmed end-to-end including Tavily web-search and Telegram bridge.

## Proposed Fix

Add a sibling probe to `src/lib/onboard/gateway-tcp-readiness.ts` — call it `gateway-sandbox-reachability.ts` — that runs **after** the host-side TCP readiness probe and **before** the wizard proceeds to sandbox image build. The probe:

1. Inspects `docker network openshell-docker` for the bridge subnet.
2. Spawns a short-lived container on that network (e.g. `busybox:latest` or whichever base image is already cached / pinned).
3. Attempts TCP connect to `host.openshell.internal:8080` (with a sensible timeout — 3-5s is plenty).
4. On failure, fails preflight with an actionable error message including the workaround command.

Keeps the probe diagnostic-only — no firewall mutation by the installer (correctly out of scope for a privileged installer).

## Related

- #3340 — same UFW-blocks-sandbox-bridge pattern, for managed Ollama auth proxy `:11435`. Different port, same class of bug. Worth noting that `:8080` (gateway) is on the critical path for *every* sandbox start, where `:11435` only mattered for the Ollama provider path.
- The existing `gateway-tcp-readiness.ts` docstring already anticipates a "unified port-liveness probe API" tracked in #3213 — the new sandbox-bridge probe is a natural addition to that family.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Brev][Onboard][v0.0.39] Docker-driver gateway: sandbox→host:8080 fails on UFW-active hosts; sandbox crash-loops with "Policy fetch failed" until orphan cleanup #3439

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Diagnosis

Workaround

Proposed Fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Brev][Onboard][v0.0.39] Docker-driver gateway: sandbox→host:8080 fails on UFW-active hosts; sandbox crash-loops with "Policy fetch failed" until orphan cleanup #3439

Description

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Diagnosis

Workaround

Proposed Fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions