[All Platforms][Onboard] ./install.sh and nemoclaw onboard enter dead loop — sandbox can't reach gateway, "firewall" hint is misleading

## Description

Description
<pre>On a freshly cleaned Linux host, `./install.sh` cloned from NemoClaw main HEAD
(v0.0.39-45-gc517d622c) enters a dead loop between two error states and never
completes onboarding. Reproduced on 6 distinct Linux hosts (ubuntu24, ubuntu24-gpu,
ubuntu26, dgxspark, dgx-station, WSL2 Ubuntu 24.04). The v0.0.38 release tag
(curl|bash install path) works — regression introduced in the ~45 commits between
v0.0.38 and main HEAD.
</pre>Environment
<pre>Device: GitLab CI runner host (hostname 2u2g-gen-0642, ubuntu24-gpu runner)
OS: Ubuntu 24.04 (Linux 6.x)
Architecture: x86_64
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.4.0
OpenShell CLI: openshell 0.0.39
NemoClaw: v0.0.39-45-gc517d622c (main HEAD, commit c517d622c on 2026-05-13)
OpenClaw: N/A (onboard never completes)
</pre>Steps to Reproduce
<pre>Clean Linux host (Ubuntu 24.04, GLIBC >= 2.38, docker installed).

1. git clone https://github.com/NVIDIA/NemoClaw.git
2. cd NemoClaw
3. NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 ./install.sh
 -> Fails at [2/8] Starting OpenShell gateway with "host firewall is blocking" error.

4. Follow the on-screen hint:
 sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp
 (Even though `ufw status` is inactive, user follows the hint.)

5. Re-run ./install.sh
 -> Passes [2/8] this time (only because the gateway from step 3 is still running).
 -> Fails with: "Existing gateway was started without GPU passthrough. To enable
 GPU, destroy the existing sandbox and gateway, then re-onboard:
 nemoclaw destroy --yes && nemoclaw onboard --gpu"

6. nemoclaw list
 -> "No sandboxes registered" — there is no to destroy.

7. nemoclaw uninstall
 -> Cleans state (note: prints "Destroyed gateway 'nemoclaw' skipped" even though
 next install.sh observes a clean state).

8. Re-run ./install.sh
 -> Back to step 3's firewall error. LOOP. User can never complete onboarding.
</pre>Expected Result
<pre>./install.sh completes onboarding in one pass: creates the openshell-docker
network, starts the gateway, registers sandbox "my-assistant", and
`nemoclaw my-assistant status` returns Ready.
</pre>Actual Result
<pre>Onboarding never completes; two error states alternate.

State A (after fresh state — steps 3 and 8):
 [2/8] Starting OpenShell gateway
 Starting OpenShell Docker-driver gateway...
 Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
 ✗ Sandbox containers cannot reach the gateway at host.openshell.internal:8080.
 A host firewall is blocking traffic from the sandbox bridge.
 To allow it:
 sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp
 Then re-run `nemoclaw onboard`.

State B (gateway from a prior run still up — step 5):
 [2/8] Starting OpenShell gateway
 ✓ Port 8080 already owned by NemoClaw OpenShell Docker gateway
 Existing gateway was started without GPU passthrough.
 To enable GPU, destroy the existing sandbox and gateway, then re-onboard:
 nemoclaw destroy --yes && nemoclaw onboard --gpu

State B's suggested command is not actionable: `nemoclaw list` shows no
registered sandbox, so there is no . Only `nemoclaw uninstall` clears
state — and that returns us to State A. LOOP.

ROOT CAUSE (verified by ss -tlnp + docker probe):

Gateway listening sockets:
 127.0.0.1:8080
 172.18.0.1:8080 (openshell-docker bridge gateway IP)

Sandbox container is started with: --add-host=host.openshell.internal:host-gateway
Docker's "host-gateway" magic constant on Linux resolves to the DEFAULT bridge
(docker0) gateway, which is 172.17.0.1, NOT the openshell-docker bridge gateway.
The gateway does NOT listen on 172.17.0.1:8080 -> "Connection refused" -> onboard
reports it as "firewall blocking".

Disabling ufw with `sudo ufw disable` does NOT fix it — verified on all 6 Linux
hosts. It is not a host firewall problem.

Manual reproduction of the connectivity failure (on openshell-docker network):
 $ docker run --rm --network openshell-docker \
 --add-host=host.openshell.internal:host-gateway \
 alpine sh -c 'apk add -q curl >/dev/null; curl -v --max-time 5 \
 http://host.openshell.internal:8080'
 * Trying 172.17.0.1:8080...
 * connect to 172.17.0.1 port 8080 from 172.18.0.2 port 39514 failed: Connection refused

SUB-BUGS COMBINING INTO THE LOOP:

1. Gateway binds only 127.0.0.1 + 172.18.0.1 — should bind 0.0.0.0:8080 (or also
 include 172.17.0.1), OR onboard should pass
 `--add-host=host.openshell.internal:172.18.0.1` (explicit openshell-docker
 bridge gateway) instead of relying on Docker's `host-gateway` magic.

2. The "host firewall is blocking" diagnostic is hard-coded. It does not check
 `ufw status` first; it does not query iptables; it assumes the only cause of
 "connection refused" is firewall. On hosts with ufw inactive (most CI runners),
 the suggestion is wrong and wastes user/operator time.

3. The "nemoclaw destroy --yes" suggestion is not actionable when no
 sandbox is registered. There is no CLI help that documents "use `nemoclaw
 uninstall` to clear stale gateway state".

4. `nemoclaw uninstall` prints "Destroyed gateway 'nemoclaw' skipped" — wording
 suggests the gateway was NOT destroyed, even though it actually clears state.

AFFECTED HOSTS:

Same dead loop (root cause #1 above):
 ubuntu24, ubuntu24-gpu, ubuntu26, dgxspark, dgx-station, wsl-x86 (WSL2 Ubuntu 24.04)

Independent bugs (NOT the same dead loop):
 ubuntu22 — openshell-gateway binary requires GLIBC 2.38, but Ubuntu 22.04 has
 GLIBC 2.35. Gateway process cannot start. Should be tracked as a separate bug.
 macOS — onboard reports "Docker network 'openshell-docker' not found" at [2/8].
 Different symptom; needs separate triage.

WORKING VERSION:
 v0.0.38 tag installed via curl|bash. CI pipeline 51109488 ran 27 minutes of sanity
 tests on ubuntu22 against v0.0.38. The regression was introduced in the ~45
 commits between v0.0.38 release and main HEAD c517d622c.

DOWNSTREAM IMPACT:
 GitLab QA schedule pipelines 18424 (RUN_MATRIX=true) and 18426 (RUN_MATRIX_SLOW=true)
 use NEMOCLAW_TEST_INSTALL_FROM_URL=https://github.com/NVIDIA/NemoClaw/tree/main
 and have failed 5 consecutive days because of this dead loop. Manual reproductions
 in pipelines 51130981 and 51133458 confirm the same behaviour.
</pre>Logs
<pre>Gateway listening (excerpt from openshell-gateway.log):
 INFO openshell_server::cli: TLS disabled — listening on plaintext HTTP
 INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080
 INFO openshell_server: Using compute driver driver=docker
 INFO openshell_server: Server listening address=127.0.0.1:8080
 INFO openshell_server: Server listening address=172.18.0.1:8080

ss -tlnp on host:
 LISTEN 0 128 172.18.0.1:8080 0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=16))
 LISTEN 0 128 127.0.0.1:8080 0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=14))

Docker networks:
 bridge 172.17.0.0/16 gw 172.17.0.1
 openshell-docker 172.18.0.0/16 gw 172.18.0.1

Suggested fix (in order of impact):
 1. Bind gateway on 0.0.0.0:8080 (or also include 172.17.0.1) — single line in
 openshell-server cli binding logic.
 2. In onboard, replace `--add-host=host.openshell.internal:host-gateway` with
 `--add-host=host.openshell.internal:172.18.0.1` (the openshell-docker bridge
 gateway IP).
 3. Before printing "firewall is blocking", probe `ufw status` / inspect
 iptables, and verify gateway is actually reachable on the expected IP. Report
 the real cause when the test fails on the IP-binding side.
 4. When suggesting `nemoclaw destroy --yes` in a context where no
 sandbox is registered, suggest `nemoclaw uninstall` (or a dedicated reset
 command) instead.
</pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Policy&Network |

---
[NVB#6171920]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[All Platforms][Onboard] ./install.sh and nemoclaw onboard enter dead loop — sandbox can't reach gateway, "firewall" hint is misleading #3456

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Policy&Network

[All Platforms][Onboard] ./install.sh and nemoclaw onboard enter dead loop — sandbox can't reach gateway, "firewall" hint is misleading #3456

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions