Skip to content

[All Platforms][Onboard] ./install.sh and nemoclaw onboard enter dead loop — sandbox can't reach gateway, "firewall" hint is misleading #3456

@zNeill

Description

@zNeill

Description

Description

On a freshly cleaned Linux host, `./install.sh` cloned from NemoClaw main HEAD
(v0.0.39-45-gc517d622c) enters a dead loop between two error states and never
completes onboarding. Reproduced on 6 distinct Linux hosts (ubuntu24, ubuntu24-gpu,
ubuntu26, dgxspark, dgx-station, WSL2 Ubuntu 24.04). The v0.0.38 release tag
(curl|bash install path) works — regression introduced in the ~45 commits between
v0.0.38 and main HEAD.
Environment
Device:        GitLab CI runner host (hostname 2u2g-gen-0642, ubuntu24-gpu runner)
OS:            Ubuntu 24.04 (Linux 6.x)
Architecture:  x86_64
Node.js:       v22.22.2
npm:           10.9.7
Docker:        29.4.0
OpenShell CLI: openshell 0.0.39
NemoClaw:      v0.0.39-45-gc517d622c (main HEAD, commit c517d622c on 2026-05-13)
OpenClaw:      N/A (onboard never completes)
Steps to Reproduce
Clean Linux host (Ubuntu 24.04, GLIBC >= 2.38, docker installed).

1. git clone https://github.com/NVIDIA/NemoClaw.git
2. cd NemoClaw
3. NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 ./install.sh
   -> Fails at [2/8] Starting OpenShell gateway with "host firewall is blocking" error.

4. Follow the on-screen hint:
   sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp
   (Even though `ufw status` is inactive, user follows the hint.)

5. Re-run ./install.sh
   -> Passes [2/8] this time (only because the gateway from step 3 is still running).
   -> Fails with: "Existing gateway was started without GPU passthrough. To enable
      GPU, destroy the existing sandbox and gateway, then re-onboard:
        nemoclaw  destroy --yes && nemoclaw onboard --gpu"

6. nemoclaw list
   -> "No sandboxes registered" — there is no  to destroy.

7. nemoclaw uninstall
   -> Cleans state (note: prints "Destroyed gateway 'nemoclaw' skipped" even though
      next install.sh observes a clean state).

8. Re-run ./install.sh
   -> Back to step 3's firewall error. LOOP. User can never complete onboarding.
Expected Result
./install.sh completes onboarding in one pass: creates the openshell-docker
network, starts the gateway, registers sandbox "my-assistant", and
`nemoclaw my-assistant status` returns Ready.
Actual Result
Onboarding never completes; two error states alternate.

State A (after fresh state — steps 3 and 8):
  [2/8] Starting OpenShell gateway
    Starting OpenShell Docker-driver gateway...
    Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
    ✗ Sandbox containers cannot reach the gateway at host.openshell.internal:8080.
      A host firewall is blocking traffic from the sandbox bridge.
      To allow it:
        sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp
      Then re-run `nemoclaw onboard`.

State B (gateway from a prior run still up — step 5):
  [2/8] Starting OpenShell gateway
    ✓ Port 8080 already owned by NemoClaw OpenShell Docker gateway
    Existing gateway was started without GPU passthrough.
    To enable GPU, destroy the existing sandbox and gateway, then re-onboard:
      nemoclaw  destroy --yes && nemoclaw onboard --gpu

State B's suggested command is not actionable: `nemoclaw list` shows no
registered sandbox, so there is no . Only `nemoclaw uninstall` clears
state — and that returns us to State A. LOOP.

ROOT CAUSE (verified by ss -tlnp + docker probe):

Gateway listening sockets:
  127.0.0.1:8080
  172.18.0.1:8080   (openshell-docker bridge gateway IP)

Sandbox container is started with: --add-host=host.openshell.internal:host-gateway
Docker's "host-gateway" magic constant on Linux resolves to the DEFAULT bridge
(docker0) gateway, which is 172.17.0.1, NOT the openshell-docker bridge gateway.
The gateway does NOT listen on 172.17.0.1:8080 -> "Connection refused" -> onboard
reports it as "firewall blocking".

Disabling ufw with `sudo ufw disable` does NOT fix it — verified on all 6 Linux
hosts. It is not a host firewall problem.

Manual reproduction of the connectivity failure (on openshell-docker network):
  $ docker run --rm --network openshell-docker \
      --add-host=host.openshell.internal:host-gateway \
      alpine sh -c 'apk add -q curl >/dev/null; curl -v --max-time 5 \
                    http://host.openshell.internal:8080'
  * Trying 172.17.0.1:8080...
  * connect to 172.17.0.1 port 8080 from 172.18.0.2 port 39514 failed: Connection refused

SUB-BUGS COMBINING INTO THE LOOP:

1. Gateway binds only 127.0.0.1 + 172.18.0.1 — should bind 0.0.0.0:8080 (or also
   include 172.17.0.1), OR onboard should pass
   `--add-host=host.openshell.internal:172.18.0.1` (explicit openshell-docker
   bridge gateway) instead of relying on Docker's `host-gateway` magic.

2. The "host firewall is blocking" diagnostic is hard-coded. It does not check
   `ufw status` first; it does not query iptables; it assumes the only cause of
   "connection refused" is firewall. On hosts with ufw inactive (most CI runners),
   the suggestion is wrong and wastes user/operator time.

3. The "nemoclaw  destroy --yes" suggestion is not actionable when no
   sandbox is registered. There is no CLI help that documents "use `nemoclaw
   uninstall` to clear stale gateway state".

4. `nemoclaw uninstall` prints "Destroyed gateway 'nemoclaw' skipped" — wording
   suggests the gateway was NOT destroyed, even though it actually clears state.

AFFECTED HOSTS:

Same dead loop (root cause #1 above):
  ubuntu24, ubuntu24-gpu, ubuntu26, dgxspark, dgx-station, wsl-x86 (WSL2 Ubuntu 24.04)

Independent bugs (NOT the same dead loop):
  ubuntu22 — openshell-gateway binary requires GLIBC 2.38, but Ubuntu 22.04 has
    GLIBC 2.35. Gateway process cannot start. Should be tracked as a separate bug.
  macOS  — onboard reports "Docker network 'openshell-docker' not found" at [2/8].
    Different symptom; needs separate triage.

WORKING VERSION:
  v0.0.38 tag installed via curl|bash. CI pipeline 51109488 ran 27 minutes of sanity
  tests on ubuntu22 against v0.0.38. The regression was introduced in the ~45
  commits between v0.0.38 release and main HEAD c517d622c.

DOWNSTREAM IMPACT:
  GitLab QA schedule pipelines 18424 (RUN_MATRIX=true) and 18426 (RUN_MATRIX_SLOW=true)
  use NEMOCLAW_TEST_INSTALL_FROM_URL=https://github.com/NVIDIA/NemoClaw/tree/main
  and have failed 5 consecutive days because of this dead loop. Manual reproductions
  in pipelines 51130981 and 51133458 confirm the same behaviour.
Logs
Gateway listening (excerpt from openshell-gateway.log):
  INFO openshell_server::cli: TLS disabled — listening on plaintext HTTP
  INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080
  INFO openshell_server: Using compute driver driver=docker
  INFO openshell_server: Server listening address=127.0.0.1:8080
  INFO openshell_server: Server listening address=172.18.0.1:8080

ss -tlnp on host:
  LISTEN 0 128 172.18.0.1:8080 0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=16))
  LISTEN 0 128 127.0.0.1:8080  0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=14))

Docker networks:
  bridge            172.17.0.0/16  gw 172.17.0.1
  openshell-docker  172.18.0.0/16  gw 172.18.0.1

Suggested fix (in order of impact):
  1. Bind gateway on 0.0.0.0:8080 (or also include 172.17.0.1) — single line in
     openshell-server cli binding logic.
  2. In onboard, replace `--add-host=host.openshell.internal:host-gateway` with
     `--add-host=host.openshell.internal:172.18.0.1` (the openshell-docker bridge
     gateway IP).
  3. Before printing "firewall is blocking", probe `ufw status` / inspect
     iptables, and verify gateway is actually reachable on the expected IP. Report
     the real cause when the test fails on the IP-binding side.
  4. When suggesting `nemoclaw  destroy --yes` in a context where no
     sandbox is registered, suggest `nemoclaw uninstall` (or a dedicated reset
     command) instead.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Policy&Network

[NVB#6171920]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions