Description
Description
On a freshly cleaned Linux host, `./install.sh` cloned from NemoClaw main HEAD
(v0.0.39-45-gc517d622c) enters a dead loop between two error states and never
completes onboarding. Reproduced on 6 distinct Linux hosts (ubuntu24, ubuntu24-gpu,
ubuntu26, dgxspark, dgx-station, WSL2 Ubuntu 24.04). The v0.0.38 release tag
(curl|bash install path) works — regression introduced in the ~45 commits between
v0.0.38 and main HEAD.
Environment
Device: GitLab CI runner host (hostname 2u2g-gen-0642, ubuntu24-gpu runner)
OS: Ubuntu 24.04 (Linux 6.x)
Architecture: x86_64
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.4.0
OpenShell CLI: openshell 0.0.39
NemoClaw: v0.0.39-45-gc517d622c (main HEAD, commit c517d622c on 2026-05-13)
OpenClaw: N/A (onboard never completes)
Steps to Reproduce
Clean Linux host (Ubuntu 24.04, GLIBC >= 2.38, docker installed).
1. git clone https://github.com/NVIDIA/NemoClaw.git
2. cd NemoClaw
3. NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 ./install.sh
-> Fails at [2/8] Starting OpenShell gateway with "host firewall is blocking" error.
4. Follow the on-screen hint:
sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp
(Even though `ufw status` is inactive, user follows the hint.)
5. Re-run ./install.sh
-> Passes [2/8] this time (only because the gateway from step 3 is still running).
-> Fails with: "Existing gateway was started without GPU passthrough. To enable
GPU, destroy the existing sandbox and gateway, then re-onboard:
nemoclaw destroy --yes && nemoclaw onboard --gpu"
6. nemoclaw list
-> "No sandboxes registered" — there is no to destroy.
7. nemoclaw uninstall
-> Cleans state (note: prints "Destroyed gateway 'nemoclaw' skipped" even though
next install.sh observes a clean state).
8. Re-run ./install.sh
-> Back to step 3's firewall error. LOOP. User can never complete onboarding.
Expected Result
./install.sh completes onboarding in one pass: creates the openshell-docker
network, starts the gateway, registers sandbox "my-assistant", and
`nemoclaw my-assistant status` returns Ready.
Actual Result
Onboarding never completes; two error states alternate.
State A (after fresh state — steps 3 and 8):
[2/8] Starting OpenShell gateway
Starting OpenShell Docker-driver gateway...
Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
✗ Sandbox containers cannot reach the gateway at host.openshell.internal:8080.
A host firewall is blocking traffic from the sandbox bridge.
To allow it:
sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp
Then re-run `nemoclaw onboard`.
State B (gateway from a prior run still up — step 5):
[2/8] Starting OpenShell gateway
✓ Port 8080 already owned by NemoClaw OpenShell Docker gateway
Existing gateway was started without GPU passthrough.
To enable GPU, destroy the existing sandbox and gateway, then re-onboard:
nemoclaw destroy --yes && nemoclaw onboard --gpu
State B's suggested command is not actionable: `nemoclaw list` shows no
registered sandbox, so there is no . Only `nemoclaw uninstall` clears
state — and that returns us to State A. LOOP.
ROOT CAUSE (verified by ss -tlnp + docker probe):
Gateway listening sockets:
127.0.0.1:8080
172.18.0.1:8080 (openshell-docker bridge gateway IP)
Sandbox container is started with: --add-host=host.openshell.internal:host-gateway
Docker's "host-gateway" magic constant on Linux resolves to the DEFAULT bridge
(docker0) gateway, which is 172.17.0.1, NOT the openshell-docker bridge gateway.
The gateway does NOT listen on 172.17.0.1:8080 -> "Connection refused" -> onboard
reports it as "firewall blocking".
Disabling ufw with `sudo ufw disable` does NOT fix it — verified on all 6 Linux
hosts. It is not a host firewall problem.
Manual reproduction of the connectivity failure (on openshell-docker network):
$ docker run --rm --network openshell-docker \
--add-host=host.openshell.internal:host-gateway \
alpine sh -c 'apk add -q curl >/dev/null; curl -v --max-time 5 \
http://host.openshell.internal:8080'
* Trying 172.17.0.1:8080...
* connect to 172.17.0.1 port 8080 from 172.18.0.2 port 39514 failed: Connection refused
SUB-BUGS COMBINING INTO THE LOOP:
1. Gateway binds only 127.0.0.1 + 172.18.0.1 — should bind 0.0.0.0:8080 (or also
include 172.17.0.1), OR onboard should pass
`--add-host=host.openshell.internal:172.18.0.1` (explicit openshell-docker
bridge gateway) instead of relying on Docker's `host-gateway` magic.
2. The "host firewall is blocking" diagnostic is hard-coded. It does not check
`ufw status` first; it does not query iptables; it assumes the only cause of
"connection refused" is firewall. On hosts with ufw inactive (most CI runners),
the suggestion is wrong and wastes user/operator time.
3. The "nemoclaw destroy --yes" suggestion is not actionable when no
sandbox is registered. There is no CLI help that documents "use `nemoclaw
uninstall` to clear stale gateway state".
4. `nemoclaw uninstall` prints "Destroyed gateway 'nemoclaw' skipped" — wording
suggests the gateway was NOT destroyed, even though it actually clears state.
AFFECTED HOSTS:
Same dead loop (root cause #1 above):
ubuntu24, ubuntu24-gpu, ubuntu26, dgxspark, dgx-station, wsl-x86 (WSL2 Ubuntu 24.04)
Independent bugs (NOT the same dead loop):
ubuntu22 — openshell-gateway binary requires GLIBC 2.38, but Ubuntu 22.04 has
GLIBC 2.35. Gateway process cannot start. Should be tracked as a separate bug.
macOS — onboard reports "Docker network 'openshell-docker' not found" at [2/8].
Different symptom; needs separate triage.
WORKING VERSION:
v0.0.38 tag installed via curl|bash. CI pipeline 51109488 ran 27 minutes of sanity
tests on ubuntu22 against v0.0.38. The regression was introduced in the ~45
commits between v0.0.38 release and main HEAD c517d622c.
DOWNSTREAM IMPACT:
GitLab QA schedule pipelines 18424 (RUN_MATRIX=true) and 18426 (RUN_MATRIX_SLOW=true)
use NEMOCLAW_TEST_INSTALL_FROM_URL=https://github.com/NVIDIA/NemoClaw/tree/main
and have failed 5 consecutive days because of this dead loop. Manual reproductions
in pipelines 51130981 and 51133458 confirm the same behaviour.
Logs
Gateway listening (excerpt from openshell-gateway.log):
INFO openshell_server::cli: TLS disabled — listening on plaintext HTTP
INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080
INFO openshell_server: Using compute driver driver=docker
INFO openshell_server: Server listening address=127.0.0.1:8080
INFO openshell_server: Server listening address=172.18.0.1:8080
ss -tlnp on host:
LISTEN 0 128 172.18.0.1:8080 0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=16))
LISTEN 0 128 127.0.0.1:8080 0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=14))
Docker networks:
bridge 172.17.0.0/16 gw 172.17.0.1
openshell-docker 172.18.0.0/16 gw 172.18.0.1
Suggested fix (in order of impact):
1. Bind gateway on 0.0.0.0:8080 (or also include 172.17.0.1) — single line in
openshell-server cli binding logic.
2. In onboard, replace `--add-host=host.openshell.internal:host-gateway` with
`--add-host=host.openshell.internal:172.18.0.1` (the openshell-docker bridge
gateway IP).
3. Before printing "firewall is blocking", probe `ufw status` / inspect
iptables, and verify gateway is actually reachable on the expected IP. Report
the real cause when the test fails on the IP-binding side.
4. When suggesting `nemoclaw destroy --yes` in a context where no
sandbox is registered, suggest `nemoclaw uninstall` (or a dedicated reset
command) instead.
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Policy&Network |
[NVB#6171920]
Description
Description
Environment Steps to ReproduceClean Linux host (Ubuntu 24.04, GLIBC >= 2.38, docker installed). 1. git clone https://github.com/NVIDIA/NemoClaw.git 2. cd NemoClaw 3. NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 ./install.sh -> Fails at [2/8] Starting OpenShell gateway with "host firewall is blocking" error. 4. Follow the on-screen hint: sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp (Even though `ufw status` is inactive, user follows the hint.) 5. Re-run ./install.sh -> Passes [2/8] this time (only because the gateway from step 3 is still running). -> Fails with: "Existing gateway was started without GPU passthrough. To enable GPU, destroy the existing sandbox and gateway, then re-onboard: nemoclaw destroy --yes && nemoclaw onboard --gpu" 6. nemoclaw list -> "No sandboxes registered" — there is no to destroy. 7. nemoclaw uninstall -> Cleans state (note: prints "Destroyed gateway 'nemoclaw' skipped" even though next install.sh observes a clean state). 8. Re-run ./install.sh -> Back to step 3's firewall error. LOOP. User can never complete onboarding.Expected Result Actual ResultOnboarding never completes; two error states alternate. State A (after fresh state — steps 3 and 8): [2/8] Starting OpenShell gateway Starting OpenShell Docker-driver gateway... Gateway log: ~/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log ✗ Sandbox containers cannot reach the gateway at host.openshell.internal:8080. A host firewall is blocking traffic from the sandbox bridge. To allow it: sudo ufw allow from 172.18.0.0/16 to any port 8080 proto tcp Then re-run `nemoclaw onboard`. State B (gateway from a prior run still up — step 5): [2/8] Starting OpenShell gateway ✓ Port 8080 already owned by NemoClaw OpenShell Docker gateway Existing gateway was started without GPU passthrough. To enable GPU, destroy the existing sandbox and gateway, then re-onboard: nemoclaw destroy --yes && nemoclaw onboard --gpu State B's suggested command is not actionable: `nemoclaw list` shows no registered sandbox, so there is no . Only `nemoclaw uninstall` clears state — and that returns us to State A. LOOP. ROOT CAUSE (verified by ss -tlnp + docker probe): Gateway listening sockets: 127.0.0.1:8080 172.18.0.1:8080 (openshell-docker bridge gateway IP) Sandbox container is started with: --add-host=host.openshell.internal:host-gateway Docker's "host-gateway" magic constant on Linux resolves to the DEFAULT bridge (docker0) gateway, which is 172.17.0.1, NOT the openshell-docker bridge gateway. The gateway does NOT listen on 172.17.0.1:8080 -> "Connection refused" -> onboard reports it as "firewall blocking". Disabling ufw with `sudo ufw disable` does NOT fix it — verified on all 6 Linux hosts. It is not a host firewall problem. Manual reproduction of the connectivity failure (on openshell-docker network): $ docker run --rm --network openshell-docker \ --add-host=host.openshell.internal:host-gateway \ alpine sh -c 'apk add -q curl >/dev/null; curl -v --max-time 5 \ http://host.openshell.internal:8080' * Trying 172.17.0.1:8080... * connect to 172.17.0.1 port 8080 from 172.18.0.2 port 39514 failed: Connection refused SUB-BUGS COMBINING INTO THE LOOP: 1. Gateway binds only 127.0.0.1 + 172.18.0.1 — should bind 0.0.0.0:8080 (or also include 172.17.0.1), OR onboard should pass `--add-host=host.openshell.internal:172.18.0.1` (explicit openshell-docker bridge gateway) instead of relying on Docker's `host-gateway` magic. 2. The "host firewall is blocking" diagnostic is hard-coded. It does not check `ufw status` first; it does not query iptables; it assumes the only cause of "connection refused" is firewall. On hosts with ufw inactive (most CI runners), the suggestion is wrong and wastes user/operator time. 3. The "nemoclaw destroy --yes" suggestion is not actionable when no sandbox is registered. There is no CLI help that documents "use `nemoclaw uninstall` to clear stale gateway state". 4. `nemoclaw uninstall` prints "Destroyed gateway 'nemoclaw' skipped" — wording suggests the gateway was NOT destroyed, even though it actually clears state. AFFECTED HOSTS: Same dead loop (root cause #1 above): ubuntu24, ubuntu24-gpu, ubuntu26, dgxspark, dgx-station, wsl-x86 (WSL2 Ubuntu 24.04) Independent bugs (NOT the same dead loop): ubuntu22 — openshell-gateway binary requires GLIBC 2.38, but Ubuntu 22.04 has GLIBC 2.35. Gateway process cannot start. Should be tracked as a separate bug. macOS — onboard reports "Docker network 'openshell-docker' not found" at [2/8]. Different symptom; needs separate triage. WORKING VERSION: v0.0.38 tag installed via curl|bash. CI pipeline 51109488 ran 27 minutes of sanity tests on ubuntu22 against v0.0.38. The regression was introduced in the ~45 commits between v0.0.38 release and main HEAD c517d622c. DOWNSTREAM IMPACT: GitLab QA schedule pipelines 18424 (RUN_MATRIX=true) and 18426 (RUN_MATRIX_SLOW=true) use NEMOCLAW_TEST_INSTALL_FROM_URL=https://github.com/NVIDIA/NemoClaw/tree/main and have failed 5 consecutive days because of this dead loop. Manual reproductions in pipelines 51130981 and 51133458 confirm the same behaviour.LogsGateway listening (excerpt from openshell-gateway.log): INFO openshell_server::cli: TLS disabled — listening on plaintext HTTP INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8080 INFO openshell_server: Using compute driver driver=docker INFO openshell_server: Server listening address=127.0.0.1:8080 INFO openshell_server: Server listening address=172.18.0.1:8080 ss -tlnp on host: LISTEN 0 128 172.18.0.1:8080 0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=16)) LISTEN 0 128 127.0.0.1:8080 0.0.0.0:* users:(("openshell-gatew",pid=3845489,fd=14)) Docker networks: bridge 172.17.0.0/16 gw 172.17.0.1 openshell-docker 172.18.0.0/16 gw 172.18.0.1 Suggested fix (in order of impact): 1. Bind gateway on 0.0.0.0:8080 (or also include 172.17.0.1) — single line in openshell-server cli binding logic. 2. In onboard, replace `--add-host=host.openshell.internal:host-gateway` with `--add-host=host.openshell.internal:172.18.0.1` (the openshell-docker bridge gateway IP). 3. Before printing "firewall is blocking", probe `ufw status` / inspect iptables, and verify gateway is actually reachable on the expected IP. Report the real cause when the test fails on the IP-binding side. 4. When suggesting `nemoclaw destroy --yes` in a context where no sandbox is registered, suggest `nemoclaw uninstall` (or a dedicated reset command) instead.Bug Details
[NVB#6171920]