Description
Description
nemoclaw onboard step 2 "Starting OpenShell gateway" uses the wrong liveness check: as long as port 18789 (dashboard/runtime) is occupied by any process, it skips gateway startup and prints [reuse] Skipping gateway (running) / Reusing healthy NemoClaw gateway., even when port 8080 (the actual openshell gateway port) is clearly free and the openshell container no longer exists. Step 4 (sandbox registration) then fails deterministically with Connection refused (os error111).
Environment
- Device: NVIDIA internal dev box (nemoclaw-qa-ubuntu2404, 64 GB RAM, no GPU) + local ipp1-1203 (local-glennz); 100% reproducible on both
- OS: Ubuntu 24.04
- OpenShell CLI: openshell 0.0.26
- NemoClaw: v0.0.18 (installed via install.sh)
- OpenClaw: bundled with v0.0.18
- Provider: NVIDIA Endpoints (build), model nvidia/nemotron-3-super-120b-a12b
- Test repo commit: feature/onboarding-and-tui-enhance @ 5782ee0
Reproduction Steps
First run a normal onboard to get a healthy sandbox, then manually create the "dashboard alive / openshell gateway container gone" residual state:
- Run nemoclaw onboard --non-interactive to completion. Verify docker ps shows openshell-cluster-nemoclaw (healthy) and ss -ltnp shows both 8080 and 18789 in LISTEN.
- Kill the openshell gateway container but keep port 18789 occupied by something (this repro uses an SSH reverse tunnel; could also be the nemoclaw dashboard itself or any python3 -m http.server 18789):
docker rm -f openshell-cluster-nemoclaw
ss -ltnp | grep -E '8080|18789'
# 8080: nothing; 18789: LISTEN (SSH or other process)
- Run nemoclaw onboard --non-interactive again (same user / same provider).
Actual Result
[1/8] Preflight checks
✓ Port 8080 available (OpenShell gateway) ← itself acknowledges 8080 is free
✓ Port 18789 already owned by healthy NemoClaw runtime (NemoClaw dashboard)
...
[2/8] Starting OpenShell gateway
[reuse] Skipping gateway (running) ← still misjudges as "running"
Reusing healthy NemoClaw gateway.
...
[4/8] Setting up inference provider
✓ Active gateway set to 'nemoclaw'
Error: × transport error
├─▶ tcp connect error
├─▶ tcp connect error
╰─▶ Connection refused (os error 111)
Stably reproducible every time, byte-for-byte identical to the QA-machine failure log for T67 ([T5882262]).
Expected Result
Step 2 should verify the actual liveness of the openshell gateway on 8080 (TCP handshake or openshell gateway info returning ready). If dead, it should restart the gateway container so that step 4 can register the sandbox successfully.
Analysis:
The root cause is the liveness judgment in step 2: it only checks whether port 18789 (dashboard) is occupied and from that infers the whole "NemoClaw runtime" is healthy, skipping the actual startup of the openshell gateway on 8080. This directly contradicts the output printed by the same preflight a few lines earlier (✓ Port 8080 available (OpenShell gateway)) — preflight already
knows 8080 is free, yet step 2 ignores that signal.
[NVB# 6090121]
Description
Description
nemoclaw onboard step 2 "Starting OpenShell gateway" uses the wrong liveness check: as long as port 18789 (dashboard/runtime) is occupied by any process, it skips gateway startup and prints [reuse] Skipping gateway (running) / Reusing healthy NemoClaw gateway., even when port 8080 (the actual openshell gateway port) is clearly free and the openshell container no longer exists. Step 4 (sandbox registration) then fails deterministically with Connection refused (os error111).
Environment
Reproduction Steps
First run a normal onboard to get a healthy sandbox, then manually create the "dashboard alive / openshell gateway container gone" residual state:
Actual Result
[1/8] Preflight checks ✓ Port 8080 available (OpenShell gateway) ← itself acknowledges 8080 is free ✓ Port 18789 already owned by healthy NemoClaw runtime (NemoClaw dashboard) ... [2/8] Starting OpenShell gateway [reuse] Skipping gateway (running) ← still misjudges as "running" Reusing healthy NemoClaw gateway. ... [4/8] Setting up inference provider ✓ Active gateway set to 'nemoclaw' Error: × transport error ├─▶ tcp connect error ├─▶ tcp connect error ╰─▶ Connection refused (os error 111)Stably reproducible every time, byte-for-byte identical to the QA-machine failure log for T67 ([T5882262]).
Expected Result
Step 2 should verify the actual liveness of the openshell gateway on 8080 (TCP handshake or openshell gateway info returning ready). If dead, it should restart the gateway container so that step 4 can register the sandbox successfully.
Analysis:
The root cause is the liveness judgment in step 2: it only checks whether port 18789 (dashboard) is occupied and from that infers the whole "NemoClaw runtime" is healthy, skipping the actual startup of the openshell gateway on 8080. This directly contradicts the output printed by the same preflight a few lines earlier (✓ Port 8080 available (OpenShell gateway)) — preflight already
knows 8080 is free, yet step 2 ignores that signal.
[NVB# 6090121]