Description
Description
On macOS with Colima just-booted (or any state where Docker daemon recently restarted), `nemoclaw onboard` step [1/8] preflight prints "Reusing healthy NemoClaw gateway." based on cached health, but step [4/8] "Setting up inference provider" then immediately fails with "transport error / tcp connect error / Connection refused (os error 61)". At the same moment, `curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/` returns 502 — gateway container is up but the upstream is not actually serving. This is a regression of NVBug 6090121 / GitHub #2020 ("Provider setup fails with Connection refused despite gateway showing healthy") which was previously marked Bug - Fixed.
Environment
Device: MacBook (Apple M4)
OS: macOS 26.1 (Darwin 25.1.0, arm64)
Architecture: arm64
Node.js: v23.10.0
npm: 11.3.0
Docker: 27.4.0 (build bde2b89, via colima)
OpenShell CLI: 0.0.36
NemoClaw: v0.0.36
OpenClaw: N/A (onboard fails before sandbox creation)
Steps to Reproduce
1. Stop and restart Colima (or simulate a freshly-booted Docker daemon):
colima stop && colima start
2. Within ~30 seconds (before the gateway upstream fully serves HTTP 200 on 8080), run:
NEMOCLAW_PROVIDER=build nemoclaw onboard --fresh --name reproduce --non-interactive --yes --yes-i-accept-third-party-software --no-gpu
3. Observe step [1/8] preflight prints "Reusing healthy NemoClaw gateway." while in another shell:
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/
actually returns 502.
4. Observe step [4/8] "Setting up inference provider" fails with Connection refused.
Expected Result
- Step [1/8] preflight should actually probe http://localhost:8080/ and refuse to print "healthy" when HTTP returns non-2xx
- If the gateway is still warming up, step [2/8] should wait for actual readiness (HTTP 200) before advancing
- Onboard should not advance to step [4/8] with a stale-cached health verdict; either it surfaces an actionable error or it retries until ready
Actual Result
[1/8] Preflight checks
Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.
✓ Port 8080 available (OpenShell gateway)
✓ Apple GPU detected: Apple M4 (10 cores), 24576 MB unified memory
ⓘ Local NIM unavailable — requires NVIDIA GPU
Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.
[2/8] Starting OpenShell gateway
[reuse] Skipping gateway (running)
Reusing healthy NemoClaw gateway.
[3/8] Configuring inference (NIM)
[non-interactive] Provider: build
Chat Completions API available — OpenClaw will use openai-completions.
Using NVIDIA Endpoints with model: nvidia/nemotron-3-super-120b-a12b
[4/8] Setting up inference provider
✓ Active gateway set to 'nemoclaw'
Error: × transport error
├─▶ tcp connect error
├─▶ tcp connect error
╰─▶ Connection refused (os error 61)
Error: × transport error ├─▶ tcp connect error ├─▶ tcp connect error ╰─▶ Connection refused (os error 61)
Logs
At the same time as the failure:
$ curl -s -o /dev/null -w "Gateway HTTP: %{http_code}\n" http://localhost:8080/
Gateway HTTP: 502
Docker container status:
$ docker ps | grep openshell
f838d5ce25fe ghcr.io/nvidia/openshell/cluster:0.0.36 ... Up About a minute (healthy) 0.0.0.0:8080->30051/tcp openshell-cluster-nemoclaw
So Docker reports the container as healthy and binding 8080->30051, but the upstream (port 30051 inside the container) is not yet serving HTTP, hence 502 from the host. Onboard's preflight uses cached gateway health rather than re-probing, so the user is told "healthy" when the gateway is actually unreachable.
Related: regression of NVBug 6090121 / GitHub #2020 (was marked Bug - Fixed).
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard |
[NVB#6158477]
Description
Description
On macOS with Colima just-booted (or any state where Docker daemon recently restarted), `nemoclaw onboard` step [1/8] preflight prints "Reusing healthy NemoClaw gateway." based on cached health, but step [4/8] "Setting up inference provider" then immediately fails with "transport error / tcp connect error / Connection refused (os error 61)". At the same moment, `curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/` returns 502 — gateway container is up but the upstream is not actually serving. This is a regression of NVBug 6090121 / GitHub #2020 ("Provider setup fails with Connection refused despite gateway showing healthy") which was previously marked Bug - Fixed.Environment Steps to Reproduce1. Stop and restart Colima (or simulate a freshly-booted Docker daemon): colima stop && colima start 2. Within ~30 seconds (before the gateway upstream fully serves HTTP 200 on 8080), run: NEMOCLAW_PROVIDER=build nemoclaw onboard --fresh --name reproduce --non-interactive --yes --yes-i-accept-third-party-software --no-gpu 3. Observe step [1/8] preflight prints "Reusing healthy NemoClaw gateway." while in another shell: curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/ actually returns 502. 4. Observe step [4/8] "Setting up inference provider" fails with Connection refused.Expected Result Actual Result[1/8] Preflight checks Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status. ✓ Port 8080 available (OpenShell gateway) ✓ Apple GPU detected: Apple M4 (10 cores), 24576 MB unified memory ⓘ Local NIM unavailable — requires NVIDIA GPU Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status. [2/8] Starting OpenShell gateway [reuse] Skipping gateway (running) Reusing healthy NemoClaw gateway. [3/8] Configuring inference (NIM) [non-interactive] Provider: build Chat Completions API available — OpenClaw will use openai-completions. Using NVIDIA Endpoints with model: nvidia/nemotron-3-super-120b-a12b [4/8] Setting up inference provider ✓ Active gateway set to 'nemoclaw' Error: × transport error ├─▶ tcp connect error ├─▶ tcp connect error ╰─▶ Connection refused (os error 61) Error: × transport error ├─▶ tcp connect error ├─▶ tcp connect error ╰─▶ Connection refused (os error 61)LogsAt the same time as the failure: $ curl -s -o /dev/null -w "Gateway HTTP: %{http_code}\n" http://localhost:8080/ Gateway HTTP: 502 Docker container status: $ docker ps | grep openshell f838d5ce25fe ghcr.io/nvidia/openshell/cluster:0.0.36 ... Up About a minute (healthy) 0.0.0.0:8080->30051/tcp openshell-cluster-nemoclaw So Docker reports the container as healthy and binding 8080->30051, but the upstream (port 30051 inside the container) is not yet serving HTTP, hence 502 from the host. Onboard's preflight uses cached gateway health rather than re-probing, so the user is told "healthy" when the gateway is actually unreachable. Related: regression of NVBug 6090121 / GitHub #2020 (was marked Bug - Fixed).Bug Details
[NVB#6158477]