[macOS][Onboard] nemoclaw onboard step [4/8] fails with "Connection refused" while preflight reports gateway healthy — stale cached health (regression of #2020)

## Description

Description
<pre>On macOS with Colima just-booted (or any state where Docker daemon recently restarted), `nemoclaw onboard` step [1/8] preflight prints "Reusing healthy NemoClaw gateway." based on cached health, but step [4/8] "Setting up inference provider" then immediately fails with "transport error / tcp connect error / Connection refused (os error 61)". At the same moment, `curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/` returns 502 — gateway container is up but the upstream is not actually serving. This is a regression of NVBug 6090121 / GitHub #2020 ("Provider setup fails with Connection refused despite gateway showing healthy") which was previously marked Bug - Fixed.
</pre>Environment
<pre>Device: MacBook (Apple M4)
OS: macOS 26.1 (Darwin 25.1.0, arm64)
Architecture: arm64
Node.js: v23.10.0
npm: 11.3.0
Docker: 27.4.0 (build bde2b89, via colima)
OpenShell CLI: 0.0.36
NemoClaw: v0.0.36
OpenClaw: N/A (onboard fails before sandbox creation)
</pre>Steps to Reproduce
<pre>1. Stop and restart Colima (or simulate a freshly-booted Docker daemon):
 colima stop && colima start
2. Within ~30 seconds (before the gateway upstream fully serves HTTP 200 on 8080), run:
 NEMOCLAW_PROVIDER=build nemoclaw onboard --fresh --name reproduce --non-interactive --yes --yes-i-accept-third-party-software --no-gpu
3. Observe step [1/8] preflight prints "Reusing healthy NemoClaw gateway." while in another shell:
 curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/
 actually returns 502.
4. Observe step [4/8] "Setting up inference provider" fails with Connection refused.
</pre>Expected Result
<pre>- Step [1/8] preflight should actually probe http://localhost:8080/ and refuse to print "healthy" when HTTP returns non-2xx
- If the gateway is still warming up, step [2/8] should wait for actual readiness (HTTP 200) before advancing
- Onboard should not advance to step [4/8] with a stale-cached health verdict; either it surfaces an actionable error or it retries until ready
</pre>Actual Result
<pre>[1/8] Preflight checks
 Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.
 ✓ Port 8080 available (OpenShell gateway)
 ✓ Apple GPU detected: Apple M4 (10 cores), 24576 MB unified memory
 ⓘ Local NIM unavailable — requires NVIDIA GPU
 Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.

[2/8] Starting OpenShell gateway
 [reuse] Skipping gateway (running)
 Reusing healthy NemoClaw gateway.

[3/8] Configuring inference (NIM)
 [non-interactive] Provider: build
 Chat Completions API available — OpenClaw will use openai-completions.
 Using NVIDIA Endpoints with model: nvidia/nemotron-3-super-120b-a12b

[4/8] Setting up inference provider
 ✓ Active gateway set to 'nemoclaw'
 Error: × transport error
 ├─▶ tcp connect error
 ├─▶ tcp connect error
 ╰─▶ Connection refused (os error 61)

 Error: × transport error ├─▶ tcp connect error ├─▶ tcp connect error ╰─▶ Connection refused (os error 61)
</pre>Logs
<pre>At the same time as the failure:
 $ curl -s -o /dev/null -w "Gateway HTTP: %{http_code}\n" http://localhost:8080/
 Gateway HTTP: 502

Docker container status:
 $ docker ps | grep openshell
 f838d5ce25fe ghcr.io/nvidia/openshell/cluster:0.0.36 ... Up About a minute (healthy) 0.0.0.0:8080->30051/tcp openshell-cluster-nemoclaw

So Docker reports the container as healthy and binding 8080->30051, but the upstream (port 30051 inside the container) is not yet serving HTTP, hence 502 from the host. Onboard's preflight uses cached gateway health rather than re-probing, so the user is told "healthy" when the gateway is actually unreachable.

Related: regression of NVBug 6090121 / GitHub #2020 (was marked Bug - Fixed).
</pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard |

---
[NVB#6158477]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[macOS][Onboard] nemoclaw onboard step [4/8] fails with "Connection refused" while preflight reports gateway healthy — stale cached health (regression of #2020) #3258

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard

[macOS][Onboard] nemoclaw onboard step [4/8] fails with "Connection refused" while preflight reports gateway healthy — stale cached health (regression of #2020) #3258

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions