Skip to content

[macOS][Onboard] nemoclaw onboard step [4/8] fails with "Connection refused" while preflight reports gateway healthy — stale cached health (regression of #2020) #3258

@hulynn

Description

@hulynn

Description

Description

On macOS with Colima just-booted (or any state where Docker daemon recently restarted), `nemoclaw onboard` step [1/8] preflight prints "Reusing healthy NemoClaw gateway." based on cached health, but step [4/8] "Setting up inference provider" then immediately fails with "transport error / tcp connect error / Connection refused (os error 61)". At the same moment, `curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/` returns 502 — gateway container is up but the upstream is not actually serving. This is a regression of NVBug 6090121 / GitHub #2020 ("Provider setup fails with Connection refused despite gateway showing healthy") which was previously marked Bug - Fixed.
Environment
Device:        MacBook (Apple M4)
OS:            macOS 26.1 (Darwin 25.1.0, arm64)
Architecture:  arm64
Node.js:       v23.10.0
npm:           11.3.0
Docker:        27.4.0 (build bde2b89, via colima)
OpenShell CLI: 0.0.36
NemoClaw:      v0.0.36
OpenClaw:      N/A (onboard fails before sandbox creation)
Steps to Reproduce
1. Stop and restart Colima (or simulate a freshly-booted Docker daemon):
   colima stop && colima start
2. Within ~30 seconds (before the gateway upstream fully serves HTTP 200 on 8080), run:
   NEMOCLAW_PROVIDER=build nemoclaw onboard --fresh --name reproduce --non-interactive --yes --yes-i-accept-third-party-software --no-gpu
3. Observe step [1/8] preflight prints "Reusing healthy NemoClaw gateway." while in another shell:
   curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/
   actually returns 502.
4. Observe step [4/8] "Setting up inference provider" fails with Connection refused.
Expected Result
- Step [1/8] preflight should actually probe http://localhost:8080/ and refuse to print "healthy" when HTTP returns non-2xx
- If the gateway is still warming up, step [2/8] should wait for actual readiness (HTTP 200) before advancing
- Onboard should not advance to step [4/8] with a stale-cached health verdict; either it surfaces an actionable error or it retries until ready
Actual Result
[1/8] Preflight checks
  Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.
  ✓ Port 8080 available (OpenShell gateway)
  ✓ Apple GPU detected: Apple M4 (10 cores), 24576 MB unified memory
  ⓘ Local NIM unavailable — requires NVIDIA GPU
  Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.

[2/8] Starting OpenShell gateway
  [reuse] Skipping gateway (running)
  Reusing healthy NemoClaw gateway.

[3/8] Configuring inference (NIM)
  [non-interactive] Provider: build
  Chat Completions API available — OpenClaw will use openai-completions.
  Using NVIDIA Endpoints with model: nvidia/nemotron-3-super-120b-a12b

[4/8] Setting up inference provider
  ✓ Active gateway set to 'nemoclaw'
  Error:   × transport error
    ├─▶ tcp connect error
    ├─▶ tcp connect error
    ╰─▶ Connection refused (os error 61)

  Error: × transport error ├─▶ tcp connect error ├─▶ tcp connect error ╰─▶ Connection refused (os error 61)
Logs
At the same time as the failure:
  $ curl -s -o /dev/null -w "Gateway HTTP: %{http_code}\n" http://localhost:8080/
  Gateway HTTP: 502

Docker container status:
  $ docker ps | grep openshell
  f838d5ce25fe ghcr.io/nvidia/openshell/cluster:0.0.36 ... Up About a minute (healthy) 0.0.0.0:8080->30051/tcp openshell-cluster-nemoclaw

So Docker reports the container as healthy and binding 8080->30051, but the upstream (port 30051 inside the container) is not yet serving HTTP, hence 502 from the host. Onboard's preflight uses cached gateway health rather than re-probing, so the user is told "healthy" when the gateway is actually unreachable.

Related: regression of NVBug 6090121 / GitHub #2020 (was marked Bug - Fixed).

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard

[NVB#6158477]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputplatform: macosAffects macOS, including Apple Silicon

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions