[DGX Spark][Inference] NIM container 300s health probe timeout too short for first-time model checkpoint load; onboard falls back to cloud API instead of waiting

## Description

Running `NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard` + NIM-local + `nvidia/nemotron-3-nano-30b-a3b` on DGX Spark, the NIM image pulls successfully, the container starts, and vLLM begins loading model checkpoint shards at ~28s/shard. NemoClaw's health probe gives up after 300s with `NIM did not become healthy within 300s.` while the container is still legitimately mid-load (saw 4/9 shards loaded at the cutoff). Onboard then declares `NIM failed to start. Falling back to cloud API.` → `Inference selection did not yield a provider/model.` which sends the user to a non-NIM provider despite a working NIM container that would have come up cleanly with ~3-5 more minutes.

Also note: NemoClaw labels the Spark GPU as `NVIDIA JMJWOA-Generic-GPU` (engineering codename) rather than the user-facing `GB10` name. T5937388 expected output says "GPU detected as GB10 with ~128000 MB unified memory"; actual is "NVIDIA JMJWOA-Generic-GPU, 124546 MB". Memory amount is roughly correct (`free -m` based). The name discrepancy is secondary to the timeout issue.

Side note: the T5937388 case still expects an "amd64-only / exec format error" warning at NIM pull on aarch64. In v0.0.46 the NIM image (`nvcr.io/nim/nvidia/nemotron-3-nano:latest`) has an aarch64 variant that runs vLLM cleanly on the Spark JMJWOA GPU — the aarch64-incompat warning is now stale guidance. This is a case-spec update separately, not a product bug.

## Environment

```text
Device:        DGX Spark (spark-8158)
OS:            Ubuntu 24.04.4 LTS
Architecture:  aarch64
Node.js:       v22.x via nvm
npm:           10.9.x
Docker:        29.2.1
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.46
OpenClaw:      N/A (onboard fell back before sandbox creation)
NIM image:     nvcr.io/nim/nvidia/nemotron-3-nano:latest
               (vLLM 0.20.2, model nvidia/nemotron-3-nano fp8, 9 safetensors shards)
```

## Steps to Reproduce

1. nemoclaw v0.0.46 installed on DGX Spark (aarch64).
2. Run:
   ```bash
   NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard --name nim-spark-sb --fresh -y
   ```
3. Choose 8 (Local NVIDIA NIM [experimental]); model 2 (`nemotron-3-nano-30b-a3b`).
4. Paste NGC API key.
5. Wait through image pull (~10 min) and container start.

## Expected Result

- NIM image pulled.
- Container started, vLLM loads checkpoints.
- NemoClaw waits long enough for the engine to become healthy (or shows progress / extends the deadline based on observed shard-load activity).
- Onboard completes with NIM provider; sandbox Ready.

## Actual Result

NIM image pull succeeds:
```text
Status: Downloaded newer image for nvcr.io/nim/nvidia/nemotron-3-nano:latest
```

NIM container starts: `ca564b3ff760aa42a4de73d964faef3d22d844770009063a115fd86350821c96`

Health probe:
```text
Waiting for NIM to become healthy...
Waiting for NIM health on port 8000 (timeout: 300s)...
NIM did not become healthy within 300s.
NIM failed to start. Falling back to cloud API.
Inference selection did not yield a provider/model.
```

`docker logs nemoclaw-nim-nemoclaw` at the timeout shows vLLM is actively loading:
```text
Starting vLLM v0.20.2
  Model:     nvidia/nemotron-3-nano
  Precision: fp8
Loading safetensors checkpoint shards:  44% Completed | 4/9 [01:52<02:20, 28.16s/it]
```

i.e. ~5-6 more minutes of legitimate work would have finished the load and the container would have served.

**Secondary**: GPU detection output says `NVIDIA JMJWOA-Generic-GPU, 124546 MB` rather than the expected `GB10 with ~128000 MB unified memory` — engineering codename leaks into user-facing onboard output.

---
[NVB#6194905](https://nvbugspro.nvidia.com/bug/6194905)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DGX Spark][Inference] NIM container 300s health probe timeout too short for first-time model checkpoint load; onboard falls back to cloud API instead of waiting #3886

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DGX Spark][Inference] NIM container 300s health probe timeout too short for first-time model checkpoint load; onboard falls back to cloud API instead of waiting #3886

Description

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions