Skip to content

[DGX Spark][Inference] NIM container 300s health probe timeout too short for first-time model checkpoint load; onboard falls back to cloud API instead of waiting #3886

@wangericnv

Description

@wangericnv

Description

Running NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard + NIM-local + nvidia/nemotron-3-nano-30b-a3b on DGX Spark, the NIM image pulls successfully, the container starts, and vLLM begins loading model checkpoint shards at ~28s/shard. NemoClaw's health probe gives up after 300s with NIM did not become healthy within 300s. while the container is still legitimately mid-load (saw 4/9 shards loaded at the cutoff). Onboard then declares NIM failed to start. Falling back to cloud API.Inference selection did not yield a provider/model. which sends the user to a non-NIM provider despite a working NIM container that would have come up cleanly with ~3-5 more minutes.

Also note: NemoClaw labels the Spark GPU as NVIDIA JMJWOA-Generic-GPU (engineering codename) rather than the user-facing GB10 name. T5937388 expected output says "GPU detected as GB10 with ~128000 MB unified memory"; actual is "NVIDIA JMJWOA-Generic-GPU, 124546 MB". Memory amount is roughly correct (free -m based). The name discrepancy is secondary to the timeout issue.

Side note: the T5937388 case still expects an "amd64-only / exec format error" warning at NIM pull on aarch64. In v0.0.46 the NIM image (nvcr.io/nim/nvidia/nemotron-3-nano:latest) has an aarch64 variant that runs vLLM cleanly on the Spark JMJWOA GPU — the aarch64-incompat warning is now stale guidance. This is a case-spec update separately, not a product bug.

Environment

Device:        DGX Spark (spark-8158)
OS:            Ubuntu 24.04.4 LTS
Architecture:  aarch64
Node.js:       v22.x via nvm
npm:           10.9.x
Docker:        29.2.1
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.46
OpenClaw:      N/A (onboard fell back before sandbox creation)
NIM image:     nvcr.io/nim/nvidia/nemotron-3-nano:latest
               (vLLM 0.20.2, model nvidia/nemotron-3-nano fp8, 9 safetensors shards)

Steps to Reproduce

  1. nemoclaw v0.0.46 installed on DGX Spark (aarch64).
  2. Run:
    NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard --name nim-spark-sb --fresh -y
  3. Choose 8 (Local NVIDIA NIM [experimental]); model 2 (nemotron-3-nano-30b-a3b).
  4. Paste NGC API key.
  5. Wait through image pull (~10 min) and container start.

Expected Result

  • NIM image pulled.
  • Container started, vLLM loads checkpoints.
  • NemoClaw waits long enough for the engine to become healthy (or shows progress / extends the deadline based on observed shard-load activity).
  • Onboard completes with NIM provider; sandbox Ready.

Actual Result

NIM image pull succeeds:

Status: Downloaded newer image for nvcr.io/nim/nvidia/nemotron-3-nano:latest

NIM container starts: ca564b3ff760aa42a4de73d964faef3d22d844770009063a115fd86350821c96

Health probe:

Waiting for NIM to become healthy...
Waiting for NIM health on port 8000 (timeout: 300s)...
NIM did not become healthy within 300s.
NIM failed to start. Falling back to cloud API.
Inference selection did not yield a provider/model.

docker logs nemoclaw-nim-nemoclaw at the timeout shows vLLM is actively loading:

Starting vLLM v0.20.2
  Model:     nvidia/nemotron-3-nano
  Precision: fp8
Loading safetensors checkpoint shards:  44% Completed | 4/9 [01:52<02:20, 28.16s/it]

i.e. ~5-6 more minutes of legitimate work would have finished the load and the container would have served.

Secondary: GPU detection output says NVIDIA JMJWOA-Generic-GPU, 124546 MB rather than the expected GB10 with ~128000 MB unified memory — engineering codename leaks into user-facing onboard output.


NVB#6194905

Metadata

Metadata

Assignees

No one assigned

    Labels

    NV QABugs found by the NVIDIA QA Teamarea: inferenceInference routing, serving, model selection, or outputsplatform: dgx-sparkAffects DGX Spark hardware or workflows
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions