Description
Running NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard + NIM-local + nvidia/nemotron-3-nano-30b-a3b on DGX Spark, the NIM image pulls successfully, the container starts, and vLLM begins loading model checkpoint shards at ~28s/shard. NemoClaw's health probe gives up after 300s with NIM did not become healthy within 300s. while the container is still legitimately mid-load (saw 4/9 shards loaded at the cutoff). Onboard then declares NIM failed to start. Falling back to cloud API. → Inference selection did not yield a provider/model. which sends the user to a non-NIM provider despite a working NIM container that would have come up cleanly with ~3-5 more minutes.
Also note: NemoClaw labels the Spark GPU as NVIDIA JMJWOA-Generic-GPU (engineering codename) rather than the user-facing GB10 name. T5937388 expected output says "GPU detected as GB10 with ~128000 MB unified memory"; actual is "NVIDIA JMJWOA-Generic-GPU, 124546 MB". Memory amount is roughly correct (free -m based). The name discrepancy is secondary to the timeout issue.
Side note: the T5937388 case still expects an "amd64-only / exec format error" warning at NIM pull on aarch64. In v0.0.46 the NIM image (nvcr.io/nim/nvidia/nemotron-3-nano:latest) has an aarch64 variant that runs vLLM cleanly on the Spark JMJWOA GPU — the aarch64-incompat warning is now stale guidance. This is a case-spec update separately, not a product bug.
Environment
Device: DGX Spark (spark-8158)
OS: Ubuntu 24.04.4 LTS
Architecture: aarch64
Node.js: v22.x via nvm
npm: 10.9.x
Docker: 29.2.1
OpenShell CLI: 0.0.39
NemoClaw: v0.0.46
OpenClaw: N/A (onboard fell back before sandbox creation)
NIM image: nvcr.io/nim/nvidia/nemotron-3-nano:latest
(vLLM 0.20.2, model nvidia/nemotron-3-nano fp8, 9 safetensors shards)
Steps to Reproduce
- nemoclaw v0.0.46 installed on DGX Spark (aarch64).
- Run:
NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard --name nim-spark-sb --fresh -y
- Choose 8 (Local NVIDIA NIM [experimental]); model 2 (
nemotron-3-nano-30b-a3b).
- Paste NGC API key.
- Wait through image pull (~10 min) and container start.
Expected Result
- NIM image pulled.
- Container started, vLLM loads checkpoints.
- NemoClaw waits long enough for the engine to become healthy (or shows progress / extends the deadline based on observed shard-load activity).
- Onboard completes with NIM provider; sandbox Ready.
Actual Result
NIM image pull succeeds:
Status: Downloaded newer image for nvcr.io/nim/nvidia/nemotron-3-nano:latest
NIM container starts: ca564b3ff760aa42a4de73d964faef3d22d844770009063a115fd86350821c96
Health probe:
Waiting for NIM to become healthy...
Waiting for NIM health on port 8000 (timeout: 300s)...
NIM did not become healthy within 300s.
NIM failed to start. Falling back to cloud API.
Inference selection did not yield a provider/model.
docker logs nemoclaw-nim-nemoclaw at the timeout shows vLLM is actively loading:
Starting vLLM v0.20.2
Model: nvidia/nemotron-3-nano
Precision: fp8
Loading safetensors checkpoint shards: 44% Completed | 4/9 [01:52<02:20, 28.16s/it]
i.e. ~5-6 more minutes of legitimate work would have finished the load and the container would have served.
Secondary: GPU detection output says NVIDIA JMJWOA-Generic-GPU, 124546 MB rather than the expected GB10 with ~128000 MB unified memory — engineering codename leaks into user-facing onboard output.
NVB#6194905
Description
Running
NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard+ NIM-local +nvidia/nemotron-3-nano-30b-a3bon DGX Spark, the NIM image pulls successfully, the container starts, and vLLM begins loading model checkpoint shards at ~28s/shard. NemoClaw's health probe gives up after 300s withNIM did not become healthy within 300s.while the container is still legitimately mid-load (saw 4/9 shards loaded at the cutoff). Onboard then declaresNIM failed to start. Falling back to cloud API.→Inference selection did not yield a provider/model.which sends the user to a non-NIM provider despite a working NIM container that would have come up cleanly with ~3-5 more minutes.Also note: NemoClaw labels the Spark GPU as
NVIDIA JMJWOA-Generic-GPU(engineering codename) rather than the user-facingGB10name. T5937388 expected output says "GPU detected as GB10 with ~128000 MB unified memory"; actual is "NVIDIA JMJWOA-Generic-GPU, 124546 MB". Memory amount is roughly correct (free -mbased). The name discrepancy is secondary to the timeout issue.Side note: the T5937388 case still expects an "amd64-only / exec format error" warning at NIM pull on aarch64. In v0.0.46 the NIM image (
nvcr.io/nim/nvidia/nemotron-3-nano:latest) has an aarch64 variant that runs vLLM cleanly on the Spark JMJWOA GPU — the aarch64-incompat warning is now stale guidance. This is a case-spec update separately, not a product bug.Environment
Steps to Reproduce
nemotron-3-nano-30b-a3b).Expected Result
Actual Result
NIM image pull succeeds:
NIM container starts:
ca564b3ff760aa42a4de73d964faef3d22d844770009063a115fd86350821c96Health probe:
docker logs nemoclaw-nim-nemoclawat the timeout shows vLLM is actively loading:i.e. ~5-6 more minutes of legitimate work would have finished the load and the container would have served.
Secondary: GPU detection output says
NVIDIA JMJWOA-Generic-GPU, 124546 MBrather than the expectedGB10 with ~128000 MB unified memory— engineering codename leaks into user-facing onboard output.NVB#6194905