Description
Description
On Ubuntu 24.04 with a fresh NemoClaw v0.0.38 install, running `NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard --fresh` and selecting "Local NVIDIA NIM [experimental]" with the default model nvidia/nemotron-3-super-120b-a12b: the NIM container image pulls successfully, the container starts, but exits 11 seconds later with exit code 0. Inside the container, NIM's model-manifest download fails with "Authentication Error" because NemoClaw stores the pasted NGC API key only in ~/.docker/config.json (used for docker pull) and does NOT inject NGC_API_KEY / NIM_NGC_API_KEY into the NIM container's environment. The wizard then waits the full 300s NIM-health timeout against an already-dead container, prints "NIM did not become healthy within 300s. NIM failed to start. Falling back to cloud API. Inference selection did not yield a provider/model.", and exits leaving no sandbox registered.
Environment
Device: Ubuntu workstation 2u1g-x570-1795 (10.63.136.90)
OS: Ubuntu 24.04.4 LTS (kernel 6.17.0-19-generic)
Architecture: x86_64
GPU: NVIDIA RTX 6000 Ada Generation, 46068 MiB, driver 595.58.03
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.4.3 (build 055a478)
OpenShell CLI: openshell 0.0.36
NemoClaw: v0.0.38
OpenClaw: 2026.4.24 (from installer log; sandbox not created)
NIM image: nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest (NIM v2.0.4)
Steps to Reproduce
1. Fresh Ubuntu 24.04 host with NVIDIA GPU, Docker + nvidia-container-toolkit installed
2. curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash (installs v0.0.38)
3. Ensure no prior nvcr.io login: rm -f ~/.docker/config.json
4. Ensure clean state: rm -f ~/.nemoclaw/onboard-session.json; nemoclaw list shows no sandboxes
5. Run: NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard --fresh
6. At [3/8] Configuring inference (NIM), select "8" (Local NVIDIA NIM [experimental])
7. At "Choose model [1]:" press Enter (default: nvidia/nemotron-3-super-120b-a12b)
8. At "NGC API Key:" paste a valid personal NGC key in the form nvapi-...
9. Watch wizard pull image, start container nemoclaw-nim-nemoclaw, then poll "Waiting for NIM health on port 8000"
10. After 5 min, wizard prints failure and shell returns
Expected Result
- NemoClaw injects NGC_API_KEY (and/or NIM_NGC_API_KEY as the NIM SDK expects) into the NIM container's environment when launching nemoclaw-nim-* via docker run
- NIM model-manifest download inside the container succeeds
- Container reaches a healthy state on port 8000 and the wizard proceeds to sandbox creation
- nemoclaw list eventually shows a Ready sandbox using the NIM-local provider
- docker inspect nemoclaw-nim-nemoclaw | jq '.[]|.Config.Env' contains NGC_API_KEY=...
- If NGC_API_KEY is intentionally not passed, wizard should detect the dead container within seconds (not wait the full 300s) and surface the in-container auth-error log line to the user
Actual Result
Wizard tail output:
Starting NIM container: nemoclaw-nim-nemoclaw
57d6a6263b87abf76bf2c24041830b78885effaf84b31b469d38c5ee23bc8d66
Waiting for NIM to become healthy...
Waiting for NIM health on port 8000 (timeout: 300s)...
NIM did not become healthy within 300s.
NIM failed to start. Falling back to cloud API.
Inference selection did not yield a provider/model.
$ nemoclaw list
No sandboxes registered. Run `nemoclaw onboard` to get started.
$ docker ps -a --filter name=nemoclaw-nim
NAMES STATUS
nemoclaw-nim-nemoclaw Exited (0) 6 minutes ago
Container environment confirms NGC_API_KEY is missing:
$ docker inspect nemoclaw-nim-nemoclaw --format '{{range .Config.Env}}{{println .}}{{end}}' | grep -i ngc
(no output)
The wizard accepted the key and stored it in ~/.docker/config.json (nvcr.io auth visible there), but the NIM container itself has no NGC env var, so in-container model download fails immediately.
Cosmetic compounding: wizard waits 300s polling a dead container before bailing — user wastes ~5 min on top of the 1-2 min image pull. After timeout, the "Falling back to cloud API" branch does NOT prompt the user for an alternate provider; it just exits.
Logs
docker logs nemoclaw-nim-nemoclaw (tail):
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 2.0.4
...
ERROR 2026-05-11 08:40:07.840 nim_sdk.py:342] Download failed after 1 attempts. Last exception: Authentication Error
ERROR 2026-05-11 08:40:07.840 model_download.py:68] Failed to download models for profile '929c6303b397a5ef7c1472230fcc8b940d7e78d230d6c3051f2f1837146c767b': Error downloading manifest: Authentication Error
ERROR 2026-05-11 08:40:07.840 actions.py:92] Model download failed: Error downloading manifest: Authentication Error
Shutting down services...
Stopping nginx...
docker inspect nemoclaw-nim-nemoclaw --format '{{.State.Status}}/{{.State.ExitCode}}/{{.State.StartedAt}}/{{.State.FinishedAt}}':
exited/0/2026-05-11T08:39:57.539935638Z/2026-05-11T08:40:08.794406746Z
Container died 11s after start; wizard polled health for additional ~289s anyway.
Full T560817 day0 test report: /home/lab/day0-automation/20260511/report-T560817.txt
Related upstream / similar symptom (different ecosystem): NVBug 4926446 (NIM nv-embedqa-e5-v5: NGC_API_KEY not getting passed through, same "container exited but job kept polling health" cosmetic) — fix pattern likely transferable.
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard |
[NVB#6163749]
Description
Description
Environment Steps to Reproduce Expected Result Actual ResultWizard tail output: Starting NIM container: nemoclaw-nim-nemoclaw 57d6a6263b87abf76bf2c24041830b78885effaf84b31b469d38c5ee23bc8d66 Waiting for NIM to become healthy... Waiting for NIM health on port 8000 (timeout: 300s)... NIM did not become healthy within 300s. NIM failed to start. Falling back to cloud API. Inference selection did not yield a provider/model. $ nemoclaw list No sandboxes registered. Run `nemoclaw onboard` to get started. $ docker ps -a --filter name=nemoclaw-nim NAMES STATUS nemoclaw-nim-nemoclaw Exited (0) 6 minutes ago Container environment confirms NGC_API_KEY is missing: $ docker inspect nemoclaw-nim-nemoclaw --format '{{range .Config.Env}}{{println .}}{{end}}' | grep -i ngc (no output) The wizard accepted the key and stored it in ~/.docker/config.json (nvcr.io auth visible there), but the NIM container itself has no NGC env var, so in-container model download fails immediately. Cosmetic compounding: wizard waits 300s polling a dead container before bailing — user wastes ~5 min on top of the 1-2 min image pull. After timeout, the "Falling back to cloud API" branch does NOT prompt the user for an alternate provider; it just exits.Logsdocker logs nemoclaw-nim-nemoclaw (tail): =========================================== == NVIDIA Inference Microservice LLM NIM == =========================================== NVIDIA Inference Microservice LLM NIM Version 2.0.4 ... ERROR 2026-05-11 08:40:07.840 nim_sdk.py:342] Download failed after 1 attempts. Last exception: Authentication Error ERROR 2026-05-11 08:40:07.840 model_download.py:68] Failed to download models for profile '929c6303b397a5ef7c1472230fcc8b940d7e78d230d6c3051f2f1837146c767b': Error downloading manifest: Authentication Error ERROR 2026-05-11 08:40:07.840 actions.py:92] Model download failed: Error downloading manifest: Authentication Error Shutting down services... Stopping nginx... docker inspect nemoclaw-nim-nemoclaw --format '{{.State.Status}}/{{.State.ExitCode}}/{{.State.StartedAt}}/{{.State.FinishedAt}}': exited/0/2026-05-11T08:39:57.539935638Z/2026-05-11T08:40:08.794406746Z Container died 11s after start; wizard polled health for additional ~289s anyway. Full T560817 day0 test report: /home/lab/day0-automation/20260511/report-T560817.txt Related upstream / similar symptom (different ecosystem): NVBug 4926446 (NIM nv-embedqa-e5-v5: NGC_API_KEY not getting passed through, same "container exited but job kept polling health" cosmetic) — fix pattern likely transferable.Bug Details
[NVB#6163749]