Skip to content

[Ubuntu 24.04][Onboard] NIM-local onboard fails: NGC_API_KEY not propagated to NIM container, model manifest download returns Authentication Error #3333

@wangericnv

Description

@wangericnv

Description

Description

On Ubuntu 24.04 with a fresh NemoClaw v0.0.38 install, running `NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard --fresh` and selecting "Local NVIDIA NIM [experimental]" with the default model nvidia/nemotron-3-super-120b-a12b: the NIM container image pulls successfully, the container starts, but exits 11 seconds later with exit code 0. Inside the container, NIM's model-manifest download fails with "Authentication Error" because NemoClaw stores the pasted NGC API key only in ~/.docker/config.json (used for docker pull) and does NOT inject NGC_API_KEY / NIM_NGC_API_KEY into the NIM container's environment. The wizard then waits the full 300s NIM-health timeout against an already-dead container, prints "NIM did not become healthy within 300s. NIM failed to start. Falling back to cloud API. Inference selection did not yield a provider/model.", and exits leaving no sandbox registered.
Environment
Device:        Ubuntu workstation 2u1g-x570-1795 (10.63.136.90)
OS:            Ubuntu 24.04.4 LTS (kernel 6.17.0-19-generic)
Architecture:  x86_64
GPU:           NVIDIA RTX 6000 Ada Generation, 46068 MiB, driver 595.58.03
Node.js:       v22.22.2
npm:           10.9.7
Docker:        29.4.3 (build 055a478)
OpenShell CLI: openshell 0.0.36
NemoClaw:      v0.0.38
OpenClaw:      2026.4.24 (from installer log; sandbox not created)
NIM image:     nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest (NIM v2.0.4)
Steps to Reproduce
1. Fresh Ubuntu 24.04 host with NVIDIA GPU, Docker + nvidia-container-toolkit installed
2. curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash  (installs v0.0.38)
3. Ensure no prior nvcr.io login: rm -f ~/.docker/config.json
4. Ensure clean state: rm -f ~/.nemoclaw/onboard-session.json; nemoclaw list shows no sandboxes
5. Run:  NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard --fresh
6. At [3/8] Configuring inference (NIM), select "8" (Local NVIDIA NIM [experimental])
7. At "Choose model [1]:" press Enter (default: nvidia/nemotron-3-super-120b-a12b)
8. At "NGC API Key:" paste a valid personal NGC key in the form nvapi-...
9. Watch wizard pull image, start container nemoclaw-nim-nemoclaw, then poll "Waiting for NIM health on port 8000"
10. After 5 min, wizard prints failure and shell returns
Expected Result
- NemoClaw injects NGC_API_KEY (and/or NIM_NGC_API_KEY as the NIM SDK expects) into the NIM container's environment when launching nemoclaw-nim-* via docker run
- NIM model-manifest download inside the container succeeds
- Container reaches a healthy state on port 8000 and the wizard proceeds to sandbox creation
- nemoclaw list eventually shows a Ready sandbox using the NIM-local provider
- docker inspect nemoclaw-nim-nemoclaw | jq '.[]|.Config.Env' contains NGC_API_KEY=...
- If NGC_API_KEY is intentionally not passed, wizard should detect the dead container within seconds (not wait the full 300s) and surface the in-container auth-error log line to the user
Actual Result
Wizard tail output:
  Starting NIM container: nemoclaw-nim-nemoclaw
  57d6a6263b87abf76bf2c24041830b78885effaf84b31b469d38c5ee23bc8d66
  Waiting for NIM to become healthy...
  Waiting for NIM health on port 8000 (timeout: 300s)...
  NIM did not become healthy within 300s.
  NIM failed to start. Falling back to cloud API.
  Inference selection did not yield a provider/model.
$ nemoclaw list
  No sandboxes registered. Run `nemoclaw onboard` to get started.
$ docker ps -a --filter name=nemoclaw-nim
  NAMES                    STATUS
  nemoclaw-nim-nemoclaw    Exited (0) 6 minutes ago

Container environment confirms NGC_API_KEY is missing:
$ docker inspect nemoclaw-nim-nemoclaw --format '{{range .Config.Env}}{{println .}}{{end}}' | grep -i ngc
  (no output)

The wizard accepted the key and stored it in ~/.docker/config.json (nvcr.io auth visible there), but the NIM container itself has no NGC env var, so in-container model download fails immediately.

Cosmetic compounding: wizard waits 300s polling a dead container before bailing — user wastes ~5 min on top of the 1-2 min image pull. After timeout, the "Falling back to cloud API" branch does NOT prompt the user for an alternate provider; it just exits.
Logs
docker logs nemoclaw-nim-nemoclaw (tail):
  ===========================================
  == NVIDIA Inference Microservice LLM NIM ==
  ===========================================
  NVIDIA Inference Microservice LLM NIM Version 2.0.4
  ...
  ERROR 2026-05-11 08:40:07.840 nim_sdk.py:342] Download failed after 1 attempts. Last exception: Authentication Error
  ERROR 2026-05-11 08:40:07.840 model_download.py:68] Failed to download models for profile '929c6303b397a5ef7c1472230fcc8b940d7e78d230d6c3051f2f1837146c767b': Error downloading manifest: Authentication Error
  ERROR 2026-05-11 08:40:07.840 actions.py:92] Model download failed: Error downloading manifest: Authentication Error
  Shutting down services...
    Stopping nginx...

docker inspect nemoclaw-nim-nemoclaw --format '{{.State.Status}}/{{.State.ExitCode}}/{{.State.StartedAt}}/{{.State.FinishedAt}}':
  exited/0/2026-05-11T08:39:57.539935638Z/2026-05-11T08:40:08.794406746Z

Container died 11s after start; wizard polled health for additional ~289s anyway.

Full T560817 day0 test report: /home/lab/day0-automation/20260511/report-T560817.txt

Related upstream / similar symptom (different ecosystem): NVBug 4926446 (NIM nv-embedqa-e5-v5: NGC_API_KEY not getting passed through, same "container exited but job kept polling health" cosmetic) — fix pattern likely transferable.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard

[NVB#6163749]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: integrationsThird-party service integration behaviorplatform: ubuntuAffects Ubuntu Linux environments
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions