Skip to content

fix(nim): restore NGC_API_KEY env passthrough and fast-fail health check#3335

Merged
ericksoa merged 3 commits into
mainfrom
fix/3333-restore-nim-ngc-api-key
May 11, 2026
Merged

fix(nim): restore NGC_API_KEY env passthrough and fast-fail health check#3335
ericksoa merged 3 commits into
mainfrom
fix/3333-restore-nim-ngc-api-key

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

NIM-local onboard fails because NGC_API_KEY is not propagated into the NIM container, so model-manifest download returns "Authentication Error" and the container exits within ~10s. The original passthrough from #219 was lost in a string→argv refactor of startNimContainerByName. This restores it and short-circuits the 300s health-wait when the container has already exited.

Related Issue

Fixes #3333

Changes

  • startNimContainerByName accepts opts.ngcApiKey and falls back to NGC_API_KEY / NVIDIA_API_KEY from the credential store; injects -e NGC_API_KEY and -e NIM_NGC_API_KEY for the container.
  • waitForNimHealth accepts an optional container name, polls docker inspect each interval, and surfaces the last 30 log lines on early exit.
  • onboard.ts threads the captured key through both the fresh-login and already-logged-in paths.
  • New dockerLogs adapter; regression tests covering env-flag injection and the health-wait fast-fail.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • NGC API key can now be configured during onboarding for local NIM setup, improving container manifest retrieval.
    • Enhanced NIM container health checking with early state inspection and diagnostic logging.
  • Chores

    • Added Docker logging helper and expanded test coverage for NIM initialization and health monitoring.

Review Change Stack

Fixes #3333

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e23bad6c-5fe1-4b13-bea6-ff1979ccd50c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3333-restore-nim-ngc-api-key

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/nim.ts`:
- Around line 362-364: The current envFlags constant embeds API keys in argv
(envFlags) which exposes secrets; change envFlags to only include the variable
names (e.g. ["-e", "NGC_API_KEY", "-e", "NIM_NGC_API_KEY"]) and remove KEY=value
pairs, then inject the actual ngcApiKey into the child process environment (the
env option passed to the subprocess spawn/exec call) so the real values are not
visible in process arguments; update the corresponding places that build
envFlags (the envFlags declaration and the subprocess invocation that uses it
around lines where envFlags is referenced) to merge process.env with {
NGC_API_KEY: ngcApiKey, NIM_NGC_API_KEY: ngcApiKey } when ngcApiKey is present.
- Line 436: The dockerLogs call used in the fast-fail branch
(dockerLogs(container, { tail: 30, opts: { ignoreError: true } })) can block if
Docker is unresponsive; update that call to include a bounded timeout in the
opts (e.g., opts: { ignoreError: true, timeout: <ms> }) so log capture fails
fast and does not defeat the fast-fail behavior—add the timeout option to the
dockerLogs invocation near dockerLogs(container, ...) and choose a reasonable
value (e.g., 2000–5000ms) consistent with other timeouts in the codebase.

In `@src/lib/onboard.ts`:
- Around line 6820-6827: The branch currently assigns ngcApiKey using
getCredential("NGC_API_KEY") || getCredential("NVIDIA_API_KEY"), which bypasses
the canonical fill-only-if-missing hydration; replace those getCredential calls
with hydrateCredentialEnv so staged legacy/env resolution is applied (e.g., set
ngcApiKey = hydrateCredentialEnv("NGC_API_KEY") ||
hydrateCredentialEnv("NVIDIA_API_KEY")), leaving the rest of the flow (and
startNimContainerByName warning behavior) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8835dc93-77f9-44e5-89ba-80864a11a55c

📥 Commits

Reviewing files that changed from the base of the PR and between 118541a and 392d126.

📒 Files selected for processing (4)
  • src/lib/adapters/docker/container.ts
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
  • src/lib/onboard.ts

Comment thread src/lib/inference/nim.ts Outdated
Comment thread src/lib/inference/nim.ts Outdated
Comment thread src/lib/onboard.ts
Fixes #3333

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng marked this pull request as ready for review May 11, 2026 14:30

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against current main, including a local no-commit merge after #3346. This restores the NGC key path without putting the key in docker argv, carries the key through the already-logged-in onboarding path, and bounds/surfaces NIM container failure logs so the auth failure is fast and diagnosable. Focused build/tests and visible CI are green.

@ericksoa ericksoa merged commit b25ab53 into main May 11, 2026
22 checks passed
@miyoungc miyoungc mentioned this pull request May 12, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 12, 2026
## Summary
Refreshes the release-prep docs for v0.0.39 based on changes merged
since the Friday 4pm doc refresh. Updates the source docs, bumps the
docs version metadata, and regenerates the NemoClaw user skills from the
refreshed docs.

## Changes
- #3314 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`:
Documents installer Docker setup, Docker group activation, and retry
guidance.
- #3317 -> `docs/get-started/quickstart.md`,
`docs/reference/commands.md`: Documents the DGX Spark and DGX Station
express install prompt and `NEMOCLAW_NO_EXPRESS`.
- #3328 and #3329 -> `docs/security/best-practices.md`,
`docs/deployment/sandbox-hardening.md`: Updates sandbox capability
hardening docs for the stricter bounding-set and `setpriv` step-down
behavior.
- #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`:
Documents Windows-host Ollama relaunch behavior, NIM key passthrough,
early health-fail diagnostics, and mixed-GPU preflight detail.
- #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 ->
`docs/about/release-notes.md`: Adds the v0.0.39 release-prep section
while keeping the v0.0.38 release notes intact.
- Advances the release-prep docs metadata from v0.0.38 to v0.0.39.
- Regenerates `.agents/skills/nemoclaw-user-*` from the updated source
docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes v0.0.39

* **New Features**
  * Host alias management commands for easier configuration
  * Sandbox GPU control options during onboarding
  * Update command with check and confirmation modes

* **Documentation**
* Enhanced Linux installer guidance with Docker and group membership
handling
  * Expanded troubleshooting for permission and connectivity issues
  * Improved capability-dropping security documentation
  * Updated inference model switching commands
  * Brev environment-specific troubleshooting

* **Improvements**
  * DGX Spark/Station express install flow
  * Windows Ollama relay and health-check enhancements
  * NVIDIA NIM preflight GPU reporting

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Onboard] NIM-local onboard fails: NGC_API_KEY not propagated to NIM container, model manifest download returns Authentication Error

3 participants