fix(nim): restore NGC_API_KEY env passthrough and fast-fail health check#3335
Conversation
Fixes #3333 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/nim.ts`:
- Around line 362-364: The current envFlags constant embeds API keys in argv
(envFlags) which exposes secrets; change envFlags to only include the variable
names (e.g. ["-e", "NGC_API_KEY", "-e", "NIM_NGC_API_KEY"]) and remove KEY=value
pairs, then inject the actual ngcApiKey into the child process environment (the
env option passed to the subprocess spawn/exec call) so the real values are not
visible in process arguments; update the corresponding places that build
envFlags (the envFlags declaration and the subprocess invocation that uses it
around lines where envFlags is referenced) to merge process.env with {
NGC_API_KEY: ngcApiKey, NIM_NGC_API_KEY: ngcApiKey } when ngcApiKey is present.
- Line 436: The dockerLogs call used in the fast-fail branch
(dockerLogs(container, { tail: 30, opts: { ignoreError: true } })) can block if
Docker is unresponsive; update that call to include a bounded timeout in the
opts (e.g., opts: { ignoreError: true, timeout: <ms> }) so log capture fails
fast and does not defeat the fast-fail behavior—add the timeout option to the
dockerLogs invocation near dockerLogs(container, ...) and choose a reasonable
value (e.g., 2000–5000ms) consistent with other timeouts in the codebase.
In `@src/lib/onboard.ts`:
- Around line 6820-6827: The branch currently assigns ngcApiKey using
getCredential("NGC_API_KEY") || getCredential("NVIDIA_API_KEY"), which bypasses
the canonical fill-only-if-missing hydration; replace those getCredential calls
with hydrateCredentialEnv so staged legacy/env resolution is applied (e.g., set
ngcApiKey = hydrateCredentialEnv("NGC_API_KEY") ||
hydrateCredentialEnv("NVIDIA_API_KEY")), leaving the rest of the flow (and
startNimContainerByName warning behavior) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8835dc93-77f9-44e5-89ba-80864a11a55c
📒 Files selected for processing (4)
src/lib/adapters/docker/container.tssrc/lib/inference/nim.test.tssrc/lib/inference/nim.tssrc/lib/onboard.ts
Fixes #3333 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
ericksoa
left a comment
There was a problem hiding this comment.
Reviewed against current main, including a local no-commit merge after #3346. This restores the NGC key path without putting the key in docker argv, carries the key through the already-logged-in onboarding path, and bounds/surfaces NIM container failure logs so the auth failure is fast and diagnosable. Focused build/tests and visible CI are green.
## Summary Refreshes the release-prep docs for v0.0.39 based on changes merged since the Friday 4pm doc refresh. Updates the source docs, bumps the docs version metadata, and regenerates the NemoClaw user skills from the refreshed docs. ## Changes - #3314 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documents installer Docker setup, Docker group activation, and retry guidance. - #3317 -> `docs/get-started/quickstart.md`, `docs/reference/commands.md`: Documents the DGX Spark and DGX Station express install prompt and `NEMOCLAW_NO_EXPRESS`. - #3328 and #3329 -> `docs/security/best-practices.md`, `docs/deployment/sandbox-hardening.md`: Updates sandbox capability hardening docs for the stricter bounding-set and `setpriv` step-down behavior. - #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`: Documents Windows-host Ollama relaunch behavior, NIM key passthrough, early health-fail diagnostics, and mixed-GPU preflight detail. - #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 -> `docs/about/release-notes.md`: Adds the v0.0.39 release-prep section while keeping the v0.0.38 release notes intact. - Advances the release-prep docs metadata from v0.0.38 to v0.0.39. - Regenerates `.agents/skills/nemoclaw-user-*` from the updated source docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes v0.0.39 * **New Features** * Host alias management commands for easier configuration * Sandbox GPU control options during onboarding * Update command with check and confirmation modes * **Documentation** * Enhanced Linux installer guidance with Docker and group membership handling * Expanded troubleshooting for permission and connectivity issues * Improved capability-dropping security documentation * Updated inference model switching commands * Brev environment-specific troubleshooting * **Improvements** * DGX Spark/Station express install flow * Windows Ollama relay and health-check enhancements * NVIDIA NIM preflight GPU reporting [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
NIM-local onboard fails because NGC_API_KEY is not propagated into the NIM container, so model-manifest download returns "Authentication Error" and the container exits within ~10s. The original passthrough from #219 was lost in a string→argv refactor of
startNimContainerByName. This restores it and short-circuits the 300s health-wait when the container has already exited.Related Issue
Fixes #3333
Changes
startNimContainerByNameacceptsopts.ngcApiKeyand falls back toNGC_API_KEY/NVIDIA_API_KEYfrom the credential store; injects-e NGC_API_KEYand-e NIM_NGC_API_KEYfor the container.waitForNimHealthaccepts an optional container name, pollsdocker inspecteach interval, and surfaces the last 30 log lines on early exit.onboard.tsthreads the captured key through both the fresh-login and already-logged-in paths.dockerLogsadapter; regression tests covering env-flag injection and the health-wait fast-fail.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Chores