Skip to content

Fix Local NIM onboarding on RTX 3090 / WSL#236

Closed
benwgarton wants to merge 2 commits into
NVIDIA:mainfrom
benwgarton:codex/nemoclaw-3090-local-nim
Closed

Fix Local NIM onboarding on RTX 3090 / WSL#236
benwgarton wants to merge 2 commits into
NVIDIA:mainfrom
benwgarton:codex/nemoclaw-3090-local-nim

Conversation

@benwgarton

Copy link
Copy Markdown

Summary

This PR fixes a Local NIM onboarding path that fails on a common consumer-GPU setup: a single RTX 3090 running under WSL.

What this changes

  • adds Local NIM runtime support for NGC_API_KEY
  • avoids requesting a GPU-backed OpenShell sandbox for host-side Local NIM
  • increases the Local NIM health wait so slower first-start paths are not treated as failures
  • applies the runtime overrides needed for meta/llama-3.1-8b-instruct to start successfully on RTX 3090 + WSL:
    • NIM_MODEL_PROFILE=default
    • NIM_RELAX_MEM_CONSTRAINTS=1
    • NIM_MAX_GPU_MEMORY_UTILIZATION_STARTUP=1.0
    • NIM_MAX_MODEL_LEN=32768

Why

During validation on an RTX 3090, the original requested model (
vidia/llama-3.3-nemotron-super-49b-v1) had no runnable profile on this hardware. Switching to the smaller Local NIM image (meta/llama-3.1-8b-instruct) worked, but only after:

  • passing NGC_API_KEY at container runtime
  • selecting the generic vLLM profile
  • reducing max model length so KV cache fits on the card
  • allowing enough time for the slower WSL startup path to finish

Without these changes, onboarding either falls back incorrectly or fails even though a working Local NIM configuration exists for this host class.

Scope

This PR keeps the fix narrow and only changes the Local NIM onboarding/runtime path.

Validation

Validated locally with:

  • OpenShell healthy and GPU-visible
  • Docker GPU passthrough working
  • Local NIM healthy on http://localhost:8000/v1/models
  • NemoClaw sandbox configured to use �llm-local with meta/llama-3.1-8b-instruct

@cv

cv commented Mar 21, 2026

Copy link
Copy Markdown
Collaborator

Hey @benwgarton, appreciate you sorting out the local NIM onboarding on RTX 3090 / WSL — that's a setup a lot of people run into issues with. Just a quick ask: there have been a good number of changes to main since this PR (new CI, features, etc.), and a rebase would help us review this with confidence. Could you update against the latest main whenever you get a chance? Looking forward to checking it out!

mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026
* Updated readme

* Updated readme

* Updated readme

* Updated readme

* Updated readme
ericksoa added a commit that referenced this pull request Apr 20, 2026
<!-- markdownlint-disable MD041 -->
## Summary
Local NIM onboarding fails at the image pull step because `docker pull
nvcr.io/nim/...` requires NGC registry authentication. This adds an NGC
API key prompt during onboard that runs `docker login nvcr.io
--password-stdin` before pulling the NIM image. The key is masked during
input and handled securely via stdin.

## Related Issue
Based on the investigation in PR #236.

## Changes
- `src/lib/nim.ts`: Add `isNgcLoggedIn()` to check if Docker is already
authenticated with nvcr.io, and `dockerLoginNgc()` to login securely via
`--password-stdin`.
- `src/lib/onboard.ts`: Prompt for NGC API key before NIM image pull
when not already logged in. Masked input, one retry on failure.
- `test/onboard-selection.test.ts`: Mock `isNgcLoggedIn` in NIM-local
selection test.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Setup wizard enforces NGC Docker authentication for NIM model setup:
interactive mode prompts for an NGC API key (one retry); non-interactive
mode prints login instructions and exits.

* **Bug Fixes / Reliability**
* Improved detection and login handling for NGC Docker credentials so
image pulls proceed only after successful authentication and failures
are reported.

* **Tests**
* Added unit tests for NGC auth detection and updated onboarding tests
to cover authenticated flows.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
@wscurran

Copy link
Copy Markdown
Contributor

Thanks for digging into the RTX 3090 / WSL Local NIM onboarding path — the specific gaps you hit here are real.

The files this PR targets (bin/lib/nim.js, bin/lib/onboard.js) were migrated to TypeScript in PR #1669, so this diff no longer applies directly to the codebase. bin/lib/credentials.js is still present, but nim.js and onboard.js are gone.

If the NIM onboarding issue persists on the current codebase, we'd welcome a resubmit targeting the TypeScript equivalents in src/. The RTX 3090 / WSL path is worth getting right — if you can confirm the issue still exists against main, that's a great starting point.

@wscurran wscurran closed this Apr 21, 2026
@wscurran wscurran added area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior platform: wsl Affects Windows Subsystem for Linux bug-fix PR fixes a bug or regression and removed priority: medium labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression platform: wsl Affects Windows Subsystem for Linux

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants