fix(onboard): pick Ollama bootstrap model from a memory-aware registry#4132
Conversation
Bootstrap-model selection on unified-memory hosts (DGX Spark, Apple Silicon, Jetson) sized only against totalMemoryMB, so a host with 128 GiB total but another GPU workload eating most of the system pool would still be promoted to the 23 GiB qwen3.6:35b default and crash the Ollama runner mid-load. Move the per-model footprints into src/lib/inference/ollama-model-registry.ts so adding a future model is a one-line registry edit, and have detectGpu populate availableMemoryMB (nvidia-smi memory.free on the primary path, MemAvailable on unified-memory and Tegra fallbacks). The selector keeps every registry entry whose requiredMemoryMB fits available memory and falls back to the smallest model when nothing else fits. Fixes #4113 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
📝 WalkthroughWalkthroughAdds available GPU memory (availableMemoryMB) to detection, a registry of Ollama models with memory/download metadata, and refactors local model selection, prompts, non-interactive onboarding, and model-size fallbacks to prefer models that fit currently available memory. ChangesAvailable Memory Detection and Registry-Based Model Selection
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: None Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
PR Review AdvisorFindings: 1 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/nim.ts`:
- Around line 79-84: The macOS/Apple branch in src/lib/inference/nim.ts
currently sets only totalMemoryMB and omits availableMemoryMB, causing
memory-aware sizing to be incorrect on Apple Silicon; update the macOS path in
the memory-probing logic (the function that builds the memory probe result which
includes totalMemoryMB and availableMemoryMB) to compute availableMemoryMB from
host vm statistics (e.g., parse vm_stat or use the appropriate host APIs to sum
free + inactive/available pages), set availableMemoryMB on the returned object
alongside totalMemoryMB, and fall back to totalMemoryMB only if the vm_stat/host
call fails so downstream callers still have a sensible value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3349ba31-8bf0-453d-9060-3829382be205
📒 Files selected for processing (6)
src/lib/inference/local.test.tssrc/lib/inference/local.tssrc/lib/inference/nim.test.tssrc/lib/inference/nim.tssrc/lib/inference/ollama-model-registry.test.tssrc/lib/inference/ollama-model-registry.ts
Address review feedback on the bootstrap-model selector: - Add downloadSizeBytes alongside requiredMemoryMB in the registry and have model-size.ts read its fallback table from there, removing the duplicated model facts in src/lib/inference/ollama/model-size.ts. - Route the non-interactive NEMOCLAW_MODEL / recovered-session path through a new resolveNonInteractiveOllamaModel helper so an explicit oversized model triggers the same downgrade + warning as the menu-default path. Unknown user-supplied tags stay respected. - Filter the installed-model selection in getDefaultOllamaModel so a previously-pulled large model is not blindly returned on a host that can no longer fit it. - Narrow the macOS scope in the GpuDetection comment: the platform still only reports total memory (no vm_stat probe yet); the registry test is explicit that apple silicon only matches when availableMemoryMB is supplied by the caller. - Update the user-facing docs to describe the new available-memory driven downgrade rather than the old 32 GiB total threshold. - Drop the lingering issue-number references from new code comments. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4132.docs.buildwithfern.com/nemoclaw |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Address second round of review feedback on the bootstrap-model selector: - promptOllamaModel filters its installed-model list through modelFitsAvailableMemory before computing the default index. Without this, a host with only an oversized model installed would surface the registry fallback default at index 0, so pressing Enter would re- select the model the runner is about to crash on. - Warn explicitly when nothing in the registry fits available memory via anyRegistryModelFits; both the interactive menu and the non- interactive resolver now log a "free memory or expect the runner to reject the load" line before returning the smallest fallback. - nim.test.ts now asserts availableMemoryMB on the GB10 / Spark / Orin / Tegra unified-memory paths and adds a parse-failure case where memory.free returns `[N/A]` but memory.total still parses. - Centralise the role aliases: local.ts now derives SMALL_OLLAMA_MODEL from SMALLEST_OLLAMA_MODEL_TAG and asserts that DEFAULT_OLLAMA_MODEL / QWEN3_6_OLLAMA_MODEL still resolve to live registry entries, so a registry edit fails module load instead of silently desyncing. - Add a macOS vm_stat probe so apple silicon hosts also populate availableMemoryMB and the registry filter is no longer a one-way claim for that platform. - Drop the lingering #3510 reference in a new comment; update model-size.ts wording from "re-exported" to "aliased locally". Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-double-onboard-e2e | ⏭️ skipped |
| gpu-e2e | ⏭️ skipped |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/inference/use-local-inference.mdx (1)
47-47: ⚡ Quick winSplit into two sentences, one per line.
The guideline requires one sentence per line for diff readability. This line contains two independent clauses separated by a semicolon that should be on separate lines.
📝 Proposed formatting fix
-On hosts where the larger starter models fit the currently available GPU memory, the starter list includes `qwen3.6:35b` and selects it by default; when another GPU workload is using most of the memory at onboard time, NemoClaw downgrades the menu to the largest model that still fits. +On hosts where the larger starter models fit the currently available GPU memory, the starter list includes `qwen3.6:35b` and selects it by default. +When another GPU workload is using most of the memory at onboard time, NemoClaw downgrades the menu to the largest model that still fits.As per coding guidelines: "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/inference/use-local-inference.mdx` at line 47, Split the single line that contains two independent clauses into two separate sentences, each on its own line: change the line containing "On hosts where the larger starter models fit the currently available GPU memory, the starter list includes `qwen3.6:35b` and selects it by default; when another GPU workload is using most of the memory at onboard time, NemoClaw downgrades the menu to the largest model that still fits." into two lines such as "On hosts where the larger starter models fit the currently available GPU memory, the starter list includes `qwen3.6:35b` and selects it by default." and "When another GPU workload is using most of the memory at onboard time, NemoClaw downgrades the menu to the largest model that still fits." ensuring each sentence occupies its own line for diff readability.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/inference/use-local-inference.mdx`:
- Line 47: Split the single line that contains two independent clauses into two
separate sentences, each on its own line: change the line containing "On hosts
where the larger starter models fit the currently available GPU memory, the
starter list includes `qwen3.6:35b` and selects it by default; when another GPU
workload is using most of the memory at onboard time, NemoClaw downgrades the
menu to the largest model that still fits." into two lines such as "On hosts
where the larger starter models fit the currently available GPU memory, the
starter list includes `qwen3.6:35b` and selects it by default." and "When
another GPU workload is using most of the memory at onboard time, NemoClaw
downgrades the menu to the largest model that still fits." ensuring each
sentence occupies its own line for diff readability.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2c318d35-7d7b-4ed8-bc1b-0804e3e6e8de
📒 Files selected for processing (10)
docs/inference/use-local-inference.mdxsrc/lib/inference/local.test.tssrc/lib/inference/local.tssrc/lib/inference/nim.test.tssrc/lib/inference/nim.tssrc/lib/inference/ollama-model-registry.test.tssrc/lib/inference/ollama-model-registry.tssrc/lib/inference/ollama/model-size.tssrc/lib/inference/ollama/proxy.tssrc/lib/onboard.ts
…probe Address third round of review feedback: - resolveNonInteractiveOllamaModel now surfaces the no-fit warning on the explicit-oversize path too: when NEMOCLAW_MODEL names a known oversized tag and the fallback also exceeds available memory, the user sees both the "falling back to qwen2.5:7b" line and the "no known model fits" line so the second probe failure is not surprising. Add a regression test exercising the <8 GB free case. - New src/lib/inference/ollama/proxy.test.ts exercises the interactive menu installed-model fit filter: an installed-only oversized tag downgrades to a fitting starter, a fitting installed tag stays as the default, and an unknown tag is respected. - nim.test.ts adds macOS coverage: a Darwin mock returning system_profiler + sysctl + vm_stat → expects availableMemoryMB, plus a vm_stat parse-failure case that drops the field cleanly. - docs/inference/use-local-inference.mdx now notes that known oversized NEMOCLAW_MODEL tags are downgraded with a warning while unknown tags pass through to the Ollama runner's own validation, and splits the L47 semicolon-joined sentence into two. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-double-onboard-e2e | ⏭️ skipped |
| gpu-e2e | ⏭️ skipped |
…hold export - promptOllamaModel now declares its parameter as `GpuInfo | null` via a type-only import, so the emitted .d.ts no longer pins the type to the default-value `null`. proxy.test.ts callers (and any other typed consumer) can pass real GpuInfo shapes without tsc complaints. - LARGE_OLLAMA_MIN_MEMORY_MB was only kept around to make existing tests look symmetric after the registry refactor took over the selector. Drop the export and have local.test.ts derive the "large enough to fit everything" memory threshold from the live registry, so a future model change does not silently desync the test fixture. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Summary
Bootstrap-model selection sized only against
totalMemoryMB, so a 128 GiB host with ~12 GiB actually free still got promoted to the 23 GiBqwen3.6:35bdefault and Ollama crashed mid-load. The non-interactiveNEMOCLAW_MODEL=qwen3.6:35bpath bypassed the menu entirely.Add a memory-aware registry as the single source of truth, populate
availableMemoryMBfromnvidia-smi memory.free/MemAvailable/vm_stat, and route every selection path (menu, default, interactive, explicit env var, recovered session) through one capacity check. Unknown user-supplied tags still pass through.Related Issue
Fixes #4113
Changes
src/lib/inference/ollama-model-registry.ts— new module holding the memory-aware model registry and capacity helpers.src/lib/inference/local.ts— selection helpers delegate to the registry; newresolveNonInteractiveOllamaModelgates the explicit-model path.src/lib/inference/nim.ts—detectGpupopulatesavailableMemoryMBon NVIDIA, unified-memory Linux, Tegra, and macOS.src/lib/inference/ollama/proxy.ts— interactive menu filters installed models through the capacity check.src/lib/inference/ollama/model-size.ts— fallback download-size table now derives from the registry.src/lib/onboard.ts— non-interactive selection delegates toresolveNonInteractiveOllamaModel.docs/inference/use-local-inference.mdx— wording updated for available-memory-based selection.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Documentation
Tests