Skip to content

fix(onboard): pick Ollama bootstrap model from a memory-aware registry#4132

Merged
cv merged 6 commits into
mainfrom
fix/ollama-memory-aware-model-4113
May 23, 2026
Merged

fix(onboard): pick Ollama bootstrap model from a memory-aware registry#4132
cv merged 6 commits into
mainfrom
fix/ollama-memory-aware-model-4113

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Bootstrap-model selection sized only against totalMemoryMB, so a 128 GiB host with ~12 GiB actually free still got promoted to the 23 GiB qwen3.6:35b default and Ollama crashed mid-load. The non-interactive NEMOCLAW_MODEL=qwen3.6:35b path bypassed the menu entirely.

Add a memory-aware registry as the single source of truth, populate availableMemoryMB from nvidia-smi memory.free / MemAvailable / vm_stat, and route every selection path (menu, default, interactive, explicit env var, recovered session) through one capacity check. Unknown user-supplied tags still pass through.

Related Issue

Fixes #4113

Changes

  • src/lib/inference/ollama-model-registry.ts — new module holding the memory-aware model registry and capacity helpers.
  • src/lib/inference/local.ts — selection helpers delegate to the registry; new resolveNonInteractiveOllamaModel gates the explicit-model path.
  • src/lib/inference/nim.tsdetectGpu populates availableMemoryMB on NVIDIA, unified-memory Linux, Tegra, and macOS.
  • src/lib/inference/ollama/proxy.ts — interactive menu filters installed models through the capacity check.
  • src/lib/inference/ollama/model-size.ts — fallback download-size table now derives from the registry.
  • src/lib/onboard.ts — non-interactive selection delegates to resolveNonInteractiveOllamaModel.
  • docs/inference/use-local-inference.mdx — wording updated for available-memory-based selection.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • GPU detection now reports available (free) memory and uses it to pick starter and default models.
    • Onboarding and model prompts now show only models that fit the host GPU and warn when a requested model won’t fit.
  • Documentation

    • Updated local inference setup to reflect memory-aware starter model selection and non-interactive fallback behavior.
  • Tests

    • Added extensive tests covering model registry, memory-aware selection, and GPU detection.

Review Change Stack

Bootstrap-model selection on unified-memory hosts (DGX Spark, Apple
Silicon, Jetson) sized only against totalMemoryMB, so a host with
128 GiB total but another GPU workload eating most of the system pool
would still be promoted to the 23 GiB qwen3.6:35b default and crash the
Ollama runner mid-load.

Move the per-model footprints into src/lib/inference/ollama-model-registry.ts
so adding a future model is a one-line registry edit, and have detectGpu
populate availableMemoryMB (nvidia-smi memory.free on the primary path,
MemAvailable on unified-memory and Tegra fallbacks). The selector keeps
every registry entry whose requiredMemoryMB fits available memory and
falls back to the smallest model when nothing else fits.

Fixes #4113

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds available GPU memory (availableMemoryMB) to detection, a registry of Ollama models with memory/download metadata, and refactors local model selection, prompts, non-interactive onboarding, and model-size fallbacks to prefer models that fit currently available memory.

Changes

Available Memory Detection and Registry-Based Model Selection

Layer / File(s) Summary
GPU detection with available memory
src/lib/inference/nim.ts, src/lib/inference/nim.test.ts
detectGpu/VM/host probes expose optional availableMemoryMB from nvidia-smi memory.free, free -m MemAvailable, or vm_stat; tests updated for the new field.
Ollama model registry and helpers
src/lib/inference/ollama-model-registry.ts, src/lib/inference/ollama-model-registry.test.ts
New registry exports per-tag requiredMemoryMB and downloadSizeBytes, effectiveGpuMemoryMB(), modelFitsAvailableMemory(), fittableOllamaModelTags(), largestFittableOllamaModelTag(), and download-size fallback map with comprehensive tests.
Local selection, prompts, onboarding integration
src/lib/inference/local.ts, src/lib/inference/local.test.ts, src/lib/inference/ollama/model-size.ts, src/lib/inference/ollama/proxy.ts, src/lib/inference/ollama/proxy.test.ts, src/lib/onboard.ts, docs/inference/use-local-inference.mdx
GpuInfo extended with availableMemoryMB?; bootstrap/default selection now uses registry fittable tags; non-interactive resolution downgrades known oversize tags with warnings; installed-model prompts filter by fit; model-size fallback uses registry map; onboarding calls resolver. Tests and docs updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Suggested labels

Provider: Ollama, documentation, enhancement: testing, v0.0.50

Suggested reviewers

  • ericksoa
  • cv
  • jyaunches

Poem

🐇 I hopped through GPUs late at night,
Counting free memory by soft moonlight.
When giants don't fit the space at hand,
I nudge you toward a smaller, safer land.
Models load happy — the rabbit's delight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(onboard): pick Ollama bootstrap model from a memory-aware registry' directly and clearly summarizes the main change: routing Ollama model selection through a new memory-aware registry to respect available host memory.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ollama-memory-aware-model-4113

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: gpu-double-onboard-e2e, ollama-proxy-e2e, gpu-repo-local-ollama-openclaw

Dispatch hint: gpu-e2e

Auto-dispatched E2E: gpu-e2e via nightly-e2e.yaml at d7688131f358be43eaae025565469f05490226f4nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high (~30 minutes; self-hosted GPU)): Required because the PR changes real local Ollama onboarding/model selection and GPU-memory detection. This job exercises the source install, NEMOCLAW_PROVIDER=ollama onboarding, Ollama install/start/pull, auth proxy setup, sandbox creation, and live sandbox inference on a NVIDIA GPU runner, catching regressions in the changed detectGpu → model selection → validation path.

Optional E2E

  • gpu-double-onboard-e2e (high (~30 minutes; self-hosted GPU)): Useful adjacent coverage for re-running Ollama onboarding with an existing sandbox/token and recovered model state. The PR changes non-interactive and recovered/requested Ollama model resolution, but proxy token consistency itself is not the primary diff, so this is optional rather than merge-blocking.
  • ollama-proxy-e2e (medium (~15 minutes; installs/pulls small Ollama model)): Useful lower-scope validation because src/lib/inference/ollama/proxy.ts was touched and the local Ollama flow depends on the authenticated proxy. It validates real Ollama inference through the proxy, token enforcement, persistence, recovery, and container reachability, but it does not directly exercise GPU available-memory model selection.
  • gpu-repo-local-ollama-openclaw (high (self-hosted GPU scenario)): Scenario-runner equivalent coverage for the local Ollama OpenClaw profile with smoke, local-ollama-inference, and ollama-proxy suites. Good confidence if validating the newer scenario framework, but overlaps with gpu-e2e for this PR.

New E2E recommendations

  • local Ollama available-memory downgrade (high): Existing GPU E2E normally runs on an idle GPU and validates the happy path, but it is unlikely to exercise the new behavior where total memory is large while currently available memory is too low and onboarding downgrades from qwen3.6:35b/nemotron to qwen2.5:7b.
    • Suggested test: Add a local Ollama onboarding E2E that creates a low-available-memory condition (for example by pre-allocating GPU memory or by using a controlled nvidia-smi/free shim in an E2E harness), runs non-interactive NEMOCLAW_PROVIDER=ollama onboarding, asserts the oversize-model warning/fallback, and verifies sandbox inference succeeds with the fallback model.
  • macOS Apple Silicon local Ollama sizing (medium): The PR adds macOS vm_stat-based availableMemoryMB handling, but the existing macOS scenario is cloud/Docker-optional and does not validate local Ollama model sizing on Apple Silicon.
    • Suggested test: Add a macOS local Ollama onboarding or dry-run E2E/assertion that verifies Apple Silicon available-memory detection influences the starter model menu/default without requiring Docker-dependent sandbox suites on GitHub-hosted macOS.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: gpu-e2e

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26336844880
Target ref: 0d24286bdc1eb22b8c4e66aed63703454238f810
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • Inference/onboarding hotspots grew further instead of being extracted: Codebase drift check: the PR patches files that still exist and have active recent history, including inference/network and onboarding host-glue paths. The behavioral change is in-scope, but the implementation continues adding memory-detection and model-selection logic to already-large security-adjacent files rather than extracting focused helpers. This keeps Docker/NIM detection, local inference networking, Ollama proxy behavior, and onboarding selection glue concentrated in monoliths, increasing future review and regression risk.
    • Recommendation: Extract Ollama selection policy and GPU available-memory probing/parsing into focused modules, and move new local/nim/proxy regression coverage out of the large hotspot test files where practical. At minimum, offset this PR's hotspot growth by extracting existing helper code before merge.
    • Evidence: This is the prior advisor hotspot finding and it still applies. Trusted monolith deltas show src/lib/inference/nim.test.ts 1241→1361 (+120), src/lib/inference/local.test.ts 839→953 (+114), src/lib/inference/nim.ts 710→794 (+84), src/lib/inference/local.ts 1014→1089 (+75), and src/lib/inference/ollama/proxy.ts 812→834 (+22). Drift evidence confirms these files still exist and have active recent history; src/lib/onboard.ts also overlaps many open PRs even though this patch is net-zero there.

🔎 Worth checking

  • Missing explicit unified-memory fallback coverage for absent or malformed MemAvailable (src/lib/inference/nim.test.ts:633): The new tests cover availableMemoryMB propagation on primary NVIDIA, GB10/Spark, Orin, Jetson/Tegra, macOS, and a primary-path memory.free parse failure. However, the unified-memory Linux path still has a fixture whose `free -m` output lacks the `available` column without asserting the resulting `availableMemoryMB` contract. A regression in `readHostAvailableMemoryMB` could silently treat malformed output as usable or unintentionally size against an invalid value.
    • Recommendation: Add a focused `detectGpu` test for a unified-memory/Spark or Jetson path where `free -m` lacks or malforms the `available` column, and assert that `availableMemoryMB` is omitted so downstream selection intentionally falls back to `totalMemoryMB`. Keep the positive propagation and primary-path parse-failure assertions already added.
    • Evidence: This prior advisor finding still applies. The mixed unified-memory fixture in `src/lib/inference/nim.test.ts` returns `free -m` output with only `total used free`, but only asserts name/gpus behavior. The PR added `omits availableMemoryMB when memory.free fails to parse on the primary path` and macOS parse-failure coverage, but not the requested unified-memory `MemAvailable` fallback assertion.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Inference/onboarding hotspots grew further instead of being extracted: Codebase drift check: the PR patches files that still exist and have active recent history, including inference/network and onboarding host-glue paths. The behavioral change is in-scope, but the implementation continues adding memory-detection and model-selection logic to already-large security-adjacent files rather than extracting focused helpers. This keeps Docker/NIM detection, local inference networking, Ollama proxy behavior, and onboarding selection glue concentrated in monoliths, increasing future review and regression risk.
    • Recommendation: Extract Ollama selection policy and GPU available-memory probing/parsing into focused modules, and move new local/nim/proxy regression coverage out of the large hotspot test files where practical. At minimum, offset this PR's hotspot growth by extracting existing helper code before merge.
    • Evidence: This is the prior advisor hotspot finding and it still applies. Trusted monolith deltas show src/lib/inference/nim.test.ts 1241→1361 (+120), src/lib/inference/local.test.ts 839→953 (+114), src/lib/inference/nim.ts 710→794 (+84), src/lib/inference/local.ts 1014→1089 (+75), and src/lib/inference/ollama/proxy.ts 812→834 (+22). Drift evidence confirms these files still exist and have active recent history; src/lib/onboard.ts also overlaps many open PRs even though this patch is net-zero there.
  • Missing explicit unified-memory fallback coverage for absent or malformed MemAvailable (src/lib/inference/nim.test.ts:633): The new tests cover availableMemoryMB propagation on primary NVIDIA, GB10/Spark, Orin, Jetson/Tegra, macOS, and a primary-path memory.free parse failure. However, the unified-memory Linux path still has a fixture whose `free -m` output lacks the `available` column without asserting the resulting `availableMemoryMB` contract. A regression in `readHostAvailableMemoryMB` could silently treat malformed output as usable or unintentionally size against an invalid value.
    • Recommendation: Add a focused `detectGpu` test for a unified-memory/Spark or Jetson path where `free -m` lacks or malforms the `available` column, and assert that `availableMemoryMB` is omitted so downstream selection intentionally falls back to `totalMemoryMB`. Keep the positive propagation and primary-path parse-failure assertions already added.
    • Evidence: This prior advisor finding still applies. The mixed unified-memory fixture in `src/lib/inference/nim.test.ts` returns `free -m` output with only `total used free`, but only asserts name/gpus behavior. The PR added `omits availableMemoryMB when memory.free fails to parse on the primary path` and macOS parse-failure coverage, but not the requested unified-memory `MemAvailable` fallback assertion.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/nim.ts`:
- Around line 79-84: The macOS/Apple branch in src/lib/inference/nim.ts
currently sets only totalMemoryMB and omits availableMemoryMB, causing
memory-aware sizing to be incorrect on Apple Silicon; update the macOS path in
the memory-probing logic (the function that builds the memory probe result which
includes totalMemoryMB and availableMemoryMB) to compute availableMemoryMB from
host vm statistics (e.g., parse vm_stat or use the appropriate host APIs to sum
free + inactive/available pages), set availableMemoryMB on the returned object
alongside totalMemoryMB, and fall back to totalMemoryMB only if the vm_stat/host
call fails so downstream callers still have a sensible value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3349ba31-8bf0-453d-9060-3829382be205

📥 Commits

Reviewing files that changed from the base of the PR and between 0f48781 and 0d24286.

📒 Files selected for processing (6)
  • src/lib/inference/local.test.ts
  • src/lib/inference/local.ts
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
  • src/lib/inference/ollama-model-registry.test.ts
  • src/lib/inference/ollama-model-registry.ts

Comment thread src/lib/inference/nim.ts
Address review feedback on the bootstrap-model selector:

- Add downloadSizeBytes alongside requiredMemoryMB in the registry and
  have model-size.ts read its fallback table from there, removing the
  duplicated model facts in src/lib/inference/ollama/model-size.ts.
- Route the non-interactive NEMOCLAW_MODEL / recovered-session path
  through a new resolveNonInteractiveOllamaModel helper so an explicit
  oversized model triggers the same downgrade + warning as the
  menu-default path. Unknown user-supplied tags stay respected.
- Filter the installed-model selection in getDefaultOllamaModel so a
  previously-pulled large model is not blindly returned on a host that
  can no longer fit it.
- Narrow the macOS scope in the GpuDetection comment: the platform still
  only reports total memory (no vm_stat probe yet); the registry test
  is explicit that apple silicon only matches when availableMemoryMB is
  supplied by the caller.
- Update the user-facing docs to describe the new available-memory
  driven downgrade rather than the old 32 GiB total threshold.
- Drop the lingering issue-number references from new code comments.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26337459706
Target ref: d45a6044cb44585e8386ea4320bcb2e38971561e
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

Address second round of review feedback on the bootstrap-model
selector:

- promptOllamaModel filters its installed-model list through
  modelFitsAvailableMemory before computing the default index. Without
  this, a host with only an oversized model installed would surface the
  registry fallback default at index 0, so pressing Enter would re-
  select the model the runner is about to crash on.
- Warn explicitly when nothing in the registry fits available memory
  via anyRegistryModelFits; both the interactive menu and the non-
  interactive resolver now log a "free memory or expect the runner to
  reject the load" line before returning the smallest fallback.
- nim.test.ts now asserts availableMemoryMB on the GB10 / Spark /
  Orin / Tegra unified-memory paths and adds a parse-failure case where
  memory.free returns `[N/A]` but memory.total still parses.
- Centralise the role aliases: local.ts now derives
  SMALL_OLLAMA_MODEL from SMALLEST_OLLAMA_MODEL_TAG and asserts that
  DEFAULT_OLLAMA_MODEL / QWEN3_6_OLLAMA_MODEL still resolve to live
  registry entries, so a registry edit fails module load instead of
  silently desyncing.
- Add a macOS vm_stat probe so apple silicon hosts also populate
  availableMemoryMB and the registry filter is no longer a one-way
  claim for that platform.
- Drop the lingering #3510 reference in a new comment; update
  model-size.ts wording from "re-exported" to "aliased locally".

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26338017309
Target ref: 0f7c55578f48cc365af9d2be82924839169ccdfe
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job Result
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/inference/use-local-inference.mdx (1)

47-47: ⚡ Quick win

Split into two sentences, one per line.

The guideline requires one sentence per line for diff readability. This line contains two independent clauses separated by a semicolon that should be on separate lines.

📝 Proposed formatting fix
-On hosts where the larger starter models fit the currently available GPU memory, the starter list includes `qwen3.6:35b` and selects it by default; when another GPU workload is using most of the memory at onboard time, NemoClaw downgrades the menu to the largest model that still fits.
+On hosts where the larger starter models fit the currently available GPU memory, the starter list includes `qwen3.6:35b` and selects it by default.
+When another GPU workload is using most of the memory at onboard time, NemoClaw downgrades the menu to the largest model that still fits.

As per coding guidelines: "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/use-local-inference.mdx` at line 47, Split the single line
that contains two independent clauses into two separate sentences, each on its
own line: change the line containing "On hosts where the larger starter models
fit the currently available GPU memory, the starter list includes `qwen3.6:35b`
and selects it by default; when another GPU workload is using most of the memory
at onboard time, NemoClaw downgrades the menu to the largest model that still
fits." into two lines such as "On hosts where the larger starter models fit the
currently available GPU memory, the starter list includes `qwen3.6:35b` and
selects it by default." and "When another GPU workload is using most of the
memory at onboard time, NemoClaw downgrades the menu to the largest model that
still fits." ensuring each sentence occupies its own line for diff readability.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/inference/use-local-inference.mdx`:
- Line 47: Split the single line that contains two independent clauses into two
separate sentences, each on its own line: change the line containing "On hosts
where the larger starter models fit the currently available GPU memory, the
starter list includes `qwen3.6:35b` and selects it by default; when another GPU
workload is using most of the memory at onboard time, NemoClaw downgrades the
menu to the largest model that still fits." into two lines such as "On hosts
where the larger starter models fit the currently available GPU memory, the
starter list includes `qwen3.6:35b` and selects it by default." and "When
another GPU workload is using most of the memory at onboard time, NemoClaw
downgrades the menu to the largest model that still fits." ensuring each
sentence occupies its own line for diff readability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2c318d35-7d7b-4ed8-bc1b-0804e3e6e8de

📥 Commits

Reviewing files that changed from the base of the PR and between 0d24286 and 0f7c555.

📒 Files selected for processing (10)
  • docs/inference/use-local-inference.mdx
  • src/lib/inference/local.test.ts
  • src/lib/inference/local.ts
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
  • src/lib/inference/ollama-model-registry.test.ts
  • src/lib/inference/ollama-model-registry.ts
  • src/lib/inference/ollama/model-size.ts
  • src/lib/inference/ollama/proxy.ts
  • src/lib/onboard.ts

…probe

Address third round of review feedback:

- resolveNonInteractiveOllamaModel now surfaces the no-fit warning on
  the explicit-oversize path too: when NEMOCLAW_MODEL names a known
  oversized tag and the fallback also exceeds available memory, the
  user sees both the "falling back to qwen2.5:7b" line and the
  "no known model fits" line so the second probe failure is not
  surprising. Add a regression test exercising the <8 GB free case.
- New src/lib/inference/ollama/proxy.test.ts exercises the interactive
  menu installed-model fit filter: an installed-only oversized tag
  downgrades to a fitting starter, a fitting installed tag stays as
  the default, and an unknown tag is respected.
- nim.test.ts adds macOS coverage: a Darwin mock returning
  system_profiler + sysctl + vm_stat → expects availableMemoryMB,
  plus a vm_stat parse-failure case that drops the field cleanly.
- docs/inference/use-local-inference.mdx now notes that known
  oversized NEMOCLAW_MODEL tags are downgraded with a warning while
  unknown tags pass through to the Ollama runner's own validation,
  and splits the L47 semicolon-joined sentence into two.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26338780965
Target ref: a1f500476f26fe4424e3e17aed6e6f1179d56f42
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job Result
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped

…hold export

- promptOllamaModel now declares its parameter as `GpuInfo | null` via a
  type-only import, so the emitted .d.ts no longer pins the type to the
  default-value `null`. proxy.test.ts callers (and any other typed
  consumer) can pass real GpuInfo shapes without tsc complaints.
- LARGE_OLLAMA_MIN_MEMORY_MB was only kept around to make existing
  tests look symmetric after the registry refactor took over the
  selector. Drop the export and have local.test.ts derive the "large
  enough to fit everything" memory threshold from the live registry,
  so a future model change does not silently desync the test fixture.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26339106774
Target ref: d7688131f358be43eaae025565469f05490226f4
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@laitingsheng laitingsheng added the v0.0.51 Release target label May 23, 2026
@cv cv enabled auto-merge (squash) May 23, 2026 19:59
@cv cv merged commit d178b79 into main May 23, 2026
40 of 41 checks passed
@wscurran wscurran added the area: inference Inference routing, serving, model selection, or outputs label Jun 3, 2026
@wscurran wscurran added bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality and removed fix labels Jun 3, 2026
@wscurran wscurran removed the feature PR adds or expands user-visible functionality label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression v0.0.51 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Ollama] Onboarding selects qwen3.6:35b despite insufficient currently available GPU memory

3 participants