Skip to content

[Brev][Onboard] nemoclaw onboard dead-loops on Ollama model selection after the default model probe times out #4812

@hulynn

Description

@hulynn

Description

On a standard Brev g6.xlarge (NVIDIA L4, 23 GB VRAM / 15 GB RAM), nemoclaw onboard with the Local Ollama provider offers nemotron-3-nano:30b (24 GB) as the default starter model. The model pulls successfully but the wizard's post-load probe times out ("did not answer the local probe in time") and the wizard dead-loops, re-offering the same too-large model + "Other" with no smaller suggestion and no abort. Onboard never completes. The model itself is healthy (a direct Ollama request returns in ~4.5s); the rejection is a false negative caused by cold-load latency exceeding the 120s probe window, which is only extended (to 300s) on DGX Spark hosts.

Related (not a duplicate): NVB 6216882 / #4178 fixed the same dead-loop symptom for a different trigger (a pre-installed old Ollama 0.6.2 crashing on model load). This is a distinct trigger — current Ollama, a healthy model, oversized default + 120s probe timeout with no non-Spark retry — and the dead-loop still occurs on v0.0.59, so the loop-escape/diagnostic added for #4178 does not cover this path.

Environment

Device:        Brev VM brev-hv34p39z2 (g6.xlarge), NVIDIA L4 23034 MiB, 4 vCPU, 15 GiB RAM
OS:            Ubuntu 22.04.5 LTS
Architecture:  x86_64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        29.5.2
OpenShell CLI: 0.0.44
NemoClaw:      v0.0.59
OpenClaw:      N/A (onboard does not complete with the default model)

Steps to Reproduce

  1. On a fresh Brev g6.xlarge (L4) with no Ollama models installed, run nemoclaw onboard --fresh --name brev-ollama
  2. At [3/8] inference, select Local Ollama (Install Ollama on first run).
  3. At "Ollama starter models", accept the default [2] nemotron-3-nano:30b.
  4. Confirm the 22.61 GB download (y). Wait for the pull to finish.

Expected Result

After the model loads, the wizard verifies it and proceeds. If a model cannot be loaded on this host, the wizard should not offer it as the default and/or should fall back to a host-fitting model, and must never loop indefinitely with no way forward.

Actual Result

Loading Ollama model: nemotron-3-nano:30b
Selected Ollama model 'nemotron-3-nano:30b' did not answer the local probe in time.
It may still be loading, too large for the host, or otherwise unhealthy.
Choose a different Ollama model or select Other.

  Ollama models:
    1) nemotron-3-nano:30b
    2) Other...
  Choose model [1]: 1
Loading Ollama model: nemotron-3-nano:30b
Selected Ollama model 'nemotron-3-nano:30b' did not answer the local probe in time.
... (loops indefinitely; onboard never completes)

Evidence the model is healthy (warm direct probe, ~4.5s):

curl -s --max-time 90 http://localhost:11434/api/generate -d '{"model":"nemotron-3-nano:30b","prompt":"say hi","stream":false}'
 -> {"response":"Hi there! ..."}  total_duration 4493715275 ns  (real 0m4.502s)
ollama ps  -> nemotron-3-nano:30b  25 GB  15%/85% CPU/GPU  CONTEXT 4096
nvidia-smi -> 23034 MiB total, 21332 used, 1232 free  (model spills GPU to CPU)

Logs

Onboard preflight detected the host limits BEFORE offering the 24GB default:
  NVIDIA GPU detected (NVIDIA L4, 23034 MB)
  Memory OK: 15368 MB RAM + 0 MB swap

Root cause (source, NemoClaw v0.0.59):
- nemoclaw/src/lib/inference/local.ts:929 getOllamaProbeCommand — probe is
  `curl -sS --max-time 120 http://localhost:11434/api/generate` (direct :11434).
- nemoclaw/src/lib/inference/local.ts:954+ validateOllamaModel — extended 300s retry
  only fires when sparkHost is true; a non-Spark host (Brev L4) gets only 120s, no retry,
  so a cold load that exceeds 120s reports "did not answer the local probe in time".
- nemoclaw/src/lib/inference/local.ts:85 DEFAULT_OLLAMA_MODEL = nemotron-3-nano:30b (~24GB).
- nemoclaw/src/lib/inference/ollama/proxy.ts (starter list ~line 394) — the starter-model
  list is NOT filtered by modelFitsAvailableMemory() (that filter only applies to the
  installed-models list); after a probe failure the wizard re-offers the same model with
  no host-fitting fallback and no abort.

Workaround verified: pre-pull a host-fitting model (llama3.2:3b) and select it — onboard
completes, and the installed-models list correctly filters out the 24GB model.

NVB#6272260

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: inferenceInference routing, serving, model selection, or outputsarea: local-modelsLocal model providers, downloads, launch, or connectivityarea: onboardingOnboarding FSM, provider setup, sandbox launch, or first-run flowplatform: brevAffects Brev hosted development environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions