Description
On a DGX Spark / GB10 host, fresh onboarding with the DGX Spark local-Ollama model selection can choose/pull qwen3.6:35b, but the model then fails the local probe when current available GPU memory is too low.
The host is large enough in aggregate, but another GPU workload was already running. Ollama reported only ~12 GiB currently available for the model and the runner exited during load.
Environment
Device: DGX Spark / NVIDIA GB10
OS: Ubuntu 24.04-family, Linux 6.17.0-1014-nvidia
Architecture: aarch64
Docker: 29.2.1, build a5c7197
OpenShell CLI: 0.0.44
NemoClaw: v0.1.0
Install ref: main
Commit: 7c7f7a428624ad72082d7b11395e25d7ae43daad
Ollama: 0.24.0
Provider: ollama-local
Steps to reproduce
- Start from a fresh NemoClaw install from
main.
- Have an existing GPU workload active so available GPU memory is reduced.
- Run non-interactive onboarding with local Ollama and the DGX Spark default/larger model:
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_PROVIDER=ollama \
NEMOCLAW_MODEL=qwen3.6:35b \
NEMOCLAW_POLICY_MODE=suggested \
NEMOCLAW_YES=1 \
nemoclaw onboard --fresh --non-interactive --yes --yes-i-accept-third-party-software
Observed behavior
The 23 GB model pulled successfully, then onboarding failed during local validation:
Loading Ollama model: qwen3.6:35b
Selected Ollama model 'qwen3.6:35b' failed the local probe: llama runner process has terminated with exit code -1
Ollama logs showed reduced currently available GPU memory and runner failure:
system memory total="121.7 GiB" free="12.4 GiB" free_swap="15.7 GiB"
gpu memory ... available="11.9 GiB" free="12.4 GiB"
...
error loading llama server error="llama runner process has terminated with exit code -1"
nvidia-smi showed another compute workload on the GPU at the time.
Expected behavior
Onboarding should avoid selecting/pulling a model that is unlikely to load with current available memory, or it should clearly warn and offer/auto-fallback to a smaller tools-capable starter model.
For this machine, retrying with qwen2.5:7b succeeded and completed onboarding.
Suggested fix
Consider one or more of:
- Include currently available GPU memory / active GPU workloads in local-Ollama model selection.
- If a large-model probe fails due to resource limits, automatically suggest or retry a smaller tools-capable model such as
qwen2.5:7b.
- In non-interactive DGX Spark express/local-Ollama paths, prefer a conservative model when available GPU memory is below the large-model threshold.
Description
On a DGX Spark / GB10 host, fresh onboarding with the DGX Spark local-Ollama model selection can choose/pull
qwen3.6:35b, but the model then fails the local probe when current available GPU memory is too low.The host is large enough in aggregate, but another GPU workload was already running. Ollama reported only ~12 GiB currently available for the model and the runner exited during load.
Environment
Steps to reproduce
main.Observed behavior
The 23 GB model pulled successfully, then onboarding failed during local validation:
Ollama logs showed reduced currently available GPU memory and runner failure:
nvidia-smishowed another compute workload on the GPU at the time.Expected behavior
Onboarding should avoid selecting/pulling a model that is unlikely to load with current available memory, or it should clearly warn and offer/auto-fallback to a smaller tools-capable starter model.
For this machine, retrying with
qwen2.5:7bsucceeded and completed onboarding.Suggested fix
Consider one or more of:
qwen2.5:7b.