[DGX Spark][Ollama] Onboarding selects qwen3.6:35b despite insufficient currently available GPU memory

## Description

On a DGX Spark / GB10 host, fresh onboarding with the DGX Spark local-Ollama model selection can choose/pull `qwen3.6:35b`, but the model then fails the local probe when current available GPU memory is too low.

The host is large enough in aggregate, but another GPU workload was already running. Ollama reported only ~12 GiB currently available for the model and the runner exited during load.

## Environment

```text
Device:        DGX Spark / NVIDIA GB10
OS:            Ubuntu 24.04-family, Linux 6.17.0-1014-nvidia
Architecture:  aarch64
Docker:        29.2.1, build a5c7197
OpenShell CLI: 0.0.44
NemoClaw:      v0.1.0
Install ref:   main
Commit:        7c7f7a428624ad72082d7b11395e25d7ae43daad
Ollama:        0.24.0
Provider:      ollama-local
```

## Steps to reproduce

1. Start from a fresh NemoClaw install from `main`.
2. Have an existing GPU workload active so available GPU memory is reduced.
3. Run non-interactive onboarding with local Ollama and the DGX Spark default/larger model:

```bash
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_PROVIDER=ollama \
NEMOCLAW_MODEL=qwen3.6:35b \
NEMOCLAW_POLICY_MODE=suggested \
NEMOCLAW_YES=1 \
nemoclaw onboard --fresh --non-interactive --yes --yes-i-accept-third-party-software
```

## Observed behavior

The 23 GB model pulled successfully, then onboarding failed during local validation:

```text
Loading Ollama model: qwen3.6:35b
Selected Ollama model 'qwen3.6:35b' failed the local probe: llama runner process has terminated with exit code -1
```

Ollama logs showed reduced currently available GPU memory and runner failure:

```text
system memory total="121.7 GiB" free="12.4 GiB" free_swap="15.7 GiB"
gpu memory ... available="11.9 GiB" free="12.4 GiB"
...
error loading llama server error="llama runner process has terminated with exit code -1"
```

`nvidia-smi` showed another compute workload on the GPU at the time.

## Expected behavior

Onboarding should avoid selecting/pulling a model that is unlikely to load with current available memory, or it should clearly warn and offer/auto-fallback to a smaller tools-capable starter model.

For this machine, retrying with `qwen2.5:7b` succeeded and completed onboarding.

## Suggested fix

Consider one or more of:

- Include currently available GPU memory / active GPU workloads in local-Ollama model selection.
- If a large-model probe fails due to resource limits, automatically suggest or retry a smaller tools-capable model such as `qwen2.5:7b`.
- In non-interactive DGX Spark express/local-Ollama paths, prefer a conservative model when available GPU memory is below the large-model threshold.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DGX Spark][Ollama] Onboarding selects qwen3.6:35b despite insufficient currently available GPU memory #4113

Description

Environment

Steps to reproduce

Observed behavior

Expected behavior

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DGX Spark][Ollama] Onboarding selects qwen3.6:35b despite insufficient currently available GPU memory #4113

Description

Description

Environment

Steps to reproduce

Observed behavior

Expected behavior

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions