Skip to content

[DGX Spark][Ollama] Onboarding selects qwen3.6:35b despite insufficient currently available GPU memory #4113

@cv

Description

@cv

Description

On a DGX Spark / GB10 host, fresh onboarding with the DGX Spark local-Ollama model selection can choose/pull qwen3.6:35b, but the model then fails the local probe when current available GPU memory is too low.

The host is large enough in aggregate, but another GPU workload was already running. Ollama reported only ~12 GiB currently available for the model and the runner exited during load.

Environment

Device:        DGX Spark / NVIDIA GB10
OS:            Ubuntu 24.04-family, Linux 6.17.0-1014-nvidia
Architecture:  aarch64
Docker:        29.2.1, build a5c7197
OpenShell CLI: 0.0.44
NemoClaw:      v0.1.0
Install ref:   main
Commit:        7c7f7a428624ad72082d7b11395e25d7ae43daad
Ollama:        0.24.0
Provider:      ollama-local

Steps to reproduce

  1. Start from a fresh NemoClaw install from main.
  2. Have an existing GPU workload active so available GPU memory is reduced.
  3. Run non-interactive onboarding with local Ollama and the DGX Spark default/larger model:
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_PROVIDER=ollama \
NEMOCLAW_MODEL=qwen3.6:35b \
NEMOCLAW_POLICY_MODE=suggested \
NEMOCLAW_YES=1 \
nemoclaw onboard --fresh --non-interactive --yes --yes-i-accept-third-party-software

Observed behavior

The 23 GB model pulled successfully, then onboarding failed during local validation:

Loading Ollama model: qwen3.6:35b
Selected Ollama model 'qwen3.6:35b' failed the local probe: llama runner process has terminated with exit code -1

Ollama logs showed reduced currently available GPU memory and runner failure:

system memory total="121.7 GiB" free="12.4 GiB" free_swap="15.7 GiB"
gpu memory ... available="11.9 GiB" free="12.4 GiB"
...
error loading llama server error="llama runner process has terminated with exit code -1"

nvidia-smi showed another compute workload on the GPU at the time.

Expected behavior

Onboarding should avoid selecting/pulling a model that is unlikely to load with current available memory, or it should clearly warn and offer/auto-fallback to a smaller tools-capable starter model.

For this machine, retrying with qwen2.5:7b succeeded and completed onboarding.

Suggested fix

Consider one or more of:

  • Include currently available GPU memory / active GPU workloads in local-Ollama model selection.
  • If a large-model probe fails due to resource limits, automatically suggest or retry a smaller tools-capable model such as qwen2.5:7b.
  • In non-interactive DGX Spark express/local-Ollama paths, prefer a conservative model when available GPU memory is below the large-model threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: inferenceInference routing, serving, model selection, or outputsarea: local-modelsLocal model providers, downloads, launch, or connectivityarea: providersInference provider integrations and provider behaviorplatform: dgx-sparkAffects DGX Spark hardware or workflowsprovider: ollamaOllama local model provider behavior
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions