Skip to content

[DGX Spark][Agent&Skills] Trivial "hello" agent turn takes ~10s P50 / 17s max on local Ollama (nemotron-3-nano:30b) #2598

@hulynn

Description

@hulynn

Description

Description

On DGX Spark with local Ollama (nemotron-3-nano:30b, fully GPU-resident), a trivial agent turn ("say hello") takes 7-17 seconds end-to-end. P50 is 10.4 s, max is 17.5 s. Pure inference latency for a one-token greeting on this hardware is expected to be ~1-2 s; agent framework overhead (gateway round-trip, prompt/tool-schema assembly, agent loop) inflates this 5-10x.
Reproduces the VDR #4 finding SP-10 (originally reported with gemma4:31b at ~8 s) on the latest NemoClaw v0.0.28 + OpenClaw 2026.4.9 build.
Environment
Device:        DGX Spark (host: p4242-0081)
OS:            Ubuntu (host); sandbox = NemoClaw default
Architecture:  aarch64 (DGX Spark / Grace)
Node.js:       Not captured
npm:           Not captured
Docker:        Not captured
OpenShell CLI: openshell 0.0.36
NemoClaw:      v0.0.28
OpenClaw:      2026.4.9 (build 0512059)

Sandbox name:  vdr
Provider:      ollama-local (Ollama 127.0.0.1:11434, proxy on :11435)
Model:         nemotron-3-nano:30b
Model status:  27 GB resident, 100% GPU, context window 262,144
GPU:           1x GPU, 122,543 MB VRAM
Steps to Reproduce
1. Onboard NemoClaw v0.0.28 on DGX Spark; choose Local Ollama → nemotron-3-nano:30b.
2. After sandbox creation, open the sandbox:
     nemoclaw vdr connect
3. Run a one-shot agent turn 10 times and measure wall-clock per iteration:
     for i in $(seq 1 10); do
       T=$( { time openclaw agent --agent main --message "say hello" --json \
                >/tmp/a.$i.json 2>/tmp/a.$i.err ; } 2>&1 | awk '/real/{print $2}')
       printf "iter %2d  agent_total=%s\n" "$i" "$T"
     done
4. Inspect /tmp/a.*.json for the agentMeta block.
Expected Result
A 1-token greeting against a fully GPU-resident 30B local model on DGX Spark
should complete end-to-end in < 3 s (inference < 2 s + framework < 1 s).
At minimum 10/10 iterations should be < 5 s.
Actual Result
10/10 iterations exceeded 7 s; 5/10 exceeded 10 s; max 17.5 s.

Raw timings (10 iterations, wall-clock):
  iter  1  17.462 s
  iter  2   8.075 s
  iter  3   8.854 s
  iter  4   7.838 s
  iter  5  15.241 s
  iter  6  10.204 s
  iter  7  10.530 s
  iter  8  14.182 s
  iter  9  11.362 s
  iter 10   7.283 s

Statistics:
  min          7.28 s
  P50 (median) 10.37 s
  P90          15.24 s
  max          17.46 s
  range        2.4x (max/min)

Agent JSON instrumentation (iter 1 sample):
  {
    "status": "ok",
    "result": {
      "payloads": [{ "text": "Hello! How can I assist you today?" }],
      "meta": {
        "durationMs": 15050,
        "agentMeta": {
          "provider": "inference",
          "model":    "nemotron-3-nano:30b",
          "lastCallUsage": { "input": 0, "output": 0, "total": 0 }
        }
      }
    }
  }

Note: lastCallUsage all-zero — the Ollama provider does not report token usage back to the agent. Cannot independently confirm prompt-token explosion (Brev/NIM showed input=18355 for the same prompt). Likely worth filing as a separate minor instrumentation bug if not already known.

GPU state during run:
  Pre-perf:  GPU util 93% (still warming from prior probe)
  Post-perf: GPU util 0%, model still resident (Ollama keep-alive 2 min)
Logs
Suggested attachments (zip and upload after draft is created):
- agent_timings.txt: 10-iter wall-clock per iteration
- agent_iter1.json:  /tmp/a.1.json full dump (systemPromptReport, sandbox info, agentMeta)
- gpu_state.txt:     ollama list + nvidia-smi pre/post

Analysis:
- Pure inference time (raw `openclaw infer model run` for the same prompt)
  was not collected in this session; will be appended once measured.
  Expected ~1-2 s based on Ollama benchmarks for 30B/GPU.
- Agent framework overhead = agent_total - raw_infer ≈ 5-7 s per turn,
  consistent with observations on Brev where the agent always builds
  full system prompt + 18k+ tool-schema tokens (Brev/NIM input=18355).
- For a trivial conversational turn the agent always pays the full
  bootstrap cost; consider a lighter path that skips full tool schema
  for "no tools needed" prompts.

Related:
- VDR #4 finding SP-10: "Latency ~8s for hello world with gemma4:31b on
  local Ollama" — current finding partially reproduces and confirms the
  issue persists at v0.0.28 / OpenClaw 2026.4.9 (slightly slower).
- BR-7 (Brev): agent turns occasionally hang ~2 min on nvidia-prod NIM.
  Different root cause (remote NIM tail latency) but same observable
  category.
- Minor instrumentation gap: agent JSON shows lastCallUsage.input/output=0
  for Ollama provider; Brev/NIM correctly reports tokens.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-VDR

[NVB#6122111]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: cliCommand line interface, flags, terminal UX, or outputarea: performanceLatency, throughput, resource use, benchmarks, or scalingneeds: unblockBlocked item needs dependency or decision resolvedplatform: dgx-sparkAffects DGX Spark hardware or workflowsprovider: ollamaOllama local model provider behaviorv0.0.62Release target
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions