[DGX Spark][Agent&Skills] Trivial "hello" agent turn takes ~10s P50 / 17s max on local Ollama (nemotron-3-nano:30b)

## Description

Description
<pre>On DGX Spark with local Ollama (nemotron-3-nano:30b, fully GPU-resident), a trivial agent turn ("say hello") takes 7-17 seconds end-to-end. P50 is 10.4 s, max is 17.5 s. Pure inference latency for a one-token greeting on this hardware is expected to be ~1-2 s; agent framework overhead (gateway round-trip, prompt/tool-schema assembly, agent loop) inflates this 5-10x.
Reproduces the VDR #4 finding SP-10 (originally reported with gemma4:31b at ~8 s) on the latest NemoClaw v0.0.28 + OpenClaw 2026.4.9 build.
</pre>Environment

<pre>Device: DGX Spark (host: p4242-0081)
OS: Ubuntu (host); sandbox = NemoClaw default
Architecture: aarch64 (DGX Spark / Grace)
Node.js: Not captured
npm: Not captured
Docker: Not captured
OpenShell CLI: openshell 0.0.36
NemoClaw: v0.0.28
OpenClaw: 2026.4.9 (build 0512059)

Sandbox name: vdr
Provider: ollama-local (Ollama 127.0.0.1:11434, proxy on :11435)
Model: nemotron-3-nano:30b
Model status: 27 GB resident, 100% GPU, context window 262,144
GPU: 1x GPU, 122,543 MB VRAM
</pre>Steps to Reproduce

<pre>1. Onboard NemoClaw v0.0.28 on DGX Spark; choose Local Ollama → nemotron-3-nano:30b.
2. After sandbox creation, open the sandbox:
 nemoclaw vdr connect
3. Run a one-shot agent turn 10 times and measure wall-clock per iteration:
 for i in $(seq 1 10); do
 T=$( { time openclaw agent --agent main --message "say hello" --json \
 >/tmp/a.$i.json 2>/tmp/a.$i.err ; } 2>&1 | awk '/real/{print $2}')
 printf "iter %2d agent_total=%s\n" "$i" "$T"
 done
4. Inspect /tmp/a.*.json for the agentMeta block.
</pre>Expected Result

<pre>A 1-token greeting against a fully GPU-resident 30B local model on DGX Spark
should complete end-to-end in < 3 s (inference < 2 s + framework < 1 s).
At minimum 10/10 iterations should be < 5 s.
</pre>Actual Result

<pre>10/10 iterations exceeded 7 s; 5/10 exceeded 10 s; max 17.5 s.

Raw timings (10 iterations, wall-clock):
 iter 1 17.462 s
 iter 2 8.075 s
 iter 3 8.854 s
 iter 4 7.838 s
 iter 5 15.241 s
 iter 6 10.204 s
 iter 7 10.530 s
 iter 8 14.182 s
 iter 9 11.362 s
 iter 10 7.283 s

Statistics:
 min 7.28 s
 P50 (median) 10.37 s
 P90 15.24 s
 max 17.46 s
 range 2.4x (max/min)

Agent JSON instrumentation (iter 1 sample):
 {
 "status": "ok",
 "result": {
 "payloads": [{ "text": "Hello! How can I assist you today?" }],
 "meta": {
 "durationMs": 15050,
 "agentMeta": {
 "provider": "inference",
 "model": "nemotron-3-nano:30b",
 "lastCallUsage": { "input": 0, "output": 0, "total": 0 }
 }
 }
 }
 }

Note: lastCallUsage all-zero — the Ollama provider does not report token usage back to the agent. Cannot independently confirm prompt-token explosion (Brev/NIM showed input=18355 for the same prompt). Likely worth filing as a separate minor instrumentation bug if not already known.

GPU state during run:
 Pre-perf: GPU util 93% (still warming from prior probe)
 Post-perf: GPU util 0%, model still resident (Ollama keep-alive 2 min)
</pre>Logs

<pre>Suggested attachments (zip and upload after draft is created):
- agent_timings.txt: 10-iter wall-clock per iteration
- agent_iter1.json: /tmp/a.1.json full dump (systemPromptReport, sandbox info, agentMeta)
- gpu_state.txt: ollama list + nvidia-smi pre/post

Analysis:
- Pure inference time (raw `openclaw infer model run` for the same prompt)
 was not collected in this session; will be appended once measured.
 Expected ~1-2 s based on Ollama benchmarks for 30B/GPU.
- Agent framework overhead = agent_total - raw_infer ≈ 5-7 s per turn,
 consistent with observations on Brev where the agent always builds
 full system prompt + 18k+ tool-schema tokens (Brev/NIM input=18355).
- For a trivial conversational turn the agent always pays the full
 bootstrap cost; consider a lighter path that skips full tool schema
 for "no tools needed" prompts.

Related:
- VDR #4 finding SP-10: "Latency ~8s for hello world with gemma4:31b on
 local Ollama" — current finding partially reproduces and confirms the
 issue persists at v0.0.28 / OpenClaw 2026.4.9 (slightly slower).
- BR-7 (Brev): agent turns occasionally hang ~2 min on nvidia-prod NIM.
 Different root cause (remote NIM tail latency) but same observable
 category.
- Minor instrumentation gap: agent JSON shows lastCallUsage.input/output=0
 for Ollama provider; Brev/NIM correctly reports tokens.
</pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-VDR |

---
[NVB#6122111]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DGX Spark][Agent&Skills] Trivial "hello" agent turn takes ~10s P50 / 17s max on local Ollama (nemotron-3-nano:30b) #2598

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-VDR

[DGX Spark][Agent&Skills] Trivial "hello" agent turn takes ~10s P50 / 17s max on local Ollama (nemotron-3-nano:30b) #2598

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions