Description
Description
On DGX Spark with local Ollama (nemotron-3-nano:30b, fully GPU-resident), a trivial agent turn ("say hello") takes 7-17 seconds end-to-end. P50 is 10.4 s, max is 17.5 s. Pure inference latency for a one-token greeting on this hardware is expected to be ~1-2 s; agent framework overhead (gateway round-trip, prompt/tool-schema assembly, agent loop) inflates this 5-10x.
Reproduces the VDR #4 finding SP-10 (originally reported with gemma4:31b at ~8 s) on the latest NemoClaw v0.0.28 + OpenClaw 2026.4.9 build.
Environment
Device: DGX Spark (host: p4242-0081)
OS: Ubuntu (host); sandbox = NemoClaw default
Architecture: aarch64 (DGX Spark / Grace)
Node.js: Not captured
npm: Not captured
Docker: Not captured
OpenShell CLI: openshell 0.0.36
NemoClaw: v0.0.28
OpenClaw: 2026.4.9 (build 0512059)
Sandbox name: vdr
Provider: ollama-local (Ollama 127.0.0.1:11434, proxy on :11435)
Model: nemotron-3-nano:30b
Model status: 27 GB resident, 100% GPU, context window 262,144
GPU: 1x GPU, 122,543 MB VRAM
Steps to Reproduce
1. Onboard NemoClaw v0.0.28 on DGX Spark; choose Local Ollama → nemotron-3-nano:30b.
2. After sandbox creation, open the sandbox:
nemoclaw vdr connect
3. Run a one-shot agent turn 10 times and measure wall-clock per iteration:
for i in $(seq 1 10); do
T=$( { time openclaw agent --agent main --message "say hello" --json \
>/tmp/a.$i.json 2>/tmp/a.$i.err ; } 2>&1 | awk '/real/{print $2}')
printf "iter %2d agent_total=%s\n" "$i" "$T"
done
4. Inspect /tmp/a.*.json for the agentMeta block.
Expected Result
A 1-token greeting against a fully GPU-resident 30B local model on DGX Spark
should complete end-to-end in < 3 s (inference < 2 s + framework < 1 s).
At minimum 10/10 iterations should be < 5 s.
Actual Result
10/10 iterations exceeded 7 s; 5/10 exceeded 10 s; max 17.5 s.
Raw timings (10 iterations, wall-clock):
iter 1 17.462 s
iter 2 8.075 s
iter 3 8.854 s
iter 4 7.838 s
iter 5 15.241 s
iter 6 10.204 s
iter 7 10.530 s
iter 8 14.182 s
iter 9 11.362 s
iter 10 7.283 s
Statistics:
min 7.28 s
P50 (median) 10.37 s
P90 15.24 s
max 17.46 s
range 2.4x (max/min)
Agent JSON instrumentation (iter 1 sample):
{
"status": "ok",
"result": {
"payloads": [{ "text": "Hello! How can I assist you today?" }],
"meta": {
"durationMs": 15050,
"agentMeta": {
"provider": "inference",
"model": "nemotron-3-nano:30b",
"lastCallUsage": { "input": 0, "output": 0, "total": 0 }
}
}
}
}
Note: lastCallUsage all-zero — the Ollama provider does not report token usage back to the agent. Cannot independently confirm prompt-token explosion (Brev/NIM showed input=18355 for the same prompt). Likely worth filing as a separate minor instrumentation bug if not already known.
GPU state during run:
Pre-perf: GPU util 93% (still warming from prior probe)
Post-perf: GPU util 0%, model still resident (Ollama keep-alive 2 min)
Logs
Suggested attachments (zip and upload after draft is created):
- agent_timings.txt: 10-iter wall-clock per iteration
- agent_iter1.json: /tmp/a.1.json full dump (systemPromptReport, sandbox info, agentMeta)
- gpu_state.txt: ollama list + nvidia-smi pre/post
Analysis:
- Pure inference time (raw `openclaw infer model run` for the same prompt)
was not collected in this session; will be appended once measured.
Expected ~1-2 s based on Ollama benchmarks for 30B/GPU.
- Agent framework overhead = agent_total - raw_infer ≈ 5-7 s per turn,
consistent with observations on Brev where the agent always builds
full system prompt + 18k+ tool-schema tokens (Brev/NIM input=18355).
- For a trivial conversational turn the agent always pays the full
bootstrap cost; consider a lighter path that skips full tool schema
for "no tools needed" prompts.
Related:
- VDR #4 finding SP-10: "Latency ~8s for hello world with gemma4:31b on
local Ollama" — current finding partially reproduces and confirms the
issue persists at v0.0.28 / OpenClaw 2026.4.9 (slightly slower).
- BR-7 (Brev): agent turns occasionally hang ~2 min on nvidia-prod NIM.
Different root cause (remote NIM tail latency) but same observable
category.
- Minor instrumentation gap: agent JSON shows lastCallUsage.input/output=0
for Ollama provider; Brev/NIM correctly reports tokens.
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-VDR |
[NVB#6122111]
Description
Description
On DGX Spark with local Ollama (nemotron-3-nano:30b, fully GPU-resident), a trivial agent turn ("say hello") takes 7-17 seconds end-to-end. P50 is 10.4 s, max is 17.5 s. Pure inference latency for a one-token greeting on this hardware is expected to be ~1-2 s; agent framework overhead (gateway round-trip, prompt/tool-schema assembly, agent loop) inflates this 5-10x. Reproduces the VDR #4 finding SP-10 (originally reported with gemma4:31b at ~8 s) on the latest NemoClaw v0.0.28 + OpenClaw 2026.4.9 build.Environment Steps to Reproduce1. Onboard NemoClaw v0.0.28 on DGX Spark; choose Local Ollama → nemotron-3-nano:30b. 2. After sandbox creation, open the sandbox: nemoclaw vdr connect 3. Run a one-shot agent turn 10 times and measure wall-clock per iteration: for i in $(seq 1 10); do T=$( { time openclaw agent --agent main --message "say hello" --json \ >/tmp/a.$i.json 2>/tmp/a.$i.err ; } 2>&1 | awk '/real/{print $2}') printf "iter %2d agent_total=%s\n" "$i" "$T" done 4. Inspect /tmp/a.*.json for the agentMeta block.Expected Result Actual Result10/10 iterations exceeded 7 s; 5/10 exceeded 10 s; max 17.5 s. Raw timings (10 iterations, wall-clock): iter 1 17.462 s iter 2 8.075 s iter 3 8.854 s iter 4 7.838 s iter 5 15.241 s iter 6 10.204 s iter 7 10.530 s iter 8 14.182 s iter 9 11.362 s iter 10 7.283 s Statistics: min 7.28 s P50 (median) 10.37 s P90 15.24 s max 17.46 s range 2.4x (max/min) Agent JSON instrumentation (iter 1 sample): { "status": "ok", "result": { "payloads": [{ "text": "Hello! How can I assist you today?" }], "meta": { "durationMs": 15050, "agentMeta": { "provider": "inference", "model": "nemotron-3-nano:30b", "lastCallUsage": { "input": 0, "output": 0, "total": 0 } } } } } Note: lastCallUsage all-zero — the Ollama provider does not report token usage back to the agent. Cannot independently confirm prompt-token explosion (Brev/NIM showed input=18355 for the same prompt). Likely worth filing as a separate minor instrumentation bug if not already known. GPU state during run: Pre-perf: GPU util 93% (still warming from prior probe) Post-perf: GPU util 0%, model still resident (Ollama keep-alive 2 min)LogsBug Details
[NVB#6122111]