Description
Description
On Brev with the shared nvidia-prod NIM endpoint, a trivial agent turn ("say hello") typically completes in 7-12 seconds, but ~10% of turns stall for >2 minutes with no error or progress indicator. P50 9.4 s; P99/max 128.88 s (≈ 2 min 9 s). Reproduces VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries") on NemoClaw v0.0.28 + OpenClaw 2026.4.9.
Cross-platform comparison (same agent code, same prompt) shows DGX Spark with local Ollama produces 0/10 outliers >60 s (max 17.5 s) — strongly localizing the tail-latency outliers to the remote NIM path, not the agent framework or prompt-bloat alone.
Two distinct issues observed and worth triaging in the same bug:
(A) Tail latency 2 min on shared nvidia-prod NIM (primary)
(B) Model-fallback path adds ~12 s per call when user passes the displayed
"/" name from `nemoclaw list` to raw inference (secondary)
Environment
Device: Brev (Shadeform), host brev-ydoa5pmhb (shadeform user)
OS: Brev / Ubuntu (Shadeform image)
Architecture: x86_64
Node.js: Not captured
npm: Not captured
Docker: Not captured
OpenShell CLI: openshell 0.0.26
NemoClaw: v0.0.28
OpenClaw: 2026.4.9 (build 0512059)
Sandbox name: aab
Provider: nvidia-prod (shared NIM endpoint)
Model: minimaxai/minimax-m2.5
Default agent: main, session agent:main:main
Steps to Reproduce
1. Onboard NemoClaw v0.0.28 on Brev with NVIDIA Endpoints → minimax-m2.5
(default sandbox name "aab").
2. From the host shell, run 10 one-shot agent turns inside the sandbox:
nemoclaw aab <<'EOF'
for i in $(seq 1 10); do
T=$( { time openclaw agent --agent main --message "say hello" --json \
>/tmp/a.$i.json 2>/tmp/a.$i.err ; } 2>&1 | awk '/real/{print $2}')
printf "iter %2d agent_total=%s\n" "$i" "$T"
done
exit
EOF</code></pre><pre>3. Observe ~10% of iterations exceed 60 s (no progress indicator, no error).
Expected Result
"Say hello" → 1-line greeting via shared NIM should complete < 15 s P95
end-to-end. No iteration should exceed 60 s without a user-visible
progress indicator or timeout.
Actual Result
Raw timings (10 iterations, wall-clock):
iter 1 11.542 s
iter 2 11.712 s
iter 3 6.926 s
iter 4 7.322 s
iter 5 128.878 s ← BR-7 reproduces
iter 6 7.194 s
iter 7 12.113 s
iter 8 6.942 s
iter 9 7.716 s
iter 10 11.080 s
Statistics:
min 6.93 s
P50 (median) 9.40 s
P90 12.11 s
P99 / max 128.88 s (≈ 2 min 9 s)
range 18.6x (max/min)
iters > 60 s 1 / 10 (10%)
Agent JSON instrumentation (iter that completed normally):
{
"status": "ok",
"result": {
"payloads": [{ "text": "Hey again! What can I do for you?" }],
"meta": {
"durationMs": 6392,
"agentMeta": {
"provider": "inference",
"model": "minimaxai/minimax-m2.5",
"usage": { "input": 18355, "output": 35, "total": 18390 },
"promptTokens": 18355
},
"aborted": false
}
}
}
Note: Even on normal turns, input prompt = 18,355 tokens to produce 35 output tokens for a "hello" reply. The agent inflates trivial messages with full system prompt + tool schemas. Likely amplifies tail latency during NIM congestion.
(B) Model-fallback overhead seen while debugging raw inference:
When the user passes --model nvidia-prod/minimaxai/minimax-m2.5 (the provider/model string displayed by `nemoclaw list`), the gateway fails twice (~6 s each, two lanes) before falling back to the working name:
[diagnostic] lane task error: lane=main durationMs=6335
error="FailoverError: Unknown model: nvidia-prod/minimaxai/minimax-m2.5"
[diagnostic] lane task error: lane=session:agent:main:main durationMs=6347
error="FailoverError: Unknown model: nvidia-prod/minimaxai/minimax-m2.5"
[model-fallback] Fell back to "inference/minimaxai/minimax-m2.5".
This adds ~12 s per call. Suspected: the gateway should resolve provider/model names listed in nemoclaw list directly (avoid the round-trip to fallback). Worth filing as a separate child bug if not in scope here.
Logs
Suggested attachments (zip and upload after draft is created):
- brev_agent_timings.txt: 10-iter wall-clock timings
- brev_agent_iter1.json: /tmp/a.1.json full dump (18355 token usage)
- brev_fallback_stderr.txt: model-fallback diagnostic stderr (12 s wasted)
Cross-platform comparison (same prompt, identical agent code):
Metric Brev (nvidia-prod/minimax-m2.5) Spark (ollama-local/nemotron-3-nano:30b)
P50 9.4 s 10.4 s
P90 12.1 s 15.2 s
max 128.88 s 17.5 s
outliers > 60 s 1/10 0/10
Same agent code, same prompt → only Brev/NIM produces 2-min outliers.
This localizes the tail-latency root cause to the remote nvidia-prod NIM
endpoint (queue depth / quota throttling / region routing).
Suggested investigation:
- Inference / NIM endpoint team: investigate nvidia-prod tail latency
(queue depth, quota throttling, region affinity, retry policy).
- OpenClaw agent framework team:
* Add user-visible progress / timeout for turns > 30 s. Default
NEMOCLAW_AGENT_TIMEOUT=600 means < 10 min hangs do not trip timeout
— UX failure.
* Reduce input prompt for trivial turns OR cache tool schemas across turns.
* Fix model resolver so provider/model names listed in `nemoclaw list`
work directly (avoid the 12 s fallback overhead).
Related:
- VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries"). This
bug is the formal tracker for that VDR4 finding.
- NVBug 6122111 ([DGX Spark][Agent&Skills] Trivial "hello" agent turn ~10s
P50 / 17s max) — same agent framework slowness floor (~5-7 s) but no
2-min outliers since Spark is local Ollama. Confirms tail issue is
remote-NIM-specific.
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended |
[NVB#6122133]
Description
Description
On Brev with the shared nvidia-prod NIM endpoint, a trivial agent turn ("say hello") typically completes in 7-12 seconds, but ~10% of turns stall for >2 minutes with no error or progress indicator. P50 9.4 s; P99/max 128.88 s (≈ 2 min 9 s). Reproduces VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries") on NemoClaw v0.0.28 + OpenClaw 2026.4.9. Cross-platform comparison (same agent code, same prompt) shows DGX Spark with local Ollama produces 0/10 outliers >60 s (max 17.5 s) — strongly localizing the tail-latency outliers to the remote NIM path, not the agent framework or prompt-bloat alone. Two distinct issues observed and worth triaging in the same bug: (A) Tail latency 2 min on shared nvidia-prod NIM (primary) (B) Model-fallback path adds ~12 s per call when user passes the displayed "/" name from `nemoclaw list` to raw inference (secondary)Environment Steps to Reproduce Expected Result Actual ResultRaw timings (10 iterations, wall-clock): iter 1 11.542 s iter 2 11.712 s iter 3 6.926 s iter 4 7.322 s iter 5 128.878 s ← BR-7 reproduces iter 6 7.194 s iter 7 12.113 s iter 8 6.942 s iter 9 7.716 s iter 10 11.080 s Statistics: min 6.93 s P50 (median) 9.40 s P90 12.11 s P99 / max 128.88 s (≈ 2 min 9 s) range 18.6x (max/min) iters > 60 s 1 / 10 (10%) Agent JSON instrumentation (iter that completed normally): { "status": "ok", "result": { "payloads": [{ "text": "Hey again! What can I do for you?" }], "meta": { "durationMs": 6392, "agentMeta": { "provider": "inference", "model": "minimaxai/minimax-m2.5", "usage": { "input": 18355, "output": 35, "total": 18390 }, "promptTokens": 18355 }, "aborted": false } } } Note: Even on normal turns, input prompt = 18,355 tokens to produce 35 output tokens for a "hello" reply. The agent inflates trivial messages with full system prompt + tool schemas. Likely amplifies tail latency during NIM congestion. (B) Model-fallback overhead seen while debugging raw inference: When the user passes --model nvidia-prod/minimaxai/minimax-m2.5 (the provider/model string displayed by `nemoclaw list`), the gateway fails twice (~6 s each, two lanes) before falling back to the working name:LogsSuggested attachments (zip and upload after draft is created): - brev_agent_timings.txt: 10-iter wall-clock timings - brev_agent_iter1.json: /tmp/a.1.json full dump (18355 token usage) - brev_fallback_stderr.txt: model-fallback diagnostic stderr (12 s wasted) Cross-platform comparison (same prompt, identical agent code): Metric Brev (nvidia-prod/minimax-m2.5) Spark (ollama-local/nemotron-3-nano:30b) P50 9.4 s 10.4 s P90 12.1 s 15.2 s max 128.88 s 17.5 s outliers > 60 s 1/10 0/10 Same agent code, same prompt → only Brev/NIM produces 2-min outliers. This localizes the tail-latency root cause to the remote nvidia-prod NIM endpoint (queue depth / quota throttling / region routing). Suggested investigation: - Inference / NIM endpoint team: investigate nvidia-prod tail latency (queue depth, quota throttling, region affinity, retry policy). - OpenClaw agent framework team: * Add user-visible progress / timeout for turns > 30 s. Default NEMOCLAW_AGENT_TIMEOUT=600 means < 10 min hangs do not trip timeout — UX failure. * Reduce input prompt for trivial turns OR cache tool schemas across turns. * Fix model resolver so provider/model names listed in `nemoclaw list` work directly (avoid the 12 s fallback overhead). Related: - VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries"). This bug is the formal tracker for that VDR4 finding. - NVBug 6122111 ([DGX Spark][Agent&Skills] Trivial "hello" agent turn ~10s P50 / 17s max) — same agent framework slowness floor (~5-7 s) but no 2-min outliers since Spark is local Ollama. Confirms tail issue is remote-NIM-specific.Bug Details
[NVB#6122133]