[Brev][Inference] Trivial agent turn occasionally hangs ~2 min on shared nvidia-prod NIM (P50 9s, P99 128s, 10% outliers)

## Description

Description
<pre>On Brev with the shared nvidia-prod NIM endpoint, a trivial agent turn ("say hello") typically completes in 7-12 seconds, but ~10% of turns stall for >2 minutes with no error or progress indicator. P50 9.4 s; P99/max 128.88 s (≈ 2 min 9 s). Reproduces VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries") on NemoClaw v0.0.28 + OpenClaw 2026.4.9.

Cross-platform comparison (same agent code, same prompt) shows DGX Spark with local Ollama produces 0/10 outliers >60 s (max 17.5 s) — strongly localizing the tail-latency outliers to the remote NIM path, not the agent framework or prompt-bloat alone.

Two distinct issues observed and worth triaging in the same bug:
(A) Tail latency 2 min on shared nvidia-prod NIM (primary)
(B) Model-fallback path adds ~12 s per call when user passes the displayed
 "<provider>/<model>" name from `nemoclaw list` to raw inference (secondary)
</pre>Environment

<pre>Device: Brev (Shadeform), host brev-ydoa5pmhb (shadeform user)
OS: Brev / Ubuntu (Shadeform image)
Architecture: x86_64
Node.js: Not captured
npm: Not captured
Docker: Not captured
OpenShell CLI: openshell 0.0.26
NemoClaw: v0.0.28
OpenClaw: 2026.4.9 (build 0512059)

Sandbox name: aab
Provider: nvidia-prod (shared NIM endpoint)
Model: minimaxai/minimax-m2.5
Default agent: main, session agent:main:main
</pre>Steps to Reproduce

<pre>1. Onboard NemoClaw v0.0.28 on Brev with NVIDIA Endpoints → minimax-m2.5
 (default sandbox name "aab").
2. From the host shell, run 10 one-shot agent turns inside the sandbox:

</pre><pre> nemoclaw aab <<'EOF'
 for i in $(seq 1 10); do
 T=$( { time openclaw agent --agent main --message "say hello" --json \
 >/tmp/a.$i.json 2>/tmp/a.$i.err ; } 2>&1 | awk '/real/{print $2}')
 printf "iter %2d agent_total=%s\n" "$i" "$T"
 done
 exit
 EOF</code></pre><pre>3. Observe ~10% of iterations exceed 60 s (no progress indicator, no error).
</pre>Expected Result

<pre>"Say hello" → 1-line greeting via shared NIM should complete < 15 s P95
end-to-end. No iteration should exceed 60 s without a user-visible
progress indicator or timeout.
</pre>Actual Result

<pre>Raw timings (10 iterations, wall-clock):
 iter 1 11.542 s
 iter 2 11.712 s
 iter 3 6.926 s
 iter 4 7.322 s
 iter 5 128.878 s ← BR-7 reproduces
 iter 6 7.194 s
 iter 7 12.113 s
 iter 8 6.942 s
 iter 9 7.716 s
 iter 10 11.080 s

Statistics:
 min 6.93 s
 P50 (median) 9.40 s
 P90 12.11 s
 P99 / max 128.88 s (≈ 2 min 9 s)
 range 18.6x (max/min)
 iters > 60 s 1 / 10 (10%)

Agent JSON instrumentation (iter that completed normally):
 {
 "status": "ok",
 "result": {
 "payloads": [{ "text": "Hey again! What can I do for you?" }],
 "meta": {
 "durationMs": 6392,
 "agentMeta": {
 "provider": "inference",
 "model": "minimaxai/minimax-m2.5",
 "usage": { "input": 18355, "output": 35, "total": 18390 },
 "promptTokens": 18355
 },
 "aborted": false
 }
 }
 }

Note: Even on normal turns, input prompt = 18,355 tokens to produce 35 output tokens for a "hello" reply. The agent inflates trivial messages with full system prompt + tool schemas. Likely amplifies tail latency during NIM congestion.

(B) Model-fallback overhead seen while debugging raw inference:
When the user passes --model nvidia-prod/minimaxai/minimax-m2.5 (the provider/model string displayed by `nemoclaw list`), the gateway fails twice (~6 s each, two lanes) before falling back to the working name:

</pre><pre> [diagnostic] lane task error: lane=main durationMs=6335
 error="FailoverError: Unknown model: nvidia-prod/minimaxai/minimax-m2.5"
 [diagnostic] lane task error: lane=session:agent:main:main durationMs=6347
 error="FailoverError: Unknown model: nvidia-prod/minimaxai/minimax-m2.5"
 [model-fallback] Fell back to "inference/minimaxai/minimax-m2.5".</code></pre><pre>This adds ~12 s per call. Suspected: the gateway should resolve provider/model names listed in `nemoclaw list` directly (avoid the round-trip to fallback). Worth filing as a separate child bug if not in scope here.
</pre>Logs

<pre>Suggested attachments (zip and upload after draft is created):
- brev_agent_timings.txt: 10-iter wall-clock timings
- brev_agent_iter1.json: /tmp/a.1.json full dump (18355 token usage)
- brev_fallback_stderr.txt: model-fallback diagnostic stderr (12 s wasted)

Cross-platform comparison (same prompt, identical agent code):

 Metric Brev (nvidia-prod/minimax-m2.5) Spark (ollama-local/nemotron-3-nano:30b)
 P50 9.4 s 10.4 s
 P90 12.1 s 15.2 s
 max 128.88 s 17.5 s
 outliers > 60 s 1/10 0/10

Same agent code, same prompt → only Brev/NIM produces 2-min outliers.
This localizes the tail-latency root cause to the remote nvidia-prod NIM
endpoint (queue depth / quota throttling / region routing).

Suggested investigation:
- Inference / NIM endpoint team: investigate nvidia-prod tail latency
 (queue depth, quota throttling, region affinity, retry policy).
- OpenClaw agent framework team:
 * Add user-visible progress / timeout for turns > 30 s. Default
 NEMOCLAW_AGENT_TIMEOUT=600 means < 10 min hangs do not trip timeout
 — UX failure.
 * Reduce input prompt for trivial turns OR cache tool schemas across turns.
 * Fix model resolver so provider/model names listed in `nemoclaw list`
 work directly (avoid the 12 s fallback overhead).

Related:
- VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries"). This
 bug is the formal tracker for that VDR4 finding.
- NVBug 6122111 ([DGX Spark][Agent&Skills] Trivial "hello" agent turn ~10s
 P50 / 17s max) — same agent framework slowness floor (~5-7 s) but no
 2-min outliers since Spark is local Ollama. Confirms tail issue is
 remote-NIM-specific.
</pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended |

---
[NVB#6122133]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Brev][Inference] Trivial agent turn occasionally hangs ~2 min on shared nvidia-prod NIM (P50 9s, P99 128s, 10% outliers) #2600

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended

[Brev][Inference] Trivial agent turn occasionally hangs ~2 min on shared nvidia-prod NIM (P50 9s, P99 128s, 10% outliers) #2600

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions