Skip to content

[Brev][Inference] Trivial agent turn occasionally hangs ~2 min on shared nvidia-prod NIM (P50 9s, P99 128s, 10% outliers) #2600

@hulynn

Description

@hulynn

Description

Description

On Brev with the shared nvidia-prod NIM endpoint, a trivial agent turn ("say hello") typically completes in 7-12 seconds, but ~10% of turns stall for >2 minutes with no error or progress indicator. P50 9.4 s; P99/max 128.88 s (≈ 2 min 9 s). Reproduces VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries") on NemoClaw v0.0.28 + OpenClaw 2026.4.9.

Cross-platform comparison (same agent code, same prompt) shows DGX Spark with local Ollama produces 0/10 outliers >60 s (max 17.5 s) — strongly localizing the tail-latency outliers to the remote NIM path, not the agent framework or prompt-bloat alone.

Two distinct issues observed and worth triaging in the same bug:
(A) Tail latency 2 min on shared nvidia-prod NIM (primary)
(B) Model-fallback path adds ~12 s per call when user passes the displayed
    "/" name from `nemoclaw list` to raw inference (secondary)
Environment
Device:        Brev (Shadeform), host brev-ydoa5pmhb (shadeform user)
OS:            Brev / Ubuntu (Shadeform image)
Architecture:  x86_64
Node.js:       Not captured
npm:           Not captured
Docker:        Not captured
OpenShell CLI: openshell 0.0.26
NemoClaw:      v0.0.28
OpenClaw:      2026.4.9 (build 0512059)

Sandbox name:  aab
Provider:      nvidia-prod (shared NIM endpoint)
Model:         minimaxai/minimax-m2.5
Default agent: main, session agent:main:main
Steps to Reproduce
1. Onboard NemoClaw v0.0.28 on Brev with NVIDIA Endpoints → minimax-m2.5
   (default sandbox name "aab").
2. From the host shell, run 10 one-shot agent turns inside the sandbox:

     nemoclaw aab <<'EOF'
 for i in $(seq 1 10); do
   T=$( { time openclaw agent --agent main --message "say hello" --json \
            >/tmp/a.$i.json 2>/tmp/a.$i.err ; } 2>&1 | awk '/real/{print $2}')
   printf "iter %2d  agent_total=%s\n" "$i" "$T"
 done
 exit
 EOF</code></pre><pre>3. Observe ~10% of iterations exceed 60 s (no progress indicator, no error).

Expected Result

"Say hello" → 1-line greeting via shared NIM should complete < 15 s P95
end-to-end. No iteration should exceed 60 s without a user-visible
progress indicator or timeout.
Actual Result
Raw timings (10 iterations, wall-clock):
  iter  1   11.542 s
  iter  2   11.712 s
  iter  3    6.926 s
  iter  4    7.322 s
  iter  5  128.878 s   ← BR-7 reproduces
  iter  6    7.194 s
  iter  7   12.113 s
  iter  8    6.942 s
  iter  9    7.716 s
  iter 10   11.080 s

Statistics:
  min            6.93 s
  P50 (median)   9.40 s
  P90           12.11 s
  P99 / max    128.88 s   (≈ 2 min 9 s)
  range         18.6x (max/min)
  iters > 60 s   1 / 10  (10%)

Agent JSON instrumentation (iter that completed normally):
  {
    "status": "ok",
    "result": {
      "payloads": [{ "text": "Hey again! What can I do for you?" }],
      "meta": {
        "durationMs": 6392,
        "agentMeta": {
          "provider": "inference",
          "model":    "minimaxai/minimax-m2.5",
          "usage":    { "input": 18355, "output": 35, "total": 18390 },
          "promptTokens": 18355
        },
        "aborted": false
      }
    }
  }

Note: Even on normal turns, input prompt = 18,355 tokens to produce 35 output tokens for a "hello" reply. The agent inflates trivial messages with full system prompt + tool schemas. Likely amplifies tail latency during NIM congestion.

(B) Model-fallback overhead seen while debugging raw inference:
When the user passes --model nvidia-prod/minimaxai/minimax-m2.5 (the provider/model string displayed by `nemoclaw list`), the gateway fails twice (~6 s each, two lanes) before falling back to the working name:

  [diagnostic] lane task error: lane=main durationMs=6335
            error="FailoverError: Unknown model: nvidia-prod/minimaxai/minimax-m2.5"

[diagnostic] lane task error: lane=session:agent:main:main durationMs=6347
error="FailoverError: Unknown model: nvidia-prod/minimaxai/minimax-m2.5"
[model-fallback] Fell back to "inference/minimaxai/minimax-m2.5".

This adds ~12 s per call. Suspected: the gateway should resolve provider/model names listed in nemoclaw list directly (avoid the round-trip to fallback). Worth filing as a separate child bug if not in scope here.
Logs

Suggested attachments (zip and upload after draft is created):
- brev_agent_timings.txt: 10-iter wall-clock timings
- brev_agent_iter1.json:  /tmp/a.1.json full dump (18355 token usage)
- brev_fallback_stderr.txt: model-fallback diagnostic stderr (12 s wasted)

Cross-platform comparison (same prompt, identical agent code):

  Metric            Brev (nvidia-prod/minimax-m2.5)   Spark (ollama-local/nemotron-3-nano:30b)
  P50                9.4 s                             10.4 s
  P90               12.1 s                             15.2 s
  max              128.88 s                            17.5 s
  outliers > 60 s    1/10                               0/10

Same agent code, same prompt → only Brev/NIM produces 2-min outliers.
This localizes the tail-latency root cause to the remote nvidia-prod NIM
endpoint (queue depth / quota throttling / region routing).

Suggested investigation:
- Inference / NIM endpoint team: investigate nvidia-prod tail latency
  (queue depth, quota throttling, region affinity, retry policy).
- OpenClaw agent framework team:
  * Add user-visible progress / timeout for turns > 30 s. Default
    NEMOCLAW_AGENT_TIMEOUT=600 means < 10 min hangs do not trip timeout
    — UX failure.
  * Reduce input prompt for trivial turns OR cache tool schemas across turns.
  * Fix model resolver so provider/model names listed in `nemoclaw list`
    work directly (avoid the 12 s fallback overhead).

Related:
- VDR #4 finding BR-7 ("Massive delays ~2 min for simple queries"). This
  bug is the formal tracker for that VDR4 finding.
- NVBug 6122111 ([DGX Spark][Agent&Skills] Trivial "hello" agent turn ~10s
  P50 / 17s max) — same agent framework slowness floor (~5-7 s) but no
  2-min outliers since Spark is local Ollama. Confirms tail issue is
  remote-NIM-specific.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6122133]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: e2eEnd-to-end tests, nightly failures, or validation infrastructurearea: inferenceInference routing, serving, model selection, or outputsneeds: unblockBlocked item needs dependency or decision resolvedplatform: brevAffects Brev hosted development environmentsplatform: ubuntuAffects Ubuntu Linux environmentsprovider: nvidiaNVIDIA inference endpoint, NIM, or NVIDIA provider behavior
No fields configured for Enhancement.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions