Skip to content

[DGX Spark][CLI&UX] openclaw tui shows indefinite spinner with no error when inference endpoint is unreachable #4434

@mercl-lau

Description

@mercl-lau

Description

When the NVIDIA inference endpoint is unreachable (e.g. blocked by firewall), openclaw tui shows an indefinite spinner with "connected" status and never surfaces an error message. The user has zero actionability -- no HTTP status, no error cause, no recovery hint. The spinner ran for over 3 minutes 42 seconds with no feedback before being manually cancelled.

Related fixed bug #6226597 covers stale health in nemoclaw status; this bug is specifically about the TUI agent chat pane failing silently on inference errors.

Environment

Device:        DGX Spark (NVIDIA_DGX_Spark), hostname spark-8158
OS:            Ubuntu 24.04.4 LTS (aarch64)
Architecture:  aarch64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        Docker version 29.2.1, build a5c7197
OpenShell CLI: openshell 0.0.44
NemoClaw:      v0.0.53
OpenClaw:      2026.5.22 (a374c3a)

Steps to Reproduce

  1. nemoclaw onboard with NVIDIA Endpoints provider (nvidia-prod), model nvidia/nemotron-3-super-120b-a12b. Verify inference healthy:
    nemoclaw my-assistant status
    Inference: healthy
  2. Block NVIDIA endpoint IPs from Docker containers:
    sudo iptables -I DOCKER-USER -d 75.2.113.119 -j DROP
    sudo iptables -I DOCKER-USER -d 99.83.136.103 -j DROP
  3. nemoclaw my-assistant connect
  4. openclaw tui
  5. Type any prompt (e.g. "hello") and press Enter
  6. Observe the TUI status bar and main pane for up to 4 minutes

Expected Result

TUI surfaces a structured error within the gateway timeout (180s) including:

  • HTTP status or cause (e.g. "HTTP 503 from upstream" or "connection refused")
  • Which layer reported it (gateway proxy / upstream API)
  • One-line recovery hint (e.g. "check egress policy" / "check API key")

The status bar should show "error" (not "connected") when inference fails.

Actual Result

TUI shows an indefinite spinner with playful loading text:

  ⠦ flibbertigibbeting… • 3m 42s | connected

Status bar reads "connected" -- NOT "error" or "timeout".
Main pane shows NOTHING -- no error text, no HTTP status, no recovery hint.
The spinner continues indefinitely past the 180s gateway timeout.
User has no way to determine what failed or how to fix it.

Logs

TUI capture (tmux capture-pane after 3m42s):

  ⠦ flibbertigibbeting… • 3m 42s | connected
  agent main | session main | inference/nvidia/nemotron-3-super-120b-a12b | tokens 6.5k/131k (5%)

No error text appeared in the main pane at any point during the 3m42s wait.
The only change was the spinner animation and elapsed time counter.

Verified that inference was actually blocked:
  curl -s --connect-timeout 5 https://integrate.api.nvidia.com/v1/models
  → exit code 28 (timeout)

NVB#6236510

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputarea: inferenceInference routing, serving, model selection, or outputsplatform: dgx-sparkAffects DGX Spark hardware or workflowsprovider: nvidiaNVIDIA inference endpoint, NIM, or NVIDIA provider behaviorv0.0.59Release target
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions