Skip to content

[Brev][Ubuntu 22.04][Onboard][Agent&Skills] NemoHermes onboard step 7 times out — Tirith build-time download_failed not retried at startup #3793

@hulynn

Description

@hulynn

Description

On Brev Ubuntu 22.04 H100, upgrading NemoClaw v0.0.43 → v0.0.45 via curl|bash and running nemoclaw onboard for NemoHermes fails at Step [7/8] "Setting up Hermes Agent inside sandbox" with ✗ Hermes Agent gateway did not respond within 90s.

Root cause: the Tirith binary download fails during sandbox image build (docker build step), but the build proceeds silently — only writing a /sandbox/.hermes/.tirith-install-failed marker with content download_failed. The Hermes start.sh does not check the marker or retry the install at startup, so hermes gateway run never comes up and nemoclaw onboard times out after 90s with no actionable error.

The runtime fallback path in tools.tirith_security does work (re-downloads Tirith and starts the gateway successfully) — proven by nemoclaw <name> connect recovery — so the network is reachable from the running sandbox; only the build-time download is broken.

Environment

Device:        Brev shadecloud (NVIDIA H100 PCIe x1)
OS:            Ubuntu 22.04.5 LTS (Jammy)
Architecture:  x86_64
Kernel:        6.8.0-90-generic
GPU Driver:    570.195.03
Node.js:       v22.22.3
npm:           10.9.8
Docker:        Docker driver; OpenShell gateway compatibility container active (host glibc 2.35 older than gateway requirement 2.39)
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.45 (upgraded from v0.0.43 via curl|bash, resolved install ref = latest)
OpenClaw:      N/A (Hermes Agent gateway did not start during onboard)

Steps to Reproduce

  1. brev shell into a fresh Brev Ubuntu 22.04 H100 instance with nemoclaw v0.0.43 already installed and an existing sandbox.

  2. Upgrade to latest:

    curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash

    (resolves to v0.0.45; installer detects existing onboarding session and resumes)

  3. Wizard runs Steps 1–4 successfully: preflight cached, OpenShell gateway compatibility patch starts the Docker gateway, NVIDIA Endpoints provider + Nemotron 3 Super 120B model selection cached.

  4. Step [5/8] messaging channels — pick Slack (option 4), paste Bot Token + App Token, allowlist Slack member ID U0AR85ATALW.

  5. Sandbox name: lynn-slack

  6. Step [6/8] creating sandbox — Hermes Dockerfile build runs all 42 steps successfully, sandbox reaches Ready, GPU proof passes (nvidia-smi, /proc/<pid>/task/<tid>/comm write, cuInit(0) via libcuda.so.1).

  7. Step [7/8] "Setting up Hermes Agent inside sandbox" waits 90s then fails.

Expected Result

Hermes Agent gateway responds within 90s; onboarding proceeds to Step [8/8] and completes. If Tirith download fails during build, sandbox startup should either retry the download or invoke the runtime fallback path (tools.tirith_security) before the gateway timeout fires, so the gateway still comes up automatically without requiring the user to run nemoclaw <name> connect as a recovery.

Actual Result

[7/8] Setting up Hermes Agent inside sandbox
  ──────────────────────────────────────────────────
  Waiting for Hermes Agent gateway (up to 90s)...
  ✗ Hermes Agent gateway did not respond within 90s
    Check: nemoclaw lynn-slack logs --follow

nemoclaw lynn-slack logs --follow shows only OpenShell supervisor SSH relay open/close events every ~3 seconds during the onboard window — no Hermes startup events, no agent.log entries, no errors.log entries. The agent literally never started.

Inside the sandbox (via nemoclaw lynn-slack connect which triggers recovery):

$ cat /sandbox/.hermes/.tirith-install-failed
download_failed

$ ls -la /sandbox/.hermes/logs/
total 20
-rw-r--r-- 1 sandbox sandbox 2195 May 19 08:17 agent.log
-rw-r--r-- 1 sandbox sandbox  316 May 19 08:17 errors.log
# Both files have mtime 08:17 — that's the recovery time, not the original onboard at 08:08-08:13.
# During the onboard, /sandbox/.hermes/logs/ was empty — gateway never wrote anything.

$ find /opt/hermes -name '*tirith*'
/opt/hermes/tests/tools/test_tirith_security.py
/opt/hermes/tools/tirith_security.py
# Source files exist; the install step is what's missing.

Logs

agent.log (captured after nemoclaw lynn-slack connect recovery, NOT during original onboard):

2026-05-19 08:17:16,489 INFO hermes_cli.plugins: Plugin 'openai' registered image_gen provider: openai
2026-05-19 08:17:16,491 INFO hermes_cli.plugins: Plugin 'openai-codex' registered image_gen provider: openai-codex
2026-05-19 08:17:16,570 INFO hermes_cli.plugins: Plugin 'xai' registered image_gen provider: xai
2026-05-19 08:17:16,571 INFO hermes_cli.plugins: Plugin discovery complete: 5 found, 3 enabled
2026-05-19 08:17:17,495 INFO tools.tirith_security: tirith not found — downloading latest release for x86_64-unknown-linux-gnu...
2026-05-19 08:17:17,503 INFO gateway.run: Starting Hermes Gateway...
2026-05-19 08:17:17,576 INFO gateway.run: Connecting to api_server...
2026-05-19 08:17:17,578 INFO gateway.platforms.api_server: [Api_Server] API server listening on http://127.0.0.1:18642 (model: hermes-agent)
2026-05-19 08:17:17,578 INFO gateway.run: ✓ api_server connected
2026-05-19 08:17:17,803 INFO gateway.run: Connecting to slack...
2026-05-19 08:17:18,001 INFO gateway.platforms.slack: [Slack] Authenticated as @nemoclawtest in workspace mercuriusSpace (team: T0AR2D4AGP5)
2026-05-19 08:17:18,002 INFO gateway.platforms.slack: [Slack] Socket Mode connected (1 workspace(s))
2026-05-19 08:17:18,003 INFO gateway.run: ✓ slack connected
2026-05-19 08:17:18,004 INFO gateway.run: Gateway running with 2 platform(s)

Note line 5: tirith not found — downloading latest release .... The runtime fallback in tools.tirith_security successfully recovered. If start.sh had checked the marker file and triggered this fallback BEFORE launching hermes gateway run, the original onboard would have succeeded.

/tmp/gateway.log (written during recovery):

[gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing - gateway launching without library guards (#2478)
WARNING gateway.platforms.api_server: [Api_Server] ⚠️ No API key configured (API_SERVER_KEY / platforms.api_server.key). All requests will be accepted without authentication. Set an API key for production deployments to prevent unauthorized access to sessions, responses, and cron jobs.

The gateway-recovery warning references #2478 (proxy-env missing), which may explain why build-time Tirith download is unreliable on Brev — the build context might not have the same proxy environment as the running sandbox.

Suggested Fix

A) Build-time hardening — retry the Tirith download N times (with backoff) before writing the .tirith-install-failed marker; or vendor the Tirith wheel/binary into the Hermes base image so the build does not depend on external network at all.

B) Startup-time fallback — in agents/hermes/start.sh, before launching hermes gateway run, check for /sandbox/.hermes/.tirith-install-failed. If present, invoke the runtime fallback path that already exists in tools.tirith_security (it works — proven by the recovery flow above). This is a strict improvement: the runtime fallback already exists, the sandbox network is already reachable, and it costs ~2 seconds. After successful install, remove the marker file so subsequent starts skip the check.

C) Diagnostic UX — when nemoclaw onboard Step 7 hits the 90s timeout, have it also cat /sandbox/.hermes/.tirith-install-failed (and any other known marker files) and surface their contents in the error message, so users can diagnose the failure without manually connecting and grepping inside the sandbox.

Related Bugs


NVB#6190755

Metadata

Metadata

Assignees

No one assigned

    Labels

    NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.needs: triageAwaiting maintainer classification

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions