Description
On Brev Ubuntu 22.04 H100, upgrading NemoClaw v0.0.43 → v0.0.45 via curl|bash and running nemoclaw onboard for NemoHermes fails at Step [7/8] "Setting up Hermes Agent inside sandbox" with ✗ Hermes Agent gateway did not respond within 90s.
Root cause: the Tirith binary download fails during sandbox image build (docker build step), but the build proceeds silently — only writing a /sandbox/.hermes/.tirith-install-failed marker with content download_failed. The Hermes start.sh does not check the marker or retry the install at startup, so hermes gateway run never comes up and nemoclaw onboard times out after 90s with no actionable error.
The runtime fallback path in tools.tirith_security does work (re-downloads Tirith and starts the gateway successfully) — proven by nemoclaw <name> connect recovery — so the network is reachable from the running sandbox; only the build-time download is broken.
Environment
Device: Brev shadecloud (NVIDIA H100 PCIe x1)
OS: Ubuntu 22.04.5 LTS (Jammy)
Architecture: x86_64
Kernel: 6.8.0-90-generic
GPU Driver: 570.195.03
Node.js: v22.22.3
npm: 10.9.8
Docker: Docker driver; OpenShell gateway compatibility container active (host glibc 2.35 older than gateway requirement 2.39)
OpenShell CLI: 0.0.39
NemoClaw: v0.0.45 (upgraded from v0.0.43 via curl|bash, resolved install ref = latest)
OpenClaw: N/A (Hermes Agent gateway did not start during onboard)
Steps to Reproduce
-
brev shell into a fresh Brev Ubuntu 22.04 H100 instance with nemoclaw v0.0.43 already installed and an existing sandbox.
-
Upgrade to latest:
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
(resolves to v0.0.45; installer detects existing onboarding session and resumes)
-
Wizard runs Steps 1–4 successfully: preflight cached, OpenShell gateway compatibility patch starts the Docker gateway, NVIDIA Endpoints provider + Nemotron 3 Super 120B model selection cached.
-
Step [5/8] messaging channels — pick Slack (option 4), paste Bot Token + App Token, allowlist Slack member ID U0AR85ATALW.
-
Sandbox name: lynn-slack
-
Step [6/8] creating sandbox — Hermes Dockerfile build runs all 42 steps successfully, sandbox reaches Ready, GPU proof passes (nvidia-smi, /proc/<pid>/task/<tid>/comm write, cuInit(0) via libcuda.so.1).
-
Step [7/8] "Setting up Hermes Agent inside sandbox" waits 90s then fails.
Expected Result
Hermes Agent gateway responds within 90s; onboarding proceeds to Step [8/8] and completes. If Tirith download fails during build, sandbox startup should either retry the download or invoke the runtime fallback path (tools.tirith_security) before the gateway timeout fires, so the gateway still comes up automatically without requiring the user to run nemoclaw <name> connect as a recovery.
Actual Result
[7/8] Setting up Hermes Agent inside sandbox
──────────────────────────────────────────────────
Waiting for Hermes Agent gateway (up to 90s)...
✗ Hermes Agent gateway did not respond within 90s
Check: nemoclaw lynn-slack logs --follow
nemoclaw lynn-slack logs --follow shows only OpenShell supervisor SSH relay open/close events every ~3 seconds during the onboard window — no Hermes startup events, no agent.log entries, no errors.log entries. The agent literally never started.
Inside the sandbox (via nemoclaw lynn-slack connect which triggers recovery):
$ cat /sandbox/.hermes/.tirith-install-failed
download_failed
$ ls -la /sandbox/.hermes/logs/
total 20
-rw-r--r-- 1 sandbox sandbox 2195 May 19 08:17 agent.log
-rw-r--r-- 1 sandbox sandbox 316 May 19 08:17 errors.log
# Both files have mtime 08:17 — that's the recovery time, not the original onboard at 08:08-08:13.
# During the onboard, /sandbox/.hermes/logs/ was empty — gateway never wrote anything.
$ find /opt/hermes -name '*tirith*'
/opt/hermes/tests/tools/test_tirith_security.py
/opt/hermes/tools/tirith_security.py
# Source files exist; the install step is what's missing.
Logs
agent.log (captured after nemoclaw lynn-slack connect recovery, NOT during original onboard):
2026-05-19 08:17:16,489 INFO hermes_cli.plugins: Plugin 'openai' registered image_gen provider: openai
2026-05-19 08:17:16,491 INFO hermes_cli.plugins: Plugin 'openai-codex' registered image_gen provider: openai-codex
2026-05-19 08:17:16,570 INFO hermes_cli.plugins: Plugin 'xai' registered image_gen provider: xai
2026-05-19 08:17:16,571 INFO hermes_cli.plugins: Plugin discovery complete: 5 found, 3 enabled
2026-05-19 08:17:17,495 INFO tools.tirith_security: tirith not found — downloading latest release for x86_64-unknown-linux-gnu...
2026-05-19 08:17:17,503 INFO gateway.run: Starting Hermes Gateway...
2026-05-19 08:17:17,576 INFO gateway.run: Connecting to api_server...
2026-05-19 08:17:17,578 INFO gateway.platforms.api_server: [Api_Server] API server listening on http://127.0.0.1:18642 (model: hermes-agent)
2026-05-19 08:17:17,578 INFO gateway.run: ✓ api_server connected
2026-05-19 08:17:17,803 INFO gateway.run: Connecting to slack...
2026-05-19 08:17:18,001 INFO gateway.platforms.slack: [Slack] Authenticated as @nemoclawtest in workspace mercuriusSpace (team: T0AR2D4AGP5)
2026-05-19 08:17:18,002 INFO gateway.platforms.slack: [Slack] Socket Mode connected (1 workspace(s))
2026-05-19 08:17:18,003 INFO gateway.run: ✓ slack connected
2026-05-19 08:17:18,004 INFO gateway.run: Gateway running with 2 platform(s)
Note line 5: tirith not found — downloading latest release .... The runtime fallback in tools.tirith_security successfully recovered. If start.sh had checked the marker file and triggered this fallback BEFORE launching hermes gateway run, the original onboard would have succeeded.
/tmp/gateway.log (written during recovery):
[gateway-recovery] WARNING: /tmp/nemoclaw-proxy-env.sh missing - gateway launching without library guards (#2478)
WARNING gateway.platforms.api_server: [Api_Server] ⚠️ No API key configured (API_SERVER_KEY / platforms.api_server.key). All requests will be accepted without authentication. Set an API key for production deployments to prevent unauthorized access to sessions, responses, and cron jobs.
The gateway-recovery warning references #2478 (proxy-env missing), which may explain why build-time Tirith download is unreliable on Brev — the build context might not have the same proxy environment as the running sandbox.
Suggested Fix
A) Build-time hardening — retry the Tirith download N times (with backoff) before writing the .tirith-install-failed marker; or vendor the Tirith wheel/binary into the Hermes base image so the build does not depend on external network at all.
B) Startup-time fallback — in agents/hermes/start.sh, before launching hermes gateway run, check for /sandbox/.hermes/.tirith-install-failed. If present, invoke the runtime fallback path that already exists in tools.tirith_security (it works — proven by the recovery flow above). This is a strict improvement: the runtime fallback already exists, the sandbox network is already reachable, and it costs ~2 seconds. After successful install, remove the marker file so subsequent starts skip the check.
C) Diagnostic UX — when nemoclaw onboard Step 7 hits the 90s timeout, have it also cat /sandbox/.hermes/.tirith-install-failed (and any other known marker files) and surface their contents in the error message, so users can diagnose the failure without manually connecting and grepping inside the sandbox.
Related Bugs
NVB#6190755
Description
On Brev Ubuntu 22.04 H100, upgrading NemoClaw v0.0.43 → v0.0.45 via
curl|bashand runningnemoclaw onboardfor NemoHermes fails at Step [7/8] "Setting up Hermes Agent inside sandbox" with✗ Hermes Agent gateway did not respond within 90s.Root cause: the Tirith binary download fails during sandbox image build (
docker buildstep), but the build proceeds silently — only writing a/sandbox/.hermes/.tirith-install-failedmarker with contentdownload_failed. The Hermesstart.shdoes not check the marker or retry the install at startup, sohermes gateway runnever comes up andnemoclaw onboardtimes out after 90s with no actionable error.The runtime fallback path in
tools.tirith_securitydoes work (re-downloads Tirith and starts the gateway successfully) — proven bynemoclaw <name> connectrecovery — so the network is reachable from the running sandbox; only the build-time download is broken.Environment
Steps to Reproduce
brev shellinto a fresh Brev Ubuntu 22.04 H100 instance withnemoclaw v0.0.43already installed and an existing sandbox.Upgrade to latest:
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash(resolves to v0.0.45; installer detects existing onboarding session and resumes)
Wizard runs Steps 1–4 successfully: preflight cached, OpenShell gateway compatibility patch starts the Docker gateway, NVIDIA Endpoints provider + Nemotron 3 Super 120B model selection cached.
Step [5/8] messaging channels — pick Slack (option 4), paste Bot Token + App Token, allowlist Slack member ID
U0AR85ATALW.Sandbox name:
lynn-slackStep [6/8] creating sandbox — Hermes Dockerfile build runs all 42 steps successfully, sandbox reaches Ready, GPU proof passes (
nvidia-smi,/proc/<pid>/task/<tid>/commwrite,cuInit(0)vialibcuda.so.1).Step [7/8] "Setting up Hermes Agent inside sandbox" waits 90s then fails.
Expected Result
Hermes Agent gateway responds within 90s; onboarding proceeds to Step [8/8] and completes. If Tirith download fails during build, sandbox startup should either retry the download or invoke the runtime fallback path (
tools.tirith_security) before the gateway timeout fires, so the gateway still comes up automatically without requiring the user to runnemoclaw <name> connectas a recovery.Actual Result
nemoclaw lynn-slack logs --followshows only OpenShell supervisor SSH relay open/close events every ~3 seconds during the onboard window — no Hermes startup events, noagent.logentries, noerrors.logentries. The agent literally never started.Inside the sandbox (via
nemoclaw lynn-slack connectwhich triggers recovery):Logs
agent.log(captured afternemoclaw lynn-slack connectrecovery, NOT during original onboard):/tmp/gateway.log(written during recovery):The gateway-recovery warning references #2478 (proxy-env missing), which may explain why build-time Tirith download is unreliable on Brev — the build context might not have the same proxy environment as the running sandbox.
Suggested Fix
A) Build-time hardening — retry the Tirith download N times (with backoff) before writing the
.tirith-install-failedmarker; or vendor the Tirith wheel/binary into the Hermes base image so the build does not depend on external network at all.B) Startup-time fallback — in
agents/hermes/start.sh, before launchinghermes gateway run, check for/sandbox/.hermes/.tirith-install-failed. If present, invoke the runtime fallback path that already exists intools.tirith_security(it works — proven by the recovery flow above). This is a strict improvement: the runtime fallback already exists, the sandbox network is already reachable, and it costs ~2 seconds. After successful install, remove the marker file so subsequent starts skip the check.C) Diagnostic UX — when
nemoclaw onboardStep 7 hits the 90s timeout, have it alsocat /sandbox/.hermes/.tirith-install-failed(and any other known marker files) and surface their contents in the error message, so users can diagnose the failure without manuallyconnecting and grepping inside the sandbox.Related Bugs
start.shdoes not retry. Not a duplicate.NVB#6190755