Description
On a fresh Brev shadecloud Ubuntu 22.04 H100 instance, nemohermes onboard (non-interactive, NVIDIA Endpoints + Nemotron 3 Super 120B) and nemohermes <name> rebuild both fail at Step [7/8] "Setting up Hermes Agent inside sandbox" with ✗ Hermes Agent gateway did not respond within 90s. The actual failure is in the sandbox bootstrap script before any gateway is started.
Root cause (verified by direct sandbox-container inspection): /usr/local/bin/nemoclaw-start calls drop_capabilities (~line 71), stripping CAP_DAC_OVERRIDE. Later (~line 208) it rewrites /sandbox/.bashrc and /sandbox/.profile via mktemp + chmod + mv. /sandbox is drwxr-xr-x sandbox:sandbox, so post-drop root can't create files in it — mktemp /sandbox/..bashrc.tmp.XXXXXX returns Permission denied. The script does not check the exit, prints the error to /tmp/nemoclaw-start.log, and exits 0. The openshell-sandbox supervisor (PID 1) treats this as success and falls back to sleep infinity — agent + all bridges never start, dangling /sandbox/.hermes/channel_directory.json -> runtime/channel_directory.json, 90s healthcheck times out.
Not a duplicate of #3793 (Tirith download_failed): /sandbox/.hermes/.tirith-install-failed marker does not exist here, and the runtime fallback path that worked for #3793 (nemoclaw <name> connect) does not recover this one — both nemohermes my-hermes connect --probe-only and nemohermes my-hermes recover return Probe failed: ... automatic recovery failed.
Secondary issues observed:
- Silent exit —
nemoclaw-start does not propagate the mktemp error. Suggest set -e around the rc-file rewrite section, or at minimum mktemp ... || die.
- Misleading doctor —
nemohermes my-hermes doctor reports [ok] Channels: discord, slack, telegram enabled based on config presence, not runtime. With no hermes process running, this is wrong.
- Policy state drift on failed rebuild — after the rebuild failure,
nemohermes my-hermes status shows Policies: slack only, but doctor and channels list still say all three are enabled. Disagreement between persisted policy state and channel config.
Impact: blocks all NemoHermes messaging-channel end-to-end testing on the Brev Ubuntu 22.04 platform documented in the Hermes quickstart. Slack / Discord / Telegram bot tokens were independently verified valid via direct curl against the vendor APIs (HTTP 200 / ok:true from auth.test, users/@me, getMe) — the failure is entirely on the NemoHermes side.
Environment
Device: Brev shadecloud (hyperstack), NVIDIA H100 PCIe x1, brev instance nemoclaw-0514
OS: Ubuntu 22.04.5 LTS, kernel 6.8.0-90-generic
(host glibc 2.35; OpenShell gateway runs in compatibility container
because gateway requires glibc 2.39)
Architecture: x86_64
Docker: 29.1.3
OpenShell CLI: 0.0.39
NemoClaw: v0.0.46 (NemoHermes)
OpenClaw: N/A (Hermes Agent never started; no in-sandbox shell possible)
Hermes Agent: v2026.4.23 (image only — never executed)
Sandbox image: openshell/sandbox-from:1779268803 (fresh build)
Sandbox name: my-hermes, id 7ed011c0-e767-420b-a4db-9991bb485074
Container: openshell-my-hermes-7ed011c0-e767-420b-a4db-9991bb485074
ENTRYPOINT: /opt/openshell/bin/openshell-sandbox (Dockerfile final USER is root)
NVIDIA_API_KEY: validated (HTTP 200 against integrate.api.nvidia.com)
Steps to Reproduce
brev shell into a fresh Brev shadecloud Ubuntu 22.04 H100 instance.
- Free disk if needed (box ships with
/var/log/syslog* ≈ 28G; clear first).
- Export onboard env:
export NEMOCLAW_AGENT=hermes
export NEMOCLAW_NON_INTERACTIVE=1
export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
export NEMOCLAW_SANDBOX_NAME=my-hermes
export NVIDIA_API_KEY=<valid nvapi- key>
- Run the installer:
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
- Steps [1/8]–[6/8] all pass (preflight, gateway in compat container, inference, sandbox build, GPU proof).
- Step [7/8] times out at 90s.
- Both
nemohermes my-hermes connect --probe-only and nemohermes my-hermes recover fail.
Direct repro of the silent failure inside the live container:
$ sudo docker exec <container> ls -la /sandbox/.bashrc /sandbox/.profile /sandbox
-r--r--r-- 1 root root 169 May 20 09:18 /sandbox/.bashrc
-r--r--r-- 1 root root 169 May 20 09:18 /sandbox/.profile
drwxr-xr-x 1 sandbox sandbox 4096 May 20 09:22 /sandbox
$ sudo docker exec <container> /usr/local/bin/nemoclaw-start
Setting up NemoClaw (Hermes)...
mktemp: failed to create file via template '/sandbox/..bashrc.tmp.XXXXXX': Permission denied
$ echo $?
0
# Capability-intact exec at the SAME path succeeds — proves the failure
# is post-drop_capabilities, not a static FS perm bug:
$ sudo docker exec <container> sh -c 'mktemp /sandbox/..bashrc.tmp.XXXXXX'
/sandbox/..bashrc.tmp.5b39Hz
Expected Result
Step [7/8] completes within the healthcheck window. Hermes Agent runs inside the sandbox. Configured Slack / Discord / Telegram channels connect and round-trip messages. doctor reflects runtime state, not just config.
If nemoclaw-start cannot rewrite a hardened root-owned read-only .bashrc / .profile due to dropped CAP_DAC_OVERRIDE, the script should either:
- (a) skip the mutation when the file is already in its locked final form, or
- (b) propagate the failure (non-zero exit) so the supervisor surfaces it instead of silently falling back to
sleep infinity and producing a misleading "Ready" sandbox.
Actual Result
[7/8] Setting up Hermes Agent inside sandbox
──────────────────────────────────────────────────
Waiting for Hermes Agent gateway (up to 90s)...
✗ Hermes Agent gateway did not respond within 90s
Check: nemohermes my-hermes logs --follow
Rebuild fails the same way; connect --probe-only and recover both fail with "automatic recovery failed". Container has only the supervisor + sleep infinity — no hermes / node / python process.
Logs
# /tmp/nemoclaw-start.log inside the sandbox (full contents):
Setting up NemoClaw (Hermes)...
mktemp: failed to create file via template '/sandbox/..bashrc.tmp.XXXXXX': Permission denied
# Container process tree:
$ sudo docker exec <container> ps -eo pid,cmd
PID CMD
1 /opt/openshell/bin/openshell-sandbox
111 sleep infinity
# `nemohermes my-hermes logs --since 5m` excerpt — only OpenShell SSH relay
# healthcheck open/close every ~3s for the entire onboard + rebuild window,
# no hermes startup events at all:
[1779269042.988] [sandbox] [OCSF ] NET:OPEN ssh relay open (channel_id=5dfb4634-..., target=unix:/run/openshell/ssh.sock)
[1779269043.160] [sandbox] [OCSF ] NET:CLOSE ssh relay closed (channel_id=5dfb4634-..., ...)
# Doctor (misleading channel state vs. runtime):
Sandbox: [ok] Live sandbox: my-hermes present (Ready)
[ok] Agent version: Hermes Agent v2026.4.23
Messaging: [ok] Channels: discord, slack, telegram enabled; no recent conflict signatures
Gateway: [fail] Docker container: openshell-cluster-nemoclaw not found
[ok] OpenShell status: connected to nemoclaw
# OpenShell gateway log — no nemoclaw-start spawn/exec record, only
# GetSandboxConfig polling every ~10s and periodic connection errors from
# the sandbox container (172.18.0.2). Never any hermes-related grpc traffic.
NVB#6195131
Description
On a fresh Brev shadecloud Ubuntu 22.04 H100 instance,
nemohermes onboard(non-interactive, NVIDIA Endpoints + Nemotron 3 Super 120B) andnemohermes <name> rebuildboth fail at Step [7/8] "Setting up Hermes Agent inside sandbox" with✗ Hermes Agent gateway did not respond within 90s. The actual failure is in the sandbox bootstrap script before any gateway is started.Root cause (verified by direct sandbox-container inspection):
/usr/local/bin/nemoclaw-startcallsdrop_capabilities(~line 71), strippingCAP_DAC_OVERRIDE. Later (~line 208) it rewrites/sandbox/.bashrcand/sandbox/.profileviamktemp + chmod + mv./sandboxisdrwxr-xr-x sandbox:sandbox, so post-drop root can't create files in it —mktemp /sandbox/..bashrc.tmp.XXXXXXreturnsPermission denied. The script does not check the exit, prints the error to/tmp/nemoclaw-start.log, and exits 0. Theopenshell-sandboxsupervisor (PID 1) treats this as success and falls back tosleep infinity— agent + all bridges never start, dangling/sandbox/.hermes/channel_directory.json -> runtime/channel_directory.json, 90s healthcheck times out.Not a duplicate of #3793 (Tirith
download_failed):/sandbox/.hermes/.tirith-install-failedmarker does not exist here, and the runtime fallback path that worked for #3793 (nemoclaw <name> connect) does not recover this one — bothnemohermes my-hermes connect --probe-onlyandnemohermes my-hermes recoverreturnProbe failed: ... automatic recovery failed.Secondary issues observed:
nemoclaw-startdoes not propagate themktemperror. Suggestset -earound the rc-file rewrite section, or at minimummktemp ... || die.nemohermes my-hermes doctorreports[ok] Channels: discord, slack, telegram enabledbased on config presence, not runtime. With no hermes process running, this is wrong.nemohermes my-hermes statusshowsPolicies: slackonly, butdoctorandchannels liststill say all three are enabled. Disagreement between persisted policy state and channel config.Impact: blocks all NemoHermes messaging-channel end-to-end testing on the Brev Ubuntu 22.04 platform documented in the Hermes quickstart. Slack / Discord / Telegram bot tokens were independently verified valid via direct curl against the vendor APIs (HTTP 200 /
ok:truefromauth.test,users/@me,getMe) — the failure is entirely on the NemoHermes side.Environment
Steps to Reproduce
brev shellinto a fresh Brev shadecloud Ubuntu 22.04 H100 instance./var/log/syslog*≈ 28G; clear first).curl -fsSL https://www.nvidia.com/nemoclaw.sh | bashnemohermes my-hermes connect --probe-onlyandnemohermes my-hermes recoverfail.Direct repro of the silent failure inside the live container:
Expected Result
Step [7/8] completes within the healthcheck window. Hermes Agent runs inside the sandbox. Configured Slack / Discord / Telegram channels connect and round-trip messages.
doctorreflects runtime state, not just config.If
nemoclaw-startcannot rewrite a hardened root-owned read-only.bashrc/.profiledue to droppedCAP_DAC_OVERRIDE, the script should either:sleep infinityand producing a misleading "Ready" sandbox.Actual Result
Rebuild fails the same way;
connect --probe-onlyandrecoverboth fail with "automatic recovery failed". Container has only the supervisor +sleep infinity— no hermes / node / python process.Logs
NVB#6195131