Skip to content

[Brev][Sandbox] NemoHermes onboard Step [7/8] times out: nemoclaw-start exits 0 silently after mktemp Permission denied in /sandbox #3891

@hulynn

Description

@hulynn

Description

On a fresh Brev shadecloud Ubuntu 22.04 H100 instance, nemohermes onboard (non-interactive, NVIDIA Endpoints + Nemotron 3 Super 120B) and nemohermes <name> rebuild both fail at Step [7/8] "Setting up Hermes Agent inside sandbox" with ✗ Hermes Agent gateway did not respond within 90s. The actual failure is in the sandbox bootstrap script before any gateway is started.

Root cause (verified by direct sandbox-container inspection): /usr/local/bin/nemoclaw-start calls drop_capabilities (~line 71), stripping CAP_DAC_OVERRIDE. Later (~line 208) it rewrites /sandbox/.bashrc and /sandbox/.profile via mktemp + chmod + mv. /sandbox is drwxr-xr-x sandbox:sandbox, so post-drop root can't create files in it — mktemp /sandbox/..bashrc.tmp.XXXXXX returns Permission denied. The script does not check the exit, prints the error to /tmp/nemoclaw-start.log, and exits 0. The openshell-sandbox supervisor (PID 1) treats this as success and falls back to sleep infinity — agent + all bridges never start, dangling /sandbox/.hermes/channel_directory.json -> runtime/channel_directory.json, 90s healthcheck times out.

Not a duplicate of #3793 (Tirith download_failed): /sandbox/.hermes/.tirith-install-failed marker does not exist here, and the runtime fallback path that worked for #3793 (nemoclaw <name> connect) does not recover this one — both nemohermes my-hermes connect --probe-only and nemohermes my-hermes recover return Probe failed: ... automatic recovery failed.

Secondary issues observed:

  1. Silent exitnemoclaw-start does not propagate the mktemp error. Suggest set -e around the rc-file rewrite section, or at minimum mktemp ... || die.
  2. Misleading doctornemohermes my-hermes doctor reports [ok] Channels: discord, slack, telegram enabled based on config presence, not runtime. With no hermes process running, this is wrong.
  3. Policy state drift on failed rebuild — after the rebuild failure, nemohermes my-hermes status shows Policies: slack only, but doctor and channels list still say all three are enabled. Disagreement between persisted policy state and channel config.

Impact: blocks all NemoHermes messaging-channel end-to-end testing on the Brev Ubuntu 22.04 platform documented in the Hermes quickstart. Slack / Discord / Telegram bot tokens were independently verified valid via direct curl against the vendor APIs (HTTP 200 / ok:true from auth.test, users/@me, getMe) — the failure is entirely on the NemoHermes side.

Environment

Device:        Brev shadecloud (hyperstack), NVIDIA H100 PCIe x1, brev instance nemoclaw-0514
OS:            Ubuntu 22.04.5 LTS, kernel 6.8.0-90-generic
               (host glibc 2.35; OpenShell gateway runs in compatibility container
               because gateway requires glibc 2.39)
Architecture:  x86_64
Docker:        29.1.3
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.46 (NemoHermes)
OpenClaw:      N/A (Hermes Agent never started; no in-sandbox shell possible)
Hermes Agent:  v2026.4.23 (image only — never executed)
Sandbox image: openshell/sandbox-from:1779268803 (fresh build)
Sandbox name:  my-hermes, id 7ed011c0-e767-420b-a4db-9991bb485074
Container:     openshell-my-hermes-7ed011c0-e767-420b-a4db-9991bb485074
ENTRYPOINT:    /opt/openshell/bin/openshell-sandbox (Dockerfile final USER is root)
NVIDIA_API_KEY: validated (HTTP 200 against integrate.api.nvidia.com)

Steps to Reproduce

  1. brev shell into a fresh Brev shadecloud Ubuntu 22.04 H100 instance.
  2. Free disk if needed (box ships with /var/log/syslog* ≈ 28G; clear first).
  3. Export onboard env:
    export NEMOCLAW_AGENT=hermes
    export NEMOCLAW_NON_INTERACTIVE=1
    export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
    export NEMOCLAW_SANDBOX_NAME=my-hermes
    export NVIDIA_API_KEY=<valid nvapi- key>
  4. Run the installer:
    curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
  5. Steps [1/8]–[6/8] all pass (preflight, gateway in compat container, inference, sandbox build, GPU proof).
  6. Step [7/8] times out at 90s.
  7. Both nemohermes my-hermes connect --probe-only and nemohermes my-hermes recover fail.

Direct repro of the silent failure inside the live container:

$ sudo docker exec <container> ls -la /sandbox/.bashrc /sandbox/.profile /sandbox
-r--r--r-- 1 root    root    169 May 20 09:18 /sandbox/.bashrc
-r--r--r-- 1 root    root    169 May 20 09:18 /sandbox/.profile
drwxr-xr-x 1 sandbox sandbox 4096 May 20 09:22 /sandbox

$ sudo docker exec <container> /usr/local/bin/nemoclaw-start
Setting up NemoClaw (Hermes)...
mktemp: failed to create file via template '/sandbox/..bashrc.tmp.XXXXXX': Permission denied
$ echo $?
0

# Capability-intact exec at the SAME path succeeds — proves the failure
# is post-drop_capabilities, not a static FS perm bug:
$ sudo docker exec <container> sh -c 'mktemp /sandbox/..bashrc.tmp.XXXXXX'
/sandbox/..bashrc.tmp.5b39Hz

Expected Result

Step [7/8] completes within the healthcheck window. Hermes Agent runs inside the sandbox. Configured Slack / Discord / Telegram channels connect and round-trip messages. doctor reflects runtime state, not just config.

If nemoclaw-start cannot rewrite a hardened root-owned read-only .bashrc / .profile due to dropped CAP_DAC_OVERRIDE, the script should either:

  • (a) skip the mutation when the file is already in its locked final form, or
  • (b) propagate the failure (non-zero exit) so the supervisor surfaces it instead of silently falling back to sleep infinity and producing a misleading "Ready" sandbox.

Actual Result

[7/8] Setting up Hermes Agent inside sandbox
──────────────────────────────────────────────────
Waiting for Hermes Agent gateway (up to 90s)...
✗ Hermes Agent gateway did not respond within 90s
  Check: nemohermes my-hermes logs --follow

Rebuild fails the same way; connect --probe-only and recover both fail with "automatic recovery failed". Container has only the supervisor + sleep infinity — no hermes / node / python process.

Logs

# /tmp/nemoclaw-start.log inside the sandbox (full contents):
Setting up NemoClaw (Hermes)...
mktemp: failed to create file via template '/sandbox/..bashrc.tmp.XXXXXX': Permission denied

# Container process tree:
$ sudo docker exec <container> ps -eo pid,cmd
    PID CMD
      1 /opt/openshell/bin/openshell-sandbox
    111 sleep infinity

# `nemohermes my-hermes logs --since 5m` excerpt — only OpenShell SSH relay
# healthcheck open/close every ~3s for the entire onboard + rebuild window,
# no hermes startup events at all:
[1779269042.988] [sandbox] [OCSF ] NET:OPEN  ssh relay open  (channel_id=5dfb4634-..., target=unix:/run/openshell/ssh.sock)
[1779269043.160] [sandbox] [OCSF ] NET:CLOSE ssh relay closed (channel_id=5dfb4634-..., ...)

# Doctor (misleading channel state vs. runtime):
Sandbox:    [ok] Live sandbox: my-hermes present (Ready)
            [ok] Agent version: Hermes Agent v2026.4.23
Messaging:  [ok] Channels: discord, slack, telegram enabled; no recent conflict signatures
Gateway:    [fail] Docker container: openshell-cluster-nemoclaw not found
            [ok] OpenShell status: connected to nemoclaw

# OpenShell gateway log — no nemoclaw-start spawn/exec record, only
# GetSandboxConfig polling every ~10s and periodic connection errors from
# the sandbox container (172.18.0.2). Never any hermes-related grpc traffic.

NVB#6195131

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.integration: hermesHermes integration behaviorplatform: ubuntuAffects Ubuntu Linux environmentsv0.0.50Release target

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions