fix(hermes): restore Spark GPU recreate startup by ericksoa · Pull Request #3963 · NVIDIA/NemoClaw

ericksoa · 2026-05-21T02:37:20Z

Summary

preserve NemoClaw's env ... nemoclaw-start sandbox command when the Docker GPU patch recreates an OpenShell-managed container
clear stale Hermes runtime/gateway.pid / runtime/gateway.lock state only when no Hermes gateway process is alive
remove orphaned Hermes socat forwarders before launching a fresh gateway

Evidence

Senthil's Spark debug bundle showed the recreated container running /opt/openshell/bin/openshell-sandbox with sleep infinity, while the image entrypoint was /usr/local/bin/nemoclaw-start. A manual nemoclaw-start then failed with PID file race lost to another gateway instance against stale Hermes runtime lock state.

Tests

npm run build:cli
bash -n agents/hermes/start.sh
npx vitest run src/lib/onboard/docker-gpu-patch.test.ts test/hermes-start.test.ts

Summary by CodeRabbit

Release Notes

New Features
- Added support for configuring custom sandbox startup commands in GPU-enabled environments via the OPENSHELL_SANDBOX_COMMAND environment variable.
- Improved Hermes gateway runtime cleanup to automatically remove stale process files and orphaned port-forwarder processes during startup.
Tests
- Added test coverage for gateway runtime cleanup scenarios and GPU sandbox command configuration behavior.

coderabbitai · 2026-05-21T02:37:31Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c40e8024-9562-492c-9e49-f9e25c9e0221

📥 Commits

Reviewing files that changed from the base of the PR and between dc63189 and 92d38d3.

📒 Files selected for processing (6)

agents/hermes/start.sh
src/lib/onboard.ts
src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts
src/lib/onboard/docker-gpu-sandbox-create.ts
test/hermes-start.test.ts

📝 Walkthrough

Walkthrough

This PR adds Hermes gateway runtime cleanup logic to detect and safely remove stale gateway artifacts, and introduces configurability for OpenShell sandbox startup commands through environment-variable injection in Docker GPU patch creation and recreation flows.

Changes

Hermes gateway runtime cleanup

Layer / File(s)	Summary
Hermes cleanup helpers and entrypoint integration `agents/hermes/start.sh`	Adds `cmdline_is_hermes_gateway()`, `has_live_hermes_gateway()`, `cleanup_orphan_socat_forwarders()`, `remove_stale_gateway_file()`, and `cleanup_stale_hermes_gateway_runtime()` helpers; calls cleanup routine in both non-root and root execution paths after `retry_tirith_marker_if_needed` and before gateway startup.
Hermes gateway cleanup test harness and coverage `test/hermes-start.test.ts`	Provides `writeFakeProcCmdline` helper and `runHermesGatewayRuntimeCleanup` test harness to simulate stale gateway state; verifies stale PID/lock removal with legacy symlink preservation, orphan socat termination, and runtime preservation when live gateway is detected.

OpenShell sandbox command configurability

Layer / File(s)	Summary
Docker GPU clone options type and helpers `src/lib/onboard/docker-gpu-patch.ts`	Introduces `OPENSHELL_SANDBOX_COMMAND` constant, extends `DockerGpuCloneRunOptions` with optional `openshellSandboxCommand` field, and adds helper to derive env value from command array.
Docker GPU clone args builder implementation `src/lib/onboard/docker-gpu-patch.ts`	Reworks `buildDockerGpuCloneRunArgs` environment and command handling to inject/replace `OPENSHELL_SANDBOX_COMMAND` env var and use provided sandbox command instead of inspected entrypoint + original Cmd.
Docker GPU sandbox recreate and create patch APIs `src/lib/onboard/docker-gpu-patch.ts`, `src/lib/onboard/docker-gpu-sandbox-create.ts`	Extends `recreateOpenShellDockerSandboxWithGpu` and `createDockerGpuSandboxCreatePatch` to accept and thread optional `openshellSandboxCommand` parameter through clone options.
Onboard sandbox command construction and wiring `src/lib/onboard.ts`	Extracts `sandboxStartupCommand` wrapper and passes computed command to GPU sandbox create patch initialization.
Docker GPU patch test coverage `src/lib/onboard/docker-gpu-patch.test.ts`	Updates fixture to include `OPENSHELL_SANDBOX_COMMAND=sleep infinity` baseline; adds tests verifying command replacement during clone args building and env/command injection; extends sandbox recreation test to assert composed env and command arguments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/NemoClaw#3797: Both PRs modify agents/hermes/start.sh startup flow by building around Hermes/Tirith marker handling (retry_tirith_marker_if_needed); the main PR runs its Hermes gateway stale-runtime cleanup immediately after that marker retry step.
NVIDIA/NemoClaw#3515: Both PRs modify src/lib/onboard/docker-gpu-patch.ts GPU docker clone run args construction logic; one adds OPENSHELL_SANDBOX_COMMAND command/env injection while the other injects security/capability flags.
NVIDIA/NemoClaw#3434: Both PRs modify the Docker GPU sandbox onboarding flow in src/lib/onboard.ts and GPU patch modules, threading extra GPU/sandbox-create options into sandbox recreation/polling logic.

Suggested labels

Integration: Hermes, Sandbox, OpenShell

Suggested reviewers

cv
jyaunches

Poem

🐰 A gateway once left in a stale, tangled mess,
Now cleaned with grace, removing distress!
And sandbox commands, once rigid and bound,
Now flow like clear streams—injection profound! 🌊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: restoring the Spark GPU recreate startup flow by preserving NemoClaw's sandbox command during Docker GPU patch recreation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/spark-hermes-gpu-recreate

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-21T02:39:06Z

E2E Advisor Recommendation

Required E2E: hermes-e2e, gpu-e2e
Optional E2E: gpu-double-onboard-e2e, rebuild-hermes-e2e, hermes-onboard-security-posture-e2e

Dispatch hint: hermes-e2e,gpu-e2e

Auto-dispatched E2E: hermes-e2e, gpu-e2e via nightly-e2e.yaml at 92d38d3a7817c3afb734d68f98656ec5ed0500c0 — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

hermes-e2e (high): Covers the real Hermes install/onboard flow, Hermes sandbox gateway health, and live inference. Required because the PR changes agents/hermes/start.sh gateway startup cleanup and the onboarding path that launches Hermes sandboxes.
gpu-e2e (high): Covers the real GPU onboarding path with local Ollama, Docker sandbox creation/recreation, and end-to-end inference through the sandbox. Required because the PR changes Docker GPU patch command propagation during sandbox create, which can decide whether the recreated OpenShell container runs nemoclaw-start instead of an idle command.

Optional E2E

gpu-double-onboard-e2e (high): Useful adjacent coverage for repeated GPU onboarding/recreation and post-reonboard inference. It is not strictly merge-blocking if gpu-e2e passes, but it can catch state or command propagation regressions during a second onboard.
rebuild-hermes-e2e (high): Optional confidence for Hermes lifecycle after sandbox recreation/rebuild, which is adjacent to stale gateway runtime lock cleanup in agents/hermes/start.sh.
hermes-onboard-security-posture-e2e (high): Optional because Hermes start.sh changed process and file cleanup around runtime state. This job validates the full Hermes onboard path with security-posture assertions, but the specific stale lock/GPU recreation behavior is better covered by a targeted future test.

New E2E recommendations

Hermes GPU Docker recreation (high): Existing GPU E2E covers OpenClaw/local Ollama and existing Hermes E2E covers CPU/cloud Hermes, but there is no combined E2E that recreates a Hermes OpenShell Docker sandbox with GPU access and verifies stale gateway.lock cleanup plus Hermes gateway/inference after recreation.
- Suggested test: Add a Hermes GPU sandbox recreation E2E that onboards Hermes with a GPU-enabled Docker OpenShell sandbox, forces the Docker GPU patch/recreate path, verifies OPENSHELL_SANDBOX_COMMAND runs nemoclaw-start, confirms stale gateway.pid/gateway.lock cleanup, and performs Hermes health plus inference.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: hermes-e2e,gpu-e2e

github-actions · 2026-05-21T02:39:48Z

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Findings: 4 blocker(s), 2 warning(s), 1 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Git diff in the prompt was truncated; repository reads were used for the main changed implementation sections, but not every unchanged surrounding line was re-reviewed.; No PR scripts, package-manager commands, Docker commands, or tests were executed.; Review thread state was unavailable beyond the provided GraphQL nodes; CodeRabbit was still pending.; No linked issues were present in trusted context, so acceptance mapping used the PR body clauses and test claims as untrusted evidence.; E2E Advisor comments were absent; only the in-progress E2E recommendation check run was available.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Recommendation: blocked
Confidence: high

The patch targets real active sandbox/Hermes GPU startup paths, but merge is blocked by pending CI/E2E, mergeStateStatus=BLOCKED, CodeRabbit still pending, large-file budget blockers, and a security concern around serializing sensitive startup env into OPENSHELL_SANDBOX_COMMAND.

Gate status

CI: pending — Head SHA 92d38d3 has pending/in-progress/queued contexts including cli-parity, E2E recommendation, wsl-e2e, macos-e2e, PR review advisor, CodeQL javascript-typescript, CodeQL python, unit-vitest-linux, checks, ShellCheck SARIF, build-sandbox-images, build-sandbox-images-arm64, and CodeRabbit.
Mergeability: fail — GitHub GraphQL reports mergeStateStatus=BLOCKED and reviewDecision=REVIEW_REQUIRED for PR fix(hermes): restore Spark GPU recreate startup #3963.
Review threads: unknown — GraphQL reviewThreads.nodes is empty, but trusted context says no review thread state was available; CodeRabbit issue comment says review is in progress and CodeRabbit status is PENDING.
Risky code tested: warning — Risky areas detected: onboarding/host glue. Unit tests were added for docker GPU patch command preservation and Hermes runtime cleanup, but runtime/sandbox/infrastructure behavior requires E2E validation for this head SHA.

🔴 Blockers

Required CI and review automation have not completed for the head SHA: The latest head SHA has many in-progress or queued contexts, and CodeRabbit is still pending. Mergeability is also blocked.
- Recommendation: Wait for the required checks and review automation to complete successfully for 92d38d3 before considering merge.
- Evidence: GraphQL statusCheckRollup shows IN_PROGRESS/QUEUED contexts including E2E recommendation, wsl-e2e, macos-e2e, CodeQL, unit-vitest-linux, ShellCheck SARIF, build-sandbox-images, and CodeRabbit=PENDING; mergeStateStatus=BLOCKED.
Runtime and sandbox lifecycle changes need E2E confirmation: The PR changes Hermes entrypoint startup, stale runtime cleanup, socat cleanup, Docker GPU container recreation, and OpenShell sandbox command propagation. Unit tests exercise extracted helpers, but cannot prove real OpenShell/Docker/Hermes supervisor reconnect behavior.
- Recommendation: Require the E2E Advisor recommendation to finish and ensure its required E2E jobs pass for this exact head SHA, especially paths covering Docker GPU recreate and Hermes startup.
- Evidence: Changed files include agents/hermes/start.sh, src/lib/onboard.ts, src/lib/onboard/docker-gpu-patch.ts, and src/lib/onboard/docker-gpu-sandbox-create.ts. E2E recommendation, wsl-e2e, and macos-e2e are IN_PROGRESS.
Large-file hotspot grew past the monolith budget (src/lib/onboard/docker-gpu-patch.ts:251): The Docker GPU patch implementation is already a large-file hotspot and grew by 25 lines in this PR.
- Recommendation: Extract the new OpenShell sandbox command serialization/recreate plumbing into a smaller helper module or offset the growth before merge, consistent with the repository's monolith budget policy.
- Evidence: Trusted monolithDeltas reports src/lib/onboard/docker-gpu-patch.ts baseLines=1178, headLines=1203, delta=25, severity=blocker.
Large test hotspot grew past the monolith budget (src/lib/onboard/docker-gpu-patch.test.ts:1): The Docker GPU patch test file is already a large-file hotspot and grew by 63 lines in this PR.
- Recommendation: Move the new OpenShell sandbox command preservation tests into a focused test file or offset the growth before merge.
- Evidence: Trusted monolithDeltas reports src/lib/onboard/docker-gpu-patch.test.ts baseLines=606, headLines=669, delta=63, severity=blocker.

🟡 Warnings

Sandbox startup command is serialized as a space-joined env value (src/lib/onboard/docker-gpu-patch.ts:251): openshellSandboxCommandEnvValue joins command parts with a plain space and writes the result into OPENSHELL_SANDBOX_COMMAND. The command can include env assignments derived from runtime credentials such as TOOL_GATEWAY_USER_TOKEN and BRAVE_API_KEY. Depending on how OpenShell later parses OPENSHELL_SANDBOX_COMMAND, this can expose secrets through Docker inspect/environment metadata and can break or become unsafe for values containing spaces or shell metacharacters.
- Recommendation: Avoid embedding sensitive env assignments in OPENSHELL_SANDBOX_COMMAND, or encode the command with a structured/quoted representation that OpenShell parses without shell interpretation. Add negative tests for token/API-key values containing spaces, quotes, semicolons, and dollar signs, and assert sensitive values are not exposed in diagnostics or persisted environment when not strictly required.
- Evidence: src/lib/onboard.ts line 5501 builds sandboxStartupCommand from envArgs; lines 5480-5488 can include TOOL_GATEWAY_USER_TOKEN and BRAVE_API_KEY. docker-gpu-patch.ts line 251 joins command parts with spaces, and lines 459-467 write OPENSHELL_SANDBOX_COMMAND= into docker run --env.
Patch overlaps with many active PRs touching the same high-risk files: The changed files still exist and the patch is not obviously stale, but there is substantial active work on the same files, including security/sandbox and onboard refactor PRs. This increases drift and conflict risk.
- Recommendation: Before merge, re-check against the latest main and coordinate with overlapping PRs, especially security-related Hermes startup and onboard extraction work.
- Evidence: Open PR overlaps include fix(sandbox): harden startup log creation #3888 touching agents/hermes/start.sh, fix(onboard): scan default CDI dirs for NVIDIA specs (#3575) #3675 touching docker-gpu-patch.ts and docker-gpu-patch.test.ts, and numerous PRs touching src/lib/onboard.ts.

🔵 Suggestions

Add edge-case tests for Hermes gateway process detection (agents/hermes/start.sh:159): The cleanup logic preserves stale runtime files if any cmdline contains a Hermes gateway run pattern. This is intentionally conservative, but false positives could preserve stale locks indefinitely.
- Recommendation: Add tests for non-gateway processes whose arguments contain the text 'hermes gateway run', and for wrapped shell/step-down command lines used by the root path, to document the intended matching behavior.
- Evidence: cmdline_is_hermes_gateway matches '"/hermes gateway run "' or '" hermes gateway run "' in agents/hermes/start.sh.

Acceptance coverage

partial — preserve NemoClaw's env ... nemoclaw-start sandbox command when the Docker GPU patch recreates an OpenShell-managed container: src/lib/onboard.ts now builds sandboxStartupCommand=["env", ...envArgs, "nemoclaw-start"] and passes it to createDockerGpuSandboxCreatePatch. docker-gpu-patch.ts uses options.openshellSandboxCommand as both OPENSHELL_SANDBOX_COMMAND and final docker command args. Unit tests assert replacement of 'sleep infinity'. Remaining concern: serialization is a plain space join and may mishandle or expose sensitive env values.
met — clear stale Hermes runtime/gateway.pid / runtime/gateway.lock state only when no Hermes gateway process is alive: agents/hermes/start.sh adds has_live_hermes_gateway and cleanup_stale_hermes_gateway_runtime, called before gateway launch in both non-root and root paths. test/hermes-start.test.ts covers stale pid/lock removal and preservation when a live gateway cmdline is present.
met — remove orphaned Hermes socat forwarders before launching a fresh gateway: agents/hermes/start.sh adds cleanup_orphan_socat_forwarders and calls it from cleanup_stale_hermes_gateway_runtime before start_socat_forwarder. test/hermes-start.test.ts covers killing a fake orphan socat when no live Hermes gateway exists.
unknown — npm run build:cli: The PR body claims this was run, but trusted CI for the head SHA is still pending and this review did not execute package-manager commands.
unknown — bash -n agents/hermes/start.sh: The PR body claims this was run, but ShellCheck SARIF is still IN_PROGRESS and this review did not execute PR scripts.
unknown — npx vitest run src/lib/onboard/docker-gpu-patch.test.ts test/hermes-start.test.ts: The PR body claims this was run and relevant tests were added, but unit-vitest-linux is QUEUED for the head SHA and this review did not run tests.

Security review

warning — 1. Secrets and Credentials: No new hardcoded secret literals were found. However, the new OPENSHELL_SANDBOX_COMMAND value can be built from envArgs that include TOOL_GATEWAY_USER_TOKEN and BRAVE_API_KEY, causing runtime credentials to be persisted in Docker env/config metadata during GPU recreate.
warning — 2. Input Validation and Data Sanitization: The sandbox command is serialized with parts.join(' ') without escaping or structured encoding. Inputs such as CHAT_UI_URL, proxy env, broker token, and Brave API key are not all constrained to shell-safe/no-space values at this serialization boundary.
pass — 3. Authentication and Authorization: No new endpoints or authorization decisions are introduced. The change preserves existing OpenShell/Hermes startup and credential-rewrite boundaries, subject to the credential exposure warning above.
pass — 4. Dependencies and Third-Party Libraries: No new package dependencies, installers, registries, or version pins are added in the shown diff.
pass — 5. Error Handling and Logging: New logs report stale PID/lock cleanup and socat PIDs without directly printing secrets. Existing diagnostic sanitization tests remain present.
pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations or algorithms are introduced or modified.
warning — 7. Configuration and Security Headers: The changed path recreates Docker sandbox containers and continues to add SYS_PTRACE and potentially apparmor=unconfined for GPU support. That behavior appears pre-existing and tested, but it remains high-risk sandbox configuration that needs E2E validation.
warning — 8. Security Testing: Tests cover stale file cleanup, symlink preservation, orphan socat cleanup, and command replacement. Missing negative tests for sensitive startup env exposure and shell/space/metacharacter handling in OPENSHELL_SANDBOX_COMMAND. E2E runtime validation is pending.
warning — 9. Holistic Security Posture: The PR addresses a real sandbox startup regression and avoids deleting locks when a live gateway is detected, but it changes high-risk sandbox lifecycle and credential-adjacent command propagation while CI/E2E and CodeRabbit remain pending.

Test / E2E status

Test depth: e2e_required — Runtime/sandbox/infrastructure paths need real execution coverage: agents/hermes/start.sh, src/lib/onboard.ts, src/lib/onboard/docker-gpu-patch.ts, and src/lib/onboard/docker-gpu-sandbox-create.ts. Unit tests validate helper logic but not OpenShell supervisor reconnect, Docker inspect/env persistence, actual GPU recreate, or Hermes gateway startup.
E2E Advisor: missing
Required E2E jobs: E2E recommendation, wsl-e2e, macos-e2e
Missing for analyzed SHA: E2E recommendation, wsl-e2e, macos-e2e

✅ What looks good

The PR targets files that still exist and directly addresses the reported Spark GPU recreate startup path.
The Docker GPU recreate path now has focused unit coverage for replacing the idle OpenShell sandbox command and adding the command env when absent.
Hermes runtime cleanup tests cover stale pid/lock removal, orphan socat cleanup, and preserving state when a live gateway is present.
The cleanup logic treats legacy Hermes PID symlinks carefully and avoids reading unsafe Tirith marker symlinks from existing coverage.

Review completeness

Git diff in the prompt was truncated; repository reads were used for the main changed implementation sections, but not every unchanged surrounding line was re-reviewed.
No PR scripts, package-manager commands, Docker commands, or tests were executed.
Review thread state was unavailable beyond the provided GraphQL nodes; CodeRabbit was still pending.
No linked issues were present in trusted context, so acceptance mapping used the PR body clauses and test claims as untrusted evidence.
E2E Advisor comments were absent; only the in-progress E2E recommendation check run was available.
Human maintainer review required: yes

github-actions · 2026-05-21T02:42:28Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26202173538
Target ref: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Workflow ref: main
Requested jobs: hermes-e2e,gpu-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success

github-actions · 2026-05-21T04:00:44Z

Selective E2E Results — ❌ Some jobs failed

Run: 26202896549
Target ref: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Workflow ref: main
Requested jobs: all (no filter)
Summary: 46 passed, 1 failed, 2 skipped

Job	Result
bedrock-runtime-compatible-anthropic-e2e	✅ success
brave-search-e2e	✅ success
channels-add-remove-e2e	✅ success
channels-stop-start-e2e	✅ success
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
credential-migration-e2e	✅ success
credential-sanitization-e2e	✅ success
device-auth-health-e2e	✅ success
diagnostics-e2e	✅ success
docs-validation-e2e	✅ success
double-onboard-e2e	✅ success
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-discord-e2e	✅ success
hermes-e2e	✅ success
hermes-inference-switch-e2e	✅ success
hermes-onboard-security-posture-e2e	✅ success
hermes-slack-e2e	✅ success
inference-routing-e2e	✅ success
issue-2478-crash-loop-recovery-e2e	✅ success
kimi-inference-compat-e2e	✅ success
launchable-smoke-e2e	✅ success
messaging-compatible-endpoint-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
onboard-negative-paths-e2e	✅ success
onboard-repair-e2e	✅ success
onboard-resume-e2e	✅ success
openclaw-inference-switch-e2e	❌ failure
openclaw-onboard-security-posture-e2e	✅ success
openclaw-slack-pairing-e2e	✅ success
openshell-gateway-upgrade-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-hermes-stale-base-e2e	✅ success
rebuild-openclaw-e2e	✅ success
runtime-overrides-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	✅ success
snapshot-commands-e2e	✅ success
state-backup-restore-e2e	✅ success
telegram-injection-e2e	✅ success
token-rotation-e2e	✅ success
tunnel-lifecycle-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

Failed jobs: openclaw-inference-switch-e2e. Check run artifacts for logs.

senthilr-nv · 2026-05-21T04:10:44Z

Verified on DGX Spark (aarch64, NVIDIA GB10, 122 GB) — fresh setup from main + checkout of fix/spark-hermes-gpu-recreate @ 92d38d3.

Onboard

nemoclaw onboard with NEMOCLAW_AGENT=hermes: 44s end-to-end, no manual intervention.
Previously (main): timed out at 90s waiting for Hermes Agent gateway; required manual nemoclaw-start invocation inside the container.

Bug 1 — GPU recreate drops ENTRYPOINT: fixed
After "Recreating OpenShell Docker sandbox container with NVIDIA GPU access", container now runs:

Entrypoint: ["/opt/openshell/bin/openshell-sandbox"]
Cmd: ["env","CHAT_UI_URL=http://127.0.0.1:8642","NEMOCLAW_DASHBOARD_PORT=8642","nemoclaw-start"]

ps -ef confirms bash /usr/local/bin/nemoclaw-start (PID 60) and hermes gateway run (PID 148) under the openshell-sandbox supervisor.

Bug 2 — Stale PID file kills fresh Hermes: fixed

/sandbox/.hermes/gateway.pid is a regular file (115 B), not a broken symlink.
No PID file race lost to another gateway instance error in gateway.log.
Hermes started cleanly on first try.

First-token inference

/v1/models → returns hermes-agent.
/v1/chat/completions (streaming, short prompt): 7.66s end-to-end.
/v1/chat/completions (non-streaming, longer reasoning): 76s, 2232 completion tokens.

LGTM from Spark validation. Ready to merge.

github-actions · 2026-05-21T04:13:20Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26202896549
Target ref: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Workflow ref: main
Requested jobs: all (no filter)
Summary: 47 passed, 0 failed, 2 skipped

Job	Result
bedrock-runtime-compatible-anthropic-e2e	✅ success
brave-search-e2e	✅ success
channels-add-remove-e2e	✅ success
channels-stop-start-e2e	✅ success
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
credential-migration-e2e	✅ success
credential-sanitization-e2e	✅ success
device-auth-health-e2e	✅ success
diagnostics-e2e	✅ success
docs-validation-e2e	✅ success
double-onboard-e2e	✅ success
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-discord-e2e	✅ success
hermes-e2e	✅ success
hermes-inference-switch-e2e	✅ success
hermes-onboard-security-posture-e2e	✅ success
hermes-slack-e2e	✅ success
inference-routing-e2e	✅ success
issue-2478-crash-loop-recovery-e2e	✅ success
kimi-inference-compat-e2e	✅ success
launchable-smoke-e2e	✅ success
messaging-compatible-endpoint-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
onboard-negative-paths-e2e	✅ success
onboard-repair-e2e	✅ success
onboard-resume-e2e	✅ success
openclaw-inference-switch-e2e	✅ success
openclaw-onboard-security-posture-e2e	✅ success
openclaw-slack-pairing-e2e	✅ success
openshell-gateway-upgrade-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-hermes-stale-base-e2e	✅ success
rebuild-openclaw-e2e	✅ success
runtime-overrides-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	✅ success
snapshot-commands-e2e	✅ success
state-backup-restore-e2e	✅ success
telegram-injection-e2e	✅ success
token-rotation-e2e	✅ success
tunnel-lifecycle-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

## Summary Refreshes NemoClaw release notes for v0.0.47 and v0.0.48, then regenerates the corresponding user-skill references so agent-facing docs match the source pages. Preview: https://nvidia-preview-docs-release-notes-47-48.docs.buildwithfern.com/nemoclaw/about/release-notes ## Changes - Adds explicit v0.0.47 and v0.0.48 sections to `docs/about/release-notes.mdx`. - Documents follow-up WSL Ollama, sandbox image, share mount, and troubleshooting updates from recent release changes. - Regenerates `nemoclaw-user-*` skill references from the Fern MDX source docs. ## Source Summary - #4003 -> `docs/about/release-notes.mdx`: Notes the messaging manifest registry work as part of v0.0.48 release coverage. - #3984 -> `docs/about/release-notes.mdx`: Captures Hermes messaging policy scoping in the v0.0.48 release notes. - #3963 -> `docs/about/release-notes.mdx`: Captures DGX Spark Hermes GPU recreation startup recovery in the v0.0.48 release notes. - #3961 -> `docs/about/release-notes.mdx`: Captures Discord loopback proxy routing in the v0.0.48 release notes. - #3940 -> `docs/about/release-notes.mdx`: Captures installer prompt clarification and express-install behavior in the v0.0.48 release notes. - #3946 -> `docs/about/release-notes.mdx`: Carries forward the Homebrew preinstall clarification in release coverage. - #3937 -> `docs/about/release-notes.mdx`: Carries forward the dashboard URL command and post-install next steps coverage. - #3921 -> `docs/about/release-notes.mdx`: Carries forward managed vLLM default behavior for DGX Spark and DGX Station. - #3931 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox `python` to `python3` compatibility symlink. - #1485 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox image Docker health check. - #3784 -> `docs/about/release-notes.mdx`: Captures VM-driver snapshot health-check reliability in release notes. - #3917 -> `docs/about/release-notes.mdx`: Captures package-based workspace template resolution in release notes. - #3170 -> `docs/about/release-notes.mdx`: Captures installer checksum compatibility from preferring `sha256sum`. - #3898 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for messaging provider scenario validation. - #3897 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for baseline onboarding scenario validation. - #3834 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for PR review advisor automation. - #3838 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for CLI display registry refactoring. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) `make docs` was attempted but could not complete because `npx fern-api` failed with `403 Forbidden` from `https://registry.npmjs.org/fern-api` in this environment. Pre-commit and pre-push hooks passed after refreshing the local CLI build output with `npm run build:cli`; no build artifacts were committed. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Added WSL onboarding notes for Windows-host Ollama detection, restart guidance, and PowerShell checks. * Clarified express-install behavior (non-interactive, sudo prompts) and default sandbox policy selection. * Added Windows preparation guidance when installer tooling is missing (winget/App Installer or Docker Desktop). * Expanded sandbox docs with Docker health checks, Homebrew/python compatibility helpers, share-mount path validation, Discord troubleshooting, and new v0.0.48/v0.0.47 release notes. * **Chores** * Improved docs preview workflow error handling.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4007?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

fix(hermes): restore Spark GPU recreate startup

92d38d3

ericksoa added fix v0.0.47 Release target labels May 21, 2026

cv approved these changes May 21, 2026

View reviewed changes

ericksoa merged commit 449f6f4 into main May 21, 2026
31 checks passed

miyoungc mentioned this pull request May 21, 2026

docs: refresh release notes for v0.0.47 and v0.0.48 #4007

Merged

12 tasks

ericksoa mentioned this pull request May 22, 2026

[Nemoclaw] [All Platforms]Hermes onboarding: sandbox build succeeds but sandbox never reaches Ready and is deleted after 180s timeout #3764

Closed

jyaunches mentioned this pull request May 26, 2026

test(e2e): migrate platform and remote coverage to scenario suites #3816

Closed

wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026

Conversation

ericksoa commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evidence

Tests

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 21, 2026

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 21, 2026

PR Review Advisor

PR Review Advisor

Gate status

🔴 Blockers

🟡 Warnings

🔵 Suggestions

Acceptance coverage

Security review

Test / E2E status

✅ What looks good

Review completeness

Uh oh!

github-actions Bot commented May 21, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented May 21, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

Uh oh!

senthilr-nv commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericksoa commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading