Skip to content

fix(hermes): restore Spark GPU recreate startup#3963

Merged
ericksoa merged 1 commit into
mainfrom
fix/spark-hermes-gpu-recreate
May 21, 2026
Merged

fix(hermes): restore Spark GPU recreate startup#3963
ericksoa merged 1 commit into
mainfrom
fix/spark-hermes-gpu-recreate

Conversation

@ericksoa

@ericksoa ericksoa commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • preserve NemoClaw's env ... nemoclaw-start sandbox command when the Docker GPU patch recreates an OpenShell-managed container
  • clear stale Hermes runtime/gateway.pid / runtime/gateway.lock state only when no Hermes gateway process is alive
  • remove orphaned Hermes socat forwarders before launching a fresh gateway

Evidence

Senthil's Spark debug bundle showed the recreated container running /opt/openshell/bin/openshell-sandbox with sleep infinity, while the image entrypoint was /usr/local/bin/nemoclaw-start. A manual nemoclaw-start then failed with PID file race lost to another gateway instance against stale Hermes runtime lock state.

Tests

  • npm run build:cli
  • bash -n agents/hermes/start.sh
  • npx vitest run src/lib/onboard/docker-gpu-patch.test.ts test/hermes-start.test.ts

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for configuring custom sandbox startup commands in GPU-enabled environments via the OPENSHELL_SANDBOX_COMMAND environment variable.
    • Improved Hermes gateway runtime cleanup to automatically remove stale process files and orphaned port-forwarder processes during startup.
  • Tests

    • Added test coverage for gateway runtime cleanup scenarios and GPU sandbox command configuration behavior.

Review Change Stack

@ericksoa ericksoa added fix v0.0.47 Release target labels May 21, 2026
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c40e8024-9562-492c-9e49-f9e25c9e0221

📥 Commits

Reviewing files that changed from the base of the PR and between dc63189 and 92d38d3.

📒 Files selected for processing (6)
  • agents/hermes/start.sh
  • src/lib/onboard.ts
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/docker-gpu-sandbox-create.ts
  • test/hermes-start.test.ts

📝 Walkthrough

Walkthrough

This PR adds Hermes gateway runtime cleanup logic to detect and safely remove stale gateway artifacts, and introduces configurability for OpenShell sandbox startup commands through environment-variable injection in Docker GPU patch creation and recreation flows.

Changes

Hermes gateway runtime cleanup

Layer / File(s) Summary
Hermes cleanup helpers and entrypoint integration
agents/hermes/start.sh
Adds cmdline_is_hermes_gateway(), has_live_hermes_gateway(), cleanup_orphan_socat_forwarders(), remove_stale_gateway_file(), and cleanup_stale_hermes_gateway_runtime() helpers; calls cleanup routine in both non-root and root execution paths after retry_tirith_marker_if_needed and before gateway startup.
Hermes gateway cleanup test harness and coverage
test/hermes-start.test.ts
Provides writeFakeProcCmdline helper and runHermesGatewayRuntimeCleanup test harness to simulate stale gateway state; verifies stale PID/lock removal with legacy symlink preservation, orphan socat termination, and runtime preservation when live gateway is detected.

OpenShell sandbox command configurability

Layer / File(s) Summary
Docker GPU clone options type and helpers
src/lib/onboard/docker-gpu-patch.ts
Introduces OPENSHELL_SANDBOX_COMMAND constant, extends DockerGpuCloneRunOptions with optional openshellSandboxCommand field, and adds helper to derive env value from command array.
Docker GPU clone args builder implementation
src/lib/onboard/docker-gpu-patch.ts
Reworks buildDockerGpuCloneRunArgs environment and command handling to inject/replace OPENSHELL_SANDBOX_COMMAND env var and use provided sandbox command instead of inspected entrypoint + original Cmd.
Docker GPU sandbox recreate and create patch APIs
src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-sandbox-create.ts
Extends recreateOpenShellDockerSandboxWithGpu and createDockerGpuSandboxCreatePatch to accept and thread optional openshellSandboxCommand parameter through clone options.
Onboard sandbox command construction and wiring
src/lib/onboard.ts
Extracts sandboxStartupCommand wrapper and passes computed command to GPU sandbox create patch initialization.
Docker GPU patch test coverage
src/lib/onboard/docker-gpu-patch.test.ts
Updates fixture to include OPENSHELL_SANDBOX_COMMAND=sleep infinity baseline; adds tests verifying command replacement during clone args building and env/command injection; extends sandbox recreation test to assert composed env and command arguments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3797: Both PRs modify agents/hermes/start.sh startup flow by building around Hermes/Tirith marker handling (retry_tirith_marker_if_needed); the main PR runs its Hermes gateway stale-runtime cleanup immediately after that marker retry step.
  • NVIDIA/NemoClaw#3515: Both PRs modify src/lib/onboard/docker-gpu-patch.ts GPU docker clone run args construction logic; one adds OPENSHELL_SANDBOX_COMMAND command/env injection while the other injects security/capability flags.
  • NVIDIA/NemoClaw#3434: Both PRs modify the Docker GPU sandbox onboarding flow in src/lib/onboard.ts and GPU patch modules, threading extra GPU/sandbox-create options into sandbox recreation/polling logic.

Suggested labels

Integration: Hermes, Sandbox, OpenShell

Suggested reviewers

  • cv
  • jyaunches

Poem

🐰 A gateway once left in a stale, tangled mess,
Now cleaned with grace, removing distress!
And sandbox commands, once rigid and bound,
Now flow like clear streams—injection profound! 🌊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: restoring the Spark GPU recreate startup flow by preserving NemoClaw's sandbox command during Docker GPU patch recreation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/spark-hermes-gpu-recreate

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: hermes-e2e, gpu-e2e
Optional E2E: gpu-double-onboard-e2e, rebuild-hermes-e2e, hermes-onboard-security-posture-e2e

Dispatch hint: hermes-e2e,gpu-e2e

Auto-dispatched E2E: hermes-e2e, gpu-e2e via nightly-e2e.yaml at 92d38d3a7817c3afb734d68f98656ec5ed0500c0nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • hermes-e2e (high): Covers the real Hermes install/onboard flow, Hermes sandbox gateway health, and live inference. Required because the PR changes agents/hermes/start.sh gateway startup cleanup and the onboarding path that launches Hermes sandboxes.
  • gpu-e2e (high): Covers the real GPU onboarding path with local Ollama, Docker sandbox creation/recreation, and end-to-end inference through the sandbox. Required because the PR changes Docker GPU patch command propagation during sandbox create, which can decide whether the recreated OpenShell container runs nemoclaw-start instead of an idle command.

Optional E2E

  • gpu-double-onboard-e2e (high): Useful adjacent coverage for repeated GPU onboarding/recreation and post-reonboard inference. It is not strictly merge-blocking if gpu-e2e passes, but it can catch state or command propagation regressions during a second onboard.
  • rebuild-hermes-e2e (high): Optional confidence for Hermes lifecycle after sandbox recreation/rebuild, which is adjacent to stale gateway runtime lock cleanup in agents/hermes/start.sh.
  • hermes-onboard-security-posture-e2e (high): Optional because Hermes start.sh changed process and file cleanup around runtime state. This job validates the full Hermes onboard path with security-posture assertions, but the specific stale lock/GPU recreation behavior is better covered by a targeted future test.

New E2E recommendations

  • Hermes GPU Docker recreation (high): Existing GPU E2E covers OpenClaw/local Ollama and existing Hermes E2E covers CPU/cloud Hermes, but there is no combined E2E that recreates a Hermes OpenShell Docker sandbox with GPU access and verifies stale gateway.lock cleanup plus Hermes gateway/inference after recreation.
    • Suggested test: Add a Hermes GPU sandbox recreation E2E that onboards Hermes with a GPU-enabled Docker OpenShell sandbox, forces the Docker GPU patch/recreate path, verifies OPENSHELL_SANDBOX_COMMAND runs nemoclaw-start, confirms stale gateway.pid/gateway.lock cleanup, and performs Hermes health plus inference.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: hermes-e2e,gpu-e2e

@github-actions

Copy link
Copy Markdown
Contributor

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Findings: 4 blocker(s), 2 warning(s), 1 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Git diff in the prompt was truncated; repository reads were used for the main changed implementation sections, but not every unchanged surrounding line was re-reviewed.; No PR scripts, package-manager commands, Docker commands, or tests were executed.; Review thread state was unavailable beyond the provided GraphQL nodes; CodeRabbit was still pending.; No linked issues were present in trusted context, so acceptance mapping used the PR body clauses and test claims as untrusted evidence.; E2E Advisor comments were absent; only the in-progress E2E recommendation check run was available.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Recommendation: blocked
Confidence: high

The patch targets real active sandbox/Hermes GPU startup paths, but merge is blocked by pending CI/E2E, mergeStateStatus=BLOCKED, CodeRabbit still pending, large-file budget blockers, and a security concern around serializing sensitive startup env into OPENSHELL_SANDBOX_COMMAND.

Gate status

  • CI: pending — Head SHA 92d38d3 has pending/in-progress/queued contexts including cli-parity, E2E recommendation, wsl-e2e, macos-e2e, PR review advisor, CodeQL javascript-typescript, CodeQL python, unit-vitest-linux, checks, ShellCheck SARIF, build-sandbox-images, build-sandbox-images-arm64, and CodeRabbit.
  • Mergeability: fail — GitHub GraphQL reports mergeStateStatus=BLOCKED and reviewDecision=REVIEW_REQUIRED for PR fix(hermes): restore Spark GPU recreate startup #3963.
  • Review threads: unknown — GraphQL reviewThreads.nodes is empty, but trusted context says no review thread state was available; CodeRabbit issue comment says review is in progress and CodeRabbit status is PENDING.
  • Risky code tested: warning — Risky areas detected: onboarding/host glue. Unit tests were added for docker GPU patch command preservation and Hermes runtime cleanup, but runtime/sandbox/infrastructure behavior requires E2E validation for this head SHA.

🔴 Blockers

  • Required CI and review automation have not completed for the head SHA: The latest head SHA has many in-progress or queued contexts, and CodeRabbit is still pending. Mergeability is also blocked.
    • Recommendation: Wait for the required checks and review automation to complete successfully for 92d38d3 before considering merge.
    • Evidence: GraphQL statusCheckRollup shows IN_PROGRESS/QUEUED contexts including E2E recommendation, wsl-e2e, macos-e2e, CodeQL, unit-vitest-linux, ShellCheck SARIF, build-sandbox-images, and CodeRabbit=PENDING; mergeStateStatus=BLOCKED.
  • Runtime and sandbox lifecycle changes need E2E confirmation: The PR changes Hermes entrypoint startup, stale runtime cleanup, socat cleanup, Docker GPU container recreation, and OpenShell sandbox command propagation. Unit tests exercise extracted helpers, but cannot prove real OpenShell/Docker/Hermes supervisor reconnect behavior.
    • Recommendation: Require the E2E Advisor recommendation to finish and ensure its required E2E jobs pass for this exact head SHA, especially paths covering Docker GPU recreate and Hermes startup.
    • Evidence: Changed files include agents/hermes/start.sh, src/lib/onboard.ts, src/lib/onboard/docker-gpu-patch.ts, and src/lib/onboard/docker-gpu-sandbox-create.ts. E2E recommendation, wsl-e2e, and macos-e2e are IN_PROGRESS.
  • Large-file hotspot grew past the monolith budget (src/lib/onboard/docker-gpu-patch.ts:251): The Docker GPU patch implementation is already a large-file hotspot and grew by 25 lines in this PR.
    • Recommendation: Extract the new OpenShell sandbox command serialization/recreate plumbing into a smaller helper module or offset the growth before merge, consistent with the repository's monolith budget policy.
    • Evidence: Trusted monolithDeltas reports src/lib/onboard/docker-gpu-patch.ts baseLines=1178, headLines=1203, delta=25, severity=blocker.
  • Large test hotspot grew past the monolith budget (src/lib/onboard/docker-gpu-patch.test.ts:1): The Docker GPU patch test file is already a large-file hotspot and grew by 63 lines in this PR.
    • Recommendation: Move the new OpenShell sandbox command preservation tests into a focused test file or offset the growth before merge.
    • Evidence: Trusted monolithDeltas reports src/lib/onboard/docker-gpu-patch.test.ts baseLines=606, headLines=669, delta=63, severity=blocker.

🟡 Warnings

  • Sandbox startup command is serialized as a space-joined env value (src/lib/onboard/docker-gpu-patch.ts:251): openshellSandboxCommandEnvValue joins command parts with a plain space and writes the result into OPENSHELL_SANDBOX_COMMAND. The command can include env assignments derived from runtime credentials such as TOOL_GATEWAY_USER_TOKEN and BRAVE_API_KEY. Depending on how OpenShell later parses OPENSHELL_SANDBOX_COMMAND, this can expose secrets through Docker inspect/environment metadata and can break or become unsafe for values containing spaces or shell metacharacters.
    • Recommendation: Avoid embedding sensitive env assignments in OPENSHELL_SANDBOX_COMMAND, or encode the command with a structured/quoted representation that OpenShell parses without shell interpretation. Add negative tests for token/API-key values containing spaces, quotes, semicolons, and dollar signs, and assert sensitive values are not exposed in diagnostics or persisted environment when not strictly required.
    • Evidence: src/lib/onboard.ts line 5501 builds sandboxStartupCommand from envArgs; lines 5480-5488 can include TOOL_GATEWAY_USER_TOKEN and BRAVE_API_KEY. docker-gpu-patch.ts line 251 joins command parts with spaces, and lines 459-467 write OPENSHELL_SANDBOX_COMMAND= into docker run --env.
  • Patch overlaps with many active PRs touching the same high-risk files: The changed files still exist and the patch is not obviously stale, but there is substantial active work on the same files, including security/sandbox and onboard refactor PRs. This increases drift and conflict risk.

🔵 Suggestions

  • Add edge-case tests for Hermes gateway process detection (agents/hermes/start.sh:159): The cleanup logic preserves stale runtime files if any cmdline contains a Hermes gateway run pattern. This is intentionally conservative, but false positives could preserve stale locks indefinitely.
    • Recommendation: Add tests for non-gateway processes whose arguments contain the text 'hermes gateway run', and for wrapped shell/step-down command lines used by the root path, to document the intended matching behavior.
    • Evidence: cmdline_is_hermes_gateway matches '"/hermes gateway run "' or '" hermes gateway run "' in agents/hermes/start.sh.

Acceptance coverage

  • partial — preserve NemoClaw's env ... nemoclaw-start sandbox command when the Docker GPU patch recreates an OpenShell-managed container: src/lib/onboard.ts now builds sandboxStartupCommand=["env", ...envArgs, "nemoclaw-start"] and passes it to createDockerGpuSandboxCreatePatch. docker-gpu-patch.ts uses options.openshellSandboxCommand as both OPENSHELL_SANDBOX_COMMAND and final docker command args. Unit tests assert replacement of 'sleep infinity'. Remaining concern: serialization is a plain space join and may mishandle or expose sensitive env values.
  • met — clear stale Hermes runtime/gateway.pid / runtime/gateway.lock state only when no Hermes gateway process is alive: agents/hermes/start.sh adds has_live_hermes_gateway and cleanup_stale_hermes_gateway_runtime, called before gateway launch in both non-root and root paths. test/hermes-start.test.ts covers stale pid/lock removal and preservation when a live gateway cmdline is present.
  • met — remove orphaned Hermes socat forwarders before launching a fresh gateway: agents/hermes/start.sh adds cleanup_orphan_socat_forwarders and calls it from cleanup_stale_hermes_gateway_runtime before start_socat_forwarder. test/hermes-start.test.ts covers killing a fake orphan socat when no live Hermes gateway exists.
  • unknownnpm run build:cli: The PR body claims this was run, but trusted CI for the head SHA is still pending and this review did not execute package-manager commands.
  • unknownbash -n agents/hermes/start.sh: The PR body claims this was run, but ShellCheck SARIF is still IN_PROGRESS and this review did not execute PR scripts.
  • unknownnpx vitest run src/lib/onboard/docker-gpu-patch.test.ts test/hermes-start.test.ts: The PR body claims this was run and relevant tests were added, but unit-vitest-linux is QUEUED for the head SHA and this review did not run tests.

Security review

  • warning — 1. Secrets and Credentials: No new hardcoded secret literals were found. However, the new OPENSHELL_SANDBOX_COMMAND value can be built from envArgs that include TOOL_GATEWAY_USER_TOKEN and BRAVE_API_KEY, causing runtime credentials to be persisted in Docker env/config metadata during GPU recreate.
  • warning — 2. Input Validation and Data Sanitization: The sandbox command is serialized with parts.join(' ') without escaping or structured encoding. Inputs such as CHAT_UI_URL, proxy env, broker token, and Brave API key are not all constrained to shell-safe/no-space values at this serialization boundary.
  • pass — 3. Authentication and Authorization: No new endpoints or authorization decisions are introduced. The change preserves existing OpenShell/Hermes startup and credential-rewrite boundaries, subject to the credential exposure warning above.
  • pass — 4. Dependencies and Third-Party Libraries: No new package dependencies, installers, registries, or version pins are added in the shown diff.
  • pass — 5. Error Handling and Logging: New logs report stale PID/lock cleanup and socat PIDs without directly printing secrets. Existing diagnostic sanitization tests remain present.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations or algorithms are introduced or modified.
  • warning — 7. Configuration and Security Headers: The changed path recreates Docker sandbox containers and continues to add SYS_PTRACE and potentially apparmor=unconfined for GPU support. That behavior appears pre-existing and tested, but it remains high-risk sandbox configuration that needs E2E validation.
  • warning — 8. Security Testing: Tests cover stale file cleanup, symlink preservation, orphan socat cleanup, and command replacement. Missing negative tests for sensitive startup env exposure and shell/space/metacharacter handling in OPENSHELL_SANDBOX_COMMAND. E2E runtime validation is pending.
  • warning — 9. Holistic Security Posture: The PR addresses a real sandbox startup regression and avoids deleting locks when a live gateway is detected, but it changes high-risk sandbox lifecycle and credential-adjacent command propagation while CI/E2E and CodeRabbit remain pending.

Test / E2E status

  • Test depth: e2e_required — Runtime/sandbox/infrastructure paths need real execution coverage: agents/hermes/start.sh, src/lib/onboard.ts, src/lib/onboard/docker-gpu-patch.ts, and src/lib/onboard/docker-gpu-sandbox-create.ts. Unit tests validate helper logic but not OpenShell supervisor reconnect, Docker inspect/env persistence, actual GPU recreate, or Hermes gateway startup.
  • E2E Advisor: missing
  • Required E2E jobs: E2E recommendation, wsl-e2e, macos-e2e
  • Missing for analyzed SHA: E2E recommendation, wsl-e2e, macos-e2e

✅ What looks good

  • The PR targets files that still exist and directly addresses the reported Spark GPU recreate startup path.
  • The Docker GPU recreate path now has focused unit coverage for replacing the idle OpenShell sandbox command and adding the command env when absent.
  • Hermes runtime cleanup tests cover stale pid/lock removal, orphan socat cleanup, and preserving state when a live gateway is present.
  • The cleanup logic treats legacy Hermes PID symlinks carefully and avoids reading unsafe Tirith marker symlinks from existing coverage.

Review completeness

  • Git diff in the prompt was truncated; repository reads were used for the main changed implementation sections, but not every unchanged surrounding line was re-reviewed.
  • No PR scripts, package-manager commands, Docker commands, or tests were executed.
  • Review thread state was unavailable beyond the provided GraphQL nodes; CodeRabbit was still pending.
  • No linked issues were present in trusted context, so acceptance mapping used the PR body clauses and test claims as untrusted evidence.
  • E2E Advisor comments were absent; only the in-progress E2E recommendation check run was available.
  • Human maintainer review required: yes

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26202173538
Target ref: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Workflow ref: main
Requested jobs: hermes-e2e,gpu-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26202896549
Target ref: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Workflow ref: main
Requested jobs: all (no filter)
Summary: 46 passed, 1 failed, 2 skipped

Job Result
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-inference-switch-e2e ❌ failure
openclaw-onboard-security-posture-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: openclaw-inference-switch-e2e. Check run artifacts for logs.

@ericksoa ericksoa merged commit 449f6f4 into main May 21, 2026
31 checks passed
@senthilr-nv

Copy link
Copy Markdown
Contributor

Verified on DGX Spark (aarch64, NVIDIA GB10, 122 GB) — fresh setup from main + checkout of fix/spark-hermes-gpu-recreate @ 92d38d3.

Onboard

  • nemoclaw onboard with NEMOCLAW_AGENT=hermes: 44s end-to-end, no manual intervention.
  • Previously (main): timed out at 90s waiting for Hermes Agent gateway; required manual nemoclaw-start invocation inside the container.

Bug 1 — GPU recreate drops ENTRYPOINT: fixed
After "Recreating OpenShell Docker sandbox container with NVIDIA GPU access", container now runs:

Entrypoint: ["/opt/openshell/bin/openshell-sandbox"]
Cmd: ["env","CHAT_UI_URL=http://127.0.0.1:8642","NEMOCLAW_DASHBOARD_PORT=8642","nemoclaw-start"]

ps -ef confirms bash /usr/local/bin/nemoclaw-start (PID 60) and hermes gateway run (PID 148) under the openshell-sandbox supervisor.

Bug 2 — Stale PID file kills fresh Hermes: fixed

  • /sandbox/.hermes/gateway.pid is a regular file (115 B), not a broken symlink.
  • No PID file race lost to another gateway instance error in gateway.log.
  • Hermes started cleanly on first try.

First-token inference

  • /v1/models → returns hermes-agent.
  • /v1/chat/completions (streaming, short prompt): 7.66s end-to-end.
  • /v1/chat/completions (non-streaming, longer reasoning): 76s, 2232 completion tokens.

LGTM from Spark validation. Ready to merge.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26202896549
Target ref: 92d38d3a7817c3afb734d68f98656ec5ed0500c0
Workflow ref: main
Requested jobs: all (no filter)
Summary: 47 passed, 0 failed, 2 skipped

Job Result
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

miyoungc added a commit that referenced this pull request May 21, 2026
## Summary
Refreshes NemoClaw release notes for v0.0.47 and v0.0.48, then
regenerates the corresponding user-skill references so agent-facing docs
match the source pages.

Preview:
https://nvidia-preview-docs-release-notes-47-48.docs.buildwithfern.com/nemoclaw/about/release-notes

## Changes
- Adds explicit v0.0.47 and v0.0.48 sections to
`docs/about/release-notes.mdx`.
- Documents follow-up WSL Ollama, sandbox image, share mount, and
troubleshooting updates from recent release changes.
- Regenerates `nemoclaw-user-*` skill references from the Fern MDX
source docs.

## Source Summary
- #4003 -> `docs/about/release-notes.mdx`: Notes the messaging manifest
registry work as part of v0.0.48 release coverage.
- #3984 -> `docs/about/release-notes.mdx`: Captures Hermes messaging
policy scoping in the v0.0.48 release notes.
- #3963 -> `docs/about/release-notes.mdx`: Captures DGX Spark Hermes GPU
recreation startup recovery in the v0.0.48 release notes.
- #3961 -> `docs/about/release-notes.mdx`: Captures Discord loopback
proxy routing in the v0.0.48 release notes.
- #3940 -> `docs/about/release-notes.mdx`: Captures installer prompt
clarification and express-install behavior in the v0.0.48 release notes.
- #3946 -> `docs/about/release-notes.mdx`: Carries forward the Homebrew
preinstall clarification in release coverage.
- #3937 -> `docs/about/release-notes.mdx`: Carries forward the dashboard
URL command and post-install next steps coverage.
- #3921 -> `docs/about/release-notes.mdx`: Carries forward managed vLLM
default behavior for DGX Spark and DGX Station.
- #3931 -> `docs/about/release-notes.mdx`,
`docs/reference/architecture.mdx`: Documents the sandbox `python` to
`python3` compatibility symlink.
- #1485 -> `docs/about/release-notes.mdx`,
`docs/reference/architecture.mdx`: Documents the sandbox image Docker
health check.
- #3784 -> `docs/about/release-notes.mdx`: Captures VM-driver snapshot
health-check reliability in release notes.
- #3917 -> `docs/about/release-notes.mdx`: Captures package-based
workspace template resolution in release notes.
- #3170 -> `docs/about/release-notes.mdx`: Captures installer checksum
compatibility from preferring `sha256sum`.
- #3898 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for messaging provider scenario validation.
- #3897 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for baseline onboarding scenario validation.
- #3834 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for PR review advisor automation.
- #3838 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for CLI display registry refactoring.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

`make docs` was attempted but could not complete because `npx fern-api`
failed with `403 Forbidden` from `https://registry.npmjs.org/fern-api`
in this environment. Pre-commit and pre-push hooks passed after
refreshing the local CLI build output with `npm run build:cli`; no build
artifacts were committed.

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added WSL onboarding notes for Windows-host Ollama detection, restart
guidance, and PowerShell checks.
* Clarified express-install behavior (non-interactive, sudo prompts) and
default sandbox policy selection.
* Added Windows preparation guidance when installer tooling is missing
(winget/App Installer or Docker Desktop).
* Expanded sandbox docs with Docker health checks, Homebrew/python
compatibility helpers, share-mount path validation, Discord
troubleshooting, and new v0.0.48/v0.0.47 release notes.
* **Chores**
  * Improved docs preview workflow error handling.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4007?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.47 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants