fix(hermes): restore Spark GPU recreate startup#3963
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR adds Hermes gateway runtime cleanup logic to detect and safely remove stale gateway artifacts, and introduces configurability for OpenShell sandbox startup commands through environment-variable injection in Docker GPU patch creation and recreation flows. ChangesHermes gateway runtime cleanup
OpenShell sandbox command configurability
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: Git diff in the prompt was truncated; repository reads were used for the main changed implementation sections, but not every unchanged surrounding line was re-reviewed.; No PR scripts, package-manager commands, Docker commands, or tests were executed.; Review thread state was unavailable beyond the provided GraphQL nodes; CodeRabbit was still pending.; No linked issues were present in trusted context, so acceptance mapping used the PR body clauses and test claims as untrusted evidence.; E2E Advisor comments were absent; only the in-progress E2E recommendation check run was available. Full advisor summaryPR Review AdvisorBase: The patch targets real active sandbox/Hermes GPU startup paths, but merge is blocked by pending CI/E2E, mergeStateStatus=BLOCKED, CodeRabbit still pending, large-file budget blockers, and a security concern around serializing sensitive startup env into OPENSHELL_SANDBOX_COMMAND. Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
Security review
Test / E2E status
✅ What looks good
Review completeness
|
Selective E2E Results — ✅ All requested jobs passedRun: 26202173538
|
Selective E2E Results — ❌ Some jobs failedRun: 26202896549
|
|
Verified on DGX Spark (aarch64, NVIDIA GB10, 122 GB) — fresh setup from Onboard
Bug 1 — GPU recreate drops ENTRYPOINT: fixed
Bug 2 — Stale PID file kills fresh Hermes: fixed
First-token inference
LGTM from Spark validation. Ready to merge. |
Selective E2E Results — ✅ All requested jobs passedRun: 26202896549
|
## Summary Refreshes NemoClaw release notes for v0.0.47 and v0.0.48, then regenerates the corresponding user-skill references so agent-facing docs match the source pages. Preview: https://nvidia-preview-docs-release-notes-47-48.docs.buildwithfern.com/nemoclaw/about/release-notes ## Changes - Adds explicit v0.0.47 and v0.0.48 sections to `docs/about/release-notes.mdx`. - Documents follow-up WSL Ollama, sandbox image, share mount, and troubleshooting updates from recent release changes. - Regenerates `nemoclaw-user-*` skill references from the Fern MDX source docs. ## Source Summary - #4003 -> `docs/about/release-notes.mdx`: Notes the messaging manifest registry work as part of v0.0.48 release coverage. - #3984 -> `docs/about/release-notes.mdx`: Captures Hermes messaging policy scoping in the v0.0.48 release notes. - #3963 -> `docs/about/release-notes.mdx`: Captures DGX Spark Hermes GPU recreation startup recovery in the v0.0.48 release notes. - #3961 -> `docs/about/release-notes.mdx`: Captures Discord loopback proxy routing in the v0.0.48 release notes. - #3940 -> `docs/about/release-notes.mdx`: Captures installer prompt clarification and express-install behavior in the v0.0.48 release notes. - #3946 -> `docs/about/release-notes.mdx`: Carries forward the Homebrew preinstall clarification in release coverage. - #3937 -> `docs/about/release-notes.mdx`: Carries forward the dashboard URL command and post-install next steps coverage. - #3921 -> `docs/about/release-notes.mdx`: Carries forward managed vLLM default behavior for DGX Spark and DGX Station. - #3931 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox `python` to `python3` compatibility symlink. - #1485 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox image Docker health check. - #3784 -> `docs/about/release-notes.mdx`: Captures VM-driver snapshot health-check reliability in release notes. - #3917 -> `docs/about/release-notes.mdx`: Captures package-based workspace template resolution in release notes. - #3170 -> `docs/about/release-notes.mdx`: Captures installer checksum compatibility from preferring `sha256sum`. - #3898 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for messaging provider scenario validation. - #3897 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for baseline onboarding scenario validation. - #3834 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for PR review advisor automation. - #3838 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for CLI display registry refactoring. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) `make docs` was attempted but could not complete because `npx fern-api` failed with `403 Forbidden` from `https://registry.npmjs.org/fern-api` in this environment. Pre-commit and pre-push hooks passed after refreshing the local CLI build output with `npm run build:cli`; no build artifacts were committed. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added WSL onboarding notes for Windows-host Ollama detection, restart guidance, and PowerShell checks. * Clarified express-install behavior (non-interactive, sudo prompts) and default sandbox policy selection. * Added Windows preparation guidance when installer tooling is missing (winget/App Installer or Docker Desktop). * Expanded sandbox docs with Docker health checks, Homebrew/python compatibility helpers, share-mount path validation, Discord troubleshooting, and new v0.0.48/v0.0.47 release notes. * **Chores** * Improved docs preview workflow error handling. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4007?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
env ... nemoclaw-startsandbox command when the Docker GPU patch recreates an OpenShell-managed containerruntime/gateway.pid/runtime/gateway.lockstate only when no Hermes gateway process is alivesocatforwarders before launching a fresh gatewayEvidence
Senthil's Spark debug bundle showed the recreated container running
/opt/openshell/bin/openshell-sandboxwithsleep infinity, while the image entrypoint was/usr/local/bin/nemoclaw-start. A manualnemoclaw-startthen failed withPID file race lost to another gateway instanceagainst stale Hermes runtime lock state.Tests
npm run build:clibash -n agents/hermes/start.shnpx vitest run src/lib/onboard/docker-gpu-patch.test.ts test/hermes-start.test.tsSummary by CodeRabbit
Release Notes
New Features
OPENSHELL_SANDBOX_COMMANDenvironment variable.Tests