fix(install): simplify DGX Spark setup, remove cgroup v2 workaround#1368
Conversation
…rkaround ## Summary - Remove obsolete cgroup v2 cgroupns=host workaround from setup-spark.sh (handled by OpenShell directly since PR OpenShell#329) - Simplify spark-install.md Quick Start to two curl commands - Keep docker group check in setup-spark.sh ## Changes - `scripts/setup-spark.sh`: Remove cgroup v2 daemon.json workaround and Docker restart logic. Keep docker group setup only. - `spark-install.md`: Replace multi-step Quick Start with two curl commands. Update troubleshooting and architecture sections to reflect the cgroup fix is now in OpenShell. ## Testing - Verified on DGX Spark (cgroup v2) without daemon.json cgroupns workaround. Gateway container gets `CgroupnsMode: host` from OpenShell directly. Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe setup-spark.sh script was refactored to remove Docker daemon.json configuration logic, replacing it with a simpler docker group membership approach. Corresponding documentation was updated to reflect the simplified installation flow and new troubleshooting guidance for Docker cgroup issues. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
@zyang-dev you need to fix your commits to have verified signatures. Please do this and let us know |
…VIDIA#1368) <!-- markdownlint-disable MD041 --> ## Summary - Remove obsolete cgroup v2 cgroupns=host workaround from setup-spark.sh (handled by OpenShell directly since PR OpenShell#329) - Simplify spark-install.md Quick Start to two curl commands - Keep docker group check in setup-spark.sh ## Related Issue <!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section if none. --> ## Changes - `scripts/setup-spark.sh`: Remove cgroup v2 daemon.json workaround and Docker restart logic. Keep docker group setup only. - `spark-install.md`: Replace multi-step Quick Start with two curl commands. Update troubleshooting and architecture sections to reflect the cgroup fix is now in OpenShell. ## Type of Change <!-- Check the one that applies. --> - [ ] Code change for a new feature, bug fix, or refactor. - [x] Code change with doc updates. - [ ] Doc only. Prose changes without code sample modifications. - [ ] Doc only. Includes code sample changes. ## Testing <!-- What testing was done? --> Verified on DGX Spark (cgroup v2) without daemon.json cgroupns workaround. Gateway container gets `CgroupnsMode: host` from OpenShell directly. - [x] `npx prek run --all-files` passes (or equivalently `make check`). - [x] `npm test` passes. - [ ] `make docs` builds without warnings. (for doc-only changes) ## Checklist ### General - [x] I have read and followed the [contributing guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md). - [ ] I have read and followed the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). (for doc-only changes) ### Code Changes <!-- Skip if this is a doc-only PR. --> - [x] Formatters applied — `npx prek run --all-files` auto-fixes formatting (or `make format` for targeted runs). - [ ] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs). ### Doc Changes <!-- Skip if this PR has no doc changes. --> - [x] Follows the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). Try running the `update-docs` agent skill to draft changes while complying with the style guide. For example, prompt your agent with "`/update-docs` catch up the docs for the new changes I made in this PR." - [ ] New pages include SPDX license header and frontmatter, if creating a new page. - [x] Cross-references and links verified. --- <!-- DCO sign-off (required by CI). Replace with your real name and email. --> Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Prerequisites simplified: OpenShell CLI is now installed automatically during the installation process. * Quick Start flow streamlined with single-command piped installation from the hosted installer. * Docker troubleshooting updated with improved container-level configuration guidance instead of manual daemon edits. * Removed complex manual Docker setup steps; enhanced architecture diagrams. * **Chores** * Setup scripts improved with streamlined Docker group management and conditional user messaging. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com> Co-authored-by: KJ <kejones@nvidia.com>
…ty (#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes #1377 - Addresses #1330 - Addresses #1390 - Related to #1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - #1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - #1450 by @cjagwani for switching Brev flows toward GCP for reliability - #1383 by @13ernkastel for the current Brev create flag compatibility work - #1364 by @WuKongAI-CMU for deploy sync-path fixes - #1362 and #1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444, and #1390 by @ericksoa <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…VIDIA#1368) <!-- markdownlint-disable MD041 --> ## Summary - Remove obsolete cgroup v2 cgroupns=host workaround from setup-spark.sh (handled by OpenShell directly since PR OpenShell#329) - Simplify spark-install.md Quick Start to two curl commands - Keep docker group check in setup-spark.sh ## Related Issue <!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section if none. --> ## Changes - `scripts/setup-spark.sh`: Remove cgroup v2 daemon.json workaround and Docker restart logic. Keep docker group setup only. - `spark-install.md`: Replace multi-step Quick Start with two curl commands. Update troubleshooting and architecture sections to reflect the cgroup fix is now in OpenShell. ## Type of Change <!-- Check the one that applies. --> - [ ] Code change for a new feature, bug fix, or refactor. - [x] Code change with doc updates. - [ ] Doc only. Prose changes without code sample modifications. - [ ] Doc only. Includes code sample changes. ## Testing <!-- What testing was done? --> Verified on DGX Spark (cgroup v2) without daemon.json cgroupns workaround. Gateway container gets `CgroupnsMode: host` from OpenShell directly. - [x] `npx prek run --all-files` passes (or equivalently `make check`). - [x] `npm test` passes. - [ ] `make docs` builds without warnings. (for doc-only changes) ## Checklist ### General - [x] I have read and followed the [contributing guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md). - [ ] I have read and followed the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). (for doc-only changes) ### Code Changes <!-- Skip if this is a doc-only PR. --> - [x] Formatters applied — `npx prek run --all-files` auto-fixes formatting (or `make format` for targeted runs). - [ ] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs). ### Doc Changes <!-- Skip if this PR has no doc changes. --> - [x] Follows the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). Try running the `update-docs` agent skill to draft changes while complying with the style guide. For example, prompt your agent with "`/update-docs` catch up the docs for the new changes I made in this PR." - [ ] New pages include SPDX license header and frontmatter, if creating a new page. - [x] Cross-references and links verified. --- <!-- DCO sign-off (required by CI). Replace with your real name and email. --> Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Prerequisites simplified: OpenShell CLI is now installed automatically during the installation process. * Quick Start flow streamlined with single-command piped installation from the hosted installer. * Docker troubleshooting updated with improved container-level configuration guidance instead of manual daemon edits. * Removed complex manual Docker setup steps; enhanced architecture diagrams. * **Chores** * Setup scripts improved with streamlined Docker group management and conditional user messaging. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com> Co-authored-by: KJ <kejones@nvidia.com>
…ty (NVIDIA#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes NVIDIA#1377 - Addresses NVIDIA#1330 - Addresses NVIDIA#1390 - Related to NVIDIA#1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - NVIDIA#1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - NVIDIA#1395 and NVIDIA#1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - NVIDIA#1450 by @cjagwani for switching Brev flows toward GCP for reliability - NVIDIA#1383 by @13ernkastel for the current Brev create flag compatibility work - NVIDIA#1364 by @WuKongAI-CMU for deploy sync-path fixes - NVIDIA#1362 and NVIDIA#1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from NVIDIA#1377 and NVIDIA#1404 by @zNeill, NVIDIA#1330 by @Marcelo5444, and NVIDIA#1390 by @ericksoa <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Related Issue
Changes
scripts/setup-spark.sh: Remove cgroup v2 daemon.json workaround and Docker restart logic. Keep docker group setup only.spark-install.md: Replace multi-step Quick Start with two curl commands. Update troubleshooting and architecture sections to reflect the cgroup fix is now in OpenShell.Type of Change
Testing
Verified on DGX Spark (cgroup v2) without daemon.json cgroupns workaround. Gateway container gets
CgroupnsMode: hostfrom OpenShell directly.npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Doc Changes
update-docsagent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docscatch up the docs for the new changes I made in this PR."Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com
Summary by CodeRabbit
Documentation
Chores