Skip to content

fix(install): simplify DGX Spark setup, remove cgroup v2 workaround#1368

Merged
cv merged 3 commits into
mainfrom
fix/spark-install-remove-cgroup-workaround
Apr 3, 2026
Merged

fix(install): simplify DGX Spark setup, remove cgroup v2 workaround#1368
cv merged 3 commits into
mainfrom
fix/spark-install-remove-cgroup-workaround

Conversation

@zyang-dev

@zyang-dev zyang-dev commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Remove obsolete cgroup v2 cgroupns=host workaround from setup-spark.sh (handled by OpenShell directly since PR OpenShell#329)
  • Simplify spark-install.md Quick Start to two curl commands
  • Keep docker group check in setup-spark.sh

Related Issue

Changes

  • scripts/setup-spark.sh: Remove cgroup v2 daemon.json workaround and Docker restart logic. Keep docker group setup only.
  • spark-install.md: Replace multi-step Quick Start with two curl commands. Update troubleshooting and architecture sections to reflect the cgroup fix is now in OpenShell.

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

Verified on DGX Spark (cgroup v2) without daemon.json cgroupns workaround. Gateway container gets CgroupnsMode: host from OpenShell directly.

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Checklist

General

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

  • Follows the style guide. Try running the update-docs agent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docs catch up the docs for the new changes I made in this PR."
  • New pages include SPDX license header and frontmatter, if creating a new page.
  • Cross-references and links verified.

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • Documentation

    • Prerequisites simplified: OpenShell CLI is now installed automatically during the installation process.
    • Quick Start flow streamlined with single-command piped installation from the hosted installer.
    • Docker troubleshooting updated with improved container-level configuration guidance instead of manual daemon edits.
    • Removed complex manual Docker setup steps; enhanced architecture diagrams.
  • Chores

    • Setup scripts improved with streamlined Docker group management and conditional user messaging.

…rkaround

  ## Summary
  - Remove obsolete cgroup v2 cgroupns=host workaround from setup-spark.sh (handled by OpenShell directly since PR OpenShell#329)
  - Simplify spark-install.md Quick Start to two curl commands
  - Keep docker group check in setup-spark.sh

  ## Changes
  - `scripts/setup-spark.sh`: Remove cgroup v2 daemon.json workaround and Docker restart logic. Keep docker group setup only.
  - `spark-install.md`: Replace multi-step Quick Start with two curl commands. Update troubleshooting and architecture sections to reflect the cgroup fix is now in OpenShell.

  ## Testing
  - Verified on DGX Spark (cgroup v2) without daemon.json cgroupns workaround. Gateway container gets `CgroupnsMode: host` from OpenShell directly.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ad9e6d4-1ef5-4f5b-ab6d-f7db063bbab1

📥 Commits

Reviewing files that changed from the base of the PR and between 2070c2c and f1dc3db.

📒 Files selected for processing (2)
  • scripts/setup-spark.sh
  • spark-install.md

📝 Walkthrough

Walkthrough

The setup-spark.sh script was refactored to remove Docker daemon.json configuration logic, replacing it with a simpler docker group membership approach. Corresponding documentation was updated to reflect the simplified installation flow and new troubleshooting guidance for Docker cgroup issues.

Changes

Cohort / File(s) Summary
Setup Script & Documentation
scripts/setup-spark.sh, spark-install.md
Simplified Docker setup: removed daemon.json editing and restart logic in favor of docker group membership. Updated install flow to use hosted NemoClaw installer script via piped curl. Replaced manual cgroup v2 workaround instructions with reference to OpenShell container-level cgroupns=host configuration. Updated troubleshooting and architecture diagram accordingly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 The config JSON hops away,
Docker groups save the day!
No daemon restarts, no cgroup fuss,
Simpler Spark setup for all of us! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main changes: simplifying the DGX Spark setup process and removing the obsolete cgroup v2 workaround that is now handled by OpenShell.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/spark-install-remove-cgroup-workaround

Comment @coderabbitai help to get the list of available commands and usage tips.

@kjw3

kjw3 commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

@zyang-dev you need to fix your commits to have verified signatures. Please do this and let us know

@cv cv merged commit 02e5bb6 into main Apr 3, 2026
8 checks passed
@zyang-dev zyang-dev deleted the fix/spark-install-remove-cgroup-workaround branch April 3, 2026 16:57
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
…VIDIA#1368)

<!-- markdownlint-disable MD041 -->
## Summary
- Remove obsolete cgroup v2 cgroupns=host workaround from setup-spark.sh
(handled by OpenShell directly since PR OpenShell#329)
  - Simplify spark-install.md Quick Start to two curl commands
  - Keep docker group check in setup-spark.sh

## Related Issue
<!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section
if none. -->

## Changes
- `scripts/setup-spark.sh`: Remove cgroup v2 daemon.json workaround and
Docker restart logic. Keep docker group setup only.
- `spark-install.md`: Replace multi-step Quick Start with two curl
commands. Update troubleshooting and architecture sections to reflect
the cgroup fix is now in OpenShell.

## Type of Change
<!-- Check the one that applies. -->
- [ ] Code change for a new feature, bug fix, or refactor.
- [x] Code change with doc updates.
- [ ] Doc only. Prose changes without code sample modifications.
- [ ] Doc only. Includes code sample changes.

## Testing
<!-- What testing was done? -->
Verified on DGX Spark (cgroup v2) without daemon.json cgroupns
workaround. Gateway container gets `CgroupnsMode: host` from OpenShell
directly.

- [x] `npx prek run --all-files` passes (or equivalently `make check`).
- [x] `npm test` passes.
- [ ] `make docs` builds without warnings. (for doc-only changes)

## Checklist

### General

- [x] I have read and followed the [contributing
guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md).
- [ ] I have read and followed the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
(for doc-only changes)

### Code Changes
<!-- Skip if this is a doc-only PR. -->
- [x] Formatters applied — `npx prek run --all-files` auto-fixes
formatting (or `make format` for targeted runs).
- [ ] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.
- [x] Doc pages updated for any user-facing behavior changes (new
commands, changed defaults, new features, bug fixes that contradict
existing docs).

### Doc Changes
<!-- Skip if this PR has no doc changes. -->
- [x] Follows the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
Try running the `update-docs` agent skill to draft changes while
complying with the style guide. For example, prompt your agent with
"`/update-docs` catch up the docs for the new changes I made in this
PR."
- [ ] New pages include SPDX license header and frontmatter, if creating
a new page.
- [x] Cross-references and links verified.

---
<!-- DCO sign-off (required by CI). Replace with your real name and
email. -->
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Prerequisites simplified: OpenShell CLI is now installed automatically
during the installation process.
* Quick Start flow streamlined with single-command piped installation
from the hosted installer.
* Docker troubleshooting updated with improved container-level
configuration guidance instead of manual daemon edits.
* Removed complex manual Docker setup steps; enhanced architecture
diagrams.

* **Chores**
* Setup scripts improved with streamlined Docker group management and
conditional user messaging.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Co-authored-by: KJ <kejones@nvidia.com>
ericksoa pushed a commit that referenced this pull request Apr 4, 2026
…ty (#1470)

## Summary
- unify installer and onboarding host detection around shared TypeScript
preflight logic
- move `deploy` behavior into TypeScript, thin the Brev compatibility
wrapper, and harden Brev readiness handling
- demote or remove legacy platform-specific setup paths (`setup-spark`,
`brev-setup.sh`) in favor of the canonical installer + onboard flow
- update docs, CLI help, and Brev E2E coverage to match the new behavior

## What Changed
- added shared host assessment and remediation planning in
`src/lib/preflight.ts`
- wired installer and onboard flows to the same host preflight decisions
- changed Podman handling from hard block to unsupported-runtime warning
- migrated deploy logic into `src/lib/deploy.ts`
- updated `nemoclaw deploy` to use the authenticated Brev CLI, current
Brev create flags, explicit GCP provider default, stricter readiness
checks, and standard installer/onboard flow
- removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh`
to a deprecated compatibility wrapper
- updated README/docs/help text and hardened the Brev E2E cleanup path

## Validation
- `npm run build:cli`
- targeted Vitest coverage for `src/lib/preflight.test.ts`,
`src/lib/deploy.test.ts`, `test/install-preflight.test.js`,
`test/cli.test.js`, `test/runner.test.js`
- live Brev validation with `TEST_SUITE=deploy-cli` on
`cpu-e2.4vcpu-16gb`
- confirmed successful end-to-end remote deploy after waiting for Brev
`status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY`

## Related Issues
- Fixes #1377
- Addresses #1330
- Addresses #1390
- Related to #1404

## Credit / Prior Work
This branch builds on ideas and prior work from:
- #1368 by @zyang-dev for simplifying Spark setup and removing the old
cgroup workaround
- #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction
and installer path reliability
- #1450 by @cjagwani for switching Brev flows toward GCP for reliability
- #1383 by @13ernkastel for the current Brev create flag compatibility
work
- #1364 by @WuKongAI-CMU for deploy sync-path fixes
- #1362 and #1266 by @jyaunches for the Brev E2E/launchable
infrastructure direction
- issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444,
and #1390 by @ericksoa


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved host diagnostics with actionable remediation guidance
surfaced during installer/onboard preflight.

* **Improvements**
* macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now
uses the standard installer + `nemoclaw onboard`.
* Preflight output shows detected runtime and WSL notes; installer
prints remediation actions and will skip onboarding on blocking issues.

* **Deprecations**
* `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap
wrapper are now deprecated compatibility paths.

* **Documentation**
* Quickstart, troubleshooting, and command reference updated to reflect
installer+onboard flow and deprecation guidance.

* **Tests**
* Added/updated tests covering preflight, deploy compatibility, CLI
aliases, and deploy e2e scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…VIDIA#1368)

<!-- markdownlint-disable MD041 -->
## Summary
- Remove obsolete cgroup v2 cgroupns=host workaround from setup-spark.sh
(handled by OpenShell directly since PR OpenShell#329)
  - Simplify spark-install.md Quick Start to two curl commands
  - Keep docker group check in setup-spark.sh

## Related Issue
<!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section
if none. -->

## Changes
- `scripts/setup-spark.sh`: Remove cgroup v2 daemon.json workaround and
Docker restart logic. Keep docker group setup only.
- `spark-install.md`: Replace multi-step Quick Start with two curl
commands. Update troubleshooting and architecture sections to reflect
the cgroup fix is now in OpenShell.

## Type of Change
<!-- Check the one that applies. -->
- [ ] Code change for a new feature, bug fix, or refactor.
- [x] Code change with doc updates.
- [ ] Doc only. Prose changes without code sample modifications.
- [ ] Doc only. Includes code sample changes.

## Testing
<!-- What testing was done? -->
Verified on DGX Spark (cgroup v2) without daemon.json cgroupns
workaround. Gateway container gets `CgroupnsMode: host` from OpenShell
directly.

- [x] `npx prek run --all-files` passes (or equivalently `make check`).
- [x] `npm test` passes.
- [ ] `make docs` builds without warnings. (for doc-only changes)

## Checklist

### General

- [x] I have read and followed the [contributing
guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md).
- [ ] I have read and followed the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
(for doc-only changes)

### Code Changes
<!-- Skip if this is a doc-only PR. -->
- [x] Formatters applied — `npx prek run --all-files` auto-fixes
formatting (or `make format` for targeted runs).
- [ ] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.
- [x] Doc pages updated for any user-facing behavior changes (new
commands, changed defaults, new features, bug fixes that contradict
existing docs).

### Doc Changes
<!-- Skip if this PR has no doc changes. -->
- [x] Follows the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
Try running the `update-docs` agent skill to draft changes while
complying with the style guide. For example, prompt your agent with
"`/update-docs` catch up the docs for the new changes I made in this
PR."
- [ ] New pages include SPDX license header and frontmatter, if creating
a new page.
- [x] Cross-references and links verified.

---
<!-- DCO sign-off (required by CI). Replace with your real name and
email. -->
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Prerequisites simplified: OpenShell CLI is now installed automatically
during the installation process.
* Quick Start flow streamlined with single-command piped installation
from the hosted installer.
* Docker troubleshooting updated with improved container-level
configuration guidance instead of manual daemon edits.
* Removed complex manual Docker setup steps; enhanced architecture
diagrams.

* **Chores**
* Setup scripts improved with streamlined Docker group management and
conditional user messaging.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Co-authored-by: KJ <kejones@nvidia.com>
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…ty (NVIDIA#1470)

## Summary
- unify installer and onboarding host detection around shared TypeScript
preflight logic
- move `deploy` behavior into TypeScript, thin the Brev compatibility
wrapper, and harden Brev readiness handling
- demote or remove legacy platform-specific setup paths (`setup-spark`,
`brev-setup.sh`) in favor of the canonical installer + onboard flow
- update docs, CLI help, and Brev E2E coverage to match the new behavior

## What Changed
- added shared host assessment and remediation planning in
`src/lib/preflight.ts`
- wired installer and onboard flows to the same host preflight decisions
- changed Podman handling from hard block to unsupported-runtime warning
- migrated deploy logic into `src/lib/deploy.ts`
- updated `nemoclaw deploy` to use the authenticated Brev CLI, current
Brev create flags, explicit GCP provider default, stricter readiness
checks, and standard installer/onboard flow
- removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh`
to a deprecated compatibility wrapper
- updated README/docs/help text and hardened the Brev E2E cleanup path

## Validation
- `npm run build:cli`
- targeted Vitest coverage for `src/lib/preflight.test.ts`,
`src/lib/deploy.test.ts`, `test/install-preflight.test.js`,
`test/cli.test.js`, `test/runner.test.js`
- live Brev validation with `TEST_SUITE=deploy-cli` on
`cpu-e2.4vcpu-16gb`
- confirmed successful end-to-end remote deploy after waiting for Brev
`status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY`

## Related Issues
- Fixes NVIDIA#1377
- Addresses NVIDIA#1330
- Addresses NVIDIA#1390
- Related to NVIDIA#1404

## Credit / Prior Work
This branch builds on ideas and prior work from:
- NVIDIA#1368 by @zyang-dev for simplifying Spark setup and removing the old
cgroup workaround
- NVIDIA#1395 and NVIDIA#1468 by @kjw3 for the thin installer/bootstrap direction
and installer path reliability
- NVIDIA#1450 by @cjagwani for switching Brev flows toward GCP for reliability
- NVIDIA#1383 by @13ernkastel for the current Brev create flag compatibility
work
- NVIDIA#1364 by @WuKongAI-CMU for deploy sync-path fixes
- NVIDIA#1362 and NVIDIA#1266 by @jyaunches for the Brev E2E/launchable
infrastructure direction
- issue ideas from NVIDIA#1377 and NVIDIA#1404 by @zNeill, NVIDIA#1330 by @Marcelo5444,
and NVIDIA#1390 by @ericksoa


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved host diagnostics with actionable remediation guidance
surfaced during installer/onboard preflight.

* **Improvements**
* macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now
uses the standard installer + `nemoclaw onboard`.
* Preflight output shows detected runtime and WSL notes; installer
prints remediation actions and will skip onboarding on blocking issues.

* **Deprecations**
* `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap
wrapper are now deprecated compatibility paths.

* **Documentation**
* Quickstart, troubleshooting, and command reference updated to reflect
installer+onboard flow and deprecation guidance.

* **Tests**
* Added/updated tests covering preflight, deploy compatibility, CLI
aliases, and deploy e2e scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow platform: dgx-spark Affects DGX Spark hardware or workflows bug-fix PR fixes a bug or regression and removed Platform: DGX Spark labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants