feat(install): ensure Docker is installed and user is in docker group#3314
Conversation
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe installer adds a Linux-only ensure_docker() preflight that verifies ChangesDocker Verification Prerequisite
Sequence Diagram(s)sequenceDiagram
participant User
participant Installer
participant ensure_docker
participant DockerDaemon
participant Systemd
User->>Installer: sees preflight_usage_notice_prompt
Installer->>ensure_docker: call ensure_docker()
ensure_docker->>DockerDaemon: run `docker info`
alt docker available
DockerDaemon-->>ensure_docker: reachable
ensure_docker-->>Installer: return success
else docker missing/unreachable
ensure_docker->>Installer: download & verify install script
ensure_docker->>Installer: run `sudo sh` (install)
ensure_docker->>Systemd: check/enable/start docker unit
Systemd-->>ensure_docker: unit active?
ensure_docker->>ensure_docker: ensure user in `docker` group
alt user added to group
ensure_docker-->>User: print newgrp/re-login guidance and exit 0
else still unreachable
DockerDaemon-->>ensure_docker: unreachable -> error
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/install.sh`:
- Around line 1703-1709: Replace the silent suppression on the systemd start
command: instead of using the trailing "|| true" after "sudo systemctl enable
--now docker", capture the command's exit status and stderr, and log a clear
error message (using the script's info/error helper functions) including the
stderr output if the command fails; keep the subsequent "docker info" check but
provide this explicit log so users see why enabling the docker.service failed
(reference: the "sudo systemctl enable --now docker" invocation and the later
"docker info" check).
- Around line 1695-1701: Replace the direct piping in the Docker install block
(the if ! command -v docker ... sudo sh -c 'curl -fsSL https://get.docker.com |
sh') with a safer flow: curl the script to a temp file, run a verification step
(reuse or add a verify_downloaded_script that checks for a shebang and non-empty
file and optionally SHA-256 if provided), ensure download errors surface
(respecting set -o pipefail) before running sudo sh on the temp file, and delete
the temp file afterward; update the error message inside the Docker install
branch to include failure details if the verification or execution fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 784e1f7f-4789-49c7-a616-10cb1a8a547e
📒 Files selected for processing (1)
scripts/install.sh
… errors Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
ericksoa
left a comment
There was a problem hiding this comment.
Thanks for tightening up the Docker bootstrap path. I think this needs one more fix before approval.
scripts/install.sh now uses id -nG "$USER" to decide whether the user is in the docker group. That checks the account's group database entry, not the groups active in the current shell. The common case this flow is trying to handle is: Docker is installed, the user has already been added to docker, but they have not run newgrp docker or logged back in yet. In that state docker info still fails, while id -nG "$USER" already includes docker, so the script skips the group-refresh instructions and falls through to:
Docker is installed but not reachable. Try: sudo systemctl start docker
That sends the user toward the wrong fix when the daemon may already be active; they actually need the same newgrp docker / relogin guidance. Please distinguish active process groups from the persisted user database, e.g. check id -nG for active membership, then if id -nG "$target_user" already contains docker, exit with the group-refresh instructions instead of the daemon error. This should also get a small installer test/stub case so this path does not regress.
…ng docker ready Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/install.sh`:
- Around line 1722-1743: Short-circuit the docker-group logic for root (UID 0)
and stop using the environment $USER: at the top of the block check UID via id
-u and if it equals 0 skip the group membership checks; replace any use of $USER
with the resolved account name from id -un; ensure the id -nG calls that check
persisted/active groups reference the proper username when needed and only call
sudo usermod -aG docker "$(id -un)" when not root, leaving needs_group_refresh
behavior unchanged for non-root accounts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6c78c32c-d04e-40b5-99e7-789d58c4e5f8
📒 Files selected for processing (1)
scripts/install.sh
…ser via id -un Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
# Conflicts: # scripts/install.sh
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/install-preflight.test.ts (1)
2637-2641: ⚡ Quick winThis test never hits the daemon-start branch.
systemctlScriptdefaultsis-activeto exit0, so this scenario skipssudo systemctl enable --now dockerand only asserts the finaldocker infofailure. If you want coverage for the new daemon remediation path, makeis-activefail here and assert the sudo log captured the enable/start call.Suggested coverage adjustment
- it("reports daemon reachability when the active shell already has docker", () => { - const { result } = runEnsureDockerWithStubs({ + it("tries to start docker when the active shell already has docker access but the daemon is down", () => { + const { result, sudoLog } = runEnsureDockerWithStubs({ dockerScript: `#!/usr/bin/env bash if [ "\${1:-}" = "info" ]; then exit 1; fi exit 0 `, idScript: `#!/usr/bin/env bash case "$*" in "-u") printf '1000\\n' ;; "-un") printf 'alice\\n' ;; "-nG alice") printf 'alice docker\\n' ;; "-nG") printf 'alice docker adm\\n' ;; *) printf 'unexpected id %s\\n' "$*" >&2; exit 99 ;; esac `, + systemctlScript: `#!/usr/bin/env bash +if [ "\${1:-}" = "is-active" ]; then exit 3; fi +if [ "\${1:-}" = "enable" ]; then exit 0; fi +exit 0 +`, }); const output = `${result.stdout}${result.stderr}`; expect(result.status, output).not.toBe(0); + expect(sudoLog).toMatch(/systemctl enable --now docker/); expect(output).toMatch(/Docker is installed but not reachable/); expect(output).toMatch(/sudo systemctl start docker/); expect(output).not.toMatch(/newgrp docker/); });Also applies to: 2730-2752
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/install-preflight.test.ts` around lines 2637 - 2641, The test's mocked systemctlScript currently returns success for "is-active", so the daemon-start remediation path is never exercised; change the systemctlScript used in the test to return a non-zero exit for the "is-active" check (so the code will take the branch that runs sudo systemctl enable --now docker) and add an assertion that the sudo log captured the enable/start invocation (look for the string matching "sudo systemctl enable --now docker" in the test's captured logs). Apply the same change to the parallel test case around the other block (the similar systemctlScript usage at the 2730–2752 area) so both cover the daemon-start branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/install.sh`:
- Around line 1753-1758: The exit message when needs_group_refresh=1 currently
tells users to rerun a hardcoded installer one-liner; update the messaging
and/or implementation to instruct the user to re-run the exact command they
originally invoked (preserving flags and environment) instead of the fixed curl
line. Replace the hardcoded "curl -fsSL https://www.nvidia.com/nemoclaw.sh |
bash" suggestion with a reconstructed invocation using the script name and
original arguments (e.g., "$0 $@" or equivalent) and include a note to preserve
any important env vars (NEMOCLAW_AGENT, --non-interactive, --fresh, etc.) so the
re-run uses the same options; keep the same needs_group_refresh exit path and
exit status.
- Around line 1689-1691: The script's fast-path and preflight checks use plain
`docker info`, which respects DOCKER_HOST/contexts; update each occurrence of
`docker info` (the checks in the install.sh preflight/fast-path logic) to
explicitly target the local daemon socket by invoking `docker --host
unix:///var/run/docker.sock info` and, when `$XDG_RUNTIME_DIR` is set
(rootless), try `docker --host unix://$XDG_RUNTIME_DIR/docker.sock info` as a
fallback; ensure the check treats success the same way as before and try the
rootful socket first, then the rootless socket, falling back to the original
`docker info` only if both socket checks fail.
---
Nitpick comments:
In `@test/install-preflight.test.ts`:
- Around line 2637-2641: The test's mocked systemctlScript currently returns
success for "is-active", so the daemon-start remediation path is never
exercised; change the systemctlScript used in the test to return a non-zero exit
for the "is-active" check (so the code will take the branch that runs sudo
systemctl enable --now docker) and add an assertion that the sudo log captured
the enable/start invocation (look for the string matching "sudo systemctl enable
--now docker" in the test's captured logs). Apply the same change to the
parallel test case around the other block (the similar systemctlScript usage at
the 2730–2752 area) so both cover the daemon-start branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b8cc6221-f72d-4704-8eb5-03a6a817c62e
📒 Files selected for processing (2)
scripts/install.shtest/install-preflight.test.ts
| # Fast path: docker info works → already set up (root, or already-active group). | ||
| if docker info >/dev/null 2>&1; then | ||
| return 0 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does docker infouse the current Docker context and theDOCKER_HOST/DOCKER_CONTEXT environment variables, and what is the recommended way for a Linux shell script to force a check against the local Docker daemon/socket?
💡 Result:
Yes. docker info (alias of docker system info) connects to whatever daemon endpoint the Docker CLI is configured to use, which is determined by the active Docker context unless overridden by higher-priority connection settings like DOCKER_HOST/--host/-H. The Docker docs state that commands run against the current context “unless overridden with environment variables such as DOCKER_HOST and DOCKER_CONTEXT,” or with --context/--host flags [1]. Also, DOCKER_CONTEXT overrides the context selected via docker context use when the environment variable is set [1][2]. Separately, the CLI reference describes DOCKER_HOST as the “Daemon socket to connect to” and that -c/--context overrides DOCKER_HOST only relative to default context selection; and the context commands explain DOCKER_HOST/DOCKER_CONTEXT and flags affect what daemon is used [3]. The docker info page shows “Client/Context: …” in its output, confirming it reports the client-side context it’s using [4]. Recommended way (Linux shell script) to force a check against the local Docker daemon/socket: run docker info with an explicit host override to the local Unix socket and clear any env overrides that could redirect the CLI. Docker provides troubleshooting guidance that if DOCKER_HOST is set then the Docker client is connecting to a daemon on that host, and recommends unsetting it when you want local behavior [5]. So in a script, do something like: unset DOCKER_HOST DOCKER_CONTEXT export DOCKER_CONTEXT=default # optional; ensures you’re not using a non-local named context # Force local daemon via unix socket (typical rootful Docker): docker --host unix:///var/run/docker.sock info >/dev/null If you’re using rootless Docker, the socket is typically under $XDG_RUNTIME_DIR/docker.sock as documented; you’d use that path instead of /var/run/docker.sock [6]. This approach forces the check against the intended local socket regardless of whatever current context the user/session has selected via docker context use or environment variables such as DOCKER_CONTEXT, because DOCKER_HOST is either unset or the connection is overridden explicitly via --host/socket URI [1][5].
Citations:
- 1: http://docs.docker.com/engine/context/working-with-contexts
- 2: https://docs.docker.com/reference/cli/docker/context/use/
- 3: https://docs.docker.com/engine/reference/commandline/cli/
- 4: https://docs.docker.com/reference/cli/docker/system/info
- 5: https://docs.docker.com/engine/daemon/troubleshoot
- 6: https://docs.docker.com/engine/security/rootless/tips/
🏁 Script executed:
# First, check if the file exists and its size
wc -l scripts/install.sh
# Then read the specific sections mentioned in the review
echo "=== Lines 1680-1695 (around 1689-1691) ==="
sed -n '1680,1695p' scripts/install.sh
echo -e "\n=== Lines 1718-1735 (around 1725-1728) ==="
sed -n '1718,1735p' scripts/install.sh
echo -e "\n=== Lines 1754-1770 (around 1761-1762) ==="
sed -n '1754,1770p' scripts/install.sh
# Check for SPDX header
echo -e "\n=== First 5 lines of file (SPDX header check) ==="
head -5 scripts/install.sh
# Check for shebang
echo -e "\n=== Shebang check ==="
head -1 scripts/install.shRepository: NVIDIA/NemoClaw
Length of output: 2330
Force Docker reachability checks against the local daemon socket, not the active Docker context.
The bare docker info checks on lines 1689-1691, 1725-1728, and 1761-1762 honor the user's current Docker context and DOCKER_HOST environment variables. If the user has Docker pointed at a remote or stale endpoint, the preflight will falsely fail or take the wrong remediation path even though the local daemon is accessible. Explicitly target the local Unix socket using docker --host unix:///var/run/docker.sock info (or the rootless equivalent under $XDG_RUNTIME_DIR/docker.sock).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/install.sh` around lines 1689 - 1691, The script's fast-path and
preflight checks use plain `docker info`, which respects DOCKER_HOST/contexts;
update each occurrence of `docker info` (the checks in the install.sh
preflight/fast-path logic) to explicitly target the local daemon socket by
invoking `docker --host unix:///var/run/docker.sock info` and, when
`$XDG_RUNTIME_DIR` is set (rootless), try `docker --host
unix://$XDG_RUNTIME_DIR/docker.sock info` as a fallback; ensure the check treats
success the same way as before and try the rootful socket first, then the
rootless socket, falling back to the original `docker info` only if both socket
checks fail.
| if [ "$needs_group_refresh" = "1" ]; then | ||
| printf "\n" | ||
| info "Docker group membership is not active in this shell yet. To finish:" | ||
| info " 1) Run: newgrp docker (or log out and log back in)" | ||
| info " 2) Re-run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash" | ||
| exit 0 |
There was a problem hiding this comment.
Don’t tell users to rerun a different installer invocation.
This exit path runs after args/env have already been parsed, so the hardcoded one-liner drops flags like --non-interactive, --fresh, NEMOCLAW_AGENT, or explicit acceptance. In headless or customized installs, following this message can change behavior or fail on the next run. Tell the user to rerun the same command they used originally, or reconstruct it from the current args/env.
Suggested tweak
- info " 2) Re-run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash"
+ info " 2) Re-run the same installer command you used originally, with the same flags/env."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [ "$needs_group_refresh" = "1" ]; then | |
| printf "\n" | |
| info "Docker group membership is not active in this shell yet. To finish:" | |
| info " 1) Run: newgrp docker (or log out and log back in)" | |
| info " 2) Re-run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash" | |
| exit 0 | |
| if [ "$needs_group_refresh" = "1" ]; then | |
| printf "\n" | |
| info "Docker group membership is not active in this shell yet. To finish:" | |
| info " 1) Run: newgrp docker (or log out and log back in)" | |
| info " 2) Re-run the same installer command you used originally, with the same flags/env." | |
| exit 0 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/install.sh` around lines 1753 - 1758, The exit message when
needs_group_refresh=1 currently tells users to rerun a hardcoded installer
one-liner; update the messaging and/or implementation to instruct the user to
re-run the exact command they originally invoked (preserving flags and
environment) instead of the fixed curl line. Replace the hardcoded "curl -fsSL
https://www.nvidia.com/nemoclaw.sh | bash" suggestion with a reconstructed
invocation using the script name and original arguments (e.g., "$0 $@" or
equivalent) and include a note to preserve any important env vars
(NEMOCLAW_AGENT, --non-interactive, --fresh, etc.) so the re-run uses the same
options; keep the same needs_group_refresh exit path and exit status.
ericksoa
left a comment
There was a problem hiding this comment.
Approved after updating the branch in 6a59bfd. The previous blocker is covered now: persisted docker-group membership without active shell membership exits with newgrp/relogin guidance instead of the daemon-start error, while active-group daemon failures still report daemon reachability. I also merged current main to resolve the installer conflict and kept the DGX express prompt path intact. Local validation passed: bash syntax/diff check, installer hash check, focused installer Vitest coverage, and build:cli. GitHub checks are green on the updated head.
## Summary Refreshes the release-prep docs for v0.0.39 based on changes merged since the Friday 4pm doc refresh. Updates the source docs, bumps the docs version metadata, and regenerates the NemoClaw user skills from the refreshed docs. ## Changes - #3314 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documents installer Docker setup, Docker group activation, and retry guidance. - #3317 -> `docs/get-started/quickstart.md`, `docs/reference/commands.md`: Documents the DGX Spark and DGX Station express install prompt and `NEMOCLAW_NO_EXPRESS`. - #3328 and #3329 -> `docs/security/best-practices.md`, `docs/deployment/sandbox-hardening.md`: Updates sandbox capability hardening docs for the stricter bounding-set and `setpriv` step-down behavior. - #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`: Documents Windows-host Ollama relaunch behavior, NIM key passthrough, early health-fail diagnostics, and mixed-GPU preflight detail. - #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 -> `docs/about/release-notes.md`: Adds the v0.0.39 release-prep section while keeping the v0.0.38 release notes intact. - Advances the release-prep docs metadata from v0.0.38 to v0.0.39. - Regenerates `.agents/skills/nemoclaw-user-*` from the updated source docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes v0.0.39 * **New Features** * Host alias management commands for easier configuration * Sandbox GPU control options during onboarding * Update command with check and confirmation modes * **Documentation** * Enhanced Linux installer guidance with Docker and group membership handling * Expanded troubleshooting for permission and connectivity issues * Improved capability-dropping security documentation * Updated inference model switching commands * Brev environment-specific troubleshooting * **Improvements** * DGX Spark/Station express install flow * Windows Ollama relay and health-check enhancements * NVIDIA NIM preflight GPU reporting [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
On vanilla Linux, install.sh assumed Docker was already installed and the current user could run it without sudo. A user without Docker, or with Docker but not in the
dockergroup, would hit permission errors during phase 3 onboard. This PR adds anensure_dockerstep that runs before phase 1 to install Docker (via the official convenience script) and add the user to the docker group, then exits with instructions to runnewgrp docker(or log out and log back in) so the new group membership is active before re-running install.Changes
scripts/install.sh: newensure_dockerhelper called frommain()before the banner.docker infoalready works (root, or already-active docker group).curl -fsSL https://get.docker.com | shif missing.docker.serviceif not running.dockergroup viasudo usermod -aG docker $USER.1) Run: newgrp docker (or log out and log back in)then2) Re-run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com
Summary by CodeRabbit
Chores
Tests