fix(onboard): repair Docker GPU sandbox readiness#3434
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Docker GPU patching for OpenShell sandboxes on Linux Docker driver gateways, Jetson/Tegra GPU detection and unified-memory handling, an orchestration module to recreate sandbox containers with GPU access and collect diagnostics, create-stream and probe updates, and test/CI additions for GPU runs. ChangesDocker GPU Patching and Jetson Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/lib/onboard/docker-gpu-patch.ts (1)
576-685: 🏗️ Heavy liftSplit orchestration/diagnostics functions to reduce complexity in a critical path.
These two functions combine many responsibilities (discovery, mode selection, mutation, readiness waiting, file/report generation). Breaking them into smaller units would lower risk and make failure handling easier to reason about.
As per coding guidelines
**/*.{js,ts,tsx}: Keep function complexity low; prefix unused variables with underscore (_).Also applies to: 701-823
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/docker-gpu-patch.ts` around lines 576 - 685, The function recreateOpenShellDockerSandboxWithGpu is doing too much (discovery, inspect, mode selection, stop/rename, run clone, wait for new ID, wait for exec readiness, cleanup and error decoration); split it into smaller helpers to reduce complexity and improve failure handling: extract container discovery/inspect into a function (e.g., discoverAndInspectContainer that wraps findOpenShellDockerSandboxContainerIds and inspectDockerContainer), extract backup/rename logic into backupAndRenameContainer (uses dockerStop, dockerRename, buildBackupContainerName), extract clone/run-and-wait into runGpuCloneAndWaitForId (uses buildDockerGpuCloneRunArgs, dockerRunDetached, waitForNewContainerId), and extract exec readiness + cleanup into waitForExecReadyAndCleanup (uses waitForOpenShellSandboxExec and dockerRm for backup); ensure the original recreateOpenShellDockerSandboxWithGpu composes these helpers, preserves the DockerGpuPatchFailureContext updates (oldContainerId, backupContainerName, newContainerId, selectedMode, modeAttempts) and still uses decoratePatchError on catch, and prefix any intentionally unused variables with an underscore per guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 6368-6404: The new verbose Docker GPU patch try/catch and
diagnostic logging in createSandbox() should be moved into one or two small
helpers in a new module (e.g., src/lib/onboard/docker-gpu-patch.ts) so
createSandbox() stays thin: create a helper like
runAndLogDockerGpuPatch(sandboxName, effectiveSandboxGpuConfig, options) that
internally calls dockerGpuPatch.recreateOpenShellDockerSandboxWithGpu(...),
catches errors, calls dockerGpuPatch.collectDockerGpuPatchDiagnostics(...) and
dockerGpuPatch.dockerGpuPatchCleanupCommands(...), performs the
console.error/console.log output now in onboard.ts, and returns a compact result
object (mode label or error/diagnostics info); then replace the current
try/catch blocks in createSandbox() (around
recreateOpenShellDockerSandboxWithGpu and the other two similar blocks) with
single calls to this helper and handle only the small returned result.
- Around line 6509-6523: The diagnostics and cleanup messaging for the Docker
GPU patch should only run when the patch path is active; wrap the block that
calls dockerGpuPatch.collectDockerGpuPatchDiagnostics and
dockerGpuPatch.dockerGpuPatchCleanupCommands in a guard (e.g., if
(useDockerGpuPatch) { ... }) so it only executes when useDockerGpuPatch is
truthy (or alternatively guard on dockerGpuPatchResult being present/activated).
Keep the existing references sandboxName, runCaptureOpenshell,
dockerGpuPatchResult, and the two dockerGpuPatch.* calls inside that guarded
block.
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 217-221: pushNumberFlag currently rejects non-positive numbers
which causes Docker's unlimited swap sentinel HostConfig.MemorySwap = -1 to be
dropped when rebuilding docker run args; update pushNumberFlag to treat -1 as a
valid value (accept Number.isFinite(value) && (value === -1 || value > 0)) and
push the flag with String(value) in that case, and apply the same change to the
other occurrence of this helper used for rebuilding run args so
HostConfig.MemorySwap = -1 is preserved.
---
Nitpick comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 576-685: The function recreateOpenShellDockerSandboxWithGpu is
doing too much (discovery, inspect, mode selection, stop/rename, run clone, wait
for new ID, wait for exec readiness, cleanup and error decoration); split it
into smaller helpers to reduce complexity and improve failure handling: extract
container discovery/inspect into a function (e.g., discoverAndInspectContainer
that wraps findOpenShellDockerSandboxContainerIds and inspectDockerContainer),
extract backup/rename logic into backupAndRenameContainer (uses dockerStop,
dockerRename, buildBackupContainerName), extract clone/run-and-wait into
runGpuCloneAndWaitForId (uses buildDockerGpuCloneRunArgs, dockerRunDetached,
waitForNewContainerId), and extract exec readiness + cleanup into
waitForExecReadyAndCleanup (uses waitForOpenShellSandboxExec and dockerRm for
backup); ensure the original recreateOpenShellDockerSandboxWithGpu composes
these helpers, preserves the DockerGpuPatchFailureContext updates
(oldContainerId, backupContainerName, newContainerId, selectedMode,
modeAttempts) and still uses decoratePatchError on catch, and prefix any
intentionally unused variables with an underscore per guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b9abc4e9-bbc0-4860-b39b-46f17dd97991
📒 Files selected for processing (11)
src/lib/adapters/docker/container.tssrc/lib/adapters/docker/index.test.tssrc/lib/inference/nim.test.tssrc/lib/inference/nim.tssrc/lib/onboard.tssrc/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.tssrc/lib/onboard/initial-policy.tssrc/lib/sandbox/create-stream.test.tssrc/lib/sandbox/create-stream.tstest/onboard.test.ts
|
❌ Brev E2E (gpu): FAILED on branch |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/brev-e2e.test.ts`:
- Around line 398-402: The current use of execSync to run a shell-interpolated
piped command is injection-prone; replace the single execSync call with two
execFileSync invocations: call execFileSync('brev',
['search','cpu','--min-vcpu',BREV_MIN_VCPU,'--min-ram',BREV_MIN_RAM,'--min-disk',BREV_MIN_DISK,'--provider',BREV_PROVIDER,'--sort','price'],
{ encoding: 'utf-8', timeout: 180_000 }) to capture the search output, then call
execFileSync('brev', ['create', INSTANCE_NAME, '--startup-script',
`@${setupScriptPath}`, '--detached'], { input: searchOutput, encoding: 'utf-8',
timeout: 180_000, stdio: PIPE_INPUT_STDIO }) so arguments are passed as arrays
(no shell interpolation) and the search output is fed as stdin to create; update
the code that currently uses execSync to use execFileSync and keep the same
timeout/encoding/stdio constants.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ef10bd24-0388-428c-9c61-7784a6e92c0d
📒 Files selected for processing (2)
.github/workflows/e2e-branch-validation.yamltest/e2e/brev-e2e.test.ts
|
❌ Brev E2E (gpu): FAILED on branch |
|
❌ Brev E2E (gpu): FAILED on branch |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
|
❌ Brev E2E (gpu): FAILED on branch |
|
❌ Brev E2E (gpu): FAILED on branch |
|
❌ Brev E2E (gpu): FAILED on branch |
…x-readiness # Conflicts: # test/onboard.test.ts
|
❌ Brev E2E (gpu): FAILED on branch |
|
❌ Brev E2E (gpu): FAILED on branch |
|
❌ Brev E2E (gpu): FAILED on branch |
Selective E2E Results — ❌ Some jobs failedRun: 25840442921
|
Resolve the two output threads in #3456 left after the core dead-loop fix landed via #3459 + #3434: Sub-bug #3 — `src/lib/onboard.ts` printed `nemoclaw <name> destroy --yes && nemoclaw onboard --gpu` with a literal `<name>` placeholder, and assumed at least one sandbox was registered. When the GPU-passthrough mismatch hit on the State B re-run path with an empty registry (the dead-loop case), the hint was not actionable. Replace with a registry-aware helper at `src/lib/onboard/gpu-recovery.ts` that renders the right shape: - empty registry → suggest `nemoclaw uninstall && nemoclaw onboard --gpu` - one sandbox → suggest destroy --yes --cleanup-gateway for that name - multiple sandboxes → list each, only the last gets --cleanup-gateway Sub-bug #4 — `src/lib/actions/uninstall/run-plan.ts` printed `Destroyed gateway 'nemoclaw' skipped` when the openshell destroy no-op'd (gateway already gone) — the "Destroyed … skipped" wording was self-contradictory. Extend `runOptional` with an `onSkip` option; route the gateway destroy to emit `Gateway 'nemoclaw' already removed or unreachable` on no-op. Tests: - `src/lib/onboard/gpu-recovery.test.ts` (6 tests): forbid literal `<name>` placeholder anywhere in the output; cover empty / single / multi-sandbox cases; defensive filter on whitespace names so a `nemoclaw destroy` rendering can never happen. - `src/lib/actions/uninstall/run-plan.test.ts`: assert the new "already removed or unreachable" wording and the absence of the "Destroyed gateway 'nemoclaw' skipped" string. The core dead loop itself (sub-bugs #1, #2 and State B GPU mismatch) is already addressed by #3459 + #3434 + #3483; #3456 will close once this lands. See the #3456 status comment for the full mapping. Refs #3456. Mirrors (and tightens) the approach in the closed PR #3464, which left the literal `<name>` placeholder in tests per CodeRabbit feedback that was never addressed. Signed-off-by: Charan Jagwani <charjags100@gmail.com>
…e test Why: rebasing onto main pulled in NVIDIA#3434's new "can override the sandbox inference base URL" test, which was written against the pre-WeChat 13-parameter signature of patchStagedDockerfile. The WeChat branch added a wechatConfig: LooseObject parameter at slot 13, so the test's `false` (intended for darwinVmCompat) was landing in wechatConfig and TypeScript flagged the boolean-to-LooseObject mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary - Bump the docs metadata and version switcher to `0.0.41`. - Add v0.0.41 release notes plus operator guidance for OpenShell pinning, Docker bridge reachability, Local Ollama proxy reachability, and Docker GPU onboarding diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3434 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Linux Docker-driver GPU onboarding behavior, diagnostics, cleanup guidance, and the `NEMOCLAW_DOCKER_GPU_PATCH` troubleshooting escape hatch. - #3483 -> `docs/about/release-notes.md`: Note that `nemoclaw uninstall` removes all installer-managed OpenShell helper binaries unless `--keep-openshell` is passed. - #3446 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document blueprint-driven OpenShell install pin resolution and fallback behavior. - #3472 -> `docs/inference/use-local-inference.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document sandbox-side Local Ollama auth proxy reachability checks and firewall remediation. - #3459 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Docker-driver sandbox-to-gateway reachability checks and firewall remediation. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `git diff --check` - `npm run build:cli` - `npm run typecheck:cli` - pre-commit hooks during `git commit` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `nemoclaw inference get` command to check current inference settings * Improved gateway health validation with Linux firewall remediation guidance * **Bug Fixes** * Enhanced proxy readiness validation with sandbox network path probes * Improved local Ollama route onboarding with rerun-safe fixes * Better sandbox-to-gateway connectivity detection * **Documentation** * Expanded troubleshooting guidance for firewall and connectivity issues * Updated CLI reference with new command and environment variable documentation * Added gateway binding and Docker-driver GPU compatibility guidance <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3531) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…3520) > **Draft for visibility.** Issue-autopilot Stages 4-5 of #3456. Will mark ready once batch self-review + CI complete. ## Summary Closes the two remaining output threads in #3456 after the core dead-loop fix already landed on `main` (via #3459, #3434, #3483). Full sub-bug mapping in the [#3456 status comment](#3456 (comment)). - **Sub-bug #3** — `nemoclaw <name> destroy --yes` recovery hint replaced with a registry-aware helper. - **Sub-bug #4** — `Destroyed gateway 'nemoclaw' skipped` self-contradictory wording replaced with `Gateway 'nemoclaw' already removed or unreachable`. ## Acceptance criteria mapping | Sub-bug | Resolution | Evidence | |---|---|---| | #1 dead loop | Already fixed on main (#3459) | out of scope | | #2 firewall diagnostic | Already fixed on main (#3459) | out of scope | | **#3** literal `<name>` placeholder | **This PR** | `src/lib/onboard/gpu-recovery.ts` + `onboard.ts:10387-10405` | | **#4** misleading "skipped" wording | **This PR** | `src/lib/actions/uninstall/run-plan.ts:210-228, 407-414` | | #5 uninstall residuals | Already fixed on main (#3483) | out of scope | ## Behavior matrix `gpuPassthroughRecoveryLines(names)`: | Input | Suggestion | |---|---| | `null` / `[]` | `nemoclaw uninstall && nemoclaw onboard --gpu` | | one sandbox | `nemoclaw <name> destroy --yes --cleanup-gateway && nemoclaw onboard --gpu` | | many sandboxes | each `destroy --yes`, only the last gets `--cleanup-gateway` | ## Test plan ``` npm run typecheck:cli npx vitest run src/lib/onboard/gpu-recovery.test.ts src/lib/actions/uninstall/run-plan.test.ts ``` 22 tests pass (6 new + 16 existing). ## Notes for reviewers - This is the work [#3464 attempted](#3464); that PR was closed without merging after CodeRabbit asked for the `<name>` placeholder to be forbidden in tests via negative assertion. This PR adopts that refinement. - `runOptional` extension is backwards-compatible — existing callers without `onSkip` get the original wording. Closes #3456 once merged. --------- Signed-off-by: Charan Jagwani <charjags100@gmail.com> Co-authored-by: Charan Jagwani <charjags100@gmail.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
…d for proc/comm write (#3515) ## Summary The Docker-driver GPU patch ([src/lib/onboard/docker-gpu-patch.ts](src/lib/onboard/docker-gpu-patch.ts)) recreates the OpenShell-managed sandbox container with `--gpus all` plus a reconstruction of the baseline container's flags, then runs a three-step GPU proof in [src/lib/onboard/initial-policy.ts](src/lib/onboard/initial-policy.ts) — including `PROC_COMM_WRITE_PROBE`, which writes to `/proc/<pid>/task/<tid>/comm`. On DGX Spark hosts (#3511) and other Docker-driver Linux hosts where the OpenShell-created baseline container's `CapAdd` does not include `SYS_PTRACE` and its `SecurityOpt` does not include `apparmor=unconfined`, the recreate inherits a flag set the kernel/LSM stack rejects for that proc write. The proof aborts onboarding: ``` ✗ GPU proof failed: /proc/<pid>/task/<tid>/comm write Error: GPU proof failed: /proc/<pid>/task/<tid>/comm write (status 2): sh: 1: cannot create /proc/<pid>/task/<tid>/comm: Permission denied ``` The reporter confirmed a bare `docker run --rm --gpus all ubuntu sh -c "echo test > /proc/1/task/1/comm"` succeeds on the same host, so the kernel itself allows the write under the right Docker flags — the problem is which flags reach the recreated container. This PR makes the GPU recreate self-sufficient for the operations the GPU proof checks, regardless of the non-GPU baseline: - **Always inject `--cap-add SYS_PTRACE`** into the clone, deduped via a `Set` so baselines that already have it stay flat. - **Inject `--security-opt apparmor=unconfined`** only when the baseline did *not* pin a specific apparmor profile. Docker rejects duplicate `--security-opt apparmor=…` entries, so a baseline that explicitly chose `apparmor=docker-default` (or similar) is respected — this stays scoped to the GPU recreate path and does not override deliberate operator choices. ## Related Issue Resolves #3511 Related context: - #3434 — PR that introduced the Docker-driver GPU recreate flow and the `/proc/comm` write probe. This PR hardens the recreate to satisfy what the probe checks across more baselines. ## Changes - `src/lib/onboard/docker-gpu-patch.ts`: dedup `CapAdd` / `SecurityOpt` via `Set`, always inject `SYS_PTRACE`, and inject `apparmor=unconfined` only when no apparmor profile is pinned by the baseline. - `src/lib/onboard/docker-gpu-patch.test.ts`: four new cases — SYS_PTRACE always added; SYS_PTRACE deduped when baseline already has it; apparmor=unconfined injected on empty baselines; baseline-pinned apparmor profile preserved. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * GPU-enabled containers are now configured with the necessary kernel security capabilities during clone/recreate operations. * Security profile handling was fixed to preserve existing security options and avoid unintended overrides or duplicate entries. * **Tests** * Added tests covering GPU container clone behavior across various capability and security-profile scenarios to prevent regressions. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3515) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
) ## Summary Fixes `host.openshell.internal` DNS resolution inside GPU sandboxes. Closes #3562, #3568. ## Root cause PR #3434 added the Docker GPU patch, which recreates the OpenShell-managed sandbox container with GPU flags after OpenShell creates it without `--gpu`. The recreation gated `--add-host` injection on `networkMode !== "host"`. But `getDockerGpuPatchNetworkMode` defaults to `"host"` (and `NEMOCLAW_DOCKER_GPU_PATCH_NETWORK` is never set), so every GPU recreate silently dropped the OpenShell-injected `host.openshell.internal` entry. Symptom on DGX Spark: `LLM request failed: network connection error` — inference can't reach the Ollama auth proxy because the hostname doesn't resolve in the recreated container. ## The fix `--add-host` writes to `/etc/hosts` in the mount namespace, not the network stack — safe to preserve with `--network=host`. Removed the conditional. `--dns` / `--dns-search` guard kept (those genuinely conflict with host networking). Also flipped `local.test.ts:618` to match the current validator output (pre-existing test drift). ## Tests - All 16/16 `docker-gpu-patch.test.ts` pass. - Existing test 'can switch the recreated sandbox to host networking for OpenShell callbacks' was asserting the bug; now asserts `--add-host` is preserved. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Docker GPU container networking now defaults to "preserve" unless explicitly set to host, improving predictable network behavior. * Host-to-container callback mapping is always preserved, ensuring external callback endpoints remain reachable after container recreation and improving endpoint resolution. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3623) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
openshell-gatewaybinary requires newer glibc than the host providesopenshell-gatewayinside anubuntu:24.04Docker container with host networking, the host Docker socket, OpenShell state, and the hostopenshell-sandboxpath mounted through--gpu, then recreates the OpenShell-managed Docker container with Docker GPU flags--gpus, NVIDIA runtime, or CDI based on Docker probes while preserving OpenShell labels and runtime settings/proc/self/task/<tid>/commwrite pluscuInit(0), withnvidia-smioptional for Jetson/Tegra hostsValidation
npm run build:clinpm run typecheck:clinpm run checksnpm run format:checkgit diff --checknpx vitest run src/lib/onboard/docker-driver-gateway-launch.test.ts src/lib/onboard/docker-gpu-patch.test.ts test/onboard.test.tsnpx vitest run test/docker-abstraction-guard.test.ts src/lib/onboard/docker-driver-gateway-launch.test.tsNotes
GLIBC_2.38/GLIBC_2.39required, Jammy has2.35).Summary by CodeRabbit
New Features
Improvements
Tests
Chores