fix(onboard): repair Docker GPU sandbox readiness by ericksoa · Pull Request #3434 · NVIDIA/NemoClaw

ericksoa · 2026-05-13T02:42:49Z

Summary

add a NemoClaw-side OpenShell gateway compatibility patch for Ubuntu 22.04/Brev-style Linux hosts where the installed openshell-gateway binary requires newer glibc than the host provides
when needed, run openshell-gateway inside an ubuntu:24.04 Docker container with host networking, the host Docker socket, OpenShell state, and the host openshell-sandbox path mounted through
add a NemoClaw-side Docker GPU patch for Linux Docker-driver sandbox onboarding that creates without OpenShell --gpu, then recreates the OpenShell-managed Docker container with Docker GPU flags
select --gpus, NVIDIA runtime, or CDI based on Docker probes while preserving OpenShell labels and runtime settings
make GPU proof require /proc/self/task/<tid>/comm write plus cuInit(0), with nvidia-smi optional for Jetson/Tegra hosts
leave failed GPU sandboxes/containers in place with diagnostics and a cleanup command instead of auto-deleting them

Validation

npm run build:cli
npm run typecheck:cli
npm run checks
npm run format:check
git diff --check
npx vitest run src/lib/onboard/docker-driver-gateway-launch.test.ts src/lib/onboard/docker-gpu-patch.test.ts test/onboard.test.ts
npx vitest run test/docker-abstraction-guard.test.ts src/lib/onboard/docker-driver-gateway-launch.test.ts
dispatched Brev GPU validation on current PR head: https://github.com/NVIDIA/NemoClaw/actions/runs/25781993350

Notes

Brev Ubuntu 22.04 reproduces the OpenShell gateway glibc blocker (GLIBC_2.38/GLIBC_2.39 required, Jammy has 2.35).
Ubuntu 24.04 GPU validation reportedly gets past gateway startup and sees the GPU inside the sandbox, which matches the glibc diagnosis.
This patch keeps the U22 path viable without requiring the user or launchable image to move to Ubuntu 24.04 first.

Summary by CodeRabbit

New Features
- Docker container rename command; Jetson/Tegra GPU detection; Docker GPU‑patch flow to enable GPU access for sandboxes.
Improvements
- Optional nvidia‑smi probe; configurable sandbox readiness timeout; sandbox creation can omit legacy --gpu flag when using patch; improved GPU patch diagnostics and manual cleanup guidance; quicker create-phase transition detection.
Tests
- New and updated unit and E2E GPU tests and harness updates.
Chores
- CI workflow adds a GPU test-suite option.

coderabbitai · 2026-05-13T02:43:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Docker GPU patching for OpenShell sandboxes on Linux Docker driver gateways, Jetson/Tegra GPU detection and unified-memory handling, an orchestration module to recreate sandbox containers with GPU access and collect diagnostics, create-stream and probe updates, and test/CI additions for GPU runs.

Changes

Docker GPU Patching and Jetson Support

Layer / File(s)	Summary
Docker container rename helper `src/lib/adapters/docker/container.ts`, `src/lib/adapters/docker/index.test.ts`	New `dockerRename` function wraps `dockerRun` with the `rename` command to back up containers before GPU-mode recreation.
NVIDIA Jetson/Tegra hardware detection `src/lib/inference/nim.ts`, `src/lib/inference/nim.test.ts`	Extended `UNIFIED_MEMORY_GPU_TAGS` to include `Jetson` and `Tegra`, added `jetson` to `NvidiaPlatform` type, and implemented host detection helpers (`readHostMemoryMB`, `hasTegraDeviceNodeSignal`, `detectTegraHostGpu`, `detectNvidiaPlatform` updates). Added test validating Jetson detection via firmware/device-node signals and host RAM fallback.
Docker GPU patch orchestration module `src/lib/onboard/docker-gpu-patch.ts`, `src/lib/onboard/docker-gpu-patch.test.ts`	New ~800-line module implementing GPU mode probing and selection (`gpus` → `nvidia-runtime` → `cdi`), container recreation (stop/rename backup → run GPU-enabled clone preserving relevant settings), OpenShell reconnection polling, diagnostics collection (timestamped failure dir with Docker ps/inspect/logs and optional OpenShell captures), and public APIs for mode builders, selection, cleanup commands, and diagnostics. Tests validate mode selection, arg construction, CDI probing, and orchestration.
Sandbox readiness timeout and GPU patch integration `src/lib/onboard.ts`	Added `getSandboxReadyTimeoutSecs` (supports `NEMOCLAW_SANDBOX_READY_TIMEOUT`), updated `buildSandboxGpuCreateArgs` to accept `suppressGpuFlag`, wired `dockerGpuPatch.recreateOpenShellDockerSandboxWithGpu` into sandbox creation when applicable, adjusted readiness polling to use dynamic timeout, and added failure diagnostics/manual cleanup guidance for GPU-patch mode. Exported new timeout helper.
Stream phase transition and optional NVIDIA probe `src/lib/onboard/initial-policy.ts`, `src/lib/sandbox/create-stream.ts`, `src/lib/sandbox/create-stream.test.ts`	Added `NVIDIA_SMI_OPTIONAL_PROBE` (runs `nvidia-smi` only if present) and switched first GPU proof command to use it; updated create-stream to transition to `"create"` phase on `Built image` lines, and added a timer-driven test for the transition and heartbeats.
Test harness and integration updates `test/onboard.test.ts`, `.github/workflows/e2e-branch-validation.yaml`, `test/e2e/brev-e2e.test.ts`	Extended test harness internals to expose `getSandboxReadyTimeoutSecs` and `buildSandboxGpuCreateArgs` options, updated tests to reflect optional `nvidia-smi` probe and timeout overrides, added GPU path CI/workflow choices and Brev GPU E2E orchestration including GPU VM provisioning and runtime preparation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Docker, fix, E2E, CI/CD

Suggested reviewers

cv
jyaunches
prekshivyas

Poem

🐇 I hopped into code with a gleam,
I nudged Jetson chips into the scheme,
Containers renamed and modes explored,
Sandboxes rebuilt while diagnostics roared—
A carrot-streaked patch, now the GPUs dream.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'fix(onboard): repair Docker GPU sandbox readiness' is clearly related to the main change—a comprehensive Docker GPU patching system for Linux Docker-driver sandbox onboarding. The title accurately summarizes the primary focus.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/docker-gpu-sandbox-readiness

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cv

Ready on green

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/lib/onboard/docker-gpu-patch.ts (1)

576-685: 🏗️ Heavy lift

Split orchestration/diagnostics functions to reduce complexity in a critical path.

These two functions combine many responsibilities (discovery, mode selection, mutation, readiness waiting, file/report generation). Breaking them into smaller units would lower risk and make failure handling easier to reason about.

As per coding guidelines **/*.{js,ts,tsx}: Keep function complexity low; prefix unused variables with underscore (_).

Also applies to: 701-823

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-gpu-patch.ts` around lines 576 - 685, The function
recreateOpenShellDockerSandboxWithGpu is doing too much (discovery, inspect,
mode selection, stop/rename, run clone, wait for new ID, wait for exec
readiness, cleanup and error decoration); split it into smaller helpers to
reduce complexity and improve failure handling: extract container
discovery/inspect into a function (e.g., discoverAndInspectContainer that wraps
findOpenShellDockerSandboxContainerIds and inspectDockerContainer), extract
backup/rename logic into backupAndRenameContainer (uses dockerStop,
dockerRename, buildBackupContainerName), extract clone/run-and-wait into
runGpuCloneAndWaitForId (uses buildDockerGpuCloneRunArgs, dockerRunDetached,
waitForNewContainerId), and extract exec readiness + cleanup into
waitForExecReadyAndCleanup (uses waitForOpenShellSandboxExec and dockerRm for
backup); ensure the original recreateOpenShellDockerSandboxWithGpu composes
these helpers, preserves the DockerGpuPatchFailureContext updates
(oldContainerId, backupContainerName, newContainerId, selectedMode,
modeAttempts) and still uses decoratePatchError on catch, and prefix any
intentionally unused variables with an underscore per guidelines.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 6368-6404: The new verbose Docker GPU patch try/catch and
diagnostic logging in createSandbox() should be moved into one or two small
helpers in a new module (e.g., src/lib/onboard/docker-gpu-patch.ts) so
createSandbox() stays thin: create a helper like
runAndLogDockerGpuPatch(sandboxName, effectiveSandboxGpuConfig, options) that
internally calls dockerGpuPatch.recreateOpenShellDockerSandboxWithGpu(...),
catches errors, calls dockerGpuPatch.collectDockerGpuPatchDiagnostics(...) and
dockerGpuPatch.dockerGpuPatchCleanupCommands(...), performs the
console.error/console.log output now in onboard.ts, and returns a compact result
object (mode label or error/diagnostics info); then replace the current
try/catch blocks in createSandbox() (around
recreateOpenShellDockerSandboxWithGpu and the other two similar blocks) with
single calls to this helper and handle only the small returned result.
- Around line 6509-6523: The diagnostics and cleanup messaging for the Docker
GPU patch should only run when the patch path is active; wrap the block that
calls dockerGpuPatch.collectDockerGpuPatchDiagnostics and
dockerGpuPatch.dockerGpuPatchCleanupCommands in a guard (e.g., if
(useDockerGpuPatch) { ... }) so it only executes when useDockerGpuPatch is
truthy (or alternatively guard on dockerGpuPatchResult being present/activated).
Keep the existing references sandboxName, runCaptureOpenshell,
dockerGpuPatchResult, and the two dockerGpuPatch.* calls inside that guarded
block.

In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 217-221: pushNumberFlag currently rejects non-positive numbers
which causes Docker's unlimited swap sentinel HostConfig.MemorySwap = -1 to be
dropped when rebuilding docker run args; update pushNumberFlag to treat -1 as a
valid value (accept Number.isFinite(value) && (value === -1 || value > 0)) and
push the flag with String(value) in that case, and apply the same change to the
other occurrence of this helper used for rebuilding run args so
HostConfig.MemorySwap = -1 is preserved.

---

Nitpick comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 576-685: The function recreateOpenShellDockerSandboxWithGpu is
doing too much (discovery, inspect, mode selection, stop/rename, run clone, wait
for new ID, wait for exec readiness, cleanup and error decoration); split it
into smaller helpers to reduce complexity and improve failure handling: extract
container discovery/inspect into a function (e.g., discoverAndInspectContainer
that wraps findOpenShellDockerSandboxContainerIds and inspectDockerContainer),
extract backup/rename logic into backupAndRenameContainer (uses dockerStop,
dockerRename, buildBackupContainerName), extract clone/run-and-wait into
runGpuCloneAndWaitForId (uses buildDockerGpuCloneRunArgs, dockerRunDetached,
waitForNewContainerId), and extract exec readiness + cleanup into
waitForExecReadyAndCleanup (uses waitForOpenShellSandboxExec and dockerRm for
backup); ensure the original recreateOpenShellDockerSandboxWithGpu composes
these helpers, preserves the DockerGpuPatchFailureContext updates
(oldContainerId, backupContainerName, newContainerId, selectedMode,
modeAttempts) and still uses decoratePatchError on catch, and prefix any
intentionally unused variables with an underscore per guidelines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9abc4e9-bbc0-4860-b39b-46f17dd97991

📥 Commits

Reviewing files that changed from the base of the PR and between 6f4e276 and 6bf3354.

📒 Files selected for processing (11)

src/lib/adapters/docker/container.ts
src/lib/adapters/docker/index.test.ts
src/lib/inference/nim.test.ts
src/lib/inference/nim.ts
src/lib/onboard.ts
src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts
src/lib/onboard/initial-policy.ts
src/lib/sandbox/create-stream.test.ts
src/lib/sandbox/create-stream.ts
test/onboard.test.ts

github-actions · 2026-05-13T02:52:25Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/brev-e2e.test.ts`:
- Around line 398-402: The current use of execSync to run a shell-interpolated
piped command is injection-prone; replace the single execSync call with two
execFileSync invocations: call execFileSync('brev',
['search','cpu','--min-vcpu',BREV_MIN_VCPU,'--min-ram',BREV_MIN_RAM,'--min-disk',BREV_MIN_DISK,'--provider',BREV_PROVIDER,'--sort','price'],
{ encoding: 'utf-8', timeout: 180_000 }) to capture the search output, then call
execFileSync('brev', ['create', INSTANCE_NAME, '--startup-script',
`@${setupScriptPath}`, '--detached'], { input: searchOutput, encoding: 'utf-8',
timeout: 180_000, stdio: PIPE_INPUT_STDIO }) so arguments are passed as arrays
(no shell interpolation) and the search output is fed as stdin to create; update
the code that currently uses execSync to use execFileSync and keep the same
timeout/encoding/stdio constants.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ef10bd24-0388-428c-9c61-7784a6e92c0d

📥 Commits

Reviewing files that changed from the base of the PR and between 6bf3354 and e8690db.

📒 Files selected for processing (2)

.github/workflows/e2e-branch-validation.yaml
test/e2e/brev-e2e.test.ts

github-actions · 2026-05-13T02:56:44Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

github-actions · 2026-05-13T03:01:46Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

github-actions · 2026-05-13T03:08:58Z

E2E Advisor Recommendation

Required E2E: e2e-branch-validation:gpu, nightly-e2e:gpu-e2e, nightly-e2e:gpu-double-onboard-e2e, e2e-branch-validation:full
Optional E2E: ollama-proxy-e2e:ollama-proxy-e2e, nightly-e2e:inference-routing-e2e, nightly-e2e:openclaw-inference-switch-e2e, nightly-e2e:sandbox-operations-e2e, nightly-e2e:onboard-resume-e2e, regression-e2e:gateway-health-honest-e2e, nightly-e2e:overlayfs-autofix-e2e

Dispatch hint: gpu-e2e,gpu-double-onboard-e2e,inference-routing-e2e,sandbox-operations-e2e,onboard-resume-e2e,overlayfs-autofix-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

e2e-branch-validation:gpu (high): This PR introduces the 'gpu' test_suite in e2e-branch-validation.yaml and the entire Linux Docker-driver GPU sandbox stack (docker-gpu-patch, docker-gpu-sandbox-create, docker-gpu-local-inference, docker-driver-gateway-launch host-mode, sandbox bridge reachability skip). The new suite provisions a Brev GPU VM and runs the Ollama GPU sandbox proof from source; without it the new code paths are unvalidated end-to-end.
nightly-e2e:gpu-e2e (high): Existing Ollama-on-GPU sandbox proof on an NVKS GPU runner. Directly exercises local.ts host-URL selection, docker-gpu-local-inference, sandbox GPU create args, and the host-network gateway path. The most authoritative existing GPU regression for this PR.
nightly-e2e:gpu-double-onboard-e2e (high): Re-onboard with Ollama proxy token consistency (ollama proxy token diverges from stored token after re-onboard, causing persistent HTTP 401 on inference #2553). onboard.ts and local inference URL helpers were both refactored; double-onboard exercises gateway reuse, drift detection, and Ollama proxy state — all areas this PR rewrites for the Docker-driver path.
e2e-branch-validation:full (medium): onboard.ts, dockerfile-patch.ts, initial-policy.ts, sandbox-gpu-create.ts, and the gateway launch path all changed. The 'full' Brev branch-validation suite is the standard gate that installs from source on a clean VM and runs install→onboard→sandbox verify→inference→CLI ops, catching regressions in the refactored onboard/gateway paths on a real Linux host.

Optional E2E

ollama-proxy-e2e:ollama-proxy-e2e (low): local.ts host-URL behavior and docker-gpu-local-inference touch the Ollama auth proxy routing. Useful confidence check that the existing Ollama proxy flow still works with the new sandbox host URL override.
nightly-e2e:inference-routing-e2e (medium): Validates inference.local routing into the sandbox; getLocalProviderBaseUrl and host-URL normalization changes could subtly shift routing behavior even on non-GPU hosts.
nightly-e2e:openclaw-inference-switch-e2e (medium): Inference switching writes the host URL into openclaw.json; the new sandbox host URL override path is adjacent to this code.
nightly-e2e:sandbox-operations-e2e (medium): create-stream.ts and the sandbox create argv builder changed (new --gpu/--gpu-device handling, docker rename helper). Sandbox operations exercises that lifecycle.
nightly-e2e:onboard-resume-e2e (medium): onboard.ts now records sandbox GPU mode/device and resumes from registry — resume coverage is relevant.
regression-e2e:gateway-health-honest-e2e (medium): Gateway launch identity, drift detection, and reachability checks were rewritten. Health-honesty regression catches false-positive 'gateway healthy' logs on Linux Docker-driver hosts.
nightly-e2e:overlayfs-autofix-e2e (medium): getGatewayStartEnv path is touched indirectly through Docker-driver gateway env handling; overlayfs autofix consumes that env on Linux.

New E2E recommendations

linux-docker-driver-gpu-sandbox (medium): The new Docker-driver GPU host-network mode (NEMOCLAW_DOCKER_GPU_PATCH=host) replaces the sandbox bridge with the host network and routes inference at 127.0.0.1 instead of host.openshell.internal. There is no existing E2E that asserts (a) the sandbox can reach the gateway when bridge reachability is skipped, AND (b) Ollama is reachable from inside the sandbox via the new host URL, AND (c) the escape-hatch NEMOCLAw_DOCKER_GPU_PATCH=0 falls back cleanly. test-gpu-e2e.sh covers the happy path but not the patch-disable fallback.
- Suggested test: test/e2e/test-gpu-docker-driver-patch-disable.sh — onboard with NEMOCLAW_DOCKER_GPU_PATCH=0 on a Linux Docker-driver GPU host; assert onboard either succeeds without host-network patching or fails with the documented escape-hatch guidance, and that the sandbox bridge gateway reachability check runs.
gpu-detection (low): detectTegraHostGpu() now classifies hosts as platform=jetson when nvidia-smi is absent but Tegra device nodes (/dev/nvhost-gpu) or firmware model indicate Jetson. No existing E2E exercises a host without nvidia-smi, so this fallback is unit-tested only.
- Suggested test: Add a Jetson/Tegra preflight smoke (mocked or on a Jetson runner if available) that asserts nemoclaw onboard preflight reports 'NVIDIA GPU detected' and platform=jetson when nvidia-smi is missing but Tegra device nodes exist.
docker-driver-gateway (medium): buildDockerDriverGatewayRuntimeIdentity now produces distinct drift/identity binaries for host-mode launches. Existing gateway-health-honest-e2e covers GLIBC-style crashes but not the host-mode identity drift case where a previous bridge-mode gateway is still listening on the port.
- Suggested test: Regression test: pre-start a bridge-mode gateway, then run onboard with GPU host-mode patch enabled; assert the stale gateway is detected as drift and recreated rather than reused.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: gpu-e2e,gpu-double-onboard-e2e,inference-routing-e2e,sandbox-operations-e2e,onboard-resume-e2e,overlayfs-autofix-e2e

github-actions · 2026-05-13T03:18:41Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

github-actions · 2026-05-13T03:37:12Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

github-actions · 2026-05-13T03:56:03Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

…x-readiness # Conflicts: # test/onboard.test.ts

github-actions · 2026-05-13T04:28:29Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

github-actions · 2026-05-13T04:37:23Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

github-actions · 2026-05-13T04:41:16Z

❌ Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readiness — See logs

github-actions · 2026-05-14T04:17:35Z

Selective E2E Results — ❌ Some jobs failed

Run: 25840442921
Target ref: a7aff406df029f96806c8cd68d264f332db32b0b
Workflow ref: main
Requested jobs: all (no filter)
Summary: 38 passed, 1 failed, 2 skipped

Job	Result
brave-search-e2e	✅ success
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
credential-migration-e2e	✅ success
credential-sanitization-e2e	✅ success
deployment-services-e2e	✅ success
device-auth-health-e2e	✅ success
diagnostics-e2e	✅ success
docs-validation-e2e	✅ success
double-onboard-e2e	✅ success
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-discord-e2e	✅ success
hermes-e2e	✅ success
hermes-inference-switch-e2e	✅ success
hermes-slack-e2e	✅ success
inference-routing-e2e	✅ success
issue-2478-crash-loop-recovery-e2e	✅ success
kimi-inference-compat-e2e	✅ success
launchable-smoke-e2e	✅ success
messaging-compatible-endpoint-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
onboard-repair-e2e	✅ success
onboard-resume-e2e	✅ success
openclaw-inference-switch-e2e	❌ failure
openshell-gateway-upgrade-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-hermes-stale-base-e2e	✅ success
rebuild-openclaw-e2e	✅ success
runtime-overrides-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	✅ success
snapshot-commands-e2e	✅ success
telegram-injection-e2e	✅ success
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

Failed jobs: openclaw-inference-switch-e2e. Check run artifacts for logs.

Resolve the two output threads in #3456 left after the core dead-loop fix landed via #3459 + #3434: Sub-bug #3 — `src/lib/onboard.ts` printed `nemoclaw <name> destroy --yes && nemoclaw onboard --gpu` with a literal `<name>` placeholder, and assumed at least one sandbox was registered. When the GPU-passthrough mismatch hit on the State B re-run path with an empty registry (the dead-loop case), the hint was not actionable. Replace with a registry-aware helper at `src/lib/onboard/gpu-recovery.ts` that renders the right shape: - empty registry → suggest `nemoclaw uninstall && nemoclaw onboard --gpu` - one sandbox → suggest destroy --yes --cleanup-gateway for that name - multiple sandboxes → list each, only the last gets --cleanup-gateway Sub-bug #4 — `src/lib/actions/uninstall/run-plan.ts` printed `Destroyed gateway 'nemoclaw' skipped` when the openshell destroy no-op'd (gateway already gone) — the "Destroyed … skipped" wording was self-contradictory. Extend `runOptional` with an `onSkip` option; route the gateway destroy to emit `Gateway 'nemoclaw' already removed or unreachable` on no-op. Tests: - `src/lib/onboard/gpu-recovery.test.ts` (6 tests): forbid literal `<name>` placeholder anywhere in the output; cover empty / single / multi-sandbox cases; defensive filter on whitespace names so a `nemoclaw destroy` rendering can never happen. - `src/lib/actions/uninstall/run-plan.test.ts`: assert the new "already removed or unreachable" wording and the absence of the "Destroyed gateway 'nemoclaw' skipped" string. The core dead loop itself (sub-bugs #1, #2 and State B GPU mismatch) is already addressed by #3459 + #3434 + #3483; #3456 will close once this lands. See the #3456 status comment for the full mapping. Refs #3456. Mirrors (and tightens) the approach in the closed PR #3464, which left the literal `<name>` placeholder in tests per CodeRabbit feedback that was never addressed. Signed-off-by: Charan Jagwani <charjags100@gmail.com>

…e test Why: rebasing onto main pulled in NVIDIA#3434's new "can override the sandbox inference base URL" test, which was written against the pre-WeChat 13-parameter signature of patchStagedDockerfile. The WeChat branch added a wechatConfig: LooseObject parameter at slot 13, so the test's `false` (intended for darwinVmCompat) was landing in wechatConfig and TypeScript flagged the boolean-to-LooseObject mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary - Bump the docs metadata and version switcher to `0.0.41`. - Add v0.0.41 release notes plus operator guidance for OpenShell pinning, Docker bridge reachability, Local Ollama proxy reachability, and Docker GPU onboarding diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3434 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Linux Docker-driver GPU onboarding behavior, diagnostics, cleanup guidance, and the `NEMOCLAW_DOCKER_GPU_PATCH` troubleshooting escape hatch. - #3483 -> `docs/about/release-notes.md`: Note that `nemoclaw uninstall` removes all installer-managed OpenShell helper binaries unless `--keep-openshell` is passed. - #3446 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document blueprint-driven OpenShell install pin resolution and fallback behavior. - #3472 -> `docs/inference/use-local-inference.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document sandbox-side Local Ollama auth proxy reachability checks and firewall remediation. - #3459 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Docker-driver sandbox-to-gateway reachability checks and firewall remediation. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `git diff --check` - `npm run build:cli` - `npm run typecheck:cli` - pre-commit hooks during `git commit`  ## Summary by CodeRabbit * **New Features** * Added `nemoclaw inference get` command to check current inference settings * Improved gateway health validation with Linux firewall remediation guidance * **Bug Fixes** * Enhanced proxy readiness validation with sandbox network path probes * Improved local Ollama route onboarding with rerun-safe fixes * Better sandbox-to-gateway connectivity detection * **Documentation** * Expanded troubleshooting guidance for firewall and connectivity issues * Updated CLI reference with new command and environment variable documentation * Added gateway binding and Docker-driver GPU compatibility guidance  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3531)

…3520) > **Draft for visibility.** Issue-autopilot Stages 4-5 of #3456. Will mark ready once batch self-review + CI complete. ## Summary Closes the two remaining output threads in #3456 after the core dead-loop fix already landed on `main` (via #3459, #3434, #3483). Full sub-bug mapping in the [#3456 status comment](#3456 (comment)). - **Sub-bug #3** — `nemoclaw <name> destroy --yes` recovery hint replaced with a registry-aware helper. - **Sub-bug #4** — `Destroyed gateway 'nemoclaw' skipped` self-contradictory wording replaced with `Gateway 'nemoclaw' already removed or unreachable`. ## Acceptance criteria mapping | Sub-bug | Resolution | Evidence | |---|---|---| | #1 dead loop | Already fixed on main (#3459) | out of scope | | #2 firewall diagnostic | Already fixed on main (#3459) | out of scope | | **#3** literal `<name>` placeholder | **This PR** | `src/lib/onboard/gpu-recovery.ts` + `onboard.ts:10387-10405` | | **#4** misleading "skipped" wording | **This PR** | `src/lib/actions/uninstall/run-plan.ts:210-228, 407-414` | | #5 uninstall residuals | Already fixed on main (#3483) | out of scope | ## Behavior matrix `gpuPassthroughRecoveryLines(names)`: | Input | Suggestion | |---|---| | `null` / `[]` | `nemoclaw uninstall && nemoclaw onboard --gpu` | | one sandbox | `nemoclaw <name> destroy --yes --cleanup-gateway && nemoclaw onboard --gpu` | | many sandboxes | each `destroy --yes`, only the last gets `--cleanup-gateway` | ## Test plan ``` npm run typecheck:cli npx vitest run src/lib/onboard/gpu-recovery.test.ts src/lib/actions/uninstall/run-plan.test.ts ``` 22 tests pass (6 new + 16 existing). ## Notes for reviewers - This is the work [#3464 attempted](#3464); that PR was closed without merging after CodeRabbit asked for the `<name>` placeholder to be forbidden in tests via negative assertion. This PR adopts that refinement. - `runOptional` extension is backwards-compatible — existing callers without `onSkip` get the original wording. Closes #3456 once merged. --------- Signed-off-by: Charan Jagwani <charjags100@gmail.com> Co-authored-by: Charan Jagwani <charjags100@gmail.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

…d for proc/comm write (#3515) ## Summary The Docker-driver GPU patch ([src/lib/onboard/docker-gpu-patch.ts](src/lib/onboard/docker-gpu-patch.ts)) recreates the OpenShell-managed sandbox container with `--gpus all` plus a reconstruction of the baseline container's flags, then runs a three-step GPU proof in [src/lib/onboard/initial-policy.ts](src/lib/onboard/initial-policy.ts) — including `PROC_COMM_WRITE_PROBE`, which writes to `/proc/<pid>/task/<tid>/comm`. On DGX Spark hosts (#3511) and other Docker-driver Linux hosts where the OpenShell-created baseline container's `CapAdd` does not include `SYS_PTRACE` and its `SecurityOpt` does not include `apparmor=unconfined`, the recreate inherits a flag set the kernel/LSM stack rejects for that proc write. The proof aborts onboarding: ``` ✗ GPU proof failed: /proc/<pid>/task/<tid>/comm write Error: GPU proof failed: /proc/<pid>/task/<tid>/comm write (status 2): sh: 1: cannot create /proc/<pid>/task/<tid>/comm: Permission denied ``` The reporter confirmed a bare `docker run --rm --gpus all ubuntu sh -c "echo test > /proc/1/task/1/comm"` succeeds on the same host, so the kernel itself allows the write under the right Docker flags — the problem is which flags reach the recreated container. This PR makes the GPU recreate self-sufficient for the operations the GPU proof checks, regardless of the non-GPU baseline: - **Always inject `--cap-add SYS_PTRACE`** into the clone, deduped via a `Set` so baselines that already have it stay flat. - **Inject `--security-opt apparmor=unconfined`** only when the baseline did *not* pin a specific apparmor profile. Docker rejects duplicate `--security-opt apparmor=…` entries, so a baseline that explicitly chose `apparmor=docker-default` (or similar) is respected — this stays scoped to the GPU recreate path and does not override deliberate operator choices. ## Related Issue Resolves #3511 Related context: - #3434 — PR that introduced the Docker-driver GPU recreate flow and the `/proc/comm` write probe. This PR hardens the recreate to satisfy what the probe checks across more baselines. ## Changes - `src/lib/onboard/docker-gpu-patch.ts`: dedup `CapAdd` / `SecurityOpt` via `Set`, always inject `SYS_PTRACE`, and inject `apparmor=unconfined` only when no apparmor profile is pinned by the baseline. - `src/lib/onboard/docker-gpu-patch.test.ts`: four new cases — SYS_PTRACE always added; SYS_PTRACE deduped when baseline already has it; apparmor=unconfined injected on empty baselines; baseline-pinned apparmor profile preserved. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * GPU-enabled containers are now configured with the necessary kernel security capabilities during clone/recreate operations. * Security profile handling was fixed to preserve existing security options and avoid unintended overrides or duplicate entries. * **Tests** * Added tests covering GPU container clone behavior across various capability and security-profile scenarios to prevent regressions.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3515)   --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

) ## Summary Fixes `host.openshell.internal` DNS resolution inside GPU sandboxes. Closes #3562, #3568. ## Root cause PR #3434 added the Docker GPU patch, which recreates the OpenShell-managed sandbox container with GPU flags after OpenShell creates it without `--gpu`. The recreation gated `--add-host` injection on `networkMode !== "host"`. But `getDockerGpuPatchNetworkMode` defaults to `"host"` (and `NEMOCLAW_DOCKER_GPU_PATCH_NETWORK` is never set), so every GPU recreate silently dropped the OpenShell-injected `host.openshell.internal` entry. Symptom on DGX Spark: `LLM request failed: network connection error` — inference can't reach the Ollama auth proxy because the hostname doesn't resolve in the recreated container. ## The fix `--add-host` writes to `/etc/hosts` in the mount namespace, not the network stack — safe to preserve with `--network=host`. Removed the conditional. `--dns` / `--dns-search` guard kept (those genuinely conflict with host networking). Also flipped `local.test.ts:618` to match the current validator output (pre-existing test drift). ## Tests - All 16/16 `docker-gpu-patch.test.ts` pass. - Existing test 'can switch the recreated sandbox to host networking for OpenShell callbacks' was asserting the bug; now asserts `--add-host` is preserved. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Docker GPU container networking now defaults to "preserve" unless explicitly set to host, improving predictable network behavior. * Host-to-container callback mapping is always preserved, ensuring external callback endpoints remain reachable after container recreation and improving endpoint resolution.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3623)   --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

fix(onboard): repair Docker GPU sandbox readiness

6bf3354

github-advanced-security AI found potential problems May 13, 2026

View reviewed changes

Comment thread src/lib/onboard/docker-gpu-patch.ts Fixed

cv approved these changes May 13, 2026

View reviewed changes

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

Comment thread src/lib/onboard.ts Outdated

Comment thread src/lib/onboard/docker-gpu-patch.ts

test: add Brev GPU branch validation

e8690db

github-advanced-security AI found potential problems May 13, 2026

View reviewed changes

Comment thread test/e2e/brev-e2e.test.ts Fixed

Comment thread test/e2e/brev-e2e.test.ts Fixed

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread test/e2e/brev-e2e.test.ts Outdated

ci: use Brev API key auth for GPU validation

06b06ee

ci: build CLI before Brev validation

1d58140

ci: allow Brev GPU provisioning to settle

61d845f

ci: use smaller default Brev GPU

f0bb0ab

ci: preflight Brev GPU candidates

3f9f175

ericksoa changed the title ~~Fix Docker GPU sandbox readiness~~ fix(onboard): repair Docker GPU sandbox readiness May 13, 2026

ericksoa added 2 commits May 12, 2026 21:15

fix(onboard): containerize incompatible OpenShell gateway

38a5e8f

Merge remote-tracking branch 'origin/main' into fix/docker-gpu-sandbo…

e2b4759

…x-readiness # Conflicts: # test/onboard.test.ts

github-advanced-security AI found potential problems May 13, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Fixed

Comment thread src/lib/onboard/docker-cdi.ts Fixed

fix(onboard): use docker adapter for gateway cleanup

48705a5

fix(onboard): address gateway compatibility scan findings

81df8a4

fix(ci): keep Brev CPU argv typed

0b113d4

cv merged commit 3205c12 into main May 13, 2026
28 checks passed

prekshivyas deleted the fix/docker-gpu-sandbox-readiness branch May 13, 2026 23:15

laitingsheng mentioned this pull request May 14, 2026

fix(onboard): ensure GPU recreate has SYS_PTRACE + apparmor=unconfined for proc/comm write #3515

Merged

12 tasks

cjagwani mentioned this pull request May 14, 2026

[All Platforms][Onboard] ./install.sh and nemoclaw onboard enter dead loop — sandbox can't reach gateway, "firewall" hint is misleading #3456

Closed

cjagwani mentioned this pull request May 14, 2026

fix(onboard,uninstall): replace misleading recovery messages (#3456) #3520

Merged

miyoungc mentioned this pull request May 14, 2026

docs(release): refresh 0.0.41 documentation #3531

Merged

coderabbitai Bot mentioned this pull request May 14, 2026

fix(onboard): allow proc writes for Docker GPU patch #3543

Merged

12 tasks

cv mentioned this pull request May 15, 2026

[Brev][Ubuntu 22.04][Onboard] NVIDIA Endpoints provider setup fails with tcp Connection refused on v0.0.39 #3396

Closed

coderabbitai Bot mentioned this pull request May 15, 2026

fix(onboard): bump default sandbox-ready timeout to 300s for GPU images #3357

Closed

7 tasks

cjagwani mentioned this pull request May 15, 2026

fix(onboard): preserve --add-host on GPU sandbox recreate (#3562) #3623

Merged

This was referenced May 16, 2026

test(onboard): split helper coverage #3640

Merged

Ggg #3666

Closed

ericksoa mentioned this pull request May 21, 2026

Brev GPU E2E fails at OpenShell Docker bridge gateway reachability #3959

Open

coderabbitai Bot mentioned this pull request May 21, 2026

fix(hermes): restore Spark GPU recreate startup #3963

Merged

coderabbitai Bot mentioned this pull request Jun 1, 2026

fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state #4599

Merged

8 tasks

Conversation

ericksoa commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

cv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

Pi Semantic E2E Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ericksoa commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading