Skip to content

fix(onboard): repair Docker GPU sandbox readiness#3434

Merged
cv merged 49 commits into
mainfrom
fix/docker-gpu-sandbox-readiness
May 13, 2026
Merged

fix(onboard): repair Docker GPU sandbox readiness#3434
cv merged 49 commits into
mainfrom
fix/docker-gpu-sandbox-readiness

Conversation

@ericksoa

@ericksoa ericksoa commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a NemoClaw-side OpenShell gateway compatibility patch for Ubuntu 22.04/Brev-style Linux hosts where the installed openshell-gateway binary requires newer glibc than the host provides
  • when needed, run openshell-gateway inside an ubuntu:24.04 Docker container with host networking, the host Docker socket, OpenShell state, and the host openshell-sandbox path mounted through
  • add a NemoClaw-side Docker GPU patch for Linux Docker-driver sandbox onboarding that creates without OpenShell --gpu, then recreates the OpenShell-managed Docker container with Docker GPU flags
  • select --gpus, NVIDIA runtime, or CDI based on Docker probes while preserving OpenShell labels and runtime settings
  • make GPU proof require /proc/self/task/<tid>/comm write plus cuInit(0), with nvidia-smi optional for Jetson/Tegra hosts
  • leave failed GPU sandboxes/containers in place with diagnostics and a cleanup command instead of auto-deleting them

Validation

  • npm run build:cli
  • npm run typecheck:cli
  • npm run checks
  • npm run format:check
  • git diff --check
  • npx vitest run src/lib/onboard/docker-driver-gateway-launch.test.ts src/lib/onboard/docker-gpu-patch.test.ts test/onboard.test.ts
  • npx vitest run test/docker-abstraction-guard.test.ts src/lib/onboard/docker-driver-gateway-launch.test.ts
  • dispatched Brev GPU validation on current PR head: https://github.com/NVIDIA/NemoClaw/actions/runs/25781993350

Notes

  • Brev Ubuntu 22.04 reproduces the OpenShell gateway glibc blocker (GLIBC_2.38/GLIBC_2.39 required, Jammy has 2.35).
  • Ubuntu 24.04 GPU validation reportedly gets past gateway startup and sees the GPU inside the sandbox, which matches the glibc diagnosis.
  • This patch keeps the U22 path viable without requiring the user or launchable image to move to Ubuntu 24.04 first.

Summary by CodeRabbit

  • New Features

    • Docker container rename command; Jetson/Tegra GPU detection; Docker GPU‑patch flow to enable GPU access for sandboxes.
  • Improvements

    • Optional nvidia‑smi probe; configurable sandbox readiness timeout; sandbox creation can omit legacy --gpu flag when using patch; improved GPU patch diagnostics and manual cleanup guidance; quicker create-phase transition detection.
  • Tests

    • New and updated unit and E2E GPU tests and harness updates.
  • Chores

    • CI workflow adds a GPU test-suite option.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Docker GPU patching for OpenShell sandboxes on Linux Docker driver gateways, Jetson/Tegra GPU detection and unified-memory handling, an orchestration module to recreate sandbox containers with GPU access and collect diagnostics, create-stream and probe updates, and test/CI additions for GPU runs.

Changes

Docker GPU Patching and Jetson Support

Layer / File(s) Summary
Docker container rename helper
src/lib/adapters/docker/container.ts, src/lib/adapters/docker/index.test.ts
New dockerRename function wraps dockerRun with the rename command to back up containers before GPU-mode recreation.
NVIDIA Jetson/Tegra hardware detection
src/lib/inference/nim.ts, src/lib/inference/nim.test.ts
Extended UNIFIED_MEMORY_GPU_TAGS to include Jetson and Tegra, added jetson to NvidiaPlatform type, and implemented host detection helpers (readHostMemoryMB, hasTegraDeviceNodeSignal, detectTegraHostGpu, detectNvidiaPlatform updates). Added test validating Jetson detection via firmware/device-node signals and host RAM fallback.
Docker GPU patch orchestration module
src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-patch.test.ts
New ~800-line module implementing GPU mode probing and selection (gpusnvidia-runtimecdi), container recreation (stop/rename backup → run GPU-enabled clone preserving relevant settings), OpenShell reconnection polling, diagnostics collection (timestamped failure dir with Docker ps/inspect/logs and optional OpenShell captures), and public APIs for mode builders, selection, cleanup commands, and diagnostics. Tests validate mode selection, arg construction, CDI probing, and orchestration.
Sandbox readiness timeout and GPU patch integration
src/lib/onboard.ts
Added getSandboxReadyTimeoutSecs (supports NEMOCLAW_SANDBOX_READY_TIMEOUT), updated buildSandboxGpuCreateArgs to accept suppressGpuFlag, wired dockerGpuPatch.recreateOpenShellDockerSandboxWithGpu into sandbox creation when applicable, adjusted readiness polling to use dynamic timeout, and added failure diagnostics/manual cleanup guidance for GPU-patch mode. Exported new timeout helper.
Stream phase transition and optional NVIDIA probe
src/lib/onboard/initial-policy.ts, src/lib/sandbox/create-stream.ts, src/lib/sandbox/create-stream.test.ts
Added NVIDIA_SMI_OPTIONAL_PROBE (runs nvidia-smi only if present) and switched first GPU proof command to use it; updated create-stream to transition to "create" phase on Built image lines, and added a timer-driven test for the transition and heartbeats.
Test harness and integration updates
test/onboard.test.ts, .github/workflows/e2e-branch-validation.yaml, test/e2e/brev-e2e.test.ts
Extended test harness internals to expose getSandboxReadyTimeoutSecs and buildSandboxGpuCreateArgs options, updated tests to reflect optional nvidia-smi probe and timeout overrides, added GPU path CI/workflow choices and Brev GPU E2E orchestration including GPU VM provisioning and runtime preparation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Docker, fix, E2E, CI/CD

Suggested reviewers

  • cv
  • jyaunches
  • prekshivyas

Poem

🐇 I hopped into code with a gleam,
I nudged Jetson chips into the scheme,
Containers renamed and modes explored,
Sandboxes rebuilt while diagnostics roared—
A carrot-streaked patch, now the GPUs dream.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'fix(onboard): repair Docker GPU sandbox readiness' is clearly related to the main change—a comprehensive Docker GPU patching system for Linux Docker-driver sandbox onboarding. The title accurately summarizes the primary focus.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/docker-gpu-sandbox-readiness

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/lib/onboard/docker-gpu-patch.ts Fixed

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready on green

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/lib/onboard/docker-gpu-patch.ts (1)

576-685: 🏗️ Heavy lift

Split orchestration/diagnostics functions to reduce complexity in a critical path.

These two functions combine many responsibilities (discovery, mode selection, mutation, readiness waiting, file/report generation). Breaking them into smaller units would lower risk and make failure handling easier to reason about.

As per coding guidelines **/*.{js,ts,tsx}: Keep function complexity low; prefix unused variables with underscore (_).

Also applies to: 701-823

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-gpu-patch.ts` around lines 576 - 685, The function
recreateOpenShellDockerSandboxWithGpu is doing too much (discovery, inspect,
mode selection, stop/rename, run clone, wait for new ID, wait for exec
readiness, cleanup and error decoration); split it into smaller helpers to
reduce complexity and improve failure handling: extract container
discovery/inspect into a function (e.g., discoverAndInspectContainer that wraps
findOpenShellDockerSandboxContainerIds and inspectDockerContainer), extract
backup/rename logic into backupAndRenameContainer (uses dockerStop,
dockerRename, buildBackupContainerName), extract clone/run-and-wait into
runGpuCloneAndWaitForId (uses buildDockerGpuCloneRunArgs, dockerRunDetached,
waitForNewContainerId), and extract exec readiness + cleanup into
waitForExecReadyAndCleanup (uses waitForOpenShellSandboxExec and dockerRm for
backup); ensure the original recreateOpenShellDockerSandboxWithGpu composes
these helpers, preserves the DockerGpuPatchFailureContext updates
(oldContainerId, backupContainerName, newContainerId, selectedMode,
modeAttempts) and still uses decoratePatchError on catch, and prefix any
intentionally unused variables with an underscore per guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 6368-6404: The new verbose Docker GPU patch try/catch and
diagnostic logging in createSandbox() should be moved into one or two small
helpers in a new module (e.g., src/lib/onboard/docker-gpu-patch.ts) so
createSandbox() stays thin: create a helper like
runAndLogDockerGpuPatch(sandboxName, effectiveSandboxGpuConfig, options) that
internally calls dockerGpuPatch.recreateOpenShellDockerSandboxWithGpu(...),
catches errors, calls dockerGpuPatch.collectDockerGpuPatchDiagnostics(...) and
dockerGpuPatch.dockerGpuPatchCleanupCommands(...), performs the
console.error/console.log output now in onboard.ts, and returns a compact result
object (mode label or error/diagnostics info); then replace the current
try/catch blocks in createSandbox() (around
recreateOpenShellDockerSandboxWithGpu and the other two similar blocks) with
single calls to this helper and handle only the small returned result.
- Around line 6509-6523: The diagnostics and cleanup messaging for the Docker
GPU patch should only run when the patch path is active; wrap the block that
calls dockerGpuPatch.collectDockerGpuPatchDiagnostics and
dockerGpuPatch.dockerGpuPatchCleanupCommands in a guard (e.g., if
(useDockerGpuPatch) { ... }) so it only executes when useDockerGpuPatch is
truthy (or alternatively guard on dockerGpuPatchResult being present/activated).
Keep the existing references sandboxName, runCaptureOpenshell,
dockerGpuPatchResult, and the two dockerGpuPatch.* calls inside that guarded
block.

In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 217-221: pushNumberFlag currently rejects non-positive numbers
which causes Docker's unlimited swap sentinel HostConfig.MemorySwap = -1 to be
dropped when rebuilding docker run args; update pushNumberFlag to treat -1 as a
valid value (accept Number.isFinite(value) && (value === -1 || value > 0)) and
push the flag with String(value) in that case, and apply the same change to the
other occurrence of this helper used for rebuilding run args so
HostConfig.MemorySwap = -1 is preserved.

---

Nitpick comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 576-685: The function recreateOpenShellDockerSandboxWithGpu is
doing too much (discovery, inspect, mode selection, stop/rename, run clone, wait
for new ID, wait for exec readiness, cleanup and error decoration); split it
into smaller helpers to reduce complexity and improve failure handling: extract
container discovery/inspect into a function (e.g., discoverAndInspectContainer
that wraps findOpenShellDockerSandboxContainerIds and inspectDockerContainer),
extract backup/rename logic into backupAndRenameContainer (uses dockerStop,
dockerRename, buildBackupContainerName), extract clone/run-and-wait into
runGpuCloneAndWaitForId (uses buildDockerGpuCloneRunArgs, dockerRunDetached,
waitForNewContainerId), and extract exec readiness + cleanup into
waitForExecReadyAndCleanup (uses waitForOpenShellSandboxExec and dockerRm for
backup); ensure the original recreateOpenShellDockerSandboxWithGpu composes
these helpers, preserves the DockerGpuPatchFailureContext updates
(oldContainerId, backupContainerName, newContainerId, selectedMode,
modeAttempts) and still uses decoratePatchError on catch, and prefix any
intentionally unused variables with an underscore per guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9abc4e9-bbc0-4860-b39b-46f17dd97991

📥 Commits

Reviewing files that changed from the base of the PR and between 6f4e276 and 6bf3354.

📒 Files selected for processing (11)
  • src/lib/adapters/docker/container.ts
  • src/lib/adapters/docker/index.test.ts
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
  • src/lib/onboard.ts
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/initial-policy.ts
  • src/lib/sandbox/create-stream.test.ts
  • src/lib/sandbox/create-stream.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard/docker-gpu-patch.ts
@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

Comment thread test/e2e/brev-e2e.test.ts Fixed
Comment thread test/e2e/brev-e2e.test.ts Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/brev-e2e.test.ts`:
- Around line 398-402: The current use of execSync to run a shell-interpolated
piped command is injection-prone; replace the single execSync call with two
execFileSync invocations: call execFileSync('brev',
['search','cpu','--min-vcpu',BREV_MIN_VCPU,'--min-ram',BREV_MIN_RAM,'--min-disk',BREV_MIN_DISK,'--provider',BREV_PROVIDER,'--sort','price'],
{ encoding: 'utf-8', timeout: 180_000 }) to capture the search output, then call
execFileSync('brev', ['create', INSTANCE_NAME, '--startup-script',
`@${setupScriptPath}`, '--detached'], { input: searchOutput, encoding: 'utf-8',
timeout: 180_000, stdio: PIPE_INPUT_STDIO }) so arguments are passed as arrays
(no shell interpolation) and the search output is fed as stdin to create; update
the code that currently uses execSync to use execFileSync and keep the same
timeout/encoding/stdio constants.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ef10bd24-0388-428c-9c61-7784a6e92c0d

📥 Commits

Reviewing files that changed from the base of the PR and between 6bf3354 and e8690db.

📒 Files selected for processing (2)
  • .github/workflows/e2e-branch-validation.yaml
  • test/e2e/brev-e2e.test.ts

Comment thread test/e2e/brev-e2e.test.ts Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: e2e-branch-validation:gpu, nightly-e2e:gpu-e2e, nightly-e2e:gpu-double-onboard-e2e, e2e-branch-validation:full
Optional E2E: ollama-proxy-e2e:ollama-proxy-e2e, nightly-e2e:inference-routing-e2e, nightly-e2e:openclaw-inference-switch-e2e, nightly-e2e:sandbox-operations-e2e, nightly-e2e:onboard-resume-e2e, regression-e2e:gateway-health-honest-e2e, nightly-e2e:overlayfs-autofix-e2e

Dispatch hint: gpu-e2e,gpu-double-onboard-e2e,inference-routing-e2e,sandbox-operations-e2e,onboard-resume-e2e,overlayfs-autofix-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • e2e-branch-validation:gpu (high): This PR introduces the 'gpu' test_suite in e2e-branch-validation.yaml and the entire Linux Docker-driver GPU sandbox stack (docker-gpu-patch, docker-gpu-sandbox-create, docker-gpu-local-inference, docker-driver-gateway-launch host-mode, sandbox bridge reachability skip). The new suite provisions a Brev GPU VM and runs the Ollama GPU sandbox proof from source; without it the new code paths are unvalidated end-to-end.
  • nightly-e2e:gpu-e2e (high): Existing Ollama-on-GPU sandbox proof on an NVKS GPU runner. Directly exercises local.ts host-URL selection, docker-gpu-local-inference, sandbox GPU create args, and the host-network gateway path. The most authoritative existing GPU regression for this PR.
  • nightly-e2e:gpu-double-onboard-e2e (high): Re-onboard with Ollama proxy token consistency (ollama proxy token diverges from stored token after re-onboard, causing persistent HTTP 401 on inference #2553). onboard.ts and local inference URL helpers were both refactored; double-onboard exercises gateway reuse, drift detection, and Ollama proxy state — all areas this PR rewrites for the Docker-driver path.
  • e2e-branch-validation:full (medium): onboard.ts, dockerfile-patch.ts, initial-policy.ts, sandbox-gpu-create.ts, and the gateway launch path all changed. The 'full' Brev branch-validation suite is the standard gate that installs from source on a clean VM and runs install→onboard→sandbox verify→inference→CLI ops, catching regressions in the refactored onboard/gateway paths on a real Linux host.

Optional E2E

  • ollama-proxy-e2e:ollama-proxy-e2e (low): local.ts host-URL behavior and docker-gpu-local-inference touch the Ollama auth proxy routing. Useful confidence check that the existing Ollama proxy flow still works with the new sandbox host URL override.
  • nightly-e2e:inference-routing-e2e (medium): Validates inference.local routing into the sandbox; getLocalProviderBaseUrl and host-URL normalization changes could subtly shift routing behavior even on non-GPU hosts.
  • nightly-e2e:openclaw-inference-switch-e2e (medium): Inference switching writes the host URL into openclaw.json; the new sandbox host URL override path is adjacent to this code.
  • nightly-e2e:sandbox-operations-e2e (medium): create-stream.ts and the sandbox create argv builder changed (new --gpu/--gpu-device handling, docker rename helper). Sandbox operations exercises that lifecycle.
  • nightly-e2e:onboard-resume-e2e (medium): onboard.ts now records sandbox GPU mode/device and resumes from registry — resume coverage is relevant.
  • regression-e2e:gateway-health-honest-e2e (medium): Gateway launch identity, drift detection, and reachability checks were rewritten. Health-honesty regression catches false-positive 'gateway healthy' logs on Linux Docker-driver hosts.
  • nightly-e2e:overlayfs-autofix-e2e (medium): getGatewayStartEnv path is touched indirectly through Docker-driver gateway env handling; overlayfs autofix consumes that env on Linux.

New E2E recommendations

  • linux-docker-driver-gpu-sandbox (medium): The new Docker-driver GPU host-network mode (NEMOCLAW_DOCKER_GPU_PATCH=host) replaces the sandbox bridge with the host network and routes inference at 127.0.0.1 instead of host.openshell.internal. There is no existing E2E that asserts (a) the sandbox can reach the gateway when bridge reachability is skipped, AND (b) Ollama is reachable from inside the sandbox via the new host URL, AND (c) the escape-hatch NEMOCLAw_DOCKER_GPU_PATCH=0 falls back cleanly. test-gpu-e2e.sh covers the happy path but not the patch-disable fallback.
    • Suggested test: test/e2e/test-gpu-docker-driver-patch-disable.sh — onboard with NEMOCLAW_DOCKER_GPU_PATCH=0 on a Linux Docker-driver GPU host; assert onboard either succeeds without host-network patching or fails with the documented escape-hatch guidance, and that the sandbox bridge gateway reachability check runs.
  • gpu-detection (low): detectTegraHostGpu() now classifies hosts as platform=jetson when nvidia-smi is absent but Tegra device nodes (/dev/nvhost-gpu) or firmware model indicate Jetson. No existing E2E exercises a host without nvidia-smi, so this fallback is unit-tested only.
    • Suggested test: Add a Jetson/Tegra preflight smoke (mocked or on a Jetson runner if available) that asserts nemoclaw onboard preflight reports 'NVIDIA GPU detected' and platform=jetson when nvidia-smi is missing but Tegra device nodes exist.
  • docker-driver-gateway (medium): buildDockerDriverGatewayRuntimeIdentity now produces distinct drift/identity binaries for host-mode launches. Existing gateway-health-honest-e2e covers GLIBC-style crashes but not the host-mode identity drift case where a previous bridge-mode gateway is still listening on the port.
    • Suggested test: Regression test: pre-start a bridge-mode gateway, then run onboard with GPU host-mode patch enabled; assert the stale gateway is detected as drift and recreated rather than reused.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gpu-e2e,gpu-double-onboard-e2e,inference-routing-e2e,sandbox-operations-e2e,onboard-resume-e2e,overlayfs-autofix-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@ericksoa ericksoa changed the title Fix Docker GPU sandbox readiness fix(onboard): repair Docker GPU sandbox readiness May 13, 2026
Comment thread src/lib/onboard.ts Fixed
Comment thread src/lib/onboard/docker-cdi.ts Fixed
@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@github-actions

Copy link
Copy Markdown
Contributor

Brev E2E (gpu): FAILED on branch fix/docker-gpu-sandbox-readinessSee logs

@cv cv merged commit 3205c12 into main May 13, 2026
28 checks passed
@prekshivyas prekshivyas deleted the fix/docker-gpu-sandbox-readiness branch May 13, 2026 23:15
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25840442921
Target ref: a7aff406df029f96806c8cd68d264f332db32b0b
Workflow ref: main
Requested jobs: all (no filter)
Summary: 38 passed, 1 failed, 2 skipped

Job Result
brave-search-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-inference-switch-e2e ❌ failure
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: openclaw-inference-switch-e2e. Check run artifacts for logs.

cjagwani pushed a commit that referenced this pull request May 14, 2026
Resolve the two output threads in #3456 left after the core dead-loop fix
landed via #3459 + #3434:

Sub-bug #3 — `src/lib/onboard.ts` printed
  `nemoclaw <name> destroy --yes && nemoclaw onboard --gpu`
with a literal `<name>` placeholder, and assumed at least one sandbox
was registered. When the GPU-passthrough mismatch hit on the State B
re-run path with an empty registry (the dead-loop case), the hint was
not actionable. Replace with a registry-aware helper at
`src/lib/onboard/gpu-recovery.ts` that renders the right shape:
  - empty registry → suggest `nemoclaw uninstall && nemoclaw onboard --gpu`
  - one sandbox → suggest destroy --yes --cleanup-gateway for that name
  - multiple sandboxes → list each, only the last gets --cleanup-gateway

Sub-bug #4 — `src/lib/actions/uninstall/run-plan.ts` printed
  `Destroyed gateway 'nemoclaw' skipped`
when the openshell destroy no-op'd (gateway already gone) — the
"Destroyed … skipped" wording was self-contradictory. Extend
`runOptional` with an `onSkip` option; route the gateway destroy to
emit `Gateway 'nemoclaw' already removed or unreachable` on no-op.

Tests:
- `src/lib/onboard/gpu-recovery.test.ts` (6 tests): forbid literal
  `<name>` placeholder anywhere in the output; cover empty / single /
  multi-sandbox cases; defensive filter on whitespace names so a
  `nemoclaw  destroy` rendering can never happen.
- `src/lib/actions/uninstall/run-plan.test.ts`: assert the new
  "already removed or unreachable" wording and the absence of the
  "Destroyed gateway 'nemoclaw' skipped" string.

The core dead loop itself (sub-bugs #1, #2 and State B GPU mismatch)
is already addressed by #3459 + #3434 + #3483; #3456 will close once
this lands. See the #3456 status comment for the full mapping.

Refs #3456. Mirrors (and tightens) the approach in the closed PR #3464,
which left the literal `<name>` placeholder in tests per CodeRabbit
feedback that was never addressed.

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
sandl99 added a commit to sandl99/NemoClaw that referenced this pull request May 14, 2026
…e test

Why: rebasing onto main pulled in NVIDIA#3434's new
"can override the sandbox inference base URL" test, which was written
against the pre-WeChat 13-parameter signature of patchStagedDockerfile.
The WeChat branch added a wechatConfig: LooseObject parameter at slot
13, so the test's `false` (intended for darwinVmCompat) was landing in
wechatConfig and TypeScript flagged the boolean-to-LooseObject mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cv pushed a commit that referenced this pull request May 14, 2026
## Summary
- Bump the docs metadata and version switcher to `0.0.41`.
- Add v0.0.41 release notes plus operator guidance for OpenShell
pinning, Docker bridge reachability, Local Ollama proxy reachability,
and Docker GPU onboarding diagnostics.
- Refresh generated `nemoclaw-user-*` skills from the updated docs.

## Source summary
- #3434 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document Linux Docker-driver GPU onboarding behavior, diagnostics,
cleanup guidance, and the `NEMOCLAW_DOCKER_GPU_PATCH` troubleshooting
escape hatch.
- #3483 -> `docs/about/release-notes.md`: Note that `nemoclaw uninstall`
removes all installer-managed OpenShell helper binaries unless
`--keep-openshell` is passed.
- #3446 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document blueprint-driven OpenShell install pin resolution and fallback
behavior.
- #3472 -> `docs/inference/use-local-inference.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document sandbox-side Local Ollama auth proxy reachability checks and
firewall remediation.
- #3459 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document Docker-driver sandbox-to-gateway reachability checks and
firewall remediation.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `git diff --check`
- `npm run build:cli`
- `npm run typecheck:cli`
- pre-commit hooks during `git commit`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added `nemoclaw inference get` command to check current inference
settings
* Improved gateway health validation with Linux firewall remediation
guidance

* **Bug Fixes**
  * Enhanced proxy readiness validation with sandbox network path probes
  * Improved local Ollama route onboarding with rerun-safe fixes
  * Better sandbox-to-gateway connectivity detection

* **Documentation**
* Expanded troubleshooting guidance for firewall and connectivity issues
* Updated CLI reference with new command and environment variable
documentation
  * Added gateway binding and Docker-driver GPU compatibility guidance

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3531)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
cv added a commit that referenced this pull request May 14, 2026
…3520)

> **Draft for visibility.** Issue-autopilot Stages 4-5 of #3456. Will
mark ready once batch self-review + CI complete.

## Summary

Closes the two remaining output threads in #3456 after the core
dead-loop fix already landed on `main` (via #3459, #3434, #3483). Full
sub-bug mapping in the [#3456 status
comment](#3456 (comment)).

- **Sub-bug #3** — `nemoclaw <name> destroy --yes` recovery hint
replaced with a registry-aware helper.
- **Sub-bug #4** — `Destroyed gateway 'nemoclaw' skipped`
self-contradictory wording replaced with `Gateway 'nemoclaw' already
removed or unreachable`.

## Acceptance criteria mapping

| Sub-bug | Resolution | Evidence |
|---|---|---|
| #1 dead loop | Already fixed on main (#3459) | out of scope |
| #2 firewall diagnostic | Already fixed on main (#3459) | out of scope
|
| **#3** literal `<name>` placeholder | **This PR** |
`src/lib/onboard/gpu-recovery.ts` + `onboard.ts:10387-10405` |
| **#4** misleading "skipped" wording | **This PR** |
`src/lib/actions/uninstall/run-plan.ts:210-228, 407-414` |
| #5 uninstall residuals | Already fixed on main (#3483) | out of scope
|

## Behavior matrix

`gpuPassthroughRecoveryLines(names)`:

| Input | Suggestion |
|---|---|
| `null` / `[]` | `nemoclaw uninstall && nemoclaw onboard --gpu` |
| one sandbox | `nemoclaw <name> destroy --yes --cleanup-gateway &&
nemoclaw onboard --gpu` |
| many sandboxes | each `destroy --yes`, only the last gets
`--cleanup-gateway` |

## Test plan

```
npm run typecheck:cli
npx vitest run src/lib/onboard/gpu-recovery.test.ts src/lib/actions/uninstall/run-plan.test.ts
```

22 tests pass (6 new + 16 existing).

## Notes for reviewers

- This is the work [#3464
attempted](#3464); that PR was
closed without merging after CodeRabbit asked for the `<name>`
placeholder to be forbidden in tests via negative assertion. This PR
adopts that refinement.
- `runOptional` extension is backwards-compatible — existing callers
without `onSkip` get the original wording.

Closes #3456 once merged.

---------

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
Co-authored-by: Charan Jagwani <charjags100@gmail.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
cv added a commit that referenced this pull request May 14, 2026
…d for proc/comm write (#3515)

## Summary

The Docker-driver GPU patch
([src/lib/onboard/docker-gpu-patch.ts](src/lib/onboard/docker-gpu-patch.ts))
recreates the OpenShell-managed sandbox container with `--gpus all` plus
a reconstruction of the baseline container's flags, then runs a
three-step GPU proof in
[src/lib/onboard/initial-policy.ts](src/lib/onboard/initial-policy.ts) —
including `PROC_COMM_WRITE_PROBE`, which writes to
`/proc/<pid>/task/<tid>/comm`.

On DGX Spark hosts (#3511) and other Docker-driver Linux hosts where the
OpenShell-created baseline container's `CapAdd` does not include
`SYS_PTRACE` and its `SecurityOpt` does not include
`apparmor=unconfined`, the recreate inherits a flag set the kernel/LSM
stack rejects for that proc write. The proof aborts onboarding:

```
✗ GPU proof failed: /proc/<pid>/task/<tid>/comm write
Error: GPU proof failed: /proc/<pid>/task/<tid>/comm write (status 2):
  sh: 1: cannot create /proc/<pid>/task/<tid>/comm: Permission denied
```

The reporter confirmed a bare `docker run --rm --gpus all ubuntu sh -c
"echo test > /proc/1/task/1/comm"` succeeds on the same host, so the
kernel itself allows the write under the right Docker flags — the
problem is which flags reach the recreated container.

This PR makes the GPU recreate self-sufficient for the operations the
GPU proof checks, regardless of the non-GPU baseline:

- **Always inject `--cap-add SYS_PTRACE`** into the clone, deduped via a
`Set` so baselines that already have it stay flat.
- **Inject `--security-opt apparmor=unconfined`** only when the baseline
did *not* pin a specific apparmor profile. Docker rejects duplicate
`--security-opt apparmor=…` entries, so a baseline that explicitly chose
`apparmor=docker-default` (or similar) is respected — this stays scoped
to the GPU recreate path and does not override deliberate operator
choices.

## Related Issue
Resolves #3511

Related context:

- #3434 — PR that introduced the Docker-driver GPU recreate flow and the
`/proc/comm` write probe. This PR hardens the recreate to satisfy what
the probe checks across more baselines.

## Changes
- `src/lib/onboard/docker-gpu-patch.ts`: dedup `CapAdd` / `SecurityOpt`
via `Set`, always inject `SYS_PTRACE`, and inject `apparmor=unconfined`
only when no apparmor profile is pinned by the baseline.
- `src/lib/onboard/docker-gpu-patch.test.ts`: four new cases —
SYS_PTRACE always added; SYS_PTRACE deduped when baseline already has
it; apparmor=unconfined injected on empty baselines; baseline-pinned
apparmor profile preserved.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* GPU-enabled containers are now configured with the necessary kernel
security capabilities during clone/recreate operations.
* Security profile handling was fixed to preserve existing security
options and avoid unintended overrides or duplicate entries.

* **Tests**
* Added tests covering GPU container clone behavior across various
capability and security-profile scenarios to prevent regressions.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3515)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
cv added a commit that referenced this pull request May 15, 2026
)

## Summary

Fixes `host.openshell.internal` DNS resolution inside GPU sandboxes.
Closes #3562, #3568.

## Root cause

PR #3434 added the Docker GPU patch, which recreates the
OpenShell-managed sandbox container with GPU flags after OpenShell
creates it without `--gpu`. The recreation gated `--add-host` injection
on `networkMode !== "host"`. But `getDockerGpuPatchNetworkMode` defaults
to `"host"` (and `NEMOCLAW_DOCKER_GPU_PATCH_NETWORK` is never set), so
every GPU recreate silently dropped the OpenShell-injected
`host.openshell.internal` entry.

Symptom on DGX Spark: `LLM request failed: network connection error` —
inference can't reach the Ollama auth proxy because the hostname doesn't
resolve in the recreated container.

## The fix

`--add-host` writes to `/etc/hosts` in the mount namespace, not the
network stack — safe to preserve with `--network=host`. Removed the
conditional. `--dns` / `--dns-search` guard kept (those genuinely
conflict with host networking).

Also flipped `local.test.ts:618` to match the current validator output
(pre-existing test drift).

## Tests

- All 16/16 `docker-gpu-patch.test.ts` pass.
- Existing test 'can switch the recreated sandbox to host networking for
OpenShell callbacks' was asserting the bug; now asserts `--add-host` is
preserved.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Docker GPU container networking now defaults to "preserve" unless
explicitly set to host, improving predictable network behavior.
* Host-to-container callback mapping is always preserved, ensuring
external callback endpoints remain reachable after container recreation
and improving endpoint resolution.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3623)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
This was referenced May 16, 2026
@wscurran wscurran added area: packaging Packages, images, registries, installers, or distribution area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images and removed area: packaging Packages, images, registries, installers, or distribution Docker labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants