fix(onboard): prefer CDI GPU mode over --gpus on CDI hosts by jason-ma-nv · Pull Request #4956 · NVIDIA/NemoClaw

jason-ma-nv · 2026-06-08T12:06:15Z

Summary

On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g. /etc/cdi/nvidia.yaml on Ubuntu 24.04/26.04), nemoclaw onboard selected --gpus all for the GPU patch recreate, the OpenShell supervisor never reconnected, the sandbox entered Error phase before the GPU proof, and onboard aborted with exit 1. This reorders GPU mode selection to prefer the CDI mode (--device nvidia.com/gpu=all) ahead of --gpus whenever a CDI spec is detected, matching how OpenShell's gateway start --gpu injects GPU devices.

Related Issue

Fixes #4948

Changes

src/lib/onboard/docker-gpu-patch.ts: buildDockerGpuModeCandidates now puts the cdi candidate first when cdiAvailable is true; --gpus and the NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are unaffected (order unchanged: gpus, nvidia-runtime). Jetson path unchanged.
src/lib/onboard/docker-gpu-patch.test.ts: adds a repro test asserting that on a CDI host where the --gpus probe would pass, the cdi mode is selected; updates the existing candidate-order assertion and stale comments to the corrected ordering.

Why this is the right layer

The create-only probe (docker create --gpus all) is accepted on these hosts, so --gpus all looked viable but diverges at runtime from OpenShell's CDI-based injection — the supervisor then never reconnects to the recreated container. Preferring CDI when a spec is present removes that divergence.

Validation caveat

The supervisor-reconnect failure is a runtime symptom that only manifests on real GPU + Docker-CDI hardware. Unit tests pin the deterministic mode-selection decision; final confirmation requires the GPU E2E path (e2e-branch-validation:gpu) on an affected host. The existing NEMOCLAW_DOCKER_GPU_PATCH=0 escape hatch is unchanged.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Verification notes: the full onboard suite passes (npx vitest run src/lib/onboard/ — 1189 tests, including the new #4948 repro), npm run typecheck:cli passes, and Biome is clean on the changed files. The full npm test run has 16 pre-existing, environment-dependent failures (network/port-binding e2e-framework fixtures, a MODULE_NOT_FOUND in ssrf-parity, and missing Docker/OpenShell fixtures in fetch-guard-patch-regression) that are unrelated to this change and do not reference the modified module — hence npm test is left unchecked.

Signed-off-by: Jason Ma jama@nvidia.com

Summary by CodeRabbit

Bug Fixes
- Improved Docker GPU detection: when the Container Device Interface (CDI) is available it is now preferred, otherwise the system falls back to legacy GPU options — improving GPU compatibility and reliability.
Tests
- Added regression tests covering CDI-first selection, fallbacks to other GPU modes, and confirming CDI-based launches omit legacy GPU flags.

On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g. /etc/cdi/nvidia.yaml on Ubuntu 24.04/26.04), `nemoclaw onboard` selected the `--gpus all` mode for the GPU patch recreate. `docker create --gpus all` is accepted on these hosts so the create-only probe passed, but OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a container recreated via the legacy --gpus path diverges from how the supervisor expects the GPU container to be wired up. The supervisor never reconnected, the sandbox entered Error phase before the GPU proof, and onboard aborted with exit 1. Reorder GPU mode candidates so the CDI mode (`--device nvidia.com/gpu=all`) is preferred ahead of --gpus whenever a CDI spec is detected; --gpus and the NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are unaffected (candidate order unchanged). Note: the supervisor-reconnect failure is a runtime symptom on real GPU hardware; final validation requires the GPU E2E path (see "verify" below). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-08T12:06:47Z

E2E Advisor Recommendation

Required E2E: gpu-repo-local-ollama-openclaw
Optional E2E: gpu-e2e, gpu-double-onboard-e2e

Dispatch hint: gpu-repo-local-ollama-openclaw

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gpu-repo-local-ollama-openclaw (high): Required because this is the typed Docker-CDI GPU scenario on the self-hosted GPU runner. It validates repo-current onboarding, Docker-CDI GPU sandbox creation/recreation, local Ollama inference, and Ollama proxy reachability—the exact runtime path changed by the PR.

Optional E2E

gpu-e2e (high): Useful redundant confidence from the legacy nightly GPU E2E script path. It validates install/onboard plus local Ollama GPU inference on an ephemeral GPU runner, but is less targeted than the typed Docker-CDI scenario.
gpu-double-onboard-e2e (high): Optional adjacent coverage for repeated GPU onboarding/recreate behavior and persisted Ollama proxy token consistency after re-onboard. It is relevant to lifecycle regressions but not the primary CDI mode-selection fix.

New E2E recommendations

docker-cdi-gpu-recreate-mode (high): Existing GPU E2E proves the sandbox works, but does not appear to explicitly assert the recreated container used CDI (--device nvidia.com/gpu=all) instead of legacy --gpus all, nor that transient Error phases beyond the old debounce window recover without rollback.
- Suggested test: Add a focused Docker-CDI GPU recreate regression that captures the patched create option/container inspect output and verifies CDI mode plus successful supervisor reconnect after a transient Error phase.

Dispatch hint

Workflow: .github/workflows/e2e-scenarios.yaml
jobs input: gpu-repo-local-ollama-openclaw

github-actions · 2026-06-08T12:06:48Z

E2E Scenario Advisor Recommendation

Required scenario E2E: gpu-repo-local-ollama-openclaw
Optional scenario E2E: None

Dispatch required scenario E2E:

gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

gpu-repo-local-ollama-openclaw: Changes affect the Docker GPU patch and supervisor reconnect path, including CDI-first GPU mode selection and reconnect debounce behavior. The dispatchable scenario that exercises Docker CDI GPU onboarding is gpu-repo-local-ollama-openclaw.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

None.

Relevant changed files

src/lib/onboard/docker-gpu-patch.ts
src/lib/onboard/docker-gpu-supervisor-reconnect.ts

coderabbitai · 2026-06-08T12:08:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f306d9cd-4edf-45a9-83a2-5b71b5dd0117

📥 Commits

Reviewing files that changed from the base of the PR and between db2f9f6 and 591651e.

📒 Files selected for processing (2)

src/lib/onboard/docker-gpu-patch-mode-selection.test.ts
src/lib/onboard/docker-gpu-patch.test.ts

💤 Files with no reviewable changes (1)

src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Reorders GPU patch mode probing to prefer CDI when Docker reports a readable NVIDIA CDI spec, reorganizes related re-exports, and adds tests validating CDI-first selection, fallback to --gpus and nvidia-runtime, and propagation of the CDI device flag into recreate flows.

Changes

CDI-first GPU patch mode candidate selection

Layer / File(s)	Summary
CDI-first candidate ordering and re-exports `src/lib/onboard/docker-gpu-patch.ts`	Re-export section from `./docker-gpu-supervisor-reconnect` is restructured; `buildDockerGpuModeCandidates` now places the CDI candidate before `--gpus` and `nvidia-runtime` when `cdiAvailable` is true.
New CDI-first mode selection tests `src/lib/onboard/docker-gpu-patch-mode-selection.test.ts`	Adds Vitest suite that stubs CDI host probes and inspects Docker recreate flows to assert CDI preference, fallback to `gpus`, fallback to `nvidia-runtime`, and that `recreateOpenShellDockerSandboxWithGpu` uses the CDI `--device nvidia.com/gpu=all` flag and omits `--gpus`.
Update existing unit tests and comments `src/lib/onboard/docker-gpu-patch.test.ts`	Renamed/updated unit test expectations to reflect CDI-first ordering and adjusted a comment to state CDI candidates are preferred ahead of `--gpus` on CDI hosts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

[Ubuntu 24.04][Onboard] nemoclaw onboard --gpu always fails with Docker GPU patch supervisor reconnect timeout on aarch64 dual-GPU host #4950: Changes to prefer CDI first and to supervisor-reconnect re-exports affect GPU-mode selection and reconnect behavior referenced by this issue.

Possibly related PRs

NVIDIA/NemoClaw#4407: Shares related diagnostics and docker-gpu-patch plumbing changes (getSandboxFailurePhase, captureDockerGpuPatchSandboxSnapshot, classifyDockerGpuPatchFailure).

Suggested labels

Docker, platform: container, fix, Sandbox, bug-fix

Suggested reviewers

cv
prekshivyas

Poem

🐰 I found a CDI flag bright and small,
I nudged it forward, now it leads the call.
No legacy flags trailing behind the cart,
The sandbox wakes with GPU in its heart.
Hooray — CDI first, a hoppity new start!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the primary change: preferring CDI GPU mode over --gpus on CDI hosts, directly addressing the main objective of the PR.
Linked Issues check	✅ Passed	The PR fully addresses issue `#4948` by reordering GPU mode candidates to prefer CDI first when available, ensuring supervisor reconnection and successful onboarding on CDI-capable Docker hosts.
Out of Scope Changes check	✅ Passed	All code changes are directly scoped to GPU mode candidate ordering and selection logic; no unrelated modifications are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/onboard-prefer-cdi-gpu-mode-4948

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-08T12:09:28Z

PR Review Advisor

Findings: 1 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

Linked issue runtime acceptance is still not proven (src/lib/onboard/docker-gpu-patch-mode-selection.test.ts:108): Issue [Ubuntu 24.04][Onboard] onboard cannot create a GPU-enabled sandbox on Docker-driver GPU host #4948 requires a normal first onboard on an affected Docker-CDI GPU host to create a GPU-enabled sandbox, reconnect the OpenShell supervisor, run the GPU proof, reach Ready, and complete onboard. The changed tests prove deterministic CDI-first selection, fallback ordering, CDI argv propagation, and a mocked long Error-phase reconnect recovery, but they still stop at mocked Docker/OpenShell boundaries and do not observe the full first-onboard runtime path.
- Recommendation: Add or identify targeted runtime/integration validation for a normal first onboard on an affected Ubuntu Docker-CDI GPU host with no existing NemoClaw sandbox, asserting the recreated container uses `--device nvidia.com/gpu=all` rather than `--gpus all`, the supervisor reconnects, the GPU proof runs, the sandbox reaches Ready, and onboard completes.
- Evidence: The new recreate test calls `recreateOpenShellDockerSandboxWithGpu()` directly with mocked `dockerRunDetached` and mocked `runOpenshell`; the reconnect test simulates `runOpenshell`/`runCaptureOpenshell` responses. The linked issue Expected Result says: "The sandbox is created with GPU access, the OpenShell supervisor reconnects to the GPU-enabled container, the GPU proof runs, the sandbox reaches Ready, and the first onboard completes."

🔎 Worth checking

Sandbox lifecycle security confidence still depends on real Docker-CDI validation (src/lib/onboard/docker-gpu-patch.ts:509): No direct sandbox escape, credential leak, SSRF bypass, policy bypass, workflow trusted-code-boundary issue, or unsafe shell-string execution issue was found. However, this is a security-sensitive sandbox lifecycle path: the PR changes which GPU device-passthrough mode is used for an OpenShell-managed container and expands the transient Error-phase debounce from about 30 seconds to about 2 minutes. Unit tests validate selection and mocked reconnect behavior, but not the real Docker-CDI container/supervisor/policy lifecycle.
- Recommendation: Cover the real Docker-CDI host behavior in runtime validation: after CDI recreate, the supervisor reconnects before timeout, the sandbox does not remain in terminal Error during GPU enablement, policy expectations are preserved, the GPU proof runs under the expected container wiring, and persistent failures still roll back to the pre-patch sandbox.
- Evidence: `buildDockerGpuModeCandidates()` now prepends CDI when `cdiAvailable` is true, and `DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_PHASE_DEFAULT_DEBOUNCE_POLLS` is now 60. The changed tests assert `dockerRunDetached` receives `--device nvidia.com/gpu=all` and mock supervisor reconnect success/recovery rather than exercising real Docker/OpenShell lifecycle behavior.

🌱 Nice ideas

None.

Consider writing more tests for

**Runtime validation** — Runtime validation: normal first onboard on an affected Ubuntu Docker-CDI GPU host with no existing NemoClaw sandbox selects `--device nvidia.com/gpu=all`, not `--gpus all`, and onboard completes.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
**Runtime validation** — Runtime validation: after CDI recreate, the OpenShell supervisor reconnects before timeout and `openshell sandbox list` reaches Ready.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
**Runtime validation** — Runtime validation: the direct sandbox GPU proof runs successfully after CDI recreate and records verified CUDA usability.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
**Runtime validation** — Runtime validation: during CDI GPU-enable, a transient Error phase recovers without aborting, while a persistent Error still rolls back to the pre-patch sandbox.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
**Runtime validation** — Composed/unit validation: the `createDockerGpuSandboxCreatePatch` first-onboard create-time path wires CDI detection through `recreateOpenShellDockerSandboxWithGpu`, not only direct calls to the recreate helper.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
**Acceptance clause:** On a Docker-driver GPU host (NVIDIA GPU auto-detected), `nemoclaw onboard` cannot bring up a GPU-enabled sandbox. — add test evidence or identify existing coverage. The production change targets Docker GPU patch mode selection, but no changed test runs `nemoclaw onboard` on a Docker-driver GPU host.
**Acceptance clause:** While creating the sandbox, onboard enables GPU passthrough — this is the standard create-then-GPU-enable path that runs on a normal FIRST onboard whenever a GPU is present on a Docker-driver gateway (gated by NEMOCLAW_DOCKER_GPU_PATCH); — add test evidence or identify existing coverage. The new CDI test directly calls `recreateOpenShellDockerSandboxWithGpu()` and proves CDI args reach `dockerRunDetached`; it does not exercise the full first-onboard create-time orchestration with no existing NemoClaw sandbox.
**Acceptance clause:** The OpenShell supervisor never reconnects to the GPU-enabled container, — add test evidence or identify existing coverage. The reconnect test simulates a transient Error phase followed by mocked supervisor success, but no changed test observes a real OpenShell supervisor reconnecting to a real Docker-CDI GPU-enabled container.

Since last review details

Current findings:

Linked issue runtime acceptance is still not proven (src/lib/onboard/docker-gpu-patch-mode-selection.test.ts:108): Issue [Ubuntu 24.04][Onboard] onboard cannot create a GPU-enabled sandbox on Docker-driver GPU host #4948 requires a normal first onboard on an affected Docker-CDI GPU host to create a GPU-enabled sandbox, reconnect the OpenShell supervisor, run the GPU proof, reach Ready, and complete onboard. The changed tests prove deterministic CDI-first selection, fallback ordering, CDI argv propagation, and a mocked long Error-phase reconnect recovery, but they still stop at mocked Docker/OpenShell boundaries and do not observe the full first-onboard runtime path.
- Recommendation: Add or identify targeted runtime/integration validation for a normal first onboard on an affected Ubuntu Docker-CDI GPU host with no existing NemoClaw sandbox, asserting the recreated container uses `--device nvidia.com/gpu=all` rather than `--gpus all`, the supervisor reconnects, the GPU proof runs, the sandbox reaches Ready, and onboard completes.
- Evidence: The new recreate test calls `recreateOpenShellDockerSandboxWithGpu()` directly with mocked `dockerRunDetached` and mocked `runOpenshell`; the reconnect test simulates `runOpenshell`/`runCaptureOpenshell` responses. The linked issue Expected Result says: "The sandbox is created with GPU access, the OpenShell supervisor reconnects to the GPU-enabled container, the GPU proof runs, the sandbox reaches Ready, and the first onboard completes."
Sandbox lifecycle security confidence still depends on real Docker-CDI validation (src/lib/onboard/docker-gpu-patch.ts:509): No direct sandbox escape, credential leak, SSRF bypass, policy bypass, workflow trusted-code-boundary issue, or unsafe shell-string execution issue was found. However, this is a security-sensitive sandbox lifecycle path: the PR changes which GPU device-passthrough mode is used for an OpenShell-managed container and expands the transient Error-phase debounce from about 30 seconds to about 2 minutes. Unit tests validate selection and mocked reconnect behavior, but not the real Docker-CDI container/supervisor/policy lifecycle.
- Recommendation: Cover the real Docker-CDI host behavior in runtime validation: after CDI recreate, the supervisor reconnects before timeout, the sandbox does not remain in terminal Error during GPU enablement, policy expectations are preserved, the GPU proof runs under the expected container wiring, and persistent failures still roll back to the pre-patch sandbox.
- Evidence: `buildDockerGpuModeCandidates()` now prepends CDI when `cdiAvailable` is true, and `DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_PHASE_DEFAULT_DEBOUNCE_POLLS` is now 60. The changed tests assert `dockerRunDetached` receives `--device nvidia.com/gpu=all` and mock supervisor reconnect success/recovery rather than exercising real Docker/OpenShell lifecycle behavior.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…4948) Address PR review-advisor findings on the CDI-first GPU mode change: - Move the CDI mode-selection tests out of the docker-gpu-patch.test.ts monolith into a focused docker-gpu-patch-mode-selection.test.ts spec, offsetting the flagged monolith growth (back to ~baseline line count). - Pin the fallback chain on a CDI host: CDI probe fails -> --gpus selected; CDI and --gpus probes fail -> NVIDIA runtime selected (attempt order starts with cdi in both cases). - Add a recreate-boundary assertion: recreateOpenShellDockerSandboxWithGpu passes --device nvidia.com/gpu=all to dockerRunDetached on a CDI host and never emits --gpus, proving the selected CDI mode reaches the real recreate command (the patched_create_option the issue logs). All four new tests fail under the previous --gpus-first ordering, confirming they pin the fix rather than restating it. No production code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wscurran · 2026-06-08T14:43:08Z

✨
Related open issues:

#4948 [Ubuntu 24.04][Onboard] onboard cannot create a GPU-enabled sandbox on Docker-driver GPU host

copy-pr-bot · 2026-06-09T23:30:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-09T23:31:50Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 27242628510
Target ref: fix/onboard-prefer-cdi-gpu-mode-4948
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

cv · 2026-06-10T00:08:11Z

Runtime validation update for #4948 on PR head 5811e0375ae9610ed8cdc7b11ff09363e4fcd9f4:

Standard PR CI is green: gh pr checks reports all 30 checks passing.
Manual typed GPU scenario run: https://github.com/NVIDIA/NemoClaw/actions/runs/27243425713
Scenario: gpu-repo-local-ollama-openclaw on the linux-amd64-gpu-rtxpro6000-latest-1 Docker-CDI GPU runner.
The first rerun before the debounce follow-up failed in onboarding with patched_create_option=--device nvidia.com/gpu=all and supervisor reconnect timeout. After the follow-up commit, onboarding passed.

Relevant acceptance evidence from the second run artifacts:

Onboarding phase: passed.
Docker GPU mode selected: --device nvidia.com/gpu=all.
Dashboard reached live state and sandbox creation completed.
GPU proof passed through nvidia-smi, /proc/<pid>/task/<tid>/comm write, and cuInit(0) via libcuda.so.1.
Sandbox CUDA usability was proven.
Sandbox local inference route reached https://inference.local/v1/models with HTTP 200.
OpenClaw reached ready state.

The manual scenario run is still red, but only after the linked issue acceptance point: two later runtime provider assertions failed because the host Ollama socket disappeared between probes (curl could not connect to 127.0.0.1:11434). The smoke checks had already passed: gateway health, sandbox listed, sandbox shell, and models-health. I am treating that as a typed-scenario/provider-runner issue rather than a blocker for the Docker-CDI onboard fix.

cv

Standard PR CI is green. I resolved the main conflict, added CDI source-boundary coverage, and followed up on the GPU runtime failure. The manual Docker-CDI scenario now passes the linked issue acceptance point: CDI recreate, supervisor reconnect/onboard, Ready/dashboard, GPU/CUDA proof, and the local inference route. The remaining red in that manual run is later host-Ollama runtime probe churn, not the onboard fix.

cv · 2026-06-10T00:59:07Z

Current-head runtime validation update for #4948 on PR head 70921b56d73eb384632e7663817927e1b680ab82:

Standard PR CI is green: gh pr checks reports all 30 checks passing after the branch was synced with main.
Manual typed GPU scenario rerun: https://github.com/NVIDIA/NemoClaw/actions/runs/27245221685
Scenario: gpu-repo-local-ollama-openclaw on the Docker-CDI GPU runner.

Acceptance evidence from the current-head artifacts:

Onboarding phase: passed.
Docker GPU mode selected: --device nvidia.com/gpu=all.
Dashboard reached live state and sandbox creation completed.
GPU proof passed through nvidia-smi, /proc/<pid>/task/<tid>/comm write, and cuInit(0) via libcuda.so.1.
Sandbox CUDA usability was proven.
Sandbox local inference route reached https://inference.local/v1/models with HTTP 200.
OpenClaw reached ready state.
State/smoke checks passed: gateway healthy, sandbox running/listed, sandbox shell, and models-health.

The typed scenario still exits red after the linked issue acceptance point, with the same provider-runner signature as the previous run: runtime.ollama.models-health saw the local Ollama model list, then runtime.ollama.chat-completion failed because curl could not connect to host 127.0.0.1:11434; runtime.ollama-auth-proxy.auth-enforcement is classified as provider-transient. I am still treating that as host-Ollama/provider-runner churn rather than a blocker for the Docker-CDI onboard fix.

## Summary - Add v0.0.62 release notes from Discussion #5100 and link release highlights to the relevant docs pages. - Document the release's GPU sandbox recreation, sandbox-side local inference verification, and Hermes dashboard port guard in the command and inference references. - Refresh generated NemoClaw user skills for the release-prep docs set. ## Source Summary - #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU recreation behavior for Linux Docker-driver sandboxes. - #5024 -> `docs/inference/use-local-inference.mdx`: Document sandbox-runtime verification of the `inference.local` local inference route. - #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra device-node group propagation for sandbox CUDA initialization. - #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`: Summarize onboarding and recovery reliability fixes, including the reserved Hermes API port guard. - #5017 and #5043 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Summarize mutable OpenClaw config recovery and host-side `agents list` coverage. - #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes upstream metadata visibility and WhatsApp QR rendering reliability. - #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`: Refresh generated user-skill references from the current docs source. ## Skipped - #5019 -> skipped for new prose because it touched `openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`. Existing source docs remain the source for generated skill synchronization. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning) - Pre-commit hooks passed during commit, including docs-to-skills verification, markdown lint, gitleaks, and skills YAML tests.  ## Summary by CodeRabbit * **New Features** * Added `nemoclaw <name> agents list` command. * v0.0.62 release notes added summarizing onboarding and recovery improvements. * **Bug Fixes** * Improved GPU sandbox onboarding reliability (NVIDIA CDI path, Jetson/Tegra device handling). * Better local inference verification and recovery for Linux Docker-driver GPU sandboxes. * Quieter/earlier handling of onboarding drift and port collisions. * **Documentation** * Expanded GPU passthrough, inference verification, writable paths (`/dev/pts`), port 8642 restriction, and command examples.  --------- Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>

jason-ma-nv self-assigned this Jun 8, 2026

jason-ma-nv added the v0.0.61 Release target label Jun 8, 2026

wscurran added area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression labels Jun 8, 2026

wscurran requested a review from cv June 8, 2026 16:08

cv added v0.0.62 Release target and removed v0.0.61 Release target labels Jun 8, 2026

merge(main): resolve CDI GPU patch conflicts

c1c7fb1

fix(onboard): extend GPU reconnect debounce

5811e03

cv approved these changes Jun 10, 2026

View reviewed changes

Merge branch 'main' into fix/onboard-prefer-cdi-gpu-mode-4948

70921b5

cv merged commit 8827570 into main Jun 10, 2026
31 of 32 checks passed

cv deleted the fix/onboard-prefer-cdi-gpu-mode-4948 branch June 10, 2026 01:04

miyoungc mentioned this pull request Jun 10, 2026

docs: refresh v0.0.62 release docs #5157

Merged

Conversation

jason-ma-nv commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Why this is the right layer

Validation caveat

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

wscurran commented Jun 8, 2026

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

cv commented Jun 10, 2026

Uh oh!

cv left a comment

Choose a reason for hiding this comment

Uh oh!

cv commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jason-ma-nv commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading