Skip to content

fix(onboard): prefer CDI GPU mode over --gpus on CDI hosts#4956

Merged
cv merged 5 commits into
mainfrom
fix/onboard-prefer-cdi-gpu-mode-4948
Jun 10, 2026
Merged

fix(onboard): prefer CDI GPU mode over --gpus on CDI hosts#4956
cv merged 5 commits into
mainfrom
fix/onboard-prefer-cdi-gpu-mode-4948

Conversation

@jason-ma-nv

@jason-ma-nv jason-ma-nv commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g. /etc/cdi/nvidia.yaml on Ubuntu 24.04/26.04), nemoclaw onboard selected --gpus all for the GPU patch recreate, the OpenShell supervisor never reconnected, the sandbox entered Error phase before the GPU proof, and onboard aborted with exit 1. This reorders GPU mode selection to prefer the CDI mode (--device nvidia.com/gpu=all) ahead of --gpus whenever a CDI spec is detected, matching how OpenShell's gateway start --gpu injects GPU devices.

Related Issue

Fixes #4948

Changes

  • src/lib/onboard/docker-gpu-patch.ts: buildDockerGpuModeCandidates now puts the cdi candidate first when cdiAvailable is true; --gpus and the NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are unaffected (order unchanged: gpus, nvidia-runtime). Jetson path unchanged.
  • src/lib/onboard/docker-gpu-patch.test.ts: adds a repro test asserting that on a CDI host where the --gpus probe would pass, the cdi mode is selected; updates the existing candidate-order assertion and stale comments to the corrected ordering.

Why this is the right layer

The create-only probe (docker create --gpus all) is accepted on these hosts, so --gpus all looked viable but diverges at runtime from OpenShell's CDI-based injection — the supervisor then never reconnects to the recreated container. Preferring CDI when a spec is present removes that divergence.

Validation caveat

The supervisor-reconnect failure is a runtime symptom that only manifests on real GPU + Docker-CDI hardware. Unit tests pin the deterministic mode-selection decision; final confirmation requires the GPU E2E path (e2e-branch-validation:gpu) on an affected host. The existing NEMOCLAW_DOCKER_GPU_PATCH=0 escape hatch is unchanged.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Verification notes: the full onboard suite passes (npx vitest run src/lib/onboard/ — 1189 tests, including the new #4948 repro), npm run typecheck:cli passes, and Biome is clean on the changed files. The full npm test run has 16 pre-existing, environment-dependent failures (network/port-binding e2e-framework fixtures, a MODULE_NOT_FOUND in ssrf-parity, and missing Docker/OpenShell fixtures in fetch-guard-patch-regression) that are unrelated to this change and do not reference the modified module — hence npm test is left unchecked.


Signed-off-by: Jason Ma jama@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved Docker GPU detection: when the Container Device Interface (CDI) is available it is now preferred, otherwise the system falls back to legacy GPU options — improving GPU compatibility and reliability.
  • Tests

    • Added regression tests covering CDI-first selection, fallbacks to other GPU modes, and confirming CDI-based launches omit legacy GPU flags.

On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g.
/etc/cdi/nvidia.yaml on Ubuntu 24.04/26.04), `nemoclaw onboard` selected
the `--gpus all` mode for the GPU patch recreate. `docker create --gpus
all` is accepted on these hosts so the create-only probe passed, but
OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a
container recreated via the legacy --gpus path diverges from how the
supervisor expects the GPU container to be wired up. The supervisor never
reconnected, the sandbox entered Error phase before the GPU proof, and
onboard aborted with exit 1.

Reorder GPU mode candidates so the CDI mode (`--device nvidia.com/gpu=all`)
is preferred ahead of --gpus whenever a CDI spec is detected; --gpus and the
NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are
unaffected (candidate order unchanged).

Note: the supervisor-reconnect failure is a runtime symptom on real GPU
hardware; final validation requires the GPU E2E path (see "verify" below).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jason-ma-nv jason-ma-nv self-assigned this Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-repo-local-ollama-openclaw
Optional E2E: gpu-e2e, gpu-double-onboard-e2e

Dispatch hint: gpu-repo-local-ollama-openclaw

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-repo-local-ollama-openclaw (high): Required because this is the typed Docker-CDI GPU scenario on the self-hosted GPU runner. It validates repo-current onboarding, Docker-CDI GPU sandbox creation/recreation, local Ollama inference, and Ollama proxy reachability—the exact runtime path changed by the PR.

Optional E2E

  • gpu-e2e (high): Useful redundant confidence from the legacy nightly GPU E2E script path. It validates install/onboard plus local Ollama GPU inference on an ephemeral GPU runner, but is less targeted than the typed Docker-CDI scenario.
  • gpu-double-onboard-e2e (high): Optional adjacent coverage for repeated GPU onboarding/recreate behavior and persisted Ollama proxy token consistency after re-onboard. It is relevant to lifecycle regressions but not the primary CDI mode-selection fix.

New E2E recommendations

  • docker-cdi-gpu-recreate-mode (high): Existing GPU E2E proves the sandbox works, but does not appear to explicitly assert the recreated container used CDI (--device nvidia.com/gpu=all) instead of legacy --gpus all, nor that transient Error phases beyond the old debounce window recover without rollback.
    • Suggested test: Add a focused Docker-CDI GPU recreate regression that captures the patched create option/container inspect output and verifies CDI mode plus successful supervisor reconnect after a transient Error phase.

Dispatch hint

  • Workflow: .github/workflows/e2e-scenarios.yaml
  • jobs input: gpu-repo-local-ollama-openclaw

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: gpu-repo-local-ollama-openclaw
Optional scenario E2E: None

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • gpu-repo-local-ollama-openclaw: Changes affect the Docker GPU patch and supervisor reconnect path, including CDI-first GPU mode selection and reconnect debounce behavior. The dispatchable scenario that exercises Docker CDI GPU onboarding is gpu-repo-local-ollama-openclaw.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

  • None.

Relevant changed files

  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.ts

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f306d9cd-4edf-45a9-83a2-5b71b5dd0117

📥 Commits

Reviewing files that changed from the base of the PR and between db2f9f6 and 591651e.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-gpu-patch-mode-selection.test.ts
  • src/lib/onboard/docker-gpu-patch.test.ts
💤 Files with no reviewable changes (1)
  • src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Reorders GPU patch mode probing to prefer CDI when Docker reports a readable NVIDIA CDI spec, reorganizes related re-exports, and adds tests validating CDI-first selection, fallback to --gpus and nvidia-runtime, and propagation of the CDI device flag into recreate flows.

Changes

CDI-first GPU patch mode candidate selection

Layer / File(s) Summary
CDI-first candidate ordering and re-exports
src/lib/onboard/docker-gpu-patch.ts
Re-export section from ./docker-gpu-supervisor-reconnect is restructured; buildDockerGpuModeCandidates now places the CDI candidate before --gpus and nvidia-runtime when cdiAvailable is true.
New CDI-first mode selection tests
src/lib/onboard/docker-gpu-patch-mode-selection.test.ts
Adds Vitest suite that stubs CDI host probes and inspects Docker recreate flows to assert CDI preference, fallback to gpus, fallback to nvidia-runtime, and that recreateOpenShellDockerSandboxWithGpu uses the CDI --device nvidia.com/gpu=all flag and omits --gpus.
Update existing unit tests and comments
src/lib/onboard/docker-gpu-patch.test.ts
Renamed/updated unit test expectations to reflect CDI-first ordering and adjusted a comment to state CDI candidates are preferred ahead of --gpus on CDI hosts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/NemoClaw#4407: Shares related diagnostics and docker-gpu-patch plumbing changes (getSandboxFailurePhase, captureDockerGpuPatchSandboxSnapshot, classifyDockerGpuPatchFailure).

Suggested labels

Docker, platform: container, fix, Sandbox, bug-fix

Suggested reviewers

  • cv
  • prekshivyas

Poem

🐰 I found a CDI flag bright and small,
I nudged it forward, now it leads the call.
No legacy flags trailing behind the cart,
The sandbox wakes with GPU in its heart.
Hooray — CDI first, a hoppity new start!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the primary change: preferring CDI GPU mode over --gpus on CDI hosts, directly addressing the main objective of the PR.
Linked Issues check ✅ Passed The PR fully addresses issue #4948 by reordering GPU mode candidates to prefer CDI first when available, ensuring supervisor reconnection and successful onboarding on CDI-capable Docker hosts.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to GPU mode candidate ordering and selection logic; no unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/onboard-prefer-cdi-gpu-mode-4948

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • Linked issue runtime acceptance is still not proven (src/lib/onboard/docker-gpu-patch-mode-selection.test.ts:108): Issue [Ubuntu 24.04][Onboard] onboard cannot create a GPU-enabled sandbox on Docker-driver GPU host #4948 requires a normal first onboard on an affected Docker-CDI GPU host to create a GPU-enabled sandbox, reconnect the OpenShell supervisor, run the GPU proof, reach Ready, and complete onboard. The changed tests prove deterministic CDI-first selection, fallback ordering, CDI argv propagation, and a mocked long Error-phase reconnect recovery, but they still stop at mocked Docker/OpenShell boundaries and do not observe the full first-onboard runtime path.
    • Recommendation: Add or identify targeted runtime/integration validation for a normal first onboard on an affected Ubuntu Docker-CDI GPU host with no existing NemoClaw sandbox, asserting the recreated container uses `--device nvidia.com/gpu=all` rather than `--gpus all`, the supervisor reconnects, the GPU proof runs, the sandbox reaches Ready, and onboard completes.
    • Evidence: The new recreate test calls `recreateOpenShellDockerSandboxWithGpu()` directly with mocked `dockerRunDetached` and mocked `runOpenshell`; the reconnect test simulates `runOpenshell`/`runCaptureOpenshell` responses. The linked issue Expected Result says: "The sandbox is created with GPU access, the OpenShell supervisor reconnects to the GPU-enabled container, the GPU proof runs, the sandbox reaches Ready, and the first onboard completes."

🔎 Worth checking

  • Sandbox lifecycle security confidence still depends on real Docker-CDI validation (src/lib/onboard/docker-gpu-patch.ts:509): No direct sandbox escape, credential leak, SSRF bypass, policy bypass, workflow trusted-code-boundary issue, or unsafe shell-string execution issue was found. However, this is a security-sensitive sandbox lifecycle path: the PR changes which GPU device-passthrough mode is used for an OpenShell-managed container and expands the transient Error-phase debounce from about 30 seconds to about 2 minutes. Unit tests validate selection and mocked reconnect behavior, but not the real Docker-CDI container/supervisor/policy lifecycle.
    • Recommendation: Cover the real Docker-CDI host behavior in runtime validation: after CDI recreate, the supervisor reconnects before timeout, the sandbox does not remain in terminal Error during GPU enablement, policy expectations are preserved, the GPU proof runs under the expected container wiring, and persistent failures still roll back to the pre-patch sandbox.
    • Evidence: `buildDockerGpuModeCandidates()` now prepends CDI when `cdiAvailable` is true, and `DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_PHASE_DEFAULT_DEBOUNCE_POLLS` is now 60. The changed tests assert `dockerRunDetached` receives `--device nvidia.com/gpu=all` and mock supervisor reconnect success/recovery rather than exercising real Docker/OpenShell lifecycle behavior.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Runtime validation: normal first onboard on an affected Ubuntu Docker-CDI GPU host with no existing NemoClaw sandbox selects `--device nvidia.com/gpu=all`, not `--gpus all`, and onboard completes.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
  • **Runtime validation** — Runtime validation: after CDI recreate, the OpenShell supervisor reconnects before timeout and `openshell sandbox list` reaches Ready.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
  • **Runtime validation** — Runtime validation: the direct sandbox GPU proof runs successfully after CDI recreate and records verified CUDA usability.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
  • **Runtime validation** — Runtime validation: during CDI GPU-enable, a transient Error phase recovers without aborting, while a persistent Error still rolls back to the pre-patch sandbox.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
  • **Runtime validation** — Composed/unit validation: the `createDockerGpuSandboxCreatePatch` first-onboard create-time path wires CDI detection through `recreateOpenShellDockerSandboxWithGpu`, not only direct calls to the recreate helper.. Unit coverage is strong for deterministic mode selection, fallback ordering, Docker argv propagation, and mocked supervisor debounce. The linked bug is a runtime/sandbox infrastructure path involving Docker-CDI hardware, OpenShell supervisor reconnect, GPU proof, Ready phase, and first-onboard completion.
  • **Acceptance clause:** On a Docker-driver GPU host (NVIDIA GPU auto-detected), `nemoclaw onboard` cannot bring up a GPU-enabled sandbox. — add test evidence or identify existing coverage. The production change targets Docker GPU patch mode selection, but no changed test runs `nemoclaw onboard` on a Docker-driver GPU host.
  • **Acceptance clause:** While creating the sandbox, onboard enables GPU passthrough — this is the standard create-then-GPU-enable path that runs on a normal FIRST onboard whenever a GPU is present on a Docker-driver gateway (gated by NEMOCLAW_DOCKER_GPU_PATCH); — add test evidence or identify existing coverage. The new CDI test directly calls `recreateOpenShellDockerSandboxWithGpu()` and proves CDI args reach `dockerRunDetached`; it does not exercise the full first-onboard create-time orchestration with no existing NemoClaw sandbox.
  • **Acceptance clause:** The OpenShell supervisor never reconnects to the GPU-enabled container, — add test evidence or identify existing coverage. The reconnect test simulates a transient Error phase followed by mocked supervisor success, but no changed test observes a real OpenShell supervisor reconnecting to a real Docker-CDI GPU-enabled container.
Since last review details

Current findings:

  • Linked issue runtime acceptance is still not proven (src/lib/onboard/docker-gpu-patch-mode-selection.test.ts:108): Issue [Ubuntu 24.04][Onboard] onboard cannot create a GPU-enabled sandbox on Docker-driver GPU host #4948 requires a normal first onboard on an affected Docker-CDI GPU host to create a GPU-enabled sandbox, reconnect the OpenShell supervisor, run the GPU proof, reach Ready, and complete onboard. The changed tests prove deterministic CDI-first selection, fallback ordering, CDI argv propagation, and a mocked long Error-phase reconnect recovery, but they still stop at mocked Docker/OpenShell boundaries and do not observe the full first-onboard runtime path.
    • Recommendation: Add or identify targeted runtime/integration validation for a normal first onboard on an affected Ubuntu Docker-CDI GPU host with no existing NemoClaw sandbox, asserting the recreated container uses `--device nvidia.com/gpu=all` rather than `--gpus all`, the supervisor reconnects, the GPU proof runs, the sandbox reaches Ready, and onboard completes.
    • Evidence: The new recreate test calls `recreateOpenShellDockerSandboxWithGpu()` directly with mocked `dockerRunDetached` and mocked `runOpenshell`; the reconnect test simulates `runOpenshell`/`runCaptureOpenshell` responses. The linked issue Expected Result says: "The sandbox is created with GPU access, the OpenShell supervisor reconnects to the GPU-enabled container, the GPU proof runs, the sandbox reaches Ready, and the first onboard completes."
  • Sandbox lifecycle security confidence still depends on real Docker-CDI validation (src/lib/onboard/docker-gpu-patch.ts:509): No direct sandbox escape, credential leak, SSRF bypass, policy bypass, workflow trusted-code-boundary issue, or unsafe shell-string execution issue was found. However, this is a security-sensitive sandbox lifecycle path: the PR changes which GPU device-passthrough mode is used for an OpenShell-managed container and expands the transient Error-phase debounce from about 30 seconds to about 2 minutes. Unit tests validate selection and mocked reconnect behavior, but not the real Docker-CDI container/supervisor/policy lifecycle.
    • Recommendation: Cover the real Docker-CDI host behavior in runtime validation: after CDI recreate, the supervisor reconnects before timeout, the sandbox does not remain in terminal Error during GPU enablement, policy expectations are preserved, the GPU proof runs under the expected container wiring, and persistent failures still roll back to the pre-patch sandbox.
    • Evidence: `buildDockerGpuModeCandidates()` now prepends CDI when `cdiAvailable` is true, and `DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_PHASE_DEFAULT_DEBOUNCE_POLLS` is now 60. The changed tests assert `dockerRunDetached` receives `--device nvidia.com/gpu=all` and mock supervisor reconnect success/recovery rather than exercising real Docker/OpenShell lifecycle behavior.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…4948)

Address PR review-advisor findings on the CDI-first GPU mode change:

- Move the CDI mode-selection tests out of the docker-gpu-patch.test.ts
  monolith into a focused docker-gpu-patch-mode-selection.test.ts spec,
  offsetting the flagged monolith growth (back to ~baseline line count).
- Pin the fallback chain on a CDI host: CDI probe fails -> --gpus selected;
  CDI and --gpus probes fail -> NVIDIA runtime selected (attempt order
  starts with cdi in both cases).
- Add a recreate-boundary assertion: recreateOpenShellDockerSandboxWithGpu
  passes --device nvidia.com/gpu=all to dockerRunDetached on a CDI host and
  never emits --gpus, proving the selected CDI mode reaches the real recreate
  command (the patched_create_option the issue logs).

All four new tests fail under the previous --gpus-first ordering, confirming
they pin the fix rather than restating it. No production code change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jason-ma-nv jason-ma-nv added the v0.0.61 Release target label Jun 8, 2026
@wscurran wscurran added area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression labels Jun 8, 2026
@wscurran

wscurran commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@wscurran wscurran requested a review from cv June 8, 2026 16:08
@cv cv added v0.0.62 Release target and removed v0.0.61 Release target labels Jun 8, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 27242628510
Target ref: fix/onboard-prefer-cdi-gpu-mode-4948
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@cv

cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Runtime validation update for #4948 on PR head 5811e0375ae9610ed8cdc7b11ff09363e4fcd9f4:

  • Standard PR CI is green: gh pr checks reports all 30 checks passing.
  • Manual typed GPU scenario run: https://github.com/NVIDIA/NemoClaw/actions/runs/27243425713
  • Scenario: gpu-repo-local-ollama-openclaw on the linux-amd64-gpu-rtxpro6000-latest-1 Docker-CDI GPU runner.
  • The first rerun before the debounce follow-up failed in onboarding with patched_create_option=--device nvidia.com/gpu=all and supervisor reconnect timeout. After the follow-up commit, onboarding passed.

Relevant acceptance evidence from the second run artifacts:

  • Onboarding phase: passed.
  • Docker GPU mode selected: --device nvidia.com/gpu=all.
  • Dashboard reached live state and sandbox creation completed.
  • GPU proof passed through nvidia-smi, /proc/<pid>/task/<tid>/comm write, and cuInit(0) via libcuda.so.1.
  • Sandbox CUDA usability was proven.
  • Sandbox local inference route reached https://inference.local/v1/models with HTTP 200.
  • OpenClaw reached ready state.

The manual scenario run is still red, but only after the linked issue acceptance point: two later runtime provider assertions failed because the host Ollama socket disappeared between probes (curl could not connect to 127.0.0.1:11434). The smoke checks had already passed: gateway health, sandbox listed, sandbox shell, and models-health. I am treating that as a typed-scenario/provider-runner issue rather than a blocker for the Docker-CDI onboard fix.

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard PR CI is green. I resolved the main conflict, added CDI source-boundary coverage, and followed up on the GPU runtime failure. The manual Docker-CDI scenario now passes the linked issue acceptance point: CDI recreate, supervisor reconnect/onboard, Ready/dashboard, GPU/CUDA proof, and the local inference route. The remaining red in that manual run is later host-Ollama runtime probe churn, not the onboard fix.

@cv

cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Current-head runtime validation update for #4948 on PR head 70921b56d73eb384632e7663817927e1b680ab82:

Acceptance evidence from the current-head artifacts:

  • Onboarding phase: passed.
  • Docker GPU mode selected: --device nvidia.com/gpu=all.
  • Dashboard reached live state and sandbox creation completed.
  • GPU proof passed through nvidia-smi, /proc/<pid>/task/<tid>/comm write, and cuInit(0) via libcuda.so.1.
  • Sandbox CUDA usability was proven.
  • Sandbox local inference route reached https://inference.local/v1/models with HTTP 200.
  • OpenClaw reached ready state.
  • State/smoke checks passed: gateway healthy, sandbox running/listed, sandbox shell, and models-health.

The typed scenario still exits red after the linked issue acceptance point, with the same provider-runner signature as the previous run: runtime.ollama.models-health saw the local Ollama model list, then runtime.ollama.chat-completion failed because curl could not connect to host 127.0.0.1:11434; runtime.ollama-auth-proxy.auth-enforcement is classified as provider-transient. I am still treating that as host-Ollama/provider-runner churn rather than a blocker for the Docker-CDI onboard fix.

@cv cv merged commit 8827570 into main Jun 10, 2026
31 of 32 checks passed
@cv cv deleted the fix/onboard-prefer-cdi-gpu-mode-4948 branch June 10, 2026 01:04
jyaunches pushed a commit that referenced this pull request Jun 10, 2026
## Summary
- Add v0.0.62 release notes from Discussion #5100 and link release
highlights to the relevant docs pages.
- Document the release's GPU sandbox recreation, sandbox-side local
inference verification, and Hermes dashboard port guard in the command
and inference references.
- Refresh generated NemoClaw user skills for the release-prep docs set.

## Source Summary
- #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU
recreation behavior for Linux Docker-driver sandboxes.
- #5024 -> `docs/inference/use-local-inference.mdx`: Document
sandbox-runtime verification of the `inference.local` local inference
route.
- #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra
device-node group propagation for sandbox CUDA initialization.
- #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`:
Summarize onboarding and recovery reliability fixes, including the
reserved Hermes API port guard.
- #5017 and #5043 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Summarize mutable OpenClaw config
recovery and host-side `agents list` coverage.
- #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes
upstream metadata visibility and WhatsApp QR rendering reliability.
- #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`:
Refresh generated user-skill references from the current docs source.

## Skipped
- #5019 -> skipped for new prose because it touched
`openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`.
Existing source docs remain the source for generated skill
synchronization.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning)
- Pre-commit hooks passed during commit, including docs-to-skills
verification, markdown lint, gitleaks, and skills YAML tests.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * Added `nemoclaw <name> agents list` command.
* v0.0.62 release notes added summarizing onboarding and recovery
improvements.

* **Bug Fixes**
* Improved GPU sandbox onboarding reliability (NVIDIA CDI path,
Jetson/Tegra device handling).
* Better local inference verification and recovery for Linux
Docker-driver GPU sandboxes.
  * Quieter/earlier handling of onboarding drift and port collisions.

* **Documentation**
* Expanded GPU passthrough, inference verification, writable paths
(`/dev/pts`), port 8642 restriction, and command examples.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression v0.0.62 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Onboard] onboard cannot create a GPU-enabled sandbox on Docker-driver GPU host

3 participants