[DGX Station][Onboard] cuInit(0) SIGSEGV in verifyDirectSandboxGpu — onboard aborts (v0.0.43 GB300 aarch64)

## Description

Description

On NVIDIA DGX Station (NVIDIA GB300, aarch64, driver 590.48.01) with the released v0.0.43, `nemoclaw onboard` (provider=build, NVIDIA Endpoints) passes preflight, starts the Docker-driver gateway with GPU passthrough enabled, builds the sandbox image, and creates the sandbox — but then `verifyDirectSandboxGpu()` segfaults inside libcuda.so.1 at `cuInit(0)`. The working sandbox is deleted and onboard exits 1, so the user must start over.

Host-level GPU passthrough is healthy: `docker run --device nvidia.com/gpu=all ubuntu:24.04 nvidia-smi -L` and `docker run --gpus all ubuntu:24.04 nvidia-smi -L` both report `GPU 0: NVIDIA GB300 (UUID GPU-2afbcedf-5635-498a-ff00-13cdfa19571f)` and `ldconfig -p` shows libcuda.so.1 mounted correctly. So the segfault is specific to nemoclaw's GPU verifier binary, not a generic CDI / driver issue.

This is reproducible 2/2 in v0.0.43 sanity matrix runs:
- Pipeline 51356759 dgx-station job 318847742 (2026-05-14 19:30 PDT): died at 8 min with same cuInit(0) SIGSEGV
- Pipeline 51397581 dgx-station job 318985352 (2026-05-15 05:10 PDT, retried after clean state): died at 7m30s with same cuInit(0) SIGSEGV

Related

NVBug 6150354 [DGX Spark][Onboard] — same `verifyDirectSandboxGpu` function but a different step (`/proc/self/task/tid/comm write` proof rejected with "command argument contains newline"). That was on PR #3001 / v0.1.0; ours is on the released v0.0.43, and the failure mode is libcuda SIGSEGV not gRPC string-formatting.
NVBug 6180214 [All Platforms (GPU)][Onboard] — different code path: gateway started without GPU passthrough refuses recreate-sandbox. Filed by the same author on 2026-05-15.Environment


Device: NVIDIA DGX Station (galaxy-ts2-052)
OS: Ubuntu 24.04 LTS (Linux 6.17.0-1008-nvidia-64k)
Architecture: aarch64
GPU: NVIDIA GB300 (UUID GPU-2afbcedf-5635-498a-ff00-13cdfa19571f, 284208 MB)
Driver: NVIDIA 590.48.01 (CUDA Version 13.1)
NVIDIA CTK: NVIDIA Container Toolkit 1.18.2
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.1.3, build f52814d
OpenShell CLI: openshell 0.0.39
NemoClaw: nemoclaw v0.0.43
OpenClaw: N/A (sandbox deleted before commit)Steps to Reproduce


1. Fresh DGX Station with Ubuntu 24.04 (no prior nemoclaw state), driver 590.48.01, docker 29.1.3, nvidia-container-toolkit 1.18.2, /etc/cdi/nvidia.yaml present.
2. Sanity-check host GPU passthrough works:
 docker run --rm --device nvidia.com/gpu=all ubuntu:24.04 nvidia-smi -L
 Should show GPU 0: NVIDIA GB300 (this is OK on the broken host).
3. Install v0.0.43 and run a fresh onboard:
 NEMOCLAW_NON_INTERACTIVE=1 \
 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
 NEMOCLAW_PROVIDER=build \
 NVIDIA_API_KEY= \
 NEMOCLAW_INSTALL_TAG=v0.0.43 \
 curl -fsSL https://www.nvidia.com/nemoclaw.sh | bashExpected Result


Onboard completes: gateway healthy, sandbox image built, sandbox commits with GPU access, the user can `nemoclaw connect` and use openclaw inside.Actual Result


Preflight, gateway start, sandbox image build, sandbox create all succeed. Then `verifyDirectSandboxGpu()` segfaults at cuInit(0):

 NVIDIA GPU detected; enabling OpenShell GPU passthrough. Use --no-gpu to opt out.
 Docker-driver GPU patch will use host networking; local inference providers will use sandbox loopback.

 [2/8] Starting OpenShell gateway
 Docker-driver GPU host networking active; skipping sandbox bridge gateway reachability probe.
 ✓ Docker-driver gateway is healthy

 ... (sandbox image build through Step 58/65, sandbox create succeeds) ...

 The failed sandbox/container has been left in place for inspection.
 Manual cleanup:
 openshell sandbox delete "my-assistant"

 /home/gitlab-runner/.nemoclaw/source/dist/lib/onboard.js:1380
 throw new Error(`GPU proof failed: ${proof.label} (status ${statusText})${diagnosticSuffix}`);
 ^
 Error: GPU proof failed: cuInit(0) via libcuda.so.1 (status 139): Segmentation fault (core dumped)
 at verifyDirectSandboxGpu (.../dist/lib/onboard.js:1380:15)
 at createSandbox (.../dist/lib/onboard.js:4685:13)
 at process.processTicksAndRejections (node:internal/process/task_queues:103:5)
 at async Object.onboard [as runOnboard] (.../dist/lib/onboard.js:8692:27)
 at async runOnboardCommand (.../dist/lib/onboard/legacy-command.js:207:5)
 at async runOnboardAction (.../dist/lib/actions/onboard.js:26:5)
 at async runOnboardAction (.../dist/lib/actions/global.js:28:5)
 at async OnboardCliCommand.run (.../dist/lib/commands/onboard.js:18:9)
 Node.js v22.22.2
 Curl Bash Installation exit code: 1

Status 139 = SIGSEGV. The verifier binary that nemoclaw spawns to run cuInit(0) inside the sandbox is crashing. Plain `docker run --gpus all ubuntu:24.04 nvidia-smi -L` on the same host succeeds, so generic CDI / driver / device-node mounting are not the cause.
<pre> </pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NemoClaw_Automation, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Sprint4-Blocker |

---
[NVB#6180869]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DGX Station][Onboard] cuInit(0) SIGSEGV in verifyDirectSandboxGpu — onboard aborts (v0.0.43 GB300 aarch64) #3600

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NemoClaw_Automation, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Sprint4-Blocker

[DGX Station][Onboard] cuInit(0) SIGSEGV in verifyDirectSandboxGpu — onboard aborts (v0.0.43 GB300 aarch64) #3600

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions