Description
Description
On NVIDIA DGX Station (NVIDIA GB300, aarch64, driver 590.48.01) with the released v0.0.43, nemoclaw onboard (provider=build, NVIDIA Endpoints) passes preflight, starts the Docker-driver gateway with GPU passthrough enabled, builds the sandbox image, and creates the sandbox — but then verifyDirectSandboxGpu() segfaults inside libcuda.so.1 at cuInit(0). The working sandbox is deleted and onboard exits 1, so the user must start over.
Host-level GPU passthrough is healthy: docker run --device nvidia.com/gpu=all ubuntu:24.04 nvidia-smi -L and docker run --gpus all ubuntu:24.04 nvidia-smi -L both report GPU 0: NVIDIA GB300 (UUID GPU-2afbcedf-5635-498a-ff00-13cdfa19571f) and ldconfig -p shows libcuda.so.1 mounted correctly. So the segfault is specific to nemoclaw's GPU verifier binary, not a generic CDI / driver issue.
This is reproducible 2/2 in v0.0.43 sanity matrix runs:
- Pipeline 51356759 dgx-station job 318847742 (2026-05-14 19:30 PDT): died at 8 min with same cuInit(0) SIGSEGV
- Pipeline 51397581 dgx-station job 318985352 (2026-05-15 05:10 PDT, retried after clean state): died at 7m30s with same cuInit(0) SIGSEGV
Related
NVBug 6150354 [DGX Spark][Onboard] — same verifyDirectSandboxGpu function but a different step (/proc/self/task/tid/comm write proof rejected with "command argument contains newline"). That was on PR #3001 / v0.1.0; ours is on the released v0.0.43, and the failure mode is libcuda SIGSEGV not gRPC string-formatting.
NVBug 6180214 [All Platforms (GPU)][Onboard] — different code path: gateway started without GPU passthrough refuses recreate-sandbox. Filed by the same author on 2026-05-15.Environment
Device: NVIDIA DGX Station (galaxy-ts2-052)
OS: Ubuntu 24.04 LTS (Linux 6.17.0-1008-nvidia-64k)
Architecture: aarch64
GPU: NVIDIA GB300 (UUID GPU-2afbcedf-5635-498a-ff00-13cdfa19571f, 284208 MB)
Driver: NVIDIA 590.48.01 (CUDA Version 13.1)
NVIDIA CTK: NVIDIA Container Toolkit 1.18.2
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.1.3, build f52814d
OpenShell CLI: openshell 0.0.39
NemoClaw: nemoclaw v0.0.43
OpenClaw: N/A (sandbox deleted before commit)Steps to Reproduce
- Fresh DGX Station with Ubuntu 24.04 (no prior nemoclaw state), driver 590.48.01, docker 29.1.3, nvidia-container-toolkit 1.18.2, /etc/cdi/nvidia.yaml present.
- Sanity-check host GPU passthrough works:
docker run --rm --device nvidia.com/gpu=all ubuntu:24.04 nvidia-smi -L
Should show GPU 0: NVIDIA GB300 (this is OK on the broken host).
- Install v0.0.43 and run a fresh onboard:
NEMOCLAW_NON_INTERACTIVE=1
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
NEMOCLAW_PROVIDER=build
NVIDIA_API_KEY=
NEMOCLAW_INSTALL_TAG=v0.0.43
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bashExpected Result
Onboard completes: gateway healthy, sandbox image built, sandbox commits with GPU access, the user can nemoclaw connect and use openclaw inside.Actual Result
Preflight, gateway start, sandbox image build, sandbox create all succeed. Then verifyDirectSandboxGpu() segfaults at cuInit(0):
NVIDIA GPU detected; enabling OpenShell GPU passthrough. Use --no-gpu to opt out.
Docker-driver GPU patch will use host networking; local inference providers will use sandbox loopback.
[2/8] Starting OpenShell gateway
Docker-driver GPU host networking active; skipping sandbox bridge gateway reachability probe.
✓ Docker-driver gateway is healthy
... (sandbox image build through Step 58/65, sandbox create succeeds) ...
The failed sandbox/container has been left in place for inspection.
Manual cleanup:
openshell sandbox delete "my-assistant"
/home/gitlab-runner/.nemoclaw/source/dist/lib/onboard.js:1380
throw new Error(GPU proof failed: ${proof.label} (status ${statusText})${diagnosticSuffix});
^
Error: GPU proof failed: cuInit(0) via libcuda.so.1 (status 139): Segmentation fault (core dumped)
at verifyDirectSandboxGpu (.../dist/lib/onboard.js:1380:15)
at createSandbox (.../dist/lib/onboard.js:4685:13)
at process.processTicksAndRejections (node:internal/process/task_queues:103:5)
at async Object.onboard [as runOnboard] (.../dist/lib/onboard.js:8692:27)
at async runOnboardCommand (.../dist/lib/onboard/legacy-command.js:207:5)
at async runOnboardAction (.../dist/lib/actions/onboard.js:26:5)
at async runOnboardAction (.../dist/lib/actions/global.js:28:5)
at async OnboardCliCommand.run (.../dist/lib/commands/onboard.js:18:9)
Node.js v22.22.2
Curl Bash Installation exit code: 1
Status 139 = SIGSEGV. The verifier binary that nemoclaw spawns to run cuInit(0) inside the sandbox is crashing. Plain docker run --gpus all ubuntu:24.04 nvidia-smi -L on the same host succeeds, so generic CDI / driver / device-node mounting are not the cause.
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NemoClaw_Automation, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Sprint4-Blocker |
[NVB#6180869]
Description
Description
On NVIDIA DGX Station (NVIDIA GB300, aarch64, driver 590.48.01) with the released v0.0.43,
nemoclaw onboard(provider=build, NVIDIA Endpoints) passes preflight, starts the Docker-driver gateway with GPU passthrough enabled, builds the sandbox image, and creates the sandbox — but thenverifyDirectSandboxGpu()segfaults inside libcuda.so.1 atcuInit(0). The working sandbox is deleted and onboard exits 1, so the user must start over.Host-level GPU passthrough is healthy:
docker run --device nvidia.com/gpu=all ubuntu:24.04 nvidia-smi -Landdocker run --gpus all ubuntu:24.04 nvidia-smi -Lboth reportGPU 0: NVIDIA GB300 (UUID GPU-2afbcedf-5635-498a-ff00-13cdfa19571f)andldconfig -pshows libcuda.so.1 mounted correctly. So the segfault is specific to nemoclaw's GPU verifier binary, not a generic CDI / driver issue.This is reproducible 2/2 in v0.0.43 sanity matrix runs:
Related
NVBug 6150354 [DGX Spark][Onboard] — same
verifyDirectSandboxGpufunction but a different step (/proc/self/task/tid/comm writeproof rejected with "command argument contains newline"). That was on PR #3001 / v0.1.0; ours is on the released v0.0.43, and the failure mode is libcuda SIGSEGV not gRPC string-formatting.NVBug 6180214 [All Platforms (GPU)][Onboard] — different code path: gateway started without GPU passthrough refuses recreate-sandbox. Filed by the same author on 2026-05-15.Environment
Device: NVIDIA DGX Station (galaxy-ts2-052)
OS: Ubuntu 24.04 LTS (Linux 6.17.0-1008-nvidia-64k)
Architecture: aarch64
GPU: NVIDIA GB300 (UUID GPU-2afbcedf-5635-498a-ff00-13cdfa19571f, 284208 MB)
Driver: NVIDIA 590.48.01 (CUDA Version 13.1)
NVIDIA CTK: NVIDIA Container Toolkit 1.18.2
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.1.3, build f52814d
OpenShell CLI: openshell 0.0.39
NemoClaw: nemoclaw v0.0.43
OpenClaw: N/A (sandbox deleted before commit)Steps to Reproduce
docker run --rm --device nvidia.com/gpu=all ubuntu:24.04 nvidia-smi -L
Should show GPU 0: NVIDIA GB300 (this is OK on the broken host).
NEMOCLAW_NON_INTERACTIVE=1
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
NEMOCLAW_PROVIDER=build
NVIDIA_API_KEY=
NEMOCLAW_INSTALL_TAG=v0.0.43
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bashExpected Result
Onboard completes: gateway healthy, sandbox image built, sandbox commits with GPU access, the user can
nemoclaw connectand use openclaw inside.Actual ResultPreflight, gateway start, sandbox image build, sandbox create all succeed. Then
verifyDirectSandboxGpu()segfaults at cuInit(0):NVIDIA GPU detected; enabling OpenShell GPU passthrough. Use --no-gpu to opt out.
Docker-driver GPU patch will use host networking; local inference providers will use sandbox loopback.
[2/8] Starting OpenShell gateway
Docker-driver GPU host networking active; skipping sandbox bridge gateway reachability probe.
✓ Docker-driver gateway is healthy
... (sandbox image build through Step 58/65, sandbox create succeeds) ...
The failed sandbox/container has been left in place for inspection.
Manual cleanup:
openshell sandbox delete "my-assistant"
/home/gitlab-runner/.nemoclaw/source/dist/lib/onboard.js:1380
throw new Error(
GPU proof failed: ${proof.label} (status ${statusText})${diagnosticSuffix});^
Error: GPU proof failed: cuInit(0) via libcuda.so.1 (status 139): Segmentation fault (core dumped)
at verifyDirectSandboxGpu (.../dist/lib/onboard.js:1380:15)
at createSandbox (.../dist/lib/onboard.js:4685:13)
at process.processTicksAndRejections (node:internal/process/task_queues:103:5)
at async Object.onboard [as runOnboard] (.../dist/lib/onboard.js:8692:27)
at async runOnboardCommand (.../dist/lib/onboard/legacy-command.js:207:5)
at async runOnboardAction (.../dist/lib/actions/onboard.js:26:5)
at async runOnboardAction (.../dist/lib/actions/global.js:28:5)
at async OnboardCliCommand.run (.../dist/lib/commands/onboard.js:18:9)
Node.js v22.22.2
Curl Bash Installation exit code: 1
Status 139 = SIGSEGV. The verifier binary that nemoclaw spawns to run cuInit(0) inside the sandbox is crashing. Plain
docker run --gpus all ubuntu:24.04 nvidia-smi -Lon the same host succeeds, so generic CDI / driver / device-node mounting are not the cause.Bug Details
[NVB#6180869]