Skip to content

[Ubuntu 24.04][Onboard] nemoclaw onboard --gpu always fails with Docker GPU patch supervisor reconnect timeout on aarch64 dual-GPU host #4950

@hulynn

Description

@hulynn

Description

On Ubuntu 24.04 aarch64 host with RTX PRO 6000 Blackwell + GB300 dual-GPU, nemoclaw onboard --gpu consistently fails at the Docker GPU patch stage. After NemoClaw recreates the sandbox container with --gpus, the OpenShell supervisor cannot reconnect (openshell sandbox exec -n <name> -- true keeps failing), the sandbox enters Error phase, and the pre-patch sandbox is rolled back.

The rolled-back container is left degraded — only PID 1 (openshell-sandbox) + sleep infinity running, no nemoclaw-start, no openclaw gateway — yet nemoclaw status still reports Phase: Ready.

Tried 3 GPU mode variants, ALL fail:

  • --gpu auto (--gpus all) → sandbox Error phase
  • --gpu explicit (--gpus all) → sandbox Error phase
  • --sandbox-gpu-device <UUID> (--gpus device=<uuid>) → patch failed, pre-patch restored

Blocks ALL test cases requiring GPU sandbox on aarch64 (e.g. T6115528 GPU sandbox DNS, T6115545 Shields Down on GPU sandbox).

Escape hatch NEMOCLAW_DOCKER_GPU_PATCH=0 works but disables GPU passthrough entirely.

Note: NemoClaw side rollback logic works correctly. The actual reconnect failure is in OpenShell, not in NemoClaw's TypeScript. But the user-visible impact on NemoClaw is total: no --gpu onboard succeeds on this host.

Environment

Host:           galaxy-sku2-018 (10.176.173.194)
OS:             Ubuntu 24.04.4 LTS
Architecture:   aarch64
Kernel:         6.17.0-1021-nvidia-64k
GPUs:           NVIDIA RTX PRO 6000 Blackwell Max-Q (97887 MB) + NVIDIA GB300 (256703 MB)
NVIDIA driver:  610.43.02 (CUDA 13.3)
Docker:         29.2.1
nvidia-container-toolkit: 1.19.0 (CDI specs at /var/run/cdi/nvidia.yaml — nvidia-ctk cdi list shows 5 devices)
NemoClaw:       v0.0.60
OpenShell:      0.0.44 (docker driver)

Steps to Reproduce

  1. On aarch64 dual-GPU host, install NemoClaw v0.0.60:

    curl -fsSL https://www.nvidia.com/nemoclaw.sh \
      | NEMOCLAW_INSTALL_TAG=v0.0.60 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 bash
  2. Set credentials and run --gpu onboard:

    export NVIDIA_API_KEY=nvapi-...
    export NEMOCLAW_PROVIDER=build
    export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
    nemoclaw onboard --gpu --name dns-gpu-test --non-interactive
  3. Observe Docker GPU patch failure message.

  4. Inspect the resulting container:

    docker exec $(docker ps --filter name=openshell-dns-gpu --format '{{.Names}}' | head -1) ps -ef

Expected Result

Onboard completes successfully with Sandbox GPU enabled. Sandbox container has nemoclaw-start + openclaw gateway running. nvidia-smi works inside the sandbox.

Actual Result

Onboard ends with:

Docker GPU patch failed.
OpenShell supervisor did not reconnect to the GPU-enabled container; pre-patch sandbox restored.
OpenShell sandbox entered Error phase before the GPU proof could run.
  sandbox_phase=Error
  patched_create_option=--gpus all
Diagnostics saved: /localhome/local-mercl/.nemoclaw/onboard-failures/2026-06-08T10-09-03-095Z-dns-gpu-test-docker-gpu-patch

docker exec into restored container shows only:

UID     PID  PPID  CMD
root    1    0     /opt/openshell/bin/openshell-sandbox
sandbox 94   1     sleep infinity

Logs

Diagnostic directory on host:
  /localhome/local-mercl/.nemoclaw/onboard-failures/2026-06-08T10-09-03-095Z-dns-gpu-test-docker-gpu-patch/
    summary.txt (failure metadata: sandbox_phase, patched_create_option, etc.)
    docker-inspect.json
    docker-logs.txt
    openshell-sandbox-get.txt (Error phase reading)
    openshell-sandbox-list.txt

Code references:
  src/lib/onboard/docker-gpu-supervisor-reconnect.ts:118-148 (detects + rolls back correctly)
  src/lib/onboard/docker-gpu-sandbox-create.ts:165-214 (reconnect wait + failure exit)

Suggested investigation:
  Root cause is in OpenShell supervisor's ability to reconnect to a --gpus-patched
  container on aarch64 + kernel 6.17 nvidia-64k + dual GPU. May be specific to
  nvidia-container-toolkit 1.19.0 + this Blackwell GPU combo. Route to OpenShell
  team for the underlying fix; NemoClaw side is correct.

NVB#6282407

Metadata

Metadata

Assignees

No one assigned

    Labels

    NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions