Skip to content

[WSL2][Sandbox] nemoclaw rebuild destroys --no-gpu sandbox without recreating when host has no NVIDIA GPU (CDI preflight) #3985

@wangericnv

Description

@wangericnv

Description

On a host without NVIDIA GPU (Windows ARM WSL2 ARM64 Snapdragon X), nemoclaw <sandbox> rebuild --yes on a sandbox that was originally onboarded with --no-gpu first deletes the old sandbox + image, then fails to recreate because the recreate step (onboard --resume) enforces a Docker CDI GPU preflight check that cannot pass on a non-NVIDIA host. Sandbox is left destroyed; workspace backup is preserved at rebuild-backups/<sb>/<timestamp> but the sandbox itself is gone.

Same non-atomic class of failure as NVB#6103453 / GH#2273 (Ubuntu trigger, fixed 2026-04-27) but a different upstream trigger code path — surfaces on no-NVIDIA-GPU hosts via the CDI preflight.

Environment

Device:        ARM64 WSL2 host (Snapdragon X laptop, no NVIDIA GPU)
OS:            Ubuntu 24.04.4 LTS Noble Numbat inside WSL2
Architecture:  aarch64 (Snapdragon X)
Node.js:       v22.22.2
npm:           10.9.7
Docker:        29.1.3, build 29.1.3-0ubuntu3~24.04.2
OpenShell CLI: 0.0.39
NemoClaw:      v0.1.0 (main HEAD cfa817b)
OpenClaw:      2026.4.24 (cbcfdf6, bundled)

Steps to Reproduce

  1. On a host with no NVIDIA GPU (e.g. Windows ARM WSL2 Snapdragon X), install NemoClaw v0.1.0 and OpenShell 0.0.39.
  2. Onboard a CPU-only sandbox:
    export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
    export NEMOCLAW_NON_INTERACTIVE=1
    export NEMOCLAW_PROVIDER=ollama
    nemoclaw onboard --fresh --non-interactive --yes \
        --yes-i-accept-third-party-software --no-gpu \
        --name arm64-test --agent openclaw
    This completes successfully. sandboxes.json records gpuEnabled: false, sandboxGpuEnabled: false. Container is healthy.
  3. Trigger a rebuild:
    nemoclaw arm64-test rebuild --yes

Expected Result

  • Rebuild should respect the originating sandbox's recorded gpuEnabled: false setting from sandboxes.json and skip the GPU/CDI preflight when recreating a --no-gpu sandbox.
  • OR rebuild's preflight should fail-fast before destroying the old sandbox, never leaving the user in a half-destroyed state.

Equivalent atomicity guarantee to the one NVB#6103453 / GH#2273 introduced for the Ubuntu trigger.

Actual Result

Rebuild executes destroy + recreate but recreate fails after destroy. End state: sandbox is gone, container is gone, sandboxes.json has a stub entry with imageTag=null, agentVersion=null, openshellVersion=null. User must manually recover by:

  1. Setting NEMOCLAW_SANDBOX_GPU=0 in the environment
  2. Running nemoclaw onboard --resume
  3. Restoring workspace via nemoclaw <sb> snapshot restore "<timestamp>"

This is data-loss-with-recovery — the user can lose state if they don't notice the rebuild-backups dir and the recovery hint.

Logs

=== rebuild output (~/ARM64_phase4b.log) ===
================ STEP C: rebuild ================
Deleted: sha256:eca3cb3e2ddf7e57bf9f3874568e0c1a15c28f23cab2c2a5fd8d4cbe97cf46f1
Deleted: sha256:28ca10eed4a473ec0bb9e2eba52c042923f08111461ff126ce7d8f77bc867b63
(... ~23 more Deleted layers ...)
  Removed Docker image openshell/sandbox-from:1779349457
  ✓ Old sandbox deleted

  Creating new sandbox with current image...
  [non-interactive] Agent: OpenClaw

  NemoClaw Onboarding
  (non-interactive mode)
  (resume mode)
  ===================

  [1/8] Preflight checks
  ──────────────────────────────────────────────────
  [resume] Skipping preflight (cached)

  ✗ Docker CDI GPU support was not detected.
    Install/configure NVIDIA Container Toolkit CDI, then restart Docker:
      sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
      sudo systemctl restart docker
    Or force CPU sandbox behavior with NEMOCLAW_SANDBOX_GPU=0.

  Recreate failed after sandbox was destroyed.
  Backup is preserved at: /home/lab/.nemoclaw/rebuild-backups/arm64-test/2026-05-21T08-16-40-021Z

  To recover manually:
    1. Fix the issue above (missing credential, Docker problem, etc.)
    2. Run: nemoclaw onboard --resume
       This will recreate sandbox 'arm64-test'.
    3. Then restore your workspace state:
       nemoclaw arm64-test snapshot restore "2026-05-21T08-16-40-021Z"

=== sandboxes.json after rebuild (stub entry, missing all runtime fields) ===
{
  "sandboxes": {
    "arm64-test": {
      "name": "arm64-test",
      "createdAt": "2026-05-21T08:27:52.160Z",
      "provider": "ollama-local",
      "gpuEnabled": false,
      ...
      "openshellDriver": null,
      "openshellVersion": null,
      "agentVersion": null,
      "imageTag": null
    }
  },
  "defaultSandbox": "arm64-test"
}

=== docker ps ===
(empty — no container exists)

Related

NVB#6103453 / GH#2273 — nemoclaw rebuild is not atomic (Ubuntu trigger), filed 2026-04-22 by Eric Wang, fixed 2026-04-27 (BugAction: QA - Closed - Verified). Same class of non-atomic destroy-then-fail-recreate behavior, but a different upstream trigger (CDI preflight on no-NVIDIA-GPU host) that the original fix did not cover.

Discovered during ARM64 validation for PR #3925 QA (#3925) but is not PR-introduced — reproduces on main HEAD cfa817b.


NVB#6199735

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: wslAffects Windows Subsystem for Linux

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions