Skip to content

[Ubuntu 24.04][CLI&UX] nemoclaw <name> status reports Phase=Error for paused Docker-driver sandbox with GPU passthrough (spec expects Phase=Ready) #4495

@hulynn

Description

@hulynn

Description

On Ubuntu 24.04 with GPU-enabled sandbox, docker pause on the per-sandbox Docker-driver container makes nemoclaw <name> status report Phase: Error instead of the spec-expected Phase: Ready. A paused container is a transient runtime suspension, not a sandbox error condition — the spec contract (T6064990) requires the phase to remain Ready because the sandbox record itself is still valid.

The bug is environment-specific: cross-verified on Brev (Ubuntu 22.04 + CPU sandbox) the spec PASSES (paused → Phase: Ready). Only reproduces on Ubuntu 24.04 + GPU sandbox. Recovery (Phase returns to Ready after docker unpause) works on both.

Environment

Reproduces on (FAIL):
  Device:        KVM VM (libvirt/QEMU x86_64 guest)
  OS:            Ubuntu 24.04.4 LTS (Noble Numbat)
  Architecture:  x86_64
  GPU:           NVIDIA A100 SXM4 40GB (passthrough)
  Sandbox GPU:   enabled (auto)

Does NOT reproduce on (PASS):
  Device:        Brev VM (GCP n2d-standard-4)
  OS:            Ubuntu 22.04 LTS (kernel 6.8.0-1058-gcp)
  GPU:           none (CPU sandbox)

Versions (identical on both hosts):
  Node.js:       v22.22.3
  npm:           10.9.8
  Docker:        29.5.2
  OpenShell CLI: 0.0.44
  NemoClaw:      v0.0.53
  OpenClaw:      2026.5.22 (a374c3a)

Steps to Reproduce

  1. On Ubuntu 24.04 with NVIDIA GPU + nvidia-container-toolkit configured, onboard a fresh sandbox so it ends up with sandbox GPU: enabled:
    NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
    NVIDIA_API_KEY=nvapi-... nemoclaw onboard --fresh --name my-assistant
  2. Confirm baseline status is Ready:
    nemoclaw my-assistant status   # expects Phase: Ready, EXIT=0
  3. Identify the per-sandbox container and pause it:
    CONTAINER=$(docker ps -a --format '{{.Names}}' | grep -E '^openshell-my-assistant-' | head -1)
    docker pause "$CONTAINER"
    docker ps -a --format '{{.Names}}: {{.Status}}' | grep "$CONTAINER"
    # expect status "Up Xs (Paused)"
  4. Run status again and observe Phase:
    nemoclaw my-assistant status | grep -E 'Phase:|Failure layer:'
  5. Recover:
    docker unpause "$CONTAINER"; sleep 20
    nemoclaw my-assistant status | grep Phase:

Expected Result

Per spec T6064990:

  • Baseline status exits 0, no Failure layer: text appears.
  • After docker pause, status exits 0, shows the same sandbox name, and continues to show Phase: Ready. Output does not contain Failure layer: or gateway_unreachable.
  • After docker unpause + 20s, status exits 0 and still shows Phase: Ready.

Actual Result

KVM (Ubuntu 24.04 + GPU sandbox) — FAIL on paused-state check:

CONTAINER=openshell-my-assistant-fe4cacf6-...
$ docker pause "$CONTAINER"
  openshell-my-assistant-fe4cacf6-...
$ docker ps -a --format '{{.Names}}: {{.Status}}' | grep $CONTAINER
  openshell-my-assistant-...: Up 45 seconds (Paused)
$ nemoclaw my-assistant status | grep -E 'Phase:|Failure layer:'
  Phase: Error             ← FAIL (spec expects 'Phase: Ready')
$ echo $?
  0                        ← exit code OK

After docker unpause + 20s, Phase: Ready — recovery works correctly.

Brev (Ubuntu 22.04 + CPU sandbox) — PASSES:

$ docker pause "openshell-my-assistant-6c8a6845-..."
$ nemoclaw my-assistant status | grep Phase:
  Phase: Ready             ← matches spec

Logs

KVM (FAIL):
  $ nemoclaw my-assistant status (baseline)
    Phase: Ready
  $ docker pause openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d
    openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d
  $ docker ps -a --format '{{.Names}}: {{.Status}}'
    openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d: Up 45 seconds (Paused)
  $ nemoclaw my-assistant status | grep -E 'Phase:|Failure layer:'
    Phase: Error
  $ docker unpause openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d
  (after 20s)
  $ nemoclaw my-assistant status | grep Phase:
    Phase: Ready

Brev (PASS — same procedure):
  $ docker pause openshell-my-assistant-6c8a6845-f9d7-41a0-8646-ab7e819e5ef0
  $ nemoclaw my-assistant status | grep Phase:
    Phase: Ready

Suggested Fix

The status interpretation logic (likely in src/lib/sandbox/status.ts or equivalent) currently routes Docker's paused state differently on GPU vs CPU sandboxes.

Suspected cause: GPU-enabled sandboxes carry additional health probes (GPU device check, NIM connectivity, etc.) that fail silently when the container is paused, which the status mapper treats as a hard Error. CPU sandboxes don't run those probes, so they fall through to the docker-state-only path which correctly maps pausedReady.

Recommended fix:

  1. In the status mapper, treat Docker's Paused state as a first-class case that maps to Phase: Ready regardless of GPU probe failure (paused == transient, by design).
  2. Optionally surface a hint line: Note: sandbox container is paused; run 'docker unpause <container>' or 'nemoclaw <name> recover' to resume.
  3. Add a regression test covering both GPU and CPU sandboxes under docker pause.

NVB#6237570

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions