Description
On Ubuntu 24.04 with GPU-enabled sandbox, docker pause on the per-sandbox Docker-driver container makes nemoclaw <name> status report Phase: Error instead of the spec-expected Phase: Ready. A paused container is a transient runtime suspension, not a sandbox error condition — the spec contract (T6064990) requires the phase to remain Ready because the sandbox record itself is still valid.
The bug is environment-specific: cross-verified on Brev (Ubuntu 22.04 + CPU sandbox) the spec PASSES (paused → Phase: Ready). Only reproduces on Ubuntu 24.04 + GPU sandbox. Recovery (Phase returns to Ready after docker unpause) works on both.
Environment
Reproduces on (FAIL):
Device: KVM VM (libvirt/QEMU x86_64 guest)
OS: Ubuntu 24.04.4 LTS (Noble Numbat)
Architecture: x86_64
GPU: NVIDIA A100 SXM4 40GB (passthrough)
Sandbox GPU: enabled (auto)
Does NOT reproduce on (PASS):
Device: Brev VM (GCP n2d-standard-4)
OS: Ubuntu 22.04 LTS (kernel 6.8.0-1058-gcp)
GPU: none (CPU sandbox)
Versions (identical on both hosts):
Node.js: v22.22.3
npm: 10.9.8
Docker: 29.5.2
OpenShell CLI: 0.0.44
NemoClaw: v0.0.53
OpenClaw: 2026.5.22 (a374c3a)
Steps to Reproduce
- On Ubuntu 24.04 with NVIDIA GPU + nvidia-container-toolkit configured, onboard a fresh sandbox so it ends up with
sandbox GPU: enabled:
NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NVIDIA_API_KEY=nvapi-... nemoclaw onboard --fresh --name my-assistant
- Confirm baseline status is Ready:
nemoclaw my-assistant status # expects Phase: Ready, EXIT=0
- Identify the per-sandbox container and pause it:
CONTAINER=$(docker ps -a --format '{{.Names}}' | grep -E '^openshell-my-assistant-' | head -1)
docker pause "$CONTAINER"
docker ps -a --format '{{.Names}}: {{.Status}}' | grep "$CONTAINER"
# expect status "Up Xs (Paused)"
- Run status again and observe Phase:
nemoclaw my-assistant status | grep -E 'Phase:|Failure layer:'
- Recover:
docker unpause "$CONTAINER"; sleep 20
nemoclaw my-assistant status | grep Phase:
Expected Result
Per spec T6064990:
- Baseline status exits 0, no
Failure layer: text appears.
- After
docker pause, status exits 0, shows the same sandbox name, and continues to show Phase: Ready. Output does not contain Failure layer: or gateway_unreachable.
- After
docker unpause + 20s, status exits 0 and still shows Phase: Ready.
Actual Result
KVM (Ubuntu 24.04 + GPU sandbox) — FAIL on paused-state check:
CONTAINER=openshell-my-assistant-fe4cacf6-...
$ docker pause "$CONTAINER"
openshell-my-assistant-fe4cacf6-...
$ docker ps -a --format '{{.Names}}: {{.Status}}' | grep $CONTAINER
openshell-my-assistant-...: Up 45 seconds (Paused)
$ nemoclaw my-assistant status | grep -E 'Phase:|Failure layer:'
Phase: Error ← FAIL (spec expects 'Phase: Ready')
$ echo $?
0 ← exit code OK
After docker unpause + 20s, Phase: Ready — recovery works correctly.
Brev (Ubuntu 22.04 + CPU sandbox) — PASSES:
$ docker pause "openshell-my-assistant-6c8a6845-..."
$ nemoclaw my-assistant status | grep Phase:
Phase: Ready ← matches spec
Logs
KVM (FAIL):
$ nemoclaw my-assistant status (baseline)
Phase: Ready
$ docker pause openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d
openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d
$ docker ps -a --format '{{.Names}}: {{.Status}}'
openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d: Up 45 seconds (Paused)
$ nemoclaw my-assistant status | grep -E 'Phase:|Failure layer:'
Phase: Error
$ docker unpause openshell-my-assistant-fe4cacf6-2a68-4bc9-a63e-4e5c7e29590d
(after 20s)
$ nemoclaw my-assistant status | grep Phase:
Phase: Ready
Brev (PASS — same procedure):
$ docker pause openshell-my-assistant-6c8a6845-f9d7-41a0-8646-ab7e819e5ef0
$ nemoclaw my-assistant status | grep Phase:
Phase: Ready
Suggested Fix
The status interpretation logic (likely in src/lib/sandbox/status.ts or equivalent) currently routes Docker's paused state differently on GPU vs CPU sandboxes.
Suspected cause: GPU-enabled sandboxes carry additional health probes (GPU device check, NIM connectivity, etc.) that fail silently when the container is paused, which the status mapper treats as a hard Error. CPU sandboxes don't run those probes, so they fall through to the docker-state-only path which correctly maps paused → Ready.
Recommended fix:
- In the status mapper, treat Docker's
Paused state as a first-class case that maps to Phase: Ready regardless of GPU probe failure (paused == transient, by design).
- Optionally surface a hint line:
Note: sandbox container is paused; run 'docker unpause <container>' or 'nemoclaw <name> recover' to resume.
- Add a regression test covering both GPU and CPU sandboxes under
docker pause.
NVB#6237570
Description
On Ubuntu 24.04 with GPU-enabled sandbox,
docker pauseon the per-sandbox Docker-driver container makesnemoclaw <name> statusreportPhase: Errorinstead of the spec-expectedPhase: Ready. A paused container is a transient runtime suspension, not a sandbox error condition — the spec contract (T6064990) requires the phase to remainReadybecause the sandbox record itself is still valid.The bug is environment-specific: cross-verified on Brev (Ubuntu 22.04 + CPU sandbox) the spec PASSES (paused →
Phase: Ready). Only reproduces on Ubuntu 24.04 + GPU sandbox. Recovery (Phasereturns toReadyafterdocker unpause) works on both.Environment
Steps to Reproduce
sandbox GPU: enabled:nemoclaw my-assistant status # expects Phase: Ready, EXIT=0Expected Result
Per spec T6064990:
Failure layer:text appears.docker pause, status exits 0, shows the same sandbox name, and continues to showPhase: Ready. Output does not containFailure layer:orgateway_unreachable.docker unpause+ 20s, status exits 0 and still showsPhase: Ready.Actual Result
KVM (Ubuntu 24.04 + GPU sandbox) — FAIL on paused-state check:
After
docker unpause+ 20s,Phase: Ready— recovery works correctly.Brev (Ubuntu 22.04 + CPU sandbox) — PASSES:
Logs
Suggested Fix
The status interpretation logic (likely in
src/lib/sandbox/status.tsor equivalent) currently routes Docker'spausedstate differently on GPU vs CPU sandboxes.Suspected cause: GPU-enabled sandboxes carry additional health probes (GPU device check, NIM connectivity, etc.) that fail silently when the container is paused, which the status mapper treats as a hard
Error. CPU sandboxes don't run those probes, so they fall through to the docker-state-only path which correctly mapspaused→Ready.Recommended fix:
Pausedstate as a first-class case that maps toPhase: Readyregardless of GPU probe failure (paused == transient, by design).Note: sandbox container is paused; run 'docker unpause <container>' or 'nemoclaw <name> recover' to resume.docker pause.NVB#6237570