Skip to content

[DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423

@wangericnv

Description

@wangericnv

Description

After sudo reboot on DGX Spark with nemoclaw v0.0.53, the openshell-gateway host process does NOT auto-start. When the user runs nemoclaw <name> status to check the sandbox post-reboot, nemoclaw auto-starts the gateway but then concludes the sandbox is "not present in the live OpenShell gateway" and silently REMOVES the local registry entry. Net effect: every host reboot destroys the user's sandbox. The underlying Docker container backup is still on disk in Exited (137) state, but nemoclaw list reports "No sandboxes registered" and the user has no straightforward recovery path.

This appears to be a regression / over-correction of the cleanup logic introduced by NVBug 6041649 and NVBug 6059659 — those fixes correctly removed ghost/destroyed sandboxes that the CLI was incorrectly resurrecting from onboard-session metadata. The same "Removed stale local registry entry" code path now fires too aggressively against VALID sandboxes whose containers exist on disk but happen to be missing from a freshly-started gateway's live state because the gateway just came up after reboot.

Environment

Device:        DGX Spark (10.173.104.102, "Spark 110")
OS:            Ubuntu 24.04.4 LTS
Kernel:        6.17.0-1014-nvidia
Architecture:  aarch64
Node.js:       v22.22.3 (via NVM; not in default PATH)
npm:           v10.x (via NVM; not in default PATH)
Docker:        29.2.1, build a5c7197
OpenShell CLI: 0.0.44 (docker driver)
NemoClaw:      v0.0.53
OpenClaw:      v2026.5.22
Sandbox:       my-assistant — provider=ollama-local, model=qwen3.6:35b

Steps to Reproduce

  1. Fresh Spark host (DGX Spark / aarch64) with nemoclaw v0.0.53 installed and my-assistant sandbox onboarded with ollama-local provider.
  2. Verify baseline healthy:
    /home/nvidia/.local/bin/nemoclaw my-assistant status
    Expected output includes: Phase: Ready, Inference (ollama backend): healthy, Inference (auth proxy): healthy.
  3. Record pre-reboot uptime: cat /proc/uptime (e.g. 119214.48 seconds).
  4. From SSH:
    sudo reboot
  5. Wait ~70 seconds for the box to come back. Verify reboot actually happened: cat /proc/uptime (should be small, e.g. 21s).
  6. Run:
    /home/nvidia/.local/bin/nemoclaw my-assistant status
  7. Observe the destructive line: Sandbox 'my-assistant' is not present in the live OpenShell gateway. Removed stale local registry entry.
  8. Confirm sandbox is gone: /home/nvidia/.local/bin/nemoclaw listNo sandboxes registered. Run nemoclaw onboard to get started.
  9. Confirm Docker leftover exists but is dead: docker ps -a | grep openshell → only the *-nemoclaw-gpu-backup-* container in Exited (137) state.

Expected Result

Per DevTest 5949417 Scenario B Step 8 (Inference provider reconnection after restart), nemoclaw status after reboot should show "Status healthy" and the sandbox should remain registered. The openshell-gateway should auto-start on boot (e.g. via systemd unit or launchd entry). The sandbox container should be restarted/recreated via the gateway's reconciliation loop, NOT deleted from the user's registry. Inference should reconnect via the Ollama provider (which itself auto-recovers via its own systemd unit on Spark).

At minimum: if the gateway truly cannot find the sandbox in its live state, the CLI should NOT silently delete the user's registry entry on a passive status query — it should warn and require explicit user action (e.g. --prune or interactive confirmation), since the same registry entry remains valid as soon as the gateway finishes reconciliation.

Actual Result

  • Reboot succeeds cleanly: uptime resets from 119214s21s; Ollama systemd auto-recovers and serves /api/tags normally.
  • openshell-gateway is NOT registered with systemd on Spark — process stays dead after reboot until manually triggered.
  • First post-reboot nemoclaw my-assistant status invocation:
    1. Auto-starts the gateway (Starting OpenShell gatewayDocker-driver gateway is healthy)
    2. Reports Inference: not verified (gateway/sandbox state not verified)
    3. Then destructively prints:
      Sandbox 'my-assistant' is not present in the live OpenShell gateway.
      Removed stale local registry entry.
      
  • After that single call:
    • nemoclaw listNo sandboxes registered.
    • nemoclaw my-assistant statusSandbox 'my-assistant' does not exist. Run 'nemoclaw onboard' to create one.
  • Docker side: the sandbox's GPU-backup container is still on disk (Exited 137 from the reboot SIGKILL), but the user has no documented path to recover the workspace from it.

Logs

# Pre-reboot baseline (healthy):
$ /home/nvidia/.local/bin/nemoclaw my-assistant status
  Sandbox: my-assistant
    Model:    qwen3.6:35b
    Provider: ollama-local
    Inference (ollama backend): healthy (http://127.0.0.1:11434/api/tags)
    Inference (auth proxy): healthy (http://127.0.0.1:11435/api/tags)
    Host GPU: yes
    Sandbox GPU: enabled (auto)
    OpenShell: 0.0.44 (docker)
    Policies: npm, pypi, huggingface, brew, brave, local-inference, openclaw-pricing
    Connected: no
    Agent:    OpenClaw v2026.5.22
  Id:    16568949-cbc2-497d-8415-2dcebf03b98b
  Name:  my-assistant
  Phase: Ready

# Pre-reboot uptime:
$ cat /proc/uptime
119214.48 2364036.82

# Post-reboot uptime confirms real reboot:
$ cat /proc/uptime
21.xx 60.xx

# Post-reboot ollama auto-recovered:
$ systemctl status ollama
  Active: active (running) since Thu 2026-05-28 08:47:48 UTC
$ curl -sS http://127.0.0.1:11434/api/tags
{"models":[{"name":"qwen3.6:35b", ... }]}

# Post-reboot openshell-gateway NOT auto-started:
$ pgrep -f openshell-gateway
(exit 1, nothing)

# First post-reboot status call — DESTRUCTIVE:
$ /home/nvidia/.local/bin/nemoclaw my-assistant status
✓ Active gateway set to 'nemoclaw'

  [2/8] Starting OpenShell gateway
  ──────────────────────────────────────────────────
  Starting OpenShell Docker-driver gateway...
  Gateway log: /home/nvidia/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
  ✓ Docker-driver gateway is healthy
✓ Active gateway set to 'nemoclaw'

  Sandbox: my-assistant
    Model:    qwen3.6:35b
    Provider: ollama-local
    Inference: not verified (gateway/sandbox state not verified)
    Host GPU: yes
    Sandbox GPU: enabled (auto)
    OpenShell: 0.0.44 (docker)
    Policies: npm, pypi, huggingface, brew, brave, local-inference, openclaw-pricing
    ...

  Sandbox 'my-assistant' is not present in the live OpenShell gateway.
  Removed stale local registry entry.

# After:
$ /home/nvidia/.local/bin/nemoclaw list
  No sandboxes registered. Run `nemoclaw onboard` to get started.

$ /home/nvidia/.local/bin/nemoclaw my-assistant status
  Sandbox 'my-assistant' does not exist.
  Run 'nemoclaw onboard' to create one.

# Docker leftover (workspace data still on disk but inaccessible via nemoclaw):
$ docker ps -a --format '{{.Names}}\t{{.Status}}'
openshell-my-assistant-16568949-cbc2-497d-8415-2dcebf03b98b-nemoclaw-gpu-backup-1779957499784	Exited (137) 10 minutes ago

# Related dev info:
- Test source: DevTest 5949417 (Inference provider reconnection after restart by provider type), Scenario B (Ollama local), step 8.
- Discovered: 2026-05-28 during v0.0.53 P1 batch on Spark 110.

Related Bugs / not duplicate of

  • NVBug 6041649 (Merc Lau, fixed 2026-04-09): [NemoClaw][spark] [GitHub Issue #1316] nemoclaw list shows ghost sandboxes recovered from onboard session after gateway restart or machine reboot — introduced the "Removed stale local registry entry" cleanup that this regression now mis-applies.
  • NVBug 6059659 (Joyce Chen, fixed 2026-04-15): [NemoClaw][All platforms][Github Issue #1641] nemoclaw list resurrects destroyed sandboxes from onboard-session metadata, but they are no longer accessible — sibling fix in the same cleanup path.

Both prior bugs were about ghost/destroyed sandboxes being incorrectly resurrected as alive. This new bug is the inverse: a VALID sandbox is being incorrectly killed because its Docker container is temporarily missing from the gateway's in-memory state right after a fresh post-reboot gateway start. Same code path, opposite environmental condition.

Proposed Fix

Two complementary changes:

  1. Register openshell-gateway with systemd (or equivalent) so it auto-starts on Spark host boot — matches what ollama.service already does on the same host. Auto-restart should also restart the sandbox container so the user lands in a healthy state without manual intervention.
  2. Tighten the "Removed stale local registry entry" cleanup so it does NOT fire on the very first post-gateway-start status query. Possible heuristics:
    • Wait one full reconciliation tick before declaring a registry entry stale.
    • Cross-check against docker ps -a for the sandbox container name pattern (openshell-<name>-<uuid>*) before deletion — if a backup container exists, prefer recovery over deletion.
    • Make the deletion explicit (require --prune flag or interactive confirmation) instead of a side-effect of a passive status query.

NVB#6235560

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: dgx-sparkAffects DGX Spark hardware or workflowssprint 6Sprint 6v0.0.65Release target

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions