[DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry"

## Description

After `sudo reboot` on DGX Spark with nemoclaw v0.0.53, the openshell-gateway host process does NOT auto-start. When the user runs `nemoclaw <name> status` to check the sandbox post-reboot, nemoclaw auto-starts the gateway but then concludes the sandbox is "not present in the live OpenShell gateway" and silently REMOVES the local registry entry. Net effect: every host reboot destroys the user's sandbox. The underlying Docker container backup is still on disk in `Exited (137)` state, but `nemoclaw list` reports "No sandboxes registered" and the user has no straightforward recovery path.

This appears to be a regression / over-correction of the cleanup logic introduced by NVBug 6041649 and NVBug 6059659 — those fixes correctly removed ghost/destroyed sandboxes that the CLI was incorrectly resurrecting from onboard-session metadata. The same "Removed stale local registry entry" code path now fires too aggressively against VALID sandboxes whose containers exist on disk but happen to be missing from a freshly-started gateway's live state because the gateway just came up after reboot.

## Environment

```text
Device:        DGX Spark (10.173.104.102, "Spark 110")
OS:            Ubuntu 24.04.4 LTS
Kernel:        6.17.0-1014-nvidia
Architecture:  aarch64
Node.js:       v22.22.3 (via NVM; not in default PATH)
npm:           v10.x (via NVM; not in default PATH)
Docker:        29.2.1, build a5c7197
OpenShell CLI: 0.0.44 (docker driver)
NemoClaw:      v0.0.53
OpenClaw:      v2026.5.22
Sandbox:       my-assistant — provider=ollama-local, model=qwen3.6:35b
```

## Steps to Reproduce

1. Fresh Spark host (DGX Spark / aarch64) with nemoclaw v0.0.53 installed and `my-assistant` sandbox onboarded with ollama-local provider.
2. Verify baseline healthy:
   ```bash
   /home/nvidia/.local/bin/nemoclaw my-assistant status
   ```
   Expected output includes: `Phase: Ready`, `Inference (ollama backend): healthy`, `Inference (auth proxy): healthy`.
3. Record pre-reboot uptime: `cat /proc/uptime` (e.g. `119214.48` seconds).
4. From SSH:
   ```bash
   sudo reboot
   ```
5. Wait ~70 seconds for the box to come back. Verify reboot actually happened: `cat /proc/uptime` (should be small, e.g. `21s`).
6. Run:
   ```bash
   /home/nvidia/.local/bin/nemoclaw my-assistant status
   ```
7. Observe the destructive line: `Sandbox 'my-assistant' is not present in the live OpenShell gateway. Removed stale local registry entry.`
8. Confirm sandbox is gone: `/home/nvidia/.local/bin/nemoclaw list` → `No sandboxes registered. Run nemoclaw onboard to get started.`
9. Confirm Docker leftover exists but is dead: `docker ps -a | grep openshell` → only the `*-nemoclaw-gpu-backup-*` container in `Exited (137)` state.

## Expected Result

Per DevTest 5949417 Scenario B Step 8 (Inference provider reconnection after restart), `nemoclaw status` after reboot should show "Status healthy" and the sandbox should remain registered. The openshell-gateway should auto-start on boot (e.g. via systemd unit or launchd entry). The sandbox container should be restarted/recreated via the gateway's reconciliation loop, NOT deleted from the user's registry. Inference should reconnect via the Ollama provider (which itself auto-recovers via its own systemd unit on Spark).

At minimum: if the gateway truly cannot find the sandbox in its live state, the CLI should NOT silently delete the user's registry entry on a passive `status` query — it should warn and require explicit user action (e.g. `--prune` or interactive confirmation), since the same registry entry remains valid as soon as the gateway finishes reconciliation.

## Actual Result

- Reboot succeeds cleanly: uptime resets from `119214s` → `21s`; Ollama systemd auto-recovers and serves `/api/tags` normally.
- openshell-gateway is NOT registered with systemd on Spark — process stays dead after reboot until manually triggered.
- First post-reboot `nemoclaw my-assistant status` invocation:
  1. Auto-starts the gateway (`Starting OpenShell gateway` → `Docker-driver gateway is healthy`)
  2. Reports `Inference: not verified (gateway/sandbox state not verified)`
  3. Then destructively prints:
     ```text
     Sandbox 'my-assistant' is not present in the live OpenShell gateway.
     Removed stale local registry entry.
     ```
- After that single call:
  - `nemoclaw list` → `No sandboxes registered.`
  - `nemoclaw my-assistant status` → `Sandbox 'my-assistant' does not exist. Run 'nemoclaw onboard' to create one.`
- Docker side: the sandbox's GPU-backup container is still on disk (`Exited 137` from the reboot SIGKILL), but the user has no documented path to recover the workspace from it.

## Logs

```text
# Pre-reboot baseline (healthy):
$ /home/nvidia/.local/bin/nemoclaw my-assistant status
  Sandbox: my-assistant
    Model:    qwen3.6:35b
    Provider: ollama-local
    Inference (ollama backend): healthy (http://127.0.0.1:11434/api/tags)
    Inference (auth proxy): healthy (http://127.0.0.1:11435/api/tags)
    Host GPU: yes
    Sandbox GPU: enabled (auto)
    OpenShell: 0.0.44 (docker)
    Policies: npm, pypi, huggingface, brew, brave, local-inference, openclaw-pricing
    Connected: no
    Agent:    OpenClaw v2026.5.22
  Id:    16568949-cbc2-497d-8415-2dcebf03b98b
  Name:  my-assistant
  Phase: Ready

# Pre-reboot uptime:
$ cat /proc/uptime
119214.48 2364036.82

# Post-reboot uptime confirms real reboot:
$ cat /proc/uptime
21.xx 60.xx

# Post-reboot ollama auto-recovered:
$ systemctl status ollama
  Active: active (running) since Thu 2026-05-28 08:47:48 UTC
$ curl -sS http://127.0.0.1:11434/api/tags
{"models":[{"name":"qwen3.6:35b", ... }]}

# Post-reboot openshell-gateway NOT auto-started:
$ pgrep -f openshell-gateway
(exit 1, nothing)

# First post-reboot status call — DESTRUCTIVE:
$ /home/nvidia/.local/bin/nemoclaw my-assistant status
✓ Active gateway set to 'nemoclaw'

  [2/8] Starting OpenShell gateway
  ──────────────────────────────────────────────────
  Starting OpenShell Docker-driver gateway...
  Gateway log: /home/nvidia/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
  ✓ Docker-driver gateway is healthy
✓ Active gateway set to 'nemoclaw'

  Sandbox: my-assistant
    Model:    qwen3.6:35b
    Provider: ollama-local
    Inference: not verified (gateway/sandbox state not verified)
    Host GPU: yes
    Sandbox GPU: enabled (auto)
    OpenShell: 0.0.44 (docker)
    Policies: npm, pypi, huggingface, brew, brave, local-inference, openclaw-pricing
    ...

  Sandbox 'my-assistant' is not present in the live OpenShell gateway.
  Removed stale local registry entry.

# After:
$ /home/nvidia/.local/bin/nemoclaw list
  No sandboxes registered. Run `nemoclaw onboard` to get started.

$ /home/nvidia/.local/bin/nemoclaw my-assistant status
  Sandbox 'my-assistant' does not exist.
  Run 'nemoclaw onboard' to create one.

# Docker leftover (workspace data still on disk but inaccessible via nemoclaw):
$ docker ps -a --format '{{.Names}}\t{{.Status}}'
openshell-my-assistant-16568949-cbc2-497d-8415-2dcebf03b98b-nemoclaw-gpu-backup-1779957499784	Exited (137) 10 minutes ago

# Related dev info:
- Test source: DevTest 5949417 (Inference provider reconnection after restart by provider type), Scenario B (Ollama local), step 8.
- Discovered: 2026-05-28 during v0.0.53 P1 batch on Spark 110.
```

## Related Bugs / not duplicate of

- NVBug 6041649 (Merc Lau, fixed 2026-04-09): `[NemoClaw][spark] [GitHub Issue #1316] nemoclaw list shows ghost sandboxes recovered from onboard session after gateway restart or machine reboot` — introduced the "Removed stale local registry entry" cleanup that this regression now mis-applies.
- NVBug 6059659 (Joyce Chen, fixed 2026-04-15): `[NemoClaw][All platforms][Github Issue #1641] nemoclaw list resurrects destroyed sandboxes from onboard-session metadata, but they are no longer accessible` — sibling fix in the same cleanup path.

Both prior bugs were about ghost/destroyed sandboxes being incorrectly resurrected as alive. This new bug is the inverse: a VALID sandbox is being incorrectly killed because its Docker container is temporarily missing from the gateway's in-memory state right after a fresh post-reboot gateway start. Same code path, opposite environmental condition.

## Proposed Fix

Two complementary changes:

1. Register openshell-gateway with systemd (or equivalent) so it auto-starts on Spark host boot — matches what `ollama.service` already does on the same host. Auto-restart should also restart the sandbox container so the user lands in a healthy state without manual intervention.
2. Tighten the "Removed stale local registry entry" cleanup so it does NOT fire on the very first post-gateway-start status query. Possible heuristics:
   - Wait one full reconciliation tick before declaring a registry entry stale.
   - Cross-check against `docker ps -a` for the sandbox container name pattern (`openshell-<name>-<uuid>*`) before deletion — if a backup container exists, prefer recovery over deletion.
   - Make the deletion explicit (require `--prune` flag or interactive confirmation) instead of a side-effect of a passive `status` query.

---
[NVB#6235560](https://nvbugspro.nvidia.com/bug/6235560)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Related Bugs / not duplicate of

Proposed Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423

Description

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Related Bugs / not duplicate of

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions