Description
After sudo reboot on DGX Spark with nemoclaw v0.0.53, the openshell-gateway host process does NOT auto-start. When the user runs nemoclaw <name> status to check the sandbox post-reboot, nemoclaw auto-starts the gateway but then concludes the sandbox is "not present in the live OpenShell gateway" and silently REMOVES the local registry entry. Net effect: every host reboot destroys the user's sandbox. The underlying Docker container backup is still on disk in Exited (137) state, but nemoclaw list reports "No sandboxes registered" and the user has no straightforward recovery path.
This appears to be a regression / over-correction of the cleanup logic introduced by NVBug 6041649 and NVBug 6059659 — those fixes correctly removed ghost/destroyed sandboxes that the CLI was incorrectly resurrecting from onboard-session metadata. The same "Removed stale local registry entry" code path now fires too aggressively against VALID sandboxes whose containers exist on disk but happen to be missing from a freshly-started gateway's live state because the gateway just came up after reboot.
Environment
Device: DGX Spark (10.173.104.102, "Spark 110")
OS: Ubuntu 24.04.4 LTS
Kernel: 6.17.0-1014-nvidia
Architecture: aarch64
Node.js: v22.22.3 (via NVM; not in default PATH)
npm: v10.x (via NVM; not in default PATH)
Docker: 29.2.1, build a5c7197
OpenShell CLI: 0.0.44 (docker driver)
NemoClaw: v0.0.53
OpenClaw: v2026.5.22
Sandbox: my-assistant — provider=ollama-local, model=qwen3.6:35b
Steps to Reproduce
- Fresh Spark host (DGX Spark / aarch64) with nemoclaw v0.0.53 installed and
my-assistant sandbox onboarded with ollama-local provider.
- Verify baseline healthy:
/home/nvidia/.local/bin/nemoclaw my-assistant status
Expected output includes: Phase: Ready, Inference (ollama backend): healthy, Inference (auth proxy): healthy.
- Record pre-reboot uptime:
cat /proc/uptime (e.g. 119214.48 seconds).
- From SSH:
- Wait ~70 seconds for the box to come back. Verify reboot actually happened:
cat /proc/uptime (should be small, e.g. 21s).
- Run:
/home/nvidia/.local/bin/nemoclaw my-assistant status
- Observe the destructive line:
Sandbox 'my-assistant' is not present in the live OpenShell gateway. Removed stale local registry entry.
- Confirm sandbox is gone:
/home/nvidia/.local/bin/nemoclaw list → No sandboxes registered. Run nemoclaw onboard to get started.
- Confirm Docker leftover exists but is dead:
docker ps -a | grep openshell → only the *-nemoclaw-gpu-backup-* container in Exited (137) state.
Expected Result
Per DevTest 5949417 Scenario B Step 8 (Inference provider reconnection after restart), nemoclaw status after reboot should show "Status healthy" and the sandbox should remain registered. The openshell-gateway should auto-start on boot (e.g. via systemd unit or launchd entry). The sandbox container should be restarted/recreated via the gateway's reconciliation loop, NOT deleted from the user's registry. Inference should reconnect via the Ollama provider (which itself auto-recovers via its own systemd unit on Spark).
At minimum: if the gateway truly cannot find the sandbox in its live state, the CLI should NOT silently delete the user's registry entry on a passive status query — it should warn and require explicit user action (e.g. --prune or interactive confirmation), since the same registry entry remains valid as soon as the gateway finishes reconciliation.
Actual Result
- Reboot succeeds cleanly: uptime resets from
119214s → 21s; Ollama systemd auto-recovers and serves /api/tags normally.
- openshell-gateway is NOT registered with systemd on Spark — process stays dead after reboot until manually triggered.
- First post-reboot
nemoclaw my-assistant status invocation:
- Auto-starts the gateway (
Starting OpenShell gateway → Docker-driver gateway is healthy)
- Reports
Inference: not verified (gateway/sandbox state not verified)
- Then destructively prints:
Sandbox 'my-assistant' is not present in the live OpenShell gateway.
Removed stale local registry entry.
- After that single call:
nemoclaw list → No sandboxes registered.
nemoclaw my-assistant status → Sandbox 'my-assistant' does not exist. Run 'nemoclaw onboard' to create one.
- Docker side: the sandbox's GPU-backup container is still on disk (
Exited 137 from the reboot SIGKILL), but the user has no documented path to recover the workspace from it.
Logs
# Pre-reboot baseline (healthy):
$ /home/nvidia/.local/bin/nemoclaw my-assistant status
Sandbox: my-assistant
Model: qwen3.6:35b
Provider: ollama-local
Inference (ollama backend): healthy (http://127.0.0.1:11434/api/tags)
Inference (auth proxy): healthy (http://127.0.0.1:11435/api/tags)
Host GPU: yes
Sandbox GPU: enabled (auto)
OpenShell: 0.0.44 (docker)
Policies: npm, pypi, huggingface, brew, brave, local-inference, openclaw-pricing
Connected: no
Agent: OpenClaw v2026.5.22
Id: 16568949-cbc2-497d-8415-2dcebf03b98b
Name: my-assistant
Phase: Ready
# Pre-reboot uptime:
$ cat /proc/uptime
119214.48 2364036.82
# Post-reboot uptime confirms real reboot:
$ cat /proc/uptime
21.xx 60.xx
# Post-reboot ollama auto-recovered:
$ systemctl status ollama
Active: active (running) since Thu 2026-05-28 08:47:48 UTC
$ curl -sS http://127.0.0.1:11434/api/tags
{"models":[{"name":"qwen3.6:35b", ... }]}
# Post-reboot openshell-gateway NOT auto-started:
$ pgrep -f openshell-gateway
(exit 1, nothing)
# First post-reboot status call — DESTRUCTIVE:
$ /home/nvidia/.local/bin/nemoclaw my-assistant status
✓ Active gateway set to 'nemoclaw'
[2/8] Starting OpenShell gateway
──────────────────────────────────────────────────
Starting OpenShell Docker-driver gateway...
Gateway log: /home/nvidia/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log
✓ Docker-driver gateway is healthy
✓ Active gateway set to 'nemoclaw'
Sandbox: my-assistant
Model: qwen3.6:35b
Provider: ollama-local
Inference: not verified (gateway/sandbox state not verified)
Host GPU: yes
Sandbox GPU: enabled (auto)
OpenShell: 0.0.44 (docker)
Policies: npm, pypi, huggingface, brew, brave, local-inference, openclaw-pricing
...
Sandbox 'my-assistant' is not present in the live OpenShell gateway.
Removed stale local registry entry.
# After:
$ /home/nvidia/.local/bin/nemoclaw list
No sandboxes registered. Run `nemoclaw onboard` to get started.
$ /home/nvidia/.local/bin/nemoclaw my-assistant status
Sandbox 'my-assistant' does not exist.
Run 'nemoclaw onboard' to create one.
# Docker leftover (workspace data still on disk but inaccessible via nemoclaw):
$ docker ps -a --format '{{.Names}}\t{{.Status}}'
openshell-my-assistant-16568949-cbc2-497d-8415-2dcebf03b98b-nemoclaw-gpu-backup-1779957499784 Exited (137) 10 minutes ago
# Related dev info:
- Test source: DevTest 5949417 (Inference provider reconnection after restart by provider type), Scenario B (Ollama local), step 8.
- Discovered: 2026-05-28 during v0.0.53 P1 batch on Spark 110.
Related Bugs / not duplicate of
- NVBug 6041649 (Merc Lau, fixed 2026-04-09):
[NemoClaw][spark] [GitHub Issue #1316] nemoclaw list shows ghost sandboxes recovered from onboard session after gateway restart or machine reboot — introduced the "Removed stale local registry entry" cleanup that this regression now mis-applies.
- NVBug 6059659 (Joyce Chen, fixed 2026-04-15):
[NemoClaw][All platforms][Github Issue #1641] nemoclaw list resurrects destroyed sandboxes from onboard-session metadata, but they are no longer accessible — sibling fix in the same cleanup path.
Both prior bugs were about ghost/destroyed sandboxes being incorrectly resurrected as alive. This new bug is the inverse: a VALID sandbox is being incorrectly killed because its Docker container is temporarily missing from the gateway's in-memory state right after a fresh post-reboot gateway start. Same code path, opposite environmental condition.
Proposed Fix
Two complementary changes:
- Register openshell-gateway with systemd (or equivalent) so it auto-starts on Spark host boot — matches what
ollama.service already does on the same host. Auto-restart should also restart the sandbox container so the user lands in a healthy state without manual intervention.
- Tighten the "Removed stale local registry entry" cleanup so it does NOT fire on the very first post-gateway-start status query. Possible heuristics:
- Wait one full reconciliation tick before declaring a registry entry stale.
- Cross-check against
docker ps -a for the sandbox container name pattern (openshell-<name>-<uuid>*) before deletion — if a backup container exists, prefer recovery over deletion.
- Make the deletion explicit (require
--prune flag or interactive confirmation) instead of a side-effect of a passive status query.
NVB#6235560
Description
After
sudo rebooton DGX Spark with nemoclaw v0.0.53, the openshell-gateway host process does NOT auto-start. When the user runsnemoclaw <name> statusto check the sandbox post-reboot, nemoclaw auto-starts the gateway but then concludes the sandbox is "not present in the live OpenShell gateway" and silently REMOVES the local registry entry. Net effect: every host reboot destroys the user's sandbox. The underlying Docker container backup is still on disk inExited (137)state, butnemoclaw listreports "No sandboxes registered" and the user has no straightforward recovery path.This appears to be a regression / over-correction of the cleanup logic introduced by NVBug 6041649 and NVBug 6059659 — those fixes correctly removed ghost/destroyed sandboxes that the CLI was incorrectly resurrecting from onboard-session metadata. The same "Removed stale local registry entry" code path now fires too aggressively against VALID sandboxes whose containers exist on disk but happen to be missing from a freshly-started gateway's live state because the gateway just came up after reboot.
Environment
Steps to Reproduce
my-assistantsandbox onboarded with ollama-local provider.Phase: Ready,Inference (ollama backend): healthy,Inference (auth proxy): healthy.cat /proc/uptime(e.g.119214.48seconds).cat /proc/uptime(should be small, e.g.21s).Sandbox 'my-assistant' is not present in the live OpenShell gateway. Removed stale local registry entry./home/nvidia/.local/bin/nemoclaw list→No sandboxes registered. Run nemoclaw onboard to get started.docker ps -a | grep openshell→ only the*-nemoclaw-gpu-backup-*container inExited (137)state.Expected Result
Per DevTest 5949417 Scenario B Step 8 (Inference provider reconnection after restart),
nemoclaw statusafter reboot should show "Status healthy" and the sandbox should remain registered. The openshell-gateway should auto-start on boot (e.g. via systemd unit or launchd entry). The sandbox container should be restarted/recreated via the gateway's reconciliation loop, NOT deleted from the user's registry. Inference should reconnect via the Ollama provider (which itself auto-recovers via its own systemd unit on Spark).At minimum: if the gateway truly cannot find the sandbox in its live state, the CLI should NOT silently delete the user's registry entry on a passive
statusquery — it should warn and require explicit user action (e.g.--pruneor interactive confirmation), since the same registry entry remains valid as soon as the gateway finishes reconciliation.Actual Result
119214s→21s; Ollama systemd auto-recovers and serves/api/tagsnormally.nemoclaw my-assistant statusinvocation:Starting OpenShell gateway→Docker-driver gateway is healthy)Inference: not verified (gateway/sandbox state not verified)nemoclaw list→No sandboxes registered.nemoclaw my-assistant status→Sandbox 'my-assistant' does not exist. Run 'nemoclaw onboard' to create one.Exited 137from the reboot SIGKILL), but the user has no documented path to recover the workspace from it.Logs
Related Bugs / not duplicate of
[NemoClaw][spark] [GitHub Issue #1316] nemoclaw list shows ghost sandboxes recovered from onboard session after gateway restart or machine reboot— introduced the "Removed stale local registry entry" cleanup that this regression now mis-applies.[NemoClaw][All platforms][Github Issue #1641] nemoclaw list resurrects destroyed sandboxes from onboard-session metadata, but they are no longer accessible— sibling fix in the same cleanup path.Both prior bugs were about ghost/destroyed sandboxes being incorrectly resurrected as alive. This new bug is the inverse: a VALID sandbox is being incorrectly killed because its Docker container is temporarily missing from the gateway's in-memory state right after a fresh post-reboot gateway start. Same code path, opposite environmental condition.
Proposed Fix
Two complementary changes:
ollama.servicealready does on the same host. Auto-restart should also restart the sandbox container so the user lands in a healthy state without manual intervention.docker ps -afor the sandbox container name pattern (openshell-<name>-<uuid>*) before deletion — if a backup container exists, prefer recovery over deletion.--pruneflag or interactive confirmation) instead of a side-effect of a passivestatusquery.NVB#6235560