Platform
macOS laptop using Colima as the container runtime
Symptom
When rerunning nemoclaw onboard after a container-runtime restart that should remove the OpenShell gateway container, onboard preflight does not enter resume-cleanup mode. Instead of detecting stale gateway metadata and destroying/recreating the gateway, step [1/8] Preflight checks reports a "healthy NemoClaw runtime (OpenShell gateway)" and reuses it, so the expected [resume] lines and Gateway metadata is stale (container not running). Cleaning up... sequence never appear. This contradicts the test spec for resume-mode stale-gateway handling.
Component area
Onboarding / Resume Mode / Gateway Lifecycle & Metadata Cleanup
Steps to reproduce
Preconditions
-
NemoClaw CLI installed; at least one previous successful nemoclaw onboard run exists.
-
~/.nemoclaw/sandboxes.json contains at least one sandbox entry (e.g. my-assistant, ollama-resume, or prachi-gemini).
-
~/.nemoclaw/onboard-session.json exists with "status": "in_progress", produced by interrupting a previous nemoclaw onboard mid-flow with Ctrl+C, as per test spec. Example verification:
cat ~/.nemoclaw/sandboxes.json
cat ~/.nemoclaw/onboard-session.json
Repro
-
Restart the container runtime so the gateway container disappears.
macOS with Colima:
(Analogues for other platforms in the spec: sudo systemctl restart docker on Linux, quit/relaunch Docker Desktop on macOS/WSL2.)
-
Verify the NemoClaw/OpenShell gateway container is missing:
docker ps -a | grep openshell-cluster
echo "grep exit code: $?"
The test intends this to show no gateway container (empty output, non-zero grep exit code).
-
Run onboard again:
-
Observe step [1/8] Preflight checks line by line.
-
Allow onboard to continue to step [2/8] and beyond.
Expected vs Actual result
Expected (per internal test spec)
Under these preconditions (in-progress onboard session, sandbox metadata present, gateway container actually gone), nemoclaw onboard should:
- Start in resume mode (banner indicates resume).
- In step [1/8] Preflight checks, log the following lines in this order:
[resume] Skipping preflight (cached)
Gateway metadata is stale (container not running). Cleaning up...
→ Found forward on sandbox '<name>'
✓ Stopped forward of port 18789 for sandbox <name>
• Destroying gateway nemoclaw...
✓ Gateway nemoclaw destroyed.
✓ Stale gateway metadata cleaned up
[resume] Recorded gateway state is unavailable; recreating it.
- Step [2/8] should then start a fresh gateway cluster automatically, and onboard should reach step [3/8] without additional user intervention beyond the usual prompts.
Actual
On the affected environment, step [1/8] Preflight checks instead shows:
[1/8] Preflight checks
✓ Docker is running
⚠ Container DNS probe inconclusive (reason: no_output).
docker run produced no output (timed out or failed to start)
Proceeding. If the sandbox build later hangs at `npm ci`, see issue #2101.
✓ Container runtime: colima
⚠ Container runtime under-provisioned: 2 vCPU / 1.9 GiB detected
(recommended: 4 vCPU / 8 GiB).
The sandbox build will be slow and may stall on default Colima settings.
Suggested: colima stop && colima start --cpu 4 --memory 8
Set NEMOCLAW_IGNORE_RUNTIME_RESOURCES=1 to silence this check.
Continue with onboarding? [Y/n]: Y
✓ openshell CLI: openshell 0.0.44
✓ Port 8080 already owned by healthy NemoClaw runtime (OpenShell gateway)
✓ Apple GPU detected: Apple M3 Pro (14 cores), 36864 MB unified memory
ⓘ Local NIM unavailable — requires NVIDIA GPU
ⓘ Sandbox GPU: disabled (no NVIDIA GPU detected)
Notably:
-
There is no (resume mode) banner.
-
There is no [resume] Skipping preflight (cached) line.
-
There is no Gateway metadata is stale (container not running). Cleaning up... sequence, nor any Destroying gateway nemoclaw... lines.
-
Instead, preflight explicitly reports:
✓ Port 8080 already owned by healthy NemoClaw runtime (OpenShell gateway)
and proceeds as if the gateway is healthy and reusable.
This contradicts the test's expectation that, after a container runtime restart which removes the gateway container, onboard should treat gateway metadata as stale and recreate the gateway, not reuse it.
Failing condition
"resume mode stale gateway cleanup fires when gateway container is gone but metadata persists": Fail.
Under the specified preconditions (status="in_progress" onboard session, sandboxes in sandboxes.json, gateway container expected removed by runtime restart), nemoclaw onboard preflight does not perform the expected stale-gateway cleanup and recreation sequence, and instead reports a healthy gateway.
This suggests either:
- The runtime restart does not actually remove the gateway container in some configurations, and preconditions cannot be met as documented; or
- The onboard resume logic is mis-detecting gateway health and skipping the stale-metadata cleanup path even when the container is gone.
Environment versions
- NemoClaw CLI: (example)
v0.0.50 (exact version from nemoclaw version).
- OpenShell CLI:
openshell 0.0.44.
- Container runtime:
colima (2 vCPU / 1.9 GiB; under-provisioned warning present).
- Host OS: macOS (Apple Silicon, M3-class).
Attachments to collect
Platform
macOS laptop using Colima as the container runtime
Symptom
When rerunning
nemoclaw onboardafter a container-runtime restart that should remove the OpenShell gateway container, onboard preflight does not enter resume-cleanup mode. Instead of detecting stale gateway metadata and destroying/recreating the gateway, step [1/8] Preflight checks reports a "healthy NemoClaw runtime (OpenShell gateway)" and reuses it, so the expected[resume]lines andGateway metadata is stale (container not running). Cleaning up...sequence never appear. This contradicts the test spec for resume-mode stale-gateway handling.Component area
Onboarding / Resume Mode / Gateway Lifecycle & Metadata Cleanup
Steps to reproduce
Preconditions
NemoClaw CLI installed; at least one previous successful
nemoclaw onboardrun exists.~/.nemoclaw/sandboxes.jsoncontains at least one sandbox entry (e.g.my-assistant,ollama-resume, orprachi-gemini).~/.nemoclaw/onboard-session.jsonexists with"status": "in_progress", produced by interrupting a previousnemoclaw onboardmid-flow withCtrl+C, as per test spec. Example verification:Repro
Restart the container runtime so the gateway container disappears.
macOS with Colima:
(Analogues for other platforms in the spec:
sudo systemctl restart dockeron Linux, quit/relaunch Docker Desktop on macOS/WSL2.)Verify the NemoClaw/OpenShell gateway container is missing:
The test intends this to show no gateway container (empty output, non-zero grep exit code).
Run onboard again:
Observe step [1/8] Preflight checks line by line.
Allow onboard to continue to step [2/8] and beyond.
Expected vs Actual result
Expected (per internal test spec)
Under these preconditions (in-progress onboard session, sandbox metadata present, gateway container actually gone),
nemoclaw onboardshould:[resume] Skipping preflight (cached)Gateway metadata is stale (container not running). Cleaning up...→ Found forward on sandbox '<name>'✓ Stopped forward of port 18789 for sandbox <name>• Destroying gateway nemoclaw...✓ Gateway nemoclaw destroyed.✓ Stale gateway metadata cleaned up[resume] Recorded gateway state is unavailable; recreating it.Actual
On the affected environment, step [1/8] Preflight checks instead shows:
Notably:
There is no
(resume mode)banner.There is no
[resume] Skipping preflight (cached)line.There is no
Gateway metadata is stale (container not running). Cleaning up...sequence, nor anyDestroying gateway nemoclaw...lines.Instead, preflight explicitly reports:
and proceeds as if the gateway is healthy and reusable.
This contradicts the test's expectation that, after a container runtime restart which removes the gateway container, onboard should treat gateway metadata as stale and recreate the gateway, not reuse it.
Failing condition
"resume mode stale gateway cleanup fires when gateway container is gone but metadata persists": Fail.
Under the specified preconditions (
status="in_progress"onboard session, sandboxes insandboxes.json, gateway container expected removed by runtime restart),nemoclaw onboardpreflight does not perform the expected stale-gateway cleanup and recreation sequence, and instead reports a healthy gateway.This suggests either:
Environment versions
v0.0.50(exact version fromnemoclaw version).openshell 0.0.44.colima(2 vCPU / 1.9 GiB; under-provisioned warning present).Attachments to collect
~/.nemoclaw/onboard-session.json(with secrets redacted).~/.nemoclaw/sandboxes.json.Output of: