Skip to content

[NemoClaw][MacOS][Onboard] Gateway not treated as stale after container runtime restart; preflight reuses healthy gateway instead of cleaning up #4268

@PrachiShevate-nv

Description

@PrachiShevate-nv

Platform

macOS laptop using Colima as the container runtime

Symptom

When rerunning nemoclaw onboard after a container-runtime restart that should remove the OpenShell gateway container, onboard preflight does not enter resume-cleanup mode. Instead of detecting stale gateway metadata and destroying/recreating the gateway, step [1/8] Preflight checks reports a "healthy NemoClaw runtime (OpenShell gateway)" and reuses it, so the expected [resume] lines and Gateway metadata is stale (container not running). Cleaning up... sequence never appear. This contradicts the test spec for resume-mode stale-gateway handling.

Component area

Onboarding / Resume Mode / Gateway Lifecycle & Metadata Cleanup

Steps to reproduce

Preconditions

  • NemoClaw CLI installed; at least one previous successful nemoclaw onboard run exists.

  • ~/.nemoclaw/sandboxes.json contains at least one sandbox entry (e.g. my-assistant, ollama-resume, or prachi-gemini).

  • ~/.nemoclaw/onboard-session.json exists with "status": "in_progress", produced by interrupting a previous nemoclaw onboard mid-flow with Ctrl+C, as per test spec. Example verification:

    cat ~/.nemoclaw/sandboxes.json
    cat ~/.nemoclaw/onboard-session.json

Repro

  1. Restart the container runtime so the gateway container disappears.

    macOS with Colima:

    colima stop
    colima start

    (Analogues for other platforms in the spec: sudo systemctl restart docker on Linux, quit/relaunch Docker Desktop on macOS/WSL2.)

  2. Verify the NemoClaw/OpenShell gateway container is missing:

    docker ps -a | grep openshell-cluster
    echo "grep exit code: $?"

    The test intends this to show no gateway container (empty output, non-zero grep exit code).

  3. Run onboard again:

    nemoclaw onboard
  4. Observe step [1/8] Preflight checks line by line.

  5. Allow onboard to continue to step [2/8] and beyond.

Expected vs Actual result

Expected (per internal test spec)

Under these preconditions (in-progress onboard session, sandbox metadata present, gateway container actually gone), nemoclaw onboard should:

  • Start in resume mode (banner indicates resume).
  • In step [1/8] Preflight checks, log the following lines in this order:
    • [resume] Skipping preflight (cached)
    • Gateway metadata is stale (container not running). Cleaning up...
    • → Found forward on sandbox '<name>'
    • ✓ Stopped forward of port 18789 for sandbox <name>
    • • Destroying gateway nemoclaw...
    • ✓ Gateway nemoclaw destroyed.
    • ✓ Stale gateway metadata cleaned up
    • [resume] Recorded gateway state is unavailable; recreating it.
  • Step [2/8] should then start a fresh gateway cluster automatically, and onboard should reach step [3/8] without additional user intervention beyond the usual prompts.

Actual

On the affected environment, step [1/8] Preflight checks instead shows:

[1/8] Preflight checks
✓ Docker is running
⚠ Container DNS probe inconclusive (reason: no_output).
   docker run produced no output (timed out or failed to start)
   Proceeding. If the sandbox build later hangs at `npm ci`, see issue #2101.
✓ Container runtime: colima
⚠ Container runtime under-provisioned: 2 vCPU / 1.9 GiB detected
   (recommended: 4 vCPU / 8 GiB).
   The sandbox build will be slow and may stall on default Colima settings.
   Suggested: colima stop && colima start --cpu 4 --memory 8
   Set NEMOCLAW_IGNORE_RUNTIME_RESOURCES=1 to silence this check.
Continue with onboarding? [Y/n]: Y
✓ openshell CLI: openshell 0.0.44
✓ Port 8080 already owned by healthy NemoClaw runtime (OpenShell gateway)
✓ Apple GPU detected: Apple M3 Pro (14 cores), 36864 MB unified memory
ⓘ Local NIM unavailable — requires NVIDIA GPU
ⓘ Sandbox GPU: disabled (no NVIDIA GPU detected)

Notably:

  • There is no (resume mode) banner.

  • There is no [resume] Skipping preflight (cached) line.

  • There is no Gateway metadata is stale (container not running). Cleaning up... sequence, nor any Destroying gateway nemoclaw... lines.

  • Instead, preflight explicitly reports:

    ✓ Port 8080 already owned by healthy NemoClaw runtime (OpenShell gateway)

    and proceeds as if the gateway is healthy and reusable.

This contradicts the test's expectation that, after a container runtime restart which removes the gateway container, onboard should treat gateway metadata as stale and recreate the gateway, not reuse it.

Failing condition

"resume mode stale gateway cleanup fires when gateway container is gone but metadata persists": Fail.

Under the specified preconditions (status="in_progress" onboard session, sandboxes in sandboxes.json, gateway container expected removed by runtime restart), nemoclaw onboard preflight does not perform the expected stale-gateway cleanup and recreation sequence, and instead reports a healthy gateway.

This suggests either:

  • The runtime restart does not actually remove the gateway container in some configurations, and preconditions cannot be met as documented; or
  • The onboard resume logic is mis-detecting gateway health and skipping the stale-metadata cleanup path even when the container is gone.

Environment versions

  • NemoClaw CLI: (example) v0.0.50 (exact version from nemoclaw version).
  • OpenShell CLI: openshell 0.0.44.
  • Container runtime: colima (2 vCPU / 1.9 GiB; under-provisioned warning present).
  • Host OS: macOS (Apple Silicon, M3-class).

Attachments to collect

  • ~/.nemoclaw/onboard-session.json (with secrets redacted).

  • ~/.nemoclaw/sandboxes.json.

  • Output of:

    docker ps -a
    colima status
    nemoclaw onboard --verbose

Metadata

Metadata

Assignees

No one assigned

    Labels

    NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputplatform: macosAffects macOS, including Apple Silicon

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions