Skip to content

[Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance #3821

@zNeill

Description

@zNeill

Description

Description
After forcibly killing the OpenShell gateway container (openshell-cluster…) with docker kill, nemoclaw connect for an existing sandbox waits the full 120s connect timeout and then fails with Status: unknown and a generic timeout message, instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error. There is no explicit guidance to re‑run nemoclaw onboard or restart the gateway container, despite the test expecting either automatic recovery or actionable recovery instructions in this scenario.

Environment

  • Platform: Linux (e.g. Ubuntu 22.04 / 24.04 / 26.04)
  • GPU: Any supported GPU
  • Docker: Installed and running (supported NemoClaw/OpenShell runtime)
  • NemoClaw CLI: v0.0.45
  • OpenShell gateway:
    • Running as a Docker container, name matching openshell-cluster / openshell-cluster-nemoclaw as in the OpenShell/NemoClaw guides.
  • Sandboxes:
    • At least one sandbox onboarded and healthy, e.g. prachi-new-sb, with a working inference provider (e.g. nvidia-prod or ollama-local).
  • NemoClaw / OpenShell versions:
    • NemoClaw CLI version and OpenShell version can be taken from nemoclaw version and openshell --version output (fill in your exact versions, similar to other issues).

Steps to Reproduce

Preconditions

  1. Ensure NemoClaw CLI is installed and Docker is running.
  2. Ensure OpenShell gateway is running as a container:
     
    bash docker ps | grep openshell-cluster

    You should see a container whose name includes openshell-cluster or openshell-cluster-nemoclaw.


  3. Ensure at least one sandbox is onboarded and healthy, e.g. prachi-new-sb:

     
    bash

    nemoclaw status nemoclaw list

    nemoclaw status should show healthy, and nemoclaw list should show prachi-new-sb with a valid model/provider.

Repro steps

  1. Confirm overall NemoClaw health:
     
    bash nemoclaw status
  2. Force‑kill the OpenShell gateway container:
     
    bash docker kill $(docker ps -q --filter name=openshell-cluster)

    (Adjust the filter to match your actual gateway container name, e.g. openshell-cluster-nemoclaw.)


  3. Wait ~30 seconds to give any background retry logic a chance to run.

  4. Attempt to connect to an existing sandbox (example: prachi-new-sb):

     
    bash

    nemoclaw prachi-new-sb connect


  5. Observe the status output and final result.

  6. After the command exits, capture the exit code:

     
    bash

    echo $?


  7. Optionally, check gateway state:

     
    bash

    docker ps | grep openshell-cluster

Expected Result

  1. nemoclaw status initially reports healthy (before gateway kill).

2–4. After killing the gateway and waiting:

  • When you run nemoclaw prachi-new-sb connect under these conditions, one of two behaviors should occur: a) Auto‑recovery supported:
    • NemoClaw (or OpenShell) detects the missing gateway and restarts it automatically (e.g. via Docker) within the connect window.
    • The gateway becomes healthy again.
    • nemoclaw prachi-new-sb connect succeeds: sandbox becomes ready and you get a shell inside the sandbox.
    • Inside the sandbox, running:
       
      bash openclaw agent --agent main -m "hello" --session-id recovery

      produces a successful inference response.

    b) Auto‑recovery NOT supported:

    • nemoclaw prachi-new-sb connect should quickly fail with a clear error message indicating that the OpenShell gateway is down or unreachable.
    • The error output should include explicit recovery steps, for example:
       
      text OpenShell gateway is not running or unreachable. Run: nemoclaw onboard to recreate the gateway, then retry your connect command.
    • The command should exit non‑zero.
    • A follow‑up nemoclaw status should reflect the degraded state or point to the missing gateway with guidance.

Actual Result

After killing the gateway and attempting to connect to an existing sandbox, the user sees:

 
bash

nemoclaw prachi-new-sb connect Waiting for sandbox 'prachi-new-sb' to be ready... Status: 2026-05-19 (103s elapsed) Status: unknown (120s elapsed) Timed out after 120s waiting for sandbox 'prachi-new-sb'. Check: openshell sandbox list Override timeout: NEMOCLAW_CONNECT_TIMEOUT=300 nemoclaw prachi-new-sb connect

Key observations:

  • nemoclaw connect waits the full default 120 seconds before failing.
  • The only status it shows during the failure is Status: unknown and a generic timeout message.
  • There is no mention that the OpenShell gateway was killed or is not running.
  • There is no explicit recovery guidance such as “Run nemoclaw onboard to recreate the gateway” or “Restart the openshell-cluster container.”
  • The only suggestion is to run openshell sandbox list or increase the timeout via NEMOCLAW_CONNECT_TIMEOUT, which does not directly solve the gateway‑down scenario.

This behavior does not match the test’s Expected behavior that:

  • Either the gateway auto‑recovers within a reasonable time, and nemoclaw connect succeeds, or
  • nemoclaw connect fails fast with a clear, actionable error message indicating that the gateway is down and how to recover (e.g. re‑run nemoclaw onboard), instead of leaving the user with Status: unknown after a long timeout.

In other words, after a deliberate gateway kill (docker kill …openshell-cluster…), nemoclaw connect currently times out with a vague “Status: unknown” and generic timeout suggestion, without auto‑recovery or explicit gateway‑recovery instructions, which makes post‑failure recovery harder than necessary.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL

[NVB#6192671]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recovery

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions