[Nemoclaw][All Platforms]   nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance

## Description

Description
After forcibly killing the OpenShell gateway container (<code>openshell-cluster…</code>) with <code>docker kill</code>, <code>nemoclaw connect</code> for an existing sandbox waits the full 120s connect timeout and then fails with <code>Status: unknown</code> and a generic timeout message, instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error. There is no explicit guidance to re‑run <code>nemoclaw onboard</code> or restart the gateway container, despite the test expecting either automatic recovery or actionable recovery instructions in this scenario.

Environment
<ul><li>
Platform: Linux (e.g. Ubuntu 22.04 / 24.04 / 26.04)</li><li>
GPU: Any supported GPU</li><li>
Docker: Installed and running (supported NemoClaw/OpenShell runtime)</li><li>
NemoClaw CLI: v0.0.45</li><li>
OpenShell gateway:
<ul>
	<li>
Running as a Docker container, name matching <code>openshell-cluster</code> / <code>openshell-cluster-nemoclaw</code> as in the OpenShell/NemoClaw guides.	</li>

</ul>
</li><li>
Sandboxes:
<ul>
	<li>
At least one sandbox onboarded and healthy, e.g. <code>prachi-new-sb</code>, with a working inference provider (e.g. <code>nvidia-prod</code> or <code>ollama-local</code>).	</li></ul>
</li><li>
NemoClaw / OpenShell versions:
<ul>
	<li>
NemoClaw CLI version and OpenShell version can be taken from <code>nemoclaw version</code> and <code>openshell --version</code> output (fill in your exact versions, similar to other issues).	</li></ul>
</li></ul>

Steps to Reproduce

Preconditions
<ol><li>
Ensure NemoClaw CLI is installed and Docker is running.</li><li>
Ensure OpenShell gateway is running as a container:
<pre> </pre>bash
<code>docker ps | grep openshell-cluster</code>


You should see a container whose name includes <code>openshell-cluster</code> or <code>openshell-cluster-nemoclaw</code>.</li><li>
Ensure at least one sandbox is onboarded and healthy, e.g. <code>prachi-new-sb</code>:
<pre> </pre>bash
<code>nemoclaw status nemoclaw list</code>


<code>nemoclaw status</code> should show healthy, and <code>nemoclaw list</code> should show <code>prachi-new-sb</code> with a valid model/provider.</li></ol>

Repro steps
<ol><li>
Confirm overall NemoClaw health:
<pre> </pre>bash
<code>nemoclaw status</code></li><li>
Force‑kill the OpenShell gateway container:
<pre> </pre>bash
<code>docker kill $(docker ps -q --filter name=openshell-cluster)</code>


(Adjust the filter to match your actual gateway container name, e.g. <code>openshell-cluster-nemoclaw</code>.)</li><li>
Wait ~30 seconds to give any background retry logic a chance to run.</li><li>
Attempt to connect to an existing sandbox (example: <code>prachi-new-sb</code>):
<pre> </pre>bash
<code>nemoclaw prachi-new-sb connect</code></li><li>
Observe the status output and final result.</li><li>
After the command exits, capture the exit code:
<pre> </pre>bash
<code>echo $?</code></li><li>
Optionally, check gateway state:
<pre> </pre>bash
<code>docker ps | grep openshell-cluster</code></li></ol>

Expected Result
<ol><li>
<code>nemoclaw status</code> initially reports healthy (before gateway kill).</li></ol>

2–4. After killing the gateway and waiting:
<ul><li>
When you run <code>nemoclaw prachi-new-sb connect</code> under these conditions, one of two behaviors should occur: a) Auto‑recovery supported:
<ul>
	<li>
NemoClaw (or OpenShell) detects the missing gateway and restarts it automatically (e.g. via Docker) within the connect window.	</li>	<li>
The gateway becomes healthy again.	</li>	<li>
<code>nemoclaw prachi-new-sb connect</code> succeeds: sandbox becomes ready and you get a shell inside the sandbox.	</li>	<li>
Inside the sandbox, running:
	<pre> </pre>	bash
	<code>openclaw agent --agent main -m "hello" --session-id recovery</code>


produces a successful inference response.	</li></ul>

b) Auto‑recovery NOT supported:
<ul>
	<li>
<code>nemoclaw prachi-new-sb connect</code> should quickly fail with a clear error message indicating that the OpenShell gateway is down or unreachable.	</li>	<li>
The error output should include explicit recovery steps, for example:
	<pre> </pre>	text
	<code>OpenShell gateway is not running or unreachable. Run: nemoclaw onboard to recreate the gateway, then retry your connect command.</code></li>	<li>
The command should exit non‑zero.	</li>	<li>
A follow‑up <code>nemoclaw status</code> should reflect the degraded state or point to the missing gateway with guidance.	</li></ul>
</li></ul>

Actual Result

After killing the gateway and attempting to connect to an existing sandbox, the user sees:
<pre> </pre>bash
<code>nemoclaw prachi-new-sb connect Waiting for sandbox 'prachi-new-sb' to be ready... Status: 2026-05-19 (103s elapsed) Status: unknown (120s elapsed) Timed out after 120s waiting for sandbox 'prachi-new-sb'. Check: openshell sandbox list Override timeout: NEMOCLAW_CONNECT_TIMEOUT=300 nemoclaw prachi-new-sb connect</code>


Key observations:
<ul><li>
<code>nemoclaw connect</code> waits the full default 120 seconds before failing.</li><li>
The only status it shows during the failure is <code>Status: unknown</code> and a generic timeout message.</li><li>
There is no mention that the OpenShell gateway was killed or is not running.</li><li>
There is no explicit recovery guidance such as “Run <code>nemoclaw onboard</code> to recreate the gateway” or “Restart the <code>openshell-cluster</code> container.”</li><li>
The only suggestion is to run <code>openshell sandbox list</code> or increase the timeout via <code>NEMOCLAW_CONNECT_TIMEOUT</code>, which does not directly solve the gateway‑down scenario.</li></ul>

This behavior does not match the test’s Expected behavior that:
<ul><li>
Either the gateway auto‑recovers within a reasonable time, and <code>nemoclaw connect</code> succeeds, or</li><li>
<code>nemoclaw connect</code> fails fast with a clear, actionable error message indicating that the gateway is down and how to recover (e.g. re‑run <code>nemoclaw onboard</code>), instead of leaving the user with <code>Status: unknown</code> after a long timeout.</li></ul>

In other words, after a deliberate gateway kill (<code>docker kill …openshell-cluster…</code>), <code>nemoclaw connect</code> currently times out with a vague “Status: unknown” and generic timeout suggestion, without auto‑recovery or explicit gateway‑recovery instructions, which makes post‑failure recovery harder than necessary.

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL |

---
[NVB#6192671]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nemoclaw][All Platforms] nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance #3821

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL

[Nemoclaw][All Platforms] nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance #3821

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions