Description
Description
After forcibly killing the OpenShell gateway container (openshell-cluster…) with docker kill, nemoclaw connect for an existing sandbox waits the full 120s connect timeout and then fails with Status: unknown and a generic timeout message, instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error. There is no explicit guidance to re‑run nemoclaw onboard or restart the gateway container, despite the test expecting either automatic recovery or actionable recovery instructions in this scenario.
Environment
-
Platform: Linux (e.g. Ubuntu 22.04 / 24.04 / 26.04)
-
GPU: Any supported GPU
-
Docker: Installed and running (supported NemoClaw/OpenShell runtime)
-
NemoClaw CLI: v0.0.45
-
OpenShell gateway:
-
Running as a Docker container, name matching
openshell-cluster / openshell-cluster-nemoclaw as in the OpenShell/NemoClaw guides.
-
Sandboxes:
-
At least one sandbox onboarded and healthy, e.g.
prachi-new-sb, with a working inference provider (e.g. nvidia-prod or ollama-local).
-
NemoClaw / OpenShell versions:
-
NemoClaw CLI version and OpenShell version can be taken from
nemoclaw version and openshell --version output (fill in your exact versions, similar to other issues).
Steps to Reproduce
Preconditions
-
Ensure NemoClaw CLI is installed and Docker is running.
-
Ensure OpenShell gateway is running as a container:
bash
docker ps | grep openshell-cluster
You should see a container whose name includes openshell-cluster or openshell-cluster-nemoclaw.
Ensure at least one sandbox is onboarded and healthy, e.g. prachi-new-sb:
bash
nemoclaw status nemoclaw list
nemoclaw status should show healthy, and nemoclaw list should show prachi-new-sb with a valid model/provider.
Repro steps
-
Confirm overall NemoClaw health:
bash
nemoclaw status -
Force‑kill the OpenShell gateway container:
bash
docker kill $(docker ps -q --filter name=openshell-cluster)
(Adjust the filter to match your actual gateway container name, e.g. openshell-cluster-nemoclaw.)
Wait ~30 seconds to give any background retry logic a chance to run.
Attempt to connect to an existing sandbox (example: prachi-new-sb):
bash
nemoclaw prachi-new-sb connect
Observe the status output and final result.
After the command exits, capture the exit code:
bash
echo $?
Optionally, check gateway state:
bash
docker ps | grep openshell-cluster
Expected Result
-
nemoclaw status initially reports healthy (before gateway kill).
2–4. After killing the gateway and waiting:
Actual Result
After killing the gateway and attempting to connect to an existing sandbox, the user sees:
bash
nemoclaw prachi-new-sb connect Waiting for sandbox 'prachi-new-sb' to be ready... Status: 2026-05-19 (103s elapsed) Status: unknown (120s elapsed) Timed out after 120s waiting for sandbox 'prachi-new-sb'. Check: openshell sandbox list Override timeout: NEMOCLAW_CONNECT_TIMEOUT=300 nemoclaw prachi-new-sb connect
Key observations:
-
nemoclaw connect waits the full default 120 seconds before failing. -
The only status it shows during the failure is
Status: unknown and a generic timeout message. -
There is no mention that the OpenShell gateway was killed or is not running.
-
There is no explicit recovery guidance such as “Run
nemoclaw onboard to recreate the gateway” or “Restart the openshell-cluster container.” -
The only suggestion is to run
openshell sandbox list or increase the timeout via NEMOCLAW_CONNECT_TIMEOUT, which does not directly solve the gateway‑down scenario.
This behavior does not match the test’s Expected behavior that:
-
Either the gateway auto‑recovers within a reasonable time, and
nemoclaw connect succeeds, or -
nemoclaw connect fails fast with a clear, actionable error message indicating that the gateway is down and how to recover (e.g. re‑run nemoclaw onboard), instead of leaving the user with Status: unknown after a long timeout.
In other words, after a deliberate gateway kill (docker kill …openshell-cluster…), nemoclaw connect currently times out with a vague “Status: unknown” and generic timeout suggestion, without auto‑recovery or explicit gateway‑recovery instructions, which makes post‑failure recovery harder than necessary.
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL |
[NVB#6192671]
Description
Description
After forcibly killing the OpenShell gateway container (
openshell-cluster…) withdocker kill,nemoclaw connectfor an existing sandbox waits the full 120s connect timeout and then fails withStatus: unknownand a generic timeout message, instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error. There is no explicit guidance to re‑runnemoclaw onboardor restart the gateway container, despite the test expecting either automatic recovery or actionable recovery instructions in this scenario.Environment
openshell-cluster/openshell-cluster-nemoclawas in the OpenShell/NemoClaw guides.prachi-new-sb, with a working inference provider (e.g.nvidia-prodorollama-local).nemoclaw versionandopenshell --versionoutput (fill in your exact versions, similar to other issues).Steps to Reproduce
Preconditions
docker ps | grep openshell-clusterYou should see a container whose name includes
openshell-clusteroropenshell-cluster-nemoclaw.Ensure at least one sandbox is onboarded and healthy, e.g.
prachi-new-sb: bashnemoclaw status nemoclaw listnemoclaw statusshould show healthy, andnemoclaw listshould showprachi-new-sbwith a valid model/provider.Repro steps
nemoclaw statusdocker kill $(docker ps -q --filter name=openshell-cluster)(Adjust the filter to match your actual gateway container name, e.g.
openshell-cluster-nemoclaw.)Wait ~30 seconds to give any background retry logic a chance to run.
Attempt to connect to an existing sandbox (example:
prachi-new-sb): bashnemoclaw prachi-new-sb connectObserve the status output and final result.
After the command exits, capture the exit code: bash
echo $?Optionally, check gateway state: bash
docker ps | grep openshell-clusterExpected Result
nemoclaw statusinitially reports healthy (before gateway kill).2–4. After killing the gateway and waiting:
nemoclaw prachi-new-sb connectunder these conditions, one of two behaviors should occur: a) Auto‑recovery supported:nemoclaw prachi-new-sb connectsucceeds: sandbox becomes ready and you get a shell inside the sandbox.openclaw agent --agent main -m "hello" --session-id recoveryproduces a successful inference response.
b) Auto‑recovery NOT supported:
nemoclaw prachi-new-sb connectshould quickly fail with a clear error message indicating that the OpenShell gateway is down or unreachable.OpenShell gateway is not running or unreachable. Run: nemoclaw onboard to recreate the gateway, then retry your connect command.nemoclaw statusshould reflect the degraded state or point to the missing gateway with guidance.Actual Result
After killing the gateway and attempting to connect to an existing sandbox, the user sees:
bashnemoclaw prachi-new-sb connect Waiting for sandbox 'prachi-new-sb' to be ready... Status: 2026-05-19 (103s elapsed) Status: unknown (120s elapsed) Timed out after 120s waiting for sandbox 'prachi-new-sb'. Check: openshell sandbox list Override timeout: NEMOCLAW_CONNECT_TIMEOUT=300 nemoclaw prachi-new-sb connectKey observations:
nemoclaw connectwaits the full default 120 seconds before failing.Status: unknownand a generic timeout message.nemoclaw onboardto recreate the gateway” or “Restart theopenshell-clustercontainer.”openshell sandbox listor increase the timeout viaNEMOCLAW_CONNECT_TIMEOUT, which does not directly solve the gateway‑down scenario.This behavior does not match the test’s Expected behavior that:
nemoclaw connectsucceeds, ornemoclaw connectfails fast with a clear, actionable error message indicating that the gateway is down and how to recover (e.g. re‑runnemoclaw onboard), instead of leaving the user withStatus: unknownafter a long timeout.In other words, after a deliberate gateway kill (
docker kill …openshell-cluster…),nemoclaw connectcurrently times out with a vague “Status: unknown” and generic timeout suggestion, without auto‑recovery or explicit gateway‑recovery instructions, which makes post‑failure recovery harder than necessary.Bug Details
[NVB#6192671]