Description
When the host Docker daemon is stopped, nemoclaw <name> status does NOT emit the Failure layer: docker_unreachable — Docker daemon is not reachable. header that the gateway-failure-classifier already defines. Instead the command prints a stale-looking status block (including a misleading "Inference: healthy" line — that probe hits the remote provider directly and doesn't go through the local Docker-hosted gateway) and exits with code 0. The only weak signal of trouble is Phase: Provisioning deep inside the sandbox detail block.
Root cause traced to a missed branch in status.ts: the classifier is only invoked from gateway-state failure branches, but on these hosts the openshell-gateway runs as a HOST process (verified via ps), independent of Docker. Stopping Docker doesn't take the gateway down, so the gateway probe still returns state: "present" and the docker_unreachable layer is never triggered.
Reproduced on both an Ubuntu 24.04 / no-GPU box and Ubuntu 24.04 / RTX 5090 box.
Environment
Device: Two Ubuntu 24.04 hosts:
- a1u2n2g-0096-02 / 10.176.178.129 (no GPU)
- 2u2g-gen-0690 / 10.57.211.27 (RTX 5090)
OS: Ubuntu 24.04.4 LTS
Architecture: x86_64
Node.js: v22.22.3
npm: 10.9.8
Docker: Docker version 29.4.1, build 055a478
OpenShell CLI: openshell 0.0.44
NemoClaw: nemoclaw v0.0.52
OpenClaw: 2026.4.24 (ollama-base) / 2026.5.22 (gpu-sb)
Steps to Reproduce
- Onboard any sandbox (e.g.
ollama-base, gpu-sb).
- Baseline:
nemoclaw <name> status # exit 0, no Failure layer text — OK
- Stop Docker:
sudo systemctl stop docker docker.socket
systemctl is-active docker # → "inactive"
- Run:
nemoclaw <name> status; echo $?
- Inspect the full output for any
Failure layer: line and inspect the exit code.
Expected Result
Stdout begins with the spec-defined header — verbatim:
Failure layer: docker_unreachable — Docker daemon is not reachable.
The header appears BEFORE any actionable lifecycle hint (e.g. before any "Run nemoclaw onboard" or "run nemoclaw <name> connect" guidance) so a user/script can identify the root cause without scrolling. Stale fields ("Inference: healthy") are suppressed or annotated. Exit code 1.
Actual Result
$ sudo systemctl stop docker docker.socket
$ systemctl is-active docker
inactive
$ nemoclaw ollama-base status
Sandbox: ollama-base
Model: gpt-4o-mini
Provider: openai-api
Inference: healthy (https://api.openai.com/v1/models) ← STALE — local stack is down but this probe hits remote OpenAI directly
Host GPU: no
Sandbox GPU: disabled (auto)
OpenShell: 0.0.39 (docker)
Policies: huggingface, brew, brave, local-inference, slack, npm, pypi
Connected: yes (1 session)
Permissions: not configured (default mutable state)
Agent: OpenClaw v2026.4.24
Sandbox:
Id: 61857c38-bbc4-4b25-8d84-a92791ed3c7a
Name: ollama-base
Phase: Provisioning ← only signal that something is wrong
...
$ echo $?
0 ← should be non-zero
(No Failure layer: text anywhere in the output. Same pattern on the GPU host against gpu-sb.)
Code Analysis
1) The classifier already implements docker_unreachable correctly — src/lib/actions/sandbox/gateway-failure-classifier.ts:
export type GatewayFailureLayer =
| "docker_unreachable"
| "container_missing"
| "container_exited_port_conflict"
| "container_exited"
| "gateway_unreachable";
...
export async function classifyGatewayFailure(
_sandboxName: string,
opts?: { runners?: GatewayFailureRunners },
): Promise<GatewayFailureResult> {
const runners = opts?.runners ?? defaultRunners;
if (!runners.dockerInfo()) { // ← docker check runs first
return {
layer: "docker_unreachable",
detail: "Docker daemon is not reachable (docker info failed or timed out).",
};
}
...
}
const LAYER_HEADERS = {
docker_unreachable: "Failure layer: docker_unreachable — Docker daemon is not reachable.",
...
};
The header string at LAYER_HEADERS["docker_unreachable"] matches the spec verbatim.
2) status.ts WIRES the classifier — but only in branches that assume the gateway probe ALREADY FAILED — src/lib/actions/sandbox/status.ts:
async function printGatewayFailureLayerHeader(sandboxName: string): Promise<void> {
const failure = await classifyGatewayFailure(sandboxName);
console.log(` ${getLayerHeader(failure.layer)}`);
}
export async function showSandboxStatus(sandboxName: string): Promise<void> {
...
let lookup: SandboxGatewayState;
try {
lookup = await getReconciledSandboxGatewayState(sandboxName, {
getState: getSandboxGatewayStateForStatus,
});
} catch (err) { ... }
...
if (lookup.state === "present") { // ← happy path — NO classifier call
console.log(lookup.output);
const phase = parseSandboxPhase(lookup.output || "");
if (phase && phase !== "Ready") {
console.log(` Sandbox '${sandboxName}' is stuck in '${phase}' phase.`);
...
}
} else if (lookup.state === "wrong_gateway_active") { ... }
else if (lookup.state === "missing") {
...
await printGatewayFailureLayerHeader(sandboxName); // ← only called when gateway lookup itself fails
...
} else if (lookup.state === "gateway_unreachable_after_restart") {
await printGatewayFailureLayerHeader(sandboxName);
...
} else if (lookup.state === "gateway_missing_after_restart") {
await printGatewayFailureLayerHeader(sandboxName);
...
} else {
await printGatewayFailureLayerHeader(sandboxName);
...
}
}
The Failure-layer header only prints when the gateway probe returns one of the failure states above (missing, gateway_unreachable_after_restart, gateway_missing_after_restart, default fallback). When the probe returns state: "present", the header is never reached.
3) Why the probe returns "present" with Docker down — the gateway is a HOST process, not a container:
$ ps -ef | grep openshell-gateway | grep -v grep
local-m+ 18801 1 0 May26 ? 00:00:45 /localhome/local-mercl/.local/bin/openshell-gateway
The openshell-gateway binary runs directly on the host. Stopping Docker takes down the sandbox containers (so Phase flips Ready → Provisioning), but the gateway keeps responding on its TCP port. getReconciledSandboxGatewayState sees a reachable gateway and classifies the state as present — the docker_unreachable branch is skipped entirely.
The classifier's own dockerInfo() check (which would catch this) is therefore never invoked.
Suggested fix:
a. Probe runners.dockerInfo() (or getSandboxDockerHealth(sandboxName)) UPFRONT in showSandboxStatus, BEFORE the gateway-state lookup. If it fails AND the sandbox's recorded openshellDriver is "docker", emit the Failure layer: docker_unreachable … header first, suppress the cached Inference: healthy line (or annotate it as stale), and set process.exitCode = 1 before continuing.
b. Optionally extend the same pattern to the other Failure layers so any layer detected by the classifier is surfaced even when the gateway probe happens to succeed (e.g. gateway healthy but container exited because docker restarted mid-cycle).
c. The Inference: healthy probe hits the remote provider URL directly. When the local stack is down, replace it with Inference: unknown (local stack unreachable) or skip the probe — the current behavior is actively misleading.
Logs
Not captured (host-level — systemctl status docker shows the daemon inactive; the bug is in nemoclaw's failure detection, not in the docker stack itself).
NVB#6229524
Description
When the host Docker daemon is stopped,
nemoclaw <name> statusdoes NOT emit theFailure layer: docker_unreachable — Docker daemon is not reachable.header that the gateway-failure-classifier already defines. Instead the command prints a stale-looking status block (including a misleading "Inference: healthy" line — that probe hits the remote provider directly and doesn't go through the local Docker-hosted gateway) and exits with code 0. The only weak signal of trouble isPhase: Provisioningdeep inside the sandbox detail block.Root cause traced to a missed branch in
status.ts: the classifier is only invoked from gateway-state failure branches, but on these hosts theopenshell-gatewayruns as a HOST process (verified viaps), independent of Docker. Stopping Docker doesn't take the gateway down, so the gateway probe still returnsstate: "present"and thedocker_unreachablelayer is never triggered.Reproduced on both an Ubuntu 24.04 / no-GPU box and Ubuntu 24.04 / RTX 5090 box.
Environment
Steps to Reproduce
ollama-base,gpu-sb).sudo systemctl stop docker docker.socket systemctl is-active docker # → "inactive"Failure layer:line and inspect the exit code.Expected Result
Stdout begins with the spec-defined header — verbatim:
The header appears BEFORE any actionable lifecycle hint (e.g. before any "Run
nemoclaw onboard" or "runnemoclaw <name> connect" guidance) so a user/script can identify the root cause without scrolling. Stale fields ("Inference: healthy") are suppressed or annotated. Exit code 1.Actual Result
(No
Failure layer:text anywhere in the output. Same pattern on the GPU host againstgpu-sb.)Code Analysis
1) The classifier already implements
docker_unreachablecorrectly —src/lib/actions/sandbox/gateway-failure-classifier.ts:The header string at
LAYER_HEADERS["docker_unreachable"]matches the spec verbatim.2)
status.tsWIRES the classifier — but only in branches that assume the gateway probe ALREADY FAILED —src/lib/actions/sandbox/status.ts:The Failure-layer header only prints when the gateway probe returns one of the failure states above (
missing,gateway_unreachable_after_restart,gateway_missing_after_restart, default fallback). When the probe returnsstate: "present", the header is never reached.3) Why the probe returns "present" with Docker down — the gateway is a HOST process, not a container:
The
openshell-gatewaybinary runs directly on the host. Stopping Docker takes down the sandbox containers (soPhaseflips Ready → Provisioning), but the gateway keeps responding on its TCP port.getReconciledSandboxGatewayStatesees a reachable gateway and classifies the state aspresent— thedocker_unreachablebranch is skipped entirely.The classifier's own
dockerInfo()check (which would catch this) is therefore never invoked.Suggested fix:
a. Probe
runners.dockerInfo()(orgetSandboxDockerHealth(sandboxName)) UPFRONT inshowSandboxStatus, BEFORE the gateway-state lookup. If it fails AND the sandbox's recordedopenshellDriveris"docker", emit theFailure layer: docker_unreachable …header first, suppress the cachedInference: healthyline (or annotate it as stale), and setprocess.exitCode = 1before continuing.b. Optionally extend the same pattern to the other Failure layers so any layer detected by the classifier is surfaced even when the gateway probe happens to succeed (e.g. gateway healthy but container exited because docker restarted mid-cycle).
c. The
Inference: healthyprobe hits the remote provider URL directly. When the local stack is down, replace it withInference: unknown (local stack unreachable)or skip the probe — the current behavior is actively misleading.Logs
Not captured (host-level —
systemctl status dockershows the daemon inactive; the bug is in nemoclaw's failure detection, not in the docker stack itself).NVB#6229524