Skip to content

[Ubuntu 24.04][CLI&UX] nemoclaw <name> status does not emit "Failure layer: docker_unreachable" when Docker daemon is stopped; exits 0 with stale "Inference: healthy" #4313

@hulynn

Description

@hulynn

Description

When the host Docker daemon is stopped, nemoclaw <name> status does NOT emit the Failure layer: docker_unreachable — Docker daemon is not reachable. header that the gateway-failure-classifier already defines. Instead the command prints a stale-looking status block (including a misleading "Inference: healthy" line — that probe hits the remote provider directly and doesn't go through the local Docker-hosted gateway) and exits with code 0. The only weak signal of trouble is Phase: Provisioning deep inside the sandbox detail block.

Root cause traced to a missed branch in status.ts: the classifier is only invoked from gateway-state failure branches, but on these hosts the openshell-gateway runs as a HOST process (verified via ps), independent of Docker. Stopping Docker doesn't take the gateway down, so the gateway probe still returns state: "present" and the docker_unreachable layer is never triggered.

Reproduced on both an Ubuntu 24.04 / no-GPU box and Ubuntu 24.04 / RTX 5090 box.

Environment

Device:        Two Ubuntu 24.04 hosts:
               - a1u2n2g-0096-02 / 10.176.178.129 (no GPU)
               - 2u2g-gen-0690   / 10.57.211.27   (RTX 5090)
OS:            Ubuntu 24.04.4 LTS
Architecture:  x86_64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        Docker version 29.4.1, build 055a478
OpenShell CLI: openshell 0.0.44
NemoClaw:      nemoclaw v0.0.52
OpenClaw:      2026.4.24 (ollama-base) / 2026.5.22 (gpu-sb)

Steps to Reproduce

  1. Onboard any sandbox (e.g. ollama-base, gpu-sb).
  2. Baseline:
    nemoclaw <name> status            # exit 0, no Failure layer text — OK
  3. Stop Docker:
    sudo systemctl stop docker docker.socket
    systemctl is-active docker         # → "inactive"
  4. Run:
    nemoclaw <name> status; echo $?
  5. Inspect the full output for any Failure layer: line and inspect the exit code.

Expected Result

Stdout begins with the spec-defined header — verbatim:

Failure layer: docker_unreachable — Docker daemon is not reachable.

The header appears BEFORE any actionable lifecycle hint (e.g. before any "Run nemoclaw onboard" or "run nemoclaw <name> connect" guidance) so a user/script can identify the root cause without scrolling. Stale fields ("Inference: healthy") are suppressed or annotated. Exit code 1.

Actual Result

$ sudo systemctl stop docker docker.socket
$ systemctl is-active docker
inactive

$ nemoclaw ollama-base status

  Sandbox: ollama-base
    Model:    gpt-4o-mini
    Provider: openai-api
    Inference: healthy (https://api.openai.com/v1/models)   ← STALE — local stack is down but this probe hits remote OpenAI directly
    Host GPU: no
    Sandbox GPU: disabled (auto)
    OpenShell: 0.0.39 (docker)
    Policies: huggingface, brew, brave, local-inference, slack, npm, pypi
    Connected: yes (1 session)
    Permissions: not configured (default mutable state)
    Agent:    OpenClaw v2026.4.24

Sandbox:
  Id:    61857c38-bbc4-4b25-8d84-a92791ed3c7a
  Name:  ollama-base
  Phase: Provisioning                                       ← only signal that something is wrong
  ...

$ echo $?
0                                                           ← should be non-zero

(No Failure layer: text anywhere in the output. Same pattern on the GPU host against gpu-sb.)

Code Analysis

1) The classifier already implements docker_unreachable correctlysrc/lib/actions/sandbox/gateway-failure-classifier.ts:

export type GatewayFailureLayer =
  | "docker_unreachable"
  | "container_missing"
  | "container_exited_port_conflict"
  | "container_exited"
  | "gateway_unreachable";
...
export async function classifyGatewayFailure(
  _sandboxName: string,
  opts?: { runners?: GatewayFailureRunners },
): Promise<GatewayFailureResult> {
  const runners = opts?.runners ?? defaultRunners;

  if (!runners.dockerInfo()) {                              // ← docker check runs first
    return {
      layer: "docker_unreachable",
      detail: "Docker daemon is not reachable (docker info failed or timed out).",
    };
  }
  ...
}

const LAYER_HEADERS = {
  docker_unreachable: "Failure layer: docker_unreachable — Docker daemon is not reachable.",
  ...
};

The header string at LAYER_HEADERS["docker_unreachable"] matches the spec verbatim.

2) status.ts WIRES the classifier — but only in branches that assume the gateway probe ALREADY FAILEDsrc/lib/actions/sandbox/status.ts:

async function printGatewayFailureLayerHeader(sandboxName: string): Promise<void> {
  const failure = await classifyGatewayFailure(sandboxName);
  console.log(`  ${getLayerHeader(failure.layer)}`);
}

export async function showSandboxStatus(sandboxName: string): Promise<void> {
  ...
  let lookup: SandboxGatewayState;
  try {
    lookup = await getReconciledSandboxGatewayState(sandboxName, {
      getState: getSandboxGatewayStateForStatus,
    });
  } catch (err) { ... }
  ...
  if (lookup.state === "present") {                          // ← happy path — NO classifier call
    console.log(lookup.output);
    const phase = parseSandboxPhase(lookup.output || "");
    if (phase && phase !== "Ready") {
      console.log(`  Sandbox '${sandboxName}' is stuck in '${phase}' phase.`);
      ...
    }
  } else if (lookup.state === "wrong_gateway_active") { ... }
  else if (lookup.state === "missing") {
    ...
    await printGatewayFailureLayerHeader(sandboxName);       // ← only called when gateway lookup itself fails
    ...
  } else if (lookup.state === "gateway_unreachable_after_restart") {
    await printGatewayFailureLayerHeader(sandboxName);
    ...
  } else if (lookup.state === "gateway_missing_after_restart") {
    await printGatewayFailureLayerHeader(sandboxName);
    ...
  } else {
    await printGatewayFailureLayerHeader(sandboxName);
    ...
  }
}

The Failure-layer header only prints when the gateway probe returns one of the failure states above (missing, gateway_unreachable_after_restart, gateway_missing_after_restart, default fallback). When the probe returns state: "present", the header is never reached.

3) Why the probe returns "present" with Docker down — the gateway is a HOST process, not a container:

$ ps -ef | grep openshell-gateway | grep -v grep
local-m+   18801   1  0 May26  ? 00:00:45 /localhome/local-mercl/.local/bin/openshell-gateway

The openshell-gateway binary runs directly on the host. Stopping Docker takes down the sandbox containers (so Phase flips Ready → Provisioning), but the gateway keeps responding on its TCP port. getReconciledSandboxGatewayState sees a reachable gateway and classifies the state as present — the docker_unreachable branch is skipped entirely.

The classifier's own dockerInfo() check (which would catch this) is therefore never invoked.

Suggested fix:

a. Probe runners.dockerInfo() (or getSandboxDockerHealth(sandboxName)) UPFRONT in showSandboxStatus, BEFORE the gateway-state lookup. If it fails AND the sandbox's recorded openshellDriver is "docker", emit the Failure layer: docker_unreachable … header first, suppress the cached Inference: healthy line (or annotate it as stale), and set process.exitCode = 1 before continuing.

b. Optionally extend the same pattern to the other Failure layers so any layer detected by the classifier is surfaced even when the gateway probe happens to succeed (e.g. gateway healthy but container exited because docker restarted mid-cycle).

c. The Inference: healthy probe hits the remote provider URL directly. When the local stack is down, replace it with Inference: unknown (local stack unreachable) or skip the probe — the current behavior is actively misleading.

Logs

Not captured (host-level — systemctl status docker shows the daemon inactive; the bug is in nemoclaw's failure detection, not in the docker stack itself).


NVB#6229524

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions