Skip to content

[Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710

@hulynn

Description

@hulynn

Description

v0.0.57 added new HEALTHCHECK branching logic in Dockerfile to fix #4503 (NVBug 6240502 — pgrep failing in Docker-driver mode). Per the design comment in scripts/nemoclaw-start.sh:200-208: marker file /tmp/nemoclaw-gateway-local present = gateway runs in THIS container (standalone deployments); marker absent = gateway runs on host (OpenShell Docker-driver, #4503) — in-container probe cannot observe it, so HEALTHCHECK should exit 0 and defer to OpenShell host-side delivery-chain monitoring.

Bug: scripts/nemoclaw-start.sh:210 calls mark_in_container_gateway() unconditionally whenever NEMOCLAW_CMD is empty. NEMOCLAW_CMD is also empty for Docker-driver sandboxes (no one-shot command), so the marker IS created in Docker-driver mode. The Dockerfile HEALTHCHECK short-circuit [ -f /tmp/nemoclaw-gateway-local ] || exit 0 therefore never fires, and the probe falls through to the original pgrep check — which still fails because the gateway runs on the host, not in the sandbox container.

End-to-end user-visible symptom (container marked (unhealthy) by Docker) is unchanged from v0.0.53. The #4503 fix is a no-op for the Docker-driver deployments it was meant to repair.

Environment

Device:        KVM VM (10.57.212.209, x86_64, NVIDIA Blackwell)
OS:            Ubuntu 24.04.4 LTS (kernel 6.17.0-23-generic)
Architecture:  x86_64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        29.5.2 (native, not Colima)
OpenShell CLI: 0.0.44 (docker-driver)
NemoClaw:      v0.0.57 (installed via NEMOCLAW_INSTALL_TAG=v0.0.57)
OpenClaw:      2026.5.22
Gateway mode:  Docker driver (openshell-gateway runs as host process)

Steps to Reproduce

  1. On Ubuntu 24.04 with Docker-driver openshell, onboard a fresh sandbox:

    NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
      NVIDIA_API_KEY=nvapi-... \
      nemoclaw onboard --fresh --name lynn-test
  2. Wait at least 60 seconds past the 45-second StartPeriod.

  3. Inspect container health state:

    SBC=$(docker ps -a --format '{{.Names}}' | grep ^openshell-lynn-test-)
    docker inspect "$SBC" --format '{{json .State.Health}}' | python3 -m json.tool
  4. Confirm the marker file IS present in Docker-driver mode (it should NOT be):

    docker exec "$SBC" ls -la /tmp/nemoclaw-gateway-local
  5. Confirm the gateway runs on the host, not in the container:

    docker exec "$SBC" pgrep --ignore-ancestors -f 'openclaw[ -]gateway'
    pgrep -a openshell-gateway   # on host

Expected Result

Docker HEALTHCHECK reports (healthy) for Docker-driver sandboxes because the gateway lives on the host and the in-container probe cannot observe it. Per the explicit design in nemoclaw-start.sh:200-208 and the Dockerfile short-circuit, this is achieved when /tmp/nemoclaw-gateway-local is ABSENT in Docker-driver sandboxes.

Specifically at step 4, the marker file should NOT exist; step 3 should show Status="healthy" with FailingStreak=0.

Actual Result

Step 3 — container Health:

Status: "unhealthy"
FailingStreak: 119 (after start-period elapsed)
Log entries: all exit=1, output=""

Step 4 — marker file IS present (bug):

-rw-r--r-- 1 sandbox sandbox 0 Jun  3 07:27 /tmp/nemoclaw-gateway-local

Step 5 — gateway location:

  • Inside container: pgrep finds nothing (exit 1)
  • On host: pgrep -a openshell-gateway returns the gateway PID

Container marked (unhealthy) by Docker indefinitely. End state identical to v0.0.53 — the #4503 fix has no visible effect.

Root Cause Analysis

Dockerfile HEALTHCHECK (verified via docker inspect):

port="${NEMOCLAW_DASHBOARD_PORT:-${OPENCLAW_GATEWAY_PORT:-}}"
if [ -z "$port" ]; then port=18789; fi
rc=0
curl -sf --max-time 3 "http://127.0.0.1:${port}/health" > /dev/null 2>&1 || rc=$?
if [ "$rc" = 0 ]; then exit 0; fi              # healthy
if [ "$rc" != 7 ]; then exit 1; fi             # non-refused HTTP failure
[ -f /tmp/nemoclaw-gateway-local ] || exit 0;  # Docker-driver short-circuit
pgrep --ignore-ancestors -f 'openclaw[ -]gateway' > /dev/null 2>&1 || exit 1;
[ -s /tmp/gateway.log ]

scripts/nemoclaw-start.sh:200-218:

# Comment: marker present = gateway in THIS container; absent = gateway on host
# (OpenShell docker-driver sandboxes run it on the host — #4503)
mark_in_container_gateway() {
  : >/tmp/nemoclaw-gateway-local 2>/dev/null || true
}
if [ ${#NEMOCLAW_CMD[@]} -eq 0 ]; then
  mark_in_container_gateway   # ← unconditional, no driver detection
fi

The marker is created any time NEMOCLAW_CMD is empty — which is true for Docker-driver sandboxes (they are not one-shot command containers). No check distinguishes "this container will start the gateway" from "gateway lives on host". The Docker-driver short-circuit in the Dockerfile HEALTHCHECK is therefore unreachable.

Suggested Fix

Gate mark_in_container_gateway() on a driver-aware signal in scripts/nemoclaw-start.sh, e.g.:

  • Check OPENSHELL_DRIVER env var (= "docker" → skip marker)
  • Check the same gating logic the script already uses "further below" to decide whether THIS container will actually launch the gateway, and only create the marker on that path.

Either change keeps the marker semantics consistent with the design comment and lets the Dockerfile HEALTHCHECK short-circuit work as intended.

Logs

$ docker inspect openshell-lynn-test-470242a0-0a49-4044-a581-4dbe590331c3 \
    --format '{{json .State.Health}}' | python3 -m json.tool
{
    "Status": "unhealthy",
    "FailingStreak": 119,
    "Log": [
        {"Start": "2026-06-03T08:28:25", "ExitCode": 1, "Output": ""},
        {"Start": "2026-06-03T08:28:55", "ExitCode": 1, "Output": ""},
        {"Start": "2026-06-03T08:29:25", "ExitCode": 1, "Output": ""},
        {"Start": "2026-06-03T08:29:56", "ExitCode": 1, "Output": ""},
        {"Start": "2026-06-03T08:30:26", "ExitCode": 1, "Output": ""}
    ]
}

$ docker exec openshell-lynn-test-... bash -c '
  ls -la /tmp/nemoclaw-gateway-local
  curl -v --max-time 3 http://127.0.0.1:18789/health 2>&1 | tail -5
  pgrep --ignore-ancestors -f "openclaw[ -]gateway"; echo pgrep_exit=$?
  ls -la /tmp/gateway.log'
-rw-r--r-- 1 sandbox sandbox    0 Jun  3 07:27 /tmp/nemoclaw-gateway-local
* connect to 127.0.0.1 port 18789 from 127.0.0.1 port 47306 failed: Connection refused
* Failed to connect to 127.0.0.1 port 18789 after 0 ms: Could not connect to server
curl: (7) Failed to connect to 127.0.0.1 port 18789 after 0 ms: Could not connect to server
pgrep_exit=1
-rw-r--r-- 1 sandbox sandbox 5219 Jun  3 07:28 /tmp/gateway.log

Related


NVB#6262928

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: containerAffects Docker, containerd, Podman, or imagesplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions