Description
v0.0.57 added new HEALTHCHECK branching logic in Dockerfile to fix #4503 (NVBug 6240502 — pgrep failing in Docker-driver mode). Per the design comment in scripts/nemoclaw-start.sh:200-208: marker file /tmp/nemoclaw-gateway-local present = gateway runs in THIS container (standalone deployments); marker absent = gateway runs on host (OpenShell Docker-driver, #4503) — in-container probe cannot observe it, so HEALTHCHECK should exit 0 and defer to OpenShell host-side delivery-chain monitoring.
Bug: scripts/nemoclaw-start.sh:210 calls mark_in_container_gateway() unconditionally whenever NEMOCLAW_CMD is empty. NEMOCLAW_CMD is also empty for Docker-driver sandboxes (no one-shot command), so the marker IS created in Docker-driver mode. The Dockerfile HEALTHCHECK short-circuit [ -f /tmp/nemoclaw-gateway-local ] || exit 0 therefore never fires, and the probe falls through to the original pgrep check — which still fails because the gateway runs on the host, not in the sandbox container.
End-to-end user-visible symptom (container marked (unhealthy) by Docker) is unchanged from v0.0.53. The #4503 fix is a no-op for the Docker-driver deployments it was meant to repair.
Environment
Device: KVM VM (10.57.212.209, x86_64, NVIDIA Blackwell)
OS: Ubuntu 24.04.4 LTS (kernel 6.17.0-23-generic)
Architecture: x86_64
Node.js: v22.22.3
npm: 10.9.8
Docker: 29.5.2 (native, not Colima)
OpenShell CLI: 0.0.44 (docker-driver)
NemoClaw: v0.0.57 (installed via NEMOCLAW_INSTALL_TAG=v0.0.57)
OpenClaw: 2026.5.22
Gateway mode: Docker driver (openshell-gateway runs as host process)
Steps to Reproduce
-
On Ubuntu 24.04 with Docker-driver openshell, onboard a fresh sandbox:
NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NVIDIA_API_KEY=nvapi-... \
nemoclaw onboard --fresh --name lynn-test
-
Wait at least 60 seconds past the 45-second StartPeriod.
-
Inspect container health state:
SBC=$(docker ps -a --format '{{.Names}}' | grep ^openshell-lynn-test-)
docker inspect "$SBC" --format '{{json .State.Health}}' | python3 -m json.tool
-
Confirm the marker file IS present in Docker-driver mode (it should NOT be):
docker exec "$SBC" ls -la /tmp/nemoclaw-gateway-local
-
Confirm the gateway runs on the host, not in the container:
docker exec "$SBC" pgrep --ignore-ancestors -f 'openclaw[ -]gateway'
pgrep -a openshell-gateway # on host
Expected Result
Docker HEALTHCHECK reports (healthy) for Docker-driver sandboxes because the gateway lives on the host and the in-container probe cannot observe it. Per the explicit design in nemoclaw-start.sh:200-208 and the Dockerfile short-circuit, this is achieved when /tmp/nemoclaw-gateway-local is ABSENT in Docker-driver sandboxes.
Specifically at step 4, the marker file should NOT exist; step 3 should show Status="healthy" with FailingStreak=0.
Actual Result
Step 3 — container Health:
Status: "unhealthy"
FailingStreak: 119 (after start-period elapsed)
Log entries: all exit=1, output=""
Step 4 — marker file IS present (bug):
-rw-r--r-- 1 sandbox sandbox 0 Jun 3 07:27 /tmp/nemoclaw-gateway-local
Step 5 — gateway location:
- Inside container:
pgrep finds nothing (exit 1)
- On host:
pgrep -a openshell-gateway returns the gateway PID
Container marked (unhealthy) by Docker indefinitely. End state identical to v0.0.53 — the #4503 fix has no visible effect.
Root Cause Analysis
Dockerfile HEALTHCHECK (verified via docker inspect):
port="${NEMOCLAW_DASHBOARD_PORT:-${OPENCLAW_GATEWAY_PORT:-}}"
if [ -z "$port" ]; then port=18789; fi
rc=0
curl -sf --max-time 3 "http://127.0.0.1:${port}/health" > /dev/null 2>&1 || rc=$?
if [ "$rc" = 0 ]; then exit 0; fi # healthy
if [ "$rc" != 7 ]; then exit 1; fi # non-refused HTTP failure
[ -f /tmp/nemoclaw-gateway-local ] || exit 0; # Docker-driver short-circuit
pgrep --ignore-ancestors -f 'openclaw[ -]gateway' > /dev/null 2>&1 || exit 1;
[ -s /tmp/gateway.log ]
scripts/nemoclaw-start.sh:200-218:
# Comment: marker present = gateway in THIS container; absent = gateway on host
# (OpenShell docker-driver sandboxes run it on the host — #4503)
mark_in_container_gateway() {
: >/tmp/nemoclaw-gateway-local 2>/dev/null || true
}
if [ ${#NEMOCLAW_CMD[@]} -eq 0 ]; then
mark_in_container_gateway # ← unconditional, no driver detection
fi
The marker is created any time NEMOCLAW_CMD is empty — which is true for Docker-driver sandboxes (they are not one-shot command containers). No check distinguishes "this container will start the gateway" from "gateway lives on host". The Docker-driver short-circuit in the Dockerfile HEALTHCHECK is therefore unreachable.
Suggested Fix
Gate mark_in_container_gateway() on a driver-aware signal in scripts/nemoclaw-start.sh, e.g.:
- Check
OPENSHELL_DRIVER env var (= "docker" → skip marker)
- Check the same gating logic the script already uses "further below" to decide whether THIS container will actually launch the gateway, and only create the marker on that path.
Either change keeps the marker semantics consistent with the design comment and lets the Dockerfile HEALTHCHECK short-circuit work as intended.
Logs
$ docker inspect openshell-lynn-test-470242a0-0a49-4044-a581-4dbe590331c3 \
--format '{{json .State.Health}}' | python3 -m json.tool
{
"Status": "unhealthy",
"FailingStreak": 119,
"Log": [
{"Start": "2026-06-03T08:28:25", "ExitCode": 1, "Output": ""},
{"Start": "2026-06-03T08:28:55", "ExitCode": 1, "Output": ""},
{"Start": "2026-06-03T08:29:25", "ExitCode": 1, "Output": ""},
{"Start": "2026-06-03T08:29:56", "ExitCode": 1, "Output": ""},
{"Start": "2026-06-03T08:30:26", "ExitCode": 1, "Output": ""}
]
}
$ docker exec openshell-lynn-test-... bash -c '
ls -la /tmp/nemoclaw-gateway-local
curl -v --max-time 3 http://127.0.0.1:18789/health 2>&1 | tail -5
pgrep --ignore-ancestors -f "openclaw[ -]gateway"; echo pgrep_exit=$?
ls -la /tmp/gateway.log'
-rw-r--r-- 1 sandbox sandbox 0 Jun 3 07:27 /tmp/nemoclaw-gateway-local
* connect to 127.0.0.1 port 18789 from 127.0.0.1 port 47306 failed: Connection refused
* Failed to connect to 127.0.0.1 port 18789 after 0 ms: Could not connect to server
curl: (7) Failed to connect to 127.0.0.1 port 18789 after 0 ms: Could not connect to server
pgrep_exit=1
-rw-r--r-- 1 sandbox sandbox 5219 Jun 3 07:28 /tmp/gateway.log
Related
NVB#6262928
Description
v0.0.57 added new HEALTHCHECK branching logic in
Dockerfileto fix #4503 (NVBug 6240502 —pgrepfailing in Docker-driver mode). Per the design comment inscripts/nemoclaw-start.sh:200-208: marker file/tmp/nemoclaw-gateway-localpresent = gateway runs in THIS container (standalone deployments); marker absent = gateway runs on host (OpenShell Docker-driver, #4503) — in-container probe cannot observe it, so HEALTHCHECK should exit 0 and defer to OpenShell host-side delivery-chain monitoring.Bug:
scripts/nemoclaw-start.sh:210callsmark_in_container_gateway()unconditionally wheneverNEMOCLAW_CMDis empty.NEMOCLAW_CMDis also empty for Docker-driver sandboxes (no one-shot command), so the marker IS created in Docker-driver mode. The Dockerfile HEALTHCHECK short-circuit[ -f /tmp/nemoclaw-gateway-local ] || exit 0therefore never fires, and the probe falls through to the originalpgrepcheck — which still fails because the gateway runs on the host, not in the sandbox container.End-to-end user-visible symptom (container marked
(unhealthy)by Docker) is unchanged from v0.0.53. The #4503 fix is a no-op for the Docker-driver deployments it was meant to repair.Environment
Steps to Reproduce
On Ubuntu 24.04 with Docker-driver openshell, onboard a fresh sandbox:
Wait at least 60 seconds past the 45-second
StartPeriod.Inspect container health state:
Confirm the marker file IS present in Docker-driver mode (it should NOT be):
Confirm the gateway runs on the host, not in the container:
Expected Result
Docker HEALTHCHECK reports
(healthy)for Docker-driver sandboxes because the gateway lives on the host and the in-container probe cannot observe it. Per the explicit design innemoclaw-start.sh:200-208and the Dockerfile short-circuit, this is achieved when/tmp/nemoclaw-gateway-localis ABSENT in Docker-driver sandboxes.Specifically at step 4, the marker file should NOT exist; step 3 should show
Status="healthy"withFailingStreak=0.Actual Result
Step 3 — container Health:
Step 4 — marker file IS present (bug):
Step 5 — gateway location:
pgrepfinds nothing (exit 1)pgrep -a openshell-gatewayreturns the gateway PIDContainer marked
(unhealthy)by Docker indefinitely. End state identical to v0.0.53 — the #4503 fix has no visible effect.Root Cause Analysis
Dockerfile HEALTHCHECK (verified via
docker inspect):scripts/nemoclaw-start.sh:200-218:The marker is created any time
NEMOCLAW_CMDis empty — which is true for Docker-driver sandboxes (they are not one-shot command containers). No check distinguishes "this container will start the gateway" from "gateway lives on host". The Docker-driver short-circuit in the Dockerfile HEALTHCHECK is therefore unreachable.Suggested Fix
Gate
mark_in_container_gateway()on a driver-aware signal inscripts/nemoclaw-start.sh, e.g.:OPENSHELL_DRIVERenv var (="docker"→ skip marker)Either change keeps the marker semantics consistent with the design comment and lets the Dockerfile HEALTHCHECK short-circuit work as intended.
Logs
Related
NVB#6262928