Skip to content

[Ubuntu 24.04][Sandbox][GitHub Issue #4503] sandbox Docker HEALTHCHECK always (unhealthy) — pgrep openclaw-gateway runs inside container but gateway runs on host #4503

@hulynn

Description

@hulynn

Description

In Docker-driver mode the sandbox container's HEALTHCHECK runs pgrep --ignore-ancestors -f 'openclaw[ -]gateway' inside the container to verify the gateway process. However, in Docker mode the gateway process (openshell-gateway) runs on the host (launched by the OpenShell CLI on the host), not inside the sandbox container. The pgrep probe always returns empty and exits 1, so the HEALTHCHECK never transitions out of (unhealthy).

The container is marked (unhealthy) from the very first check (30s after start) and stays unhealthy indefinitely. NemoClaw itself correctly reports Phase: Ready since it queries OpenShell state, not Docker health — but any tooling or monitoring that inspects docker ps or the Docker API will surface the container as permanently degraded.

Environment

Device:        KVM VM (10.57.211.27, x86_64)
OS:            Ubuntu 24.04.4 LTS (kernel 6.17.0-23)
Docker:        29.5.2
OpenShell CLI: 0.0.44 (docker)
NemoClaw:      v0.0.53-9-gea10007
OpenClaw:      2026.5.22
Gateway mode:  Docker driver (gateway runs as host process)

Steps to Reproduce

  1. On Ubuntu 24.04 (Docker mode), onboard a fresh sandbox:
    NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
    NVIDIA_API_KEY=<key> nemoclaw onboard --fresh --name my-assistant
  2. Wait ~90 seconds for the initial healthcheck window to expire.
  3. Check Docker container health:
    docker ps -a --format "{{.Names}}: {{.Status}}" | grep openshell
  4. Inspect healthcheck logs:
    docker inspect <container> --format "{{json .State.Health.Log}}" | python3 -m json.tool
  5. Manually run the probe inside the container:
    docker exec <container> bash -c 'pgrep --ignore-ancestors -f "openclaw[ -]gateway"; echo $?'
  6. Check that the gateway is actually running on the host:
    pgrep -a openshell-gateway   # non-empty on host

Expected Result

The Docker HEALTHCHECK should pass when the sandbox is functional. In Docker mode the probe should check a service that runs inside the container (e.g. the OpenClaw gateway HTTP endpoint at http://127.0.0.1:<port>/health inside the container's network namespace), or be disabled entirely for Docker-driver deployments where the gateway is a host-side process.

Actual Result

openshell-my-assistant-dfd9ebfe-...: Up 5 minutes (unhealthy)

Healthcheck log — all 5 entries:
  {"ExitCode":1,"Output":""}  (repeated every 30s)

docker exec <container> pgrep -f "openclaw[ -]gateway"
→ (empty — gateway not in container, exits 1)

pgrep -a openshell-gateway  → PID present on host

nemoclaw my-assistant status → Phase: Ready   ✓
docker ps                   → (unhealthy)     ✗

Suggested Fix

In the Dockerfile HEALTHCHECK (or the container startup script that configures it), detect or parameterize the gateway mode:

  • Docker-driver mode: probe the in-container OpenClaw HTTP endpoint directly (e.g. curl -sf http://127.0.0.1:18789/health) rather than looking for an external process. The endpoint is reachable inside the container's network namespace because the port is forwarded.
  • Alternatively, skip the pgrep branch in Docker mode and rely solely on the /tmp/gateway.log non-empty check plus the HTTP probe.

The shouldUseContainerizedGateway flag in src/lib/onboard/docker-driver-gateway-launch.ts already encodes the Docker-vs-k3s distinction at runtime; the same condition should drive which HEALTHCHECK variant is embedded in the image.


NVB#6240502

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: containerAffects Docker, containerd, Podman, or imagesplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions