Skip to content

webui container not restarted after Node.js heap OOM due to TCP-only health checks #2172

@junhaoliao

Description

@junhaoliao

Bug

(This issue was first observed by @goynam - many thanks for raising it during our offline discussions!)

The webui container's Node.js process can hit the V8 heap limit ("FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory") but remain running in a degraded state (GC-thrashing, high CPU, unresponsive to HTTP requests) instead of exiting. Because the health checks in both Docker Compose and Helm only test TCP port connectivity rather than HTTP responsiveness, the container is never marked unhealthy and is never restarted.

Observed behavior:

  • The Node.js process prints V8 OOM errors to stderr but does not exit (PID 1 stays alive).
  • The process enters a GC-thrashing loop (~50% CPU, ~5 GB RSS), unable to serve any HTTP requests.
  • docker inspect reports the container as running and healthy.
  • HTTP requests to the webui time out, making the UI completely unresponsive.
  • The container is never restarted because the restart policy (on-failure) only triggers on process exit, and the healthcheck (< /dev/tcp/webui/4000) passes as long as the TCP port is open.

Expected behavior:

  • The webui process should either exit on OOM (so that the container restarts), or the health check should detect the unresponsive state and trigger a restart.

Affected configurations:

  1. Docker Compose (tools/deployment/package/docker-compose-all.yaml, lines 403-410):

    healthcheck:
      test: ["CMD", "bash", "-c", "< /dev/tcp/webui/4000"]

    TCP-only check; does not verify the application can serve requests.

  2. Helm (tools/deployment/package-helm/templates/webui-deployment.yaml, lines 96-102):

    readinessProbe:
      tcpSocket:
        port: "webui"
    livenessProbe:
      tcpSocket:
        port: "webui"

    Same TCP-only approach. Kubernetes will not restart the pod since the liveness probe passes even
    when the application is unresponsive.

Possible fixes (non-exhaustive):

  • Change healthchecks/probes to use HTTP (e.g., httpGet on a known route in Helm, or curl -f --max-time 2 http://webui:4000/ in Compose).
  • Set --max-old-space-size on the Node.js command to make V8 abort on OOM rather than GC-thrash indefinitely.
  • Set a container memory limit (deploy.resources.limits.memory in Compose, or resources.limits.memory in Helm) so the kernel OOM-killer terminates the process.

CLP version

3b4d13f

Environment

  • Docker Compose (tools/deployment/package/) and Helm (tools/deployment/package-helm/)
  • Observed on Linux 6.8.0-106-generic with Docker

Reproduction steps

  1. Deploy the CLP package using Docker Compose or Helm with default configuration.
  2. Use the webui under a workload that causes memory pressure on the Node.js server process (e.g., large result sets, many concurrent socket connections, or repeated searches that accumulate in-memory state).
  3. Wait until the Node.js process hits the V8 heap limit and prints "FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory" to stderr.
  4. Observe that the container/pod remains in a running/healthy state despite the webui being completely unresponsive to HTTP requests.
  5. Confirm with docker exec <container> ps aux that the Node.js process is still alive, consuming high CPU (GC-thrashing) and ~5 GB RSS.
  6. Confirm with curl --max-time 5 http://<webui-host>:<port>/ that the HTTP endpoint times out.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions