Skip to content

[Bug] plugin-runtime-deps lock staleness check uses PID alone, blocks Docker gateway restarts (PID is always 7) #74346

@jhsmith409

Description

@jhsmith409

Summary

shouldRemoveRuntimeDepsLock (in the bundled plugin-runtime-deps installer) decides a stale lock is "fresh" whenever owner.pid matches a live PID. Inside Docker, the gateway's Node process is always PID 1 (or PID 7 with init: true) in its container PID namespace. Two different incarnations of the gateway share the same PID, so the new process inspects a lock left behind by the previous one, sees its own PID listed as the owner, and treats the lock as live — even though the writer is long gone.

Result: gateway hangs at starting… for the full lock-wait window (5 min) and then keeps retrying. We've seen 13+ minute hangs that only resolve when the operator manually removes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock.

Affected versions

Reproduced on ghcr.io/openclaw/openclaw:2026.4.24 and :2026.4.25-beta.4. Code path is unchanged on current main.

Source

/app/dist/bundled-runtime-deps-BdEAdjwi.js (in the v2026.4.24 dist), corresponding to bundled-runtime-deps.ts:

function shouldRemoveRuntimeDepsLock(owner, nowMs) {
  if (!owner) return true;
  if (typeof owner.pid === "number") return !isAlive(owner.pid);
  return typeof owner.createdAtMs === "number"
    && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
}

The if/else short-circuits the time-based fallback — createdAtMs is only consulted when pid is missing. As long as PID-N is alive (which it always is inside the container running the new gateway), the time-based stale check never fires.

Reproduction

  1. docker compose up -d openclaw-gateway — gateway starts cleanly, writes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock/owner.json with {"pid": 7, "createdAtMs": <T0>}.
  2. Force-kill or hard-restart the container in a way that prevents Node's normal shutdown cleanup. We hit this via docker compose down && docker compose up -d, but anything that bypasses graceful exit (OOM, container kill, sigkill) reproduces it.
  3. New container starts. The new Node process is also PID 7 inside the container.
  4. bundled-runtime-deps.ts:withBundledRuntimeDepsInstallRootLock calls removeRuntimeDepsLockIfStale(lockDir, nowMs). It reads the leftover owner.json and calls isAlive(7)true (the new process is PID 7).
  5. Lock is not removed. mkdirSync(lockDir) returns EEXIST. Loop spins waiting for the lock until BUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS = 5 * 60_000 elapses, then errors and is retried by the supervisor — the gateway log stays parked at starting… with no further entries.

We have repeatedly worked around this with:

docker compose down openclaw-gateway
rm -rf data/config/plugin-runtime-deps/openclaw-<version>-*/.openclaw-runtime-deps.lock
docker compose up -d openclaw-gateway

and after the lock removal, gateway boots in ~35 seconds.

Why the bug doesn't surface outside Docker

On a host with a normal PID namespace, the previous Node's PID is gone after exit, isAlive(<old-pid>) returns false, and the lock is removed. The bug is invisible. It only bites in containers where PIDs are recycled deterministically.

Recommended fixes (any one would help)

  1. Always consult createdAtMs even when pid is set. A lock older than BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS is stale regardless of PID. Single-line change:

    if (typeof owner.pid === "number" && isAlive(owner.pid)) {
      if (typeof owner.createdAtMs === "number"
          && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS) {
        return true;
      }
      return false;
    }
    return typeof owner.createdAtMs === "number"
      && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
  2. Use process start-time alongside PID (Linux: /proc/<pid>/stat field 22 / starttime jiffies). Two PID-7 processes in different container incarnations have different start-times. isAlive(pid) && startTimeMatches(pid, owner.startTime) distinguishes them.

  3. Use flock(2) on a sentinel file instead of an mkdir lock + owner-json. The kernel releases the lock when the holding process exits (clean or not), so stale locks don't persist across container restarts.

  4. Document the workaround in the Docker install docs and have the gateway's startup script rm -rf any lock dir whose owner.json.createdAtMs is older than e.g. 30s before invoking the gateway.

(1) is the smallest change and the lowest risk. (3) is the most architecturally sound but a bigger refactor.

Adjacent context

This isn't the only failure-mode involving plugin-runtime-deps — #73520 covers stale cross-version directories causing crash-loops on openclaw update, and #71818 / #71599 covered runtime-deps re-install loops on cold start. This issue is distinct: same-version, same-installation, just an unsafe staleness predicate that happens to short-circuit on container PID reuse.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions