[Bug] plugin-runtime-deps lock staleness check uses PID alone, blocks Docker gateway restarts (PID is always 7)

## Summary

`shouldRemoveRuntimeDepsLock` (in the bundled `plugin-runtime-deps` installer) decides a stale lock is "fresh" whenever `owner.pid` matches a live PID. **Inside Docker, the gateway's Node process is always PID 1 (or PID 7 with `init: true`) in its container PID namespace.** Two different incarnations of the gateway share the same PID, so the new process inspects a lock left behind by the previous one, sees its own PID listed as the owner, and treats the lock as live — even though the writer is long gone.

Result: gateway hangs at `starting…` for the full lock-wait window (5 min) and then keeps retrying. We've seen 13+ minute hangs that only resolve when the operator manually removes `~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock`.

## Affected versions

Reproduced on `ghcr.io/openclaw/openclaw:2026.4.24` and `:2026.4.25-beta.4`. Code path is unchanged on current `main`.

## Source

`/app/dist/bundled-runtime-deps-BdEAdjwi.js` (in the v2026.4.24 dist), corresponding to `bundled-runtime-deps.ts`:

```js
function shouldRemoveRuntimeDepsLock(owner, nowMs) {
  if (!owner) return true;
  if (typeof owner.pid === "number") return !isAlive(owner.pid);
  return typeof owner.createdAtMs === "number"
    && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
}
```

The if/else short-circuits the time-based fallback — `createdAtMs` is **only** consulted when `pid` is missing. As long as PID-N is alive (which it always is inside the container running the new gateway), the time-based stale check never fires.

## Reproduction

1. `docker compose up -d openclaw-gateway` — gateway starts cleanly, writes `~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock/owner.json` with `{"pid": 7, "createdAtMs": <T0>}`.
2. Force-kill or hard-restart the container in a way that prevents Node's normal shutdown cleanup. We hit this via `docker compose down && docker compose up -d`, but anything that bypasses graceful exit (OOM, container kill, sigkill) reproduces it.
3. New container starts. The new Node process is also PID 7 inside the container.
4. `bundled-runtime-deps.ts:withBundledRuntimeDepsInstallRootLock` calls `removeRuntimeDepsLockIfStale(lockDir, nowMs)`. It reads the leftover `owner.json` and calls `isAlive(7)` → `true` (the new process is PID 7).
5. Lock is not removed. `mkdirSync(lockDir)` returns `EEXIST`. Loop spins waiting for the lock until `BUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS = 5 * 60_000` elapses, then errors and is retried by the supervisor — the gateway log stays parked at `starting…` with no further entries.

We have repeatedly worked around this with:
```bash
docker compose down openclaw-gateway
rm -rf data/config/plugin-runtime-deps/openclaw-<version>-*/.openclaw-runtime-deps.lock
docker compose up -d openclaw-gateway
```
and after the lock removal, gateway boots in ~35 seconds.

## Why the bug doesn't surface outside Docker

On a host with a normal PID namespace, the previous Node's PID is gone after exit, `isAlive(<old-pid>)` returns `false`, and the lock is removed. The bug is invisible. It only bites in containers where PIDs are recycled deterministically.

## Recommended fixes (any one would help)

1. **Always consult `createdAtMs` even when `pid` is set.** A lock older than `BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS` is stale regardless of PID. Single-line change:

   ```js
   if (typeof owner.pid === "number" && isAlive(owner.pid)) {
     if (typeof owner.createdAtMs === "number"
         && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS) {
       return true;
     }
     return false;
   }
   return typeof owner.createdAtMs === "number"
     && nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
   ```

2. **Use process start-time alongside PID** (Linux: `/proc/<pid>/stat` field 22 / starttime jiffies). Two PID-7 processes in different container incarnations have different start-times. `isAlive(pid) && startTimeMatches(pid, owner.startTime)` distinguishes them.

3. **Use `flock(2)` on a sentinel file** instead of an `mkdir` lock + owner-json. The kernel releases the lock when the holding process exits (clean or not), so stale locks don't persist across container restarts.

4. **Document the workaround** in the Docker install docs and have the gateway's startup script `rm -rf` any lock dir whose `owner.json.createdAtMs` is older than e.g. 30s before invoking the gateway.

(1) is the smallest change and the lowest risk. (3) is the most architecturally sound but a bigger refactor.

## Adjacent context

This isn't the only failure-mode involving plugin-runtime-deps — #73520 covers stale cross-version directories causing crash-loops on `openclaw update`, and #71818 / #71599 covered runtime-deps re-install loops on cold start. This issue is distinct: same-version, same-installation, just an unsafe staleness predicate that happens to short-circuit on container PID reuse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] plugin-runtime-deps lock staleness check uses PID alone, blocks Docker gateway restarts (PID is always 7) #74346

Summary

Affected versions

Source

Reproduction

Why the bug doesn't surface outside Docker

Recommended fixes (any one would help)

Adjacent context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] plugin-runtime-deps lock staleness check uses PID alone, blocks Docker gateway restarts (PID is always 7) #74346

Description

Summary

Affected versions

Source

Reproduction

Why the bug doesn't surface outside Docker

Recommended fixes (any one would help)

Adjacent context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions