Summary
shouldRemoveRuntimeDepsLock (in the bundled plugin-runtime-deps installer) decides a stale lock is "fresh" whenever owner.pid matches a live PID. Inside Docker, the gateway's Node process is always PID 1 (or PID 7 with init: true) in its container PID namespace. Two different incarnations of the gateway share the same PID, so the new process inspects a lock left behind by the previous one, sees its own PID listed as the owner, and treats the lock as live — even though the writer is long gone.
Result: gateway hangs at starting… for the full lock-wait window (5 min) and then keeps retrying. We've seen 13+ minute hangs that only resolve when the operator manually removes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock.
Affected versions
Reproduced on ghcr.io/openclaw/openclaw:2026.4.24 and :2026.4.25-beta.4. Code path is unchanged on current main.
Source
/app/dist/bundled-runtime-deps-BdEAdjwi.js (in the v2026.4.24 dist), corresponding to bundled-runtime-deps.ts:
function shouldRemoveRuntimeDepsLock(owner, nowMs) {
if (!owner) return true;
if (typeof owner.pid === "number") return !isAlive(owner.pid);
return typeof owner.createdAtMs === "number"
&& nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
}
The if/else short-circuits the time-based fallback — createdAtMs is only consulted when pid is missing. As long as PID-N is alive (which it always is inside the container running the new gateway), the time-based stale check never fires.
Reproduction
docker compose up -d openclaw-gateway — gateway starts cleanly, writes ~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock/owner.json with {"pid": 7, "createdAtMs": <T0>}.
- Force-kill or hard-restart the container in a way that prevents Node's normal shutdown cleanup. We hit this via
docker compose down && docker compose up -d, but anything that bypasses graceful exit (OOM, container kill, sigkill) reproduces it.
- New container starts. The new Node process is also PID 7 inside the container.
bundled-runtime-deps.ts:withBundledRuntimeDepsInstallRootLock calls removeRuntimeDepsLockIfStale(lockDir, nowMs). It reads the leftover owner.json and calls isAlive(7) → true (the new process is PID 7).
- Lock is not removed.
mkdirSync(lockDir) returns EEXIST. Loop spins waiting for the lock until BUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS = 5 * 60_000 elapses, then errors and is retried by the supervisor — the gateway log stays parked at starting… with no further entries.
We have repeatedly worked around this with:
docker compose down openclaw-gateway
rm -rf data/config/plugin-runtime-deps/openclaw-<version>-*/.openclaw-runtime-deps.lock
docker compose up -d openclaw-gateway
and after the lock removal, gateway boots in ~35 seconds.
Why the bug doesn't surface outside Docker
On a host with a normal PID namespace, the previous Node's PID is gone after exit, isAlive(<old-pid>) returns false, and the lock is removed. The bug is invisible. It only bites in containers where PIDs are recycled deterministically.
Recommended fixes (any one would help)
-
Always consult createdAtMs even when pid is set. A lock older than BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS is stale regardless of PID. Single-line change:
if (typeof owner.pid === "number" && isAlive(owner.pid)) {
if (typeof owner.createdAtMs === "number"
&& nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS) {
return true;
}
return false;
}
return typeof owner.createdAtMs === "number"
&& nowMs - owner.createdAtMs > BUNDLED_RUNTIME_DEPS_LOCK_STALE_MS;
-
Use process start-time alongside PID (Linux: /proc/<pid>/stat field 22 / starttime jiffies). Two PID-7 processes in different container incarnations have different start-times. isAlive(pid) && startTimeMatches(pid, owner.startTime) distinguishes them.
-
Use flock(2) on a sentinel file instead of an mkdir lock + owner-json. The kernel releases the lock when the holding process exits (clean or not), so stale locks don't persist across container restarts.
-
Document the workaround in the Docker install docs and have the gateway's startup script rm -rf any lock dir whose owner.json.createdAtMs is older than e.g. 30s before invoking the gateway.
(1) is the smallest change and the lowest risk. (3) is the most architecturally sound but a bigger refactor.
Adjacent context
This isn't the only failure-mode involving plugin-runtime-deps — #73520 covers stale cross-version directories causing crash-loops on openclaw update, and #71818 / #71599 covered runtime-deps re-install loops on cold start. This issue is distinct: same-version, same-installation, just an unsafe staleness predicate that happens to short-circuit on container PID reuse.
Summary
shouldRemoveRuntimeDepsLock(in the bundledplugin-runtime-depsinstaller) decides a stale lock is "fresh" wheneverowner.pidmatches a live PID. Inside Docker, the gateway's Node process is always PID 1 (or PID 7 withinit: true) in its container PID namespace. Two different incarnations of the gateway share the same PID, so the new process inspects a lock left behind by the previous one, sees its own PID listed as the owner, and treats the lock as live — even though the writer is long gone.Result: gateway hangs at
starting…for the full lock-wait window (5 min) and then keeps retrying. We've seen 13+ minute hangs that only resolve when the operator manually removes~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock.Affected versions
Reproduced on
ghcr.io/openclaw/openclaw:2026.4.24and:2026.4.25-beta.4. Code path is unchanged on currentmain.Source
/app/dist/bundled-runtime-deps-BdEAdjwi.js(in the v2026.4.24 dist), corresponding tobundled-runtime-deps.ts:The if/else short-circuits the time-based fallback —
createdAtMsis only consulted whenpidis missing. As long as PID-N is alive (which it always is inside the container running the new gateway), the time-based stale check never fires.Reproduction
docker compose up -d openclaw-gateway— gateway starts cleanly, writes~/.openclaw/plugin-runtime-deps/openclaw-<version>/.openclaw-runtime-deps.lock/owner.jsonwith{"pid": 7, "createdAtMs": <T0>}.docker compose down && docker compose up -d, but anything that bypasses graceful exit (OOM, container kill, sigkill) reproduces it.bundled-runtime-deps.ts:withBundledRuntimeDepsInstallRootLockcallsremoveRuntimeDepsLockIfStale(lockDir, nowMs). It reads the leftoverowner.jsonand callsisAlive(7)→true(the new process is PID 7).mkdirSync(lockDir)returnsEEXIST. Loop spins waiting for the lock untilBUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS = 5 * 60_000elapses, then errors and is retried by the supervisor — the gateway log stays parked atstarting…with no further entries.We have repeatedly worked around this with:
and after the lock removal, gateway boots in ~35 seconds.
Why the bug doesn't surface outside Docker
On a host with a normal PID namespace, the previous Node's PID is gone after exit,
isAlive(<old-pid>)returnsfalse, and the lock is removed. The bug is invisible. It only bites in containers where PIDs are recycled deterministically.Recommended fixes (any one would help)
Always consult
createdAtMseven whenpidis set. A lock older thanBUNDLED_RUNTIME_DEPS_LOCK_STALE_MSis stale regardless of PID. Single-line change:Use process start-time alongside PID (Linux:
/proc/<pid>/statfield 22 / starttime jiffies). Two PID-7 processes in different container incarnations have different start-times.isAlive(pid) && startTimeMatches(pid, owner.startTime)distinguishes them.Use
flock(2)on a sentinel file instead of anmkdirlock + owner-json. The kernel releases the lock when the holding process exits (clean or not), so stale locks don't persist across container restarts.Document the workaround in the Docker install docs and have the gateway's startup script
rm -rfany lock dir whoseowner.json.createdAtMsis older than e.g. 30s before invoking the gateway.(1) is the smallest change and the lowest risk. (3) is the most architecturally sound but a bigger refactor.
Adjacent context
This isn't the only failure-mode involving plugin-runtime-deps — #73520 covers stale cross-version directories causing crash-loops on
openclaw update, and #71818 / #71599 covered runtime-deps re-install loops on cold start. This issue is distinct: same-version, same-installation, just an unsafe staleness predicate that happens to short-circuit on container PID reuse.