Skip to content

fix(plugins): disambiguate runtime-deps lock owners by process start-time (Docker PID reuse)#74361

Merged
steipete merged 3 commits intoopenclaw:mainfrom
jhsmith409:fix/runtime-deps-lock-startTime
Apr 29, 2026
Merged

fix(plugins): disambiguate runtime-deps lock owners by process start-time (Docker PID reuse)#74361
steipete merged 3 commits intoopenclaw:mainfrom
jhsmith409:fix/runtime-deps-lock-startTime

Conversation

@jhsmith409
Copy link
Copy Markdown
Contributor

Summary

shouldRemoveRuntimeDepsLock short-circuits on isAlive(owner.pid) alone, which is unsafe inside containers because PIDs are recycled deterministically — the new gateway is always PID 1 (or PID 7 with init: true) in its container PID namespace, so a stale lock left behind by a previous incarnation looks "live" to the new one and never gets reclaimed.

This PR captures pidStartTimeMs at lock-acquisition time and consults it in the staleness check. When both sides have start-time evidence and they disagree, the lock is treated as stale. When evidence is missing on either side (legacy locks, non-Linux hosts), the existing PID-alive-means-fresh behavior is preserved exactly.

Closes #74346.

The bug

Setup: OpenClaw on Docker (ghcr.io/openclaw/openclaw:2026.4.24+), Linux host. Reproduced both with docker compose down/up and with a hard-killed gateway (sigkill, OOM, container kill).

  1. Old gateway dies without graceful cleanup — .openclaw-runtime-deps.lock/owner.json is left behind with {"pid": 7, "createdAtMs": <T0>}.
  2. New container starts. The new Node process becomes PID 7 in its namespace.
  3. Lock-acquisition path calls removeRuntimeDepsLockIfStale(lockDir, nowMs), which calls shouldRemoveRuntimeDepsLock(owner, nowMs). Owner has pid: 7. isAlive(7) returns true (the new process is PID 7). The function returns false — lock stays.
  4. mkdirSync(lockDir) returns EEXIST. The wait loop spins until BUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS (5 min) elapses, then errors out and is restarted by the supervisor. Cycle repeats indefinitely.

Operators have been working around this by stopping the container and manually removing .openclaw-runtime-deps.lock. After lock removal the gateway boots in ~35 seconds.

The fix

Strictly-additive change to the lock owner record and the staleness predicate:

  1. New pidStartTimeMs field on the lock owner record. Computed once at module load as Date.now() - process.uptime() * 1000 and persisted alongside pid / createdAtMs in owner.json.
  2. New readProcessStartTimeMs(pid) helper that returns the live PID's start-time in epoch ms on Linux (parses /proc/<pid>/stat field 22, anchored on the last ) to handle process names with spaces/parens), or null elsewhere. Hard-coded 100 for _SC_CLK_TCK since the constant cancels out — both ends of the comparison use the same conversion.
  3. shouldRemoveRuntimeDepsLock extended: when owner.pidStartTimeMs is set AND readStartTimeMs(owner.pid) returns a value AND the two disagree, the lock is stale. When evidence is missing on either side, the existing PID-alive-means-fresh path is taken unchanged.

Why strictly additive

  • Wire format: lock files written by older releases lack pidStartTimeMs. They continue to take the legacy path. No migration step needed.
  • Predicate: when start-time evidence cannot be confirmed (legacy lock, non-Linux host), behavior is identical to today's code. The pre-existing test does not expire active runtime-deps install locks by age alone continues to pass.
  • Platforms: macOS / Windows hosts return null from readProcessStartTimeMs (no /proc) and take the legacy path. Linux hosts (including Docker Desktop's Linux VM, which all Linux containers run inside on macOS / Windows) get the disambiguation.

Tests

src/plugins/bundled-runtime-deps.test.ts adds four new cases:

  • matching start-time → lock is fresh (same incarnation re-checking its own lock).
  • mismatched start-time, PID alive → lock is stale (the Docker PID-reuse case this PR targets).
  • PID alive, no live start-time available → fall back to legacy behavior. Asserts that even with pidStartTimeMs set on the owner, we don't stomp the lock when the live side cannot be verified.
  • (existing test) does not expire active runtime-deps install locks by age alone → still passes unchanged, because the legacy path is preserved exactly when evidence is missing.

All 94 tests in bundled-runtime-deps.test.ts pass locally.

Verified

  • pnpm vitest run --reporter verbose --no-coverage src/plugins/bundled-runtime-deps.test.ts — 94/94 pass.
  • pnpm tsc --noEmit -p tsconfig.json — no new errors related to this file.

Notes for review

  • The capture timing (Date.now() - process.uptime() * 1000) drifts a few ms relative to /proc's starttime jiffies. That's fine because the comparison only runs against the same PID's start-time read from /proc later — equality holds within rounding tolerance and we use Math.round on both sides.
  • I deliberately kept this as a "fix the predicate" change rather than a flock(2) rewrite. A flock-based design (kernel-released on process exit, no staleness logic needed) would be cleaner architecturally but is a much larger change and would require picking a Node lockfile library. Happy to follow up with an RFC if you'd like to go that direction; this PR addresses the immediate Docker hang while preserving every existing invariant.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR fixes the Docker PID-reuse hang in shouldRemoveRuntimeDepsLock by capturing a pidStartTimeMs at lock-acquisition time and comparing it to the live PID's start time when evaluating staleness. The approach is well-reasoned and additive, but there is one P1 correctness issue:

  • False-positive stale detection: CURRENT_PROCESS_START_TIME_MS is stored as Math.round(Date.now() - process.uptime() * 1000) (sub-millisecond precision), while readProcessStartTimeMs derives the start time from /proc/pid/stat jiffies (10 ms kernel resolution). For any process that starts mid-jiffy the two values differ by up to ~10 ms, so the strict !== at line 606 will fire on a perfectly live lock. The fix is to derive CURRENT_PROCESS_START_TIME_MS using the same jiffy-based formula so both sides are quantized identically.

Confidence Score: 3/5

Not safe to merge as-is — the jiffy-vs-high-res mismatch can produce false-positive stale verdicts that stomp live locks.

A single P1 bug: the strict equality comparison between a high-res stored timestamp and a jiffy-quantized live read will false-positive for any process that started mid-jiffy, which is the common case. This can cause a healthy process's lock to be deleted by a concurrent waiter, breaking the mutual-exclusion guarantee the lock provides.

src/plugins/bundled-runtime-deps.ts — specifically the CURRENT_PROCESS_START_TIME_MS definition (line 462) and its use as pidStartTimeMs (line 698), and the comparison at line 606.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/plugins/bundled-runtime-deps.ts
Line: 606

Comment:
**Strict `!==` comparison between high-res and jiffy-quantized start times produces false positives**

`owner.pidStartTimeMs` is written as `Math.round(Date.now() - process.uptime() * 1000)` (sub-millisecond precision), while `readStartTimeMs` computes `Math.round(Date.now() - os.uptime() * 1000 + starttimeJiffies * 10)` where `starttimeJiffies` is truncated to 10 ms resolution by the kernel. For a live lock owner these two formulas can disagree by up to ~10 ms (the start time falls in the middle of a jiffy), so the strict `!==` can fire on a completely valid, live lock — causing an unrelated process to stomp it.

Concrete example: process A starts 1000.005 s after boot → `CURRENT_PROCESS_START_TIME_MS ≈ T_boot + 1_000_005 ms`, but `readProcessStartTimeMs(A.pid)` resolves `starttimeJiffies = 100000` → returns `T_boot + 1_000_000 ms`. The 5 ms difference triggers a false-positive eviction.

The simplest fix is to derive `CURRENT_PROCESS_START_TIME_MS` from the same jiffy-based formula so both sides of the comparison are quantized identically:

```ts
// Use the same jiffy-quantized formula as readProcessStartTimeMs so that
// the stored value and the live read are always comparable.
const CURRENT_PROCESS_START_TIME_MS = readProcessStartTimeMs(process.pid);
```

Then narrow the `pidStartTimeMs` field type to `number | null` and skip writing it (or write `null`) when `readProcessStartTimeMs` returns `null` (non-Linux). The staleness check already guards `typeof owner.pidStartTimeMs === "number"`, so `null` will fall through to legacy behavior unchanged.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(plugins): disambiguate runtime-deps ..." | Re-trigger Greptile

Comment thread src/plugins/bundled-runtime-deps.ts Outdated
// before.
if (typeof owner.pidStartTimeMs === "number") {
const liveStartTimeMs = readStartTimeMs(owner.pid);
if (liveStartTimeMs !== null && liveStartTimeMs !== owner.pidStartTimeMs) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Strict !== comparison between high-res and jiffy-quantized start times produces false positives

owner.pidStartTimeMs is written as Math.round(Date.now() - process.uptime() * 1000) (sub-millisecond precision), while readStartTimeMs computes Math.round(Date.now() - os.uptime() * 1000 + starttimeJiffies * 10) where starttimeJiffies is truncated to 10 ms resolution by the kernel. For a live lock owner these two formulas can disagree by up to ~10 ms (the start time falls in the middle of a jiffy), so the strict !== can fire on a completely valid, live lock — causing an unrelated process to stomp it.

Concrete example: process A starts 1000.005 s after boot → CURRENT_PROCESS_START_TIME_MS ≈ T_boot + 1_000_005 ms, but readProcessStartTimeMs(A.pid) resolves starttimeJiffies = 100000 → returns T_boot + 1_000_000 ms. The 5 ms difference triggers a false-positive eviction.

The simplest fix is to derive CURRENT_PROCESS_START_TIME_MS from the same jiffy-based formula so both sides of the comparison are quantized identically:

// Use the same jiffy-quantized formula as readProcessStartTimeMs so that
// the stored value and the live read are always comparable.
const CURRENT_PROCESS_START_TIME_MS = readProcessStartTimeMs(process.pid);

Then narrow the pidStartTimeMs field type to number | null and skip writing it (or write null) when readProcessStartTimeMs returns null (non-Linux). The staleness check already guards typeof owner.pidStartTimeMs === "number", so null will fall through to legacy behavior unchanged.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/plugins/bundled-runtime-deps.ts
Line: 606

Comment:
**Strict `!==` comparison between high-res and jiffy-quantized start times produces false positives**

`owner.pidStartTimeMs` is written as `Math.round(Date.now() - process.uptime() * 1000)` (sub-millisecond precision), while `readStartTimeMs` computes `Math.round(Date.now() - os.uptime() * 1000 + starttimeJiffies * 10)` where `starttimeJiffies` is truncated to 10 ms resolution by the kernel. For a live lock owner these two formulas can disagree by up to ~10 ms (the start time falls in the middle of a jiffy), so the strict `!==` can fire on a completely valid, live lock — causing an unrelated process to stomp it.

Concrete example: process A starts 1000.005 s after boot → `CURRENT_PROCESS_START_TIME_MS ≈ T_boot + 1_000_005 ms`, but `readProcessStartTimeMs(A.pid)` resolves `starttimeJiffies = 100000` → returns `T_boot + 1_000_000 ms`. The 5 ms difference triggers a false-positive eviction.

The simplest fix is to derive `CURRENT_PROCESS_START_TIME_MS` from the same jiffy-based formula so both sides of the comparison are quantized identically:

```ts
// Use the same jiffy-quantized formula as readProcessStartTimeMs so that
// the stored value and the live read are always comparable.
const CURRENT_PROCESS_START_TIME_MS = readProcessStartTimeMs(process.pid);
```

Then narrow the `pidStartTimeMs` field type to `number | null` and skip writing it (or write `null`) when `readProcessStartTimeMs` returns `null` (non-Linux). The staleness check already guards `typeof owner.pidStartTimeMs === "number"`, so `null` will fall through to legacy behavior unchanged.

How can I resolve this? If you propose a fix, please make it concise.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs maintainer review before merge.

What this changes:

This PR updates bundled runtime-deps lock ownership to persist a process start-time marker, treats live-PID locks as stale when the live PID has a different start time, and adds regression tests for matching, mismatched, and missing start-time evidence cases.

Maintainer follow-up before merge:

This is an open contributor implementation PR for a confirmed Docker startup bug; the next action is maintainer review and CI, especially deciding whether to adjust the PR to reuse the existing starttime helper/units before merge.

Best possible solution:

Land a narrow runtime-deps lock hardening fix, either in this PR or an equivalent replacement, that records a comparable process-start identity, treats a live PID with mismatched identity as stale in Docker, preserves legacy and non-Linux lock behavior, and adds focused regression coverage around active locks, PID reuse, and missing start-time evidence.

Acceptance criteria:

  • pnpm test src/plugins/bundled-runtime-deps.test.ts
  • pnpm exec oxfmt --check --threads=1 src/plugins/bundled-runtime-deps.ts src/plugins/bundled-runtime-deps.test.ts
  • pnpm check:changed in Testbox before merge

What I checked:

  • Current main still short-circuits on PID liveness: shouldRemoveRuntimeDepsLock() returns !isAlive(owner.pid) as soon as owner.pid is present, so createdAtMs and lock age are ignored when the same numeric PID is alive in a new Docker process. (src/plugins/bundled-runtime-deps.ts:526, aaa194c58b5a)
  • Current lock owner has no start-time identity: The synchronous filesystem lock writer stores only pid and createdAtMs in owner.json; the async writer has the same payload shape. (src/plugins/bundled-runtime-deps.ts:615, aaa194c58b5a)
  • Existing tests preserve the live-PID behavior but do not cover PID reuse: The current test suite asserts that an old lock with a live PID is not expired by age alone, then covers dead-PID and ownerless cases; there is no current main test for live PID plus mismatched process start identity. (src/plugins/bundled-runtime-deps.test.ts:2097, aaa194c58b5a)
  • PR diff is focused on the relevant code and tests: The PR patch modifies only src/plugins/bundled-runtime-deps.ts and src/plugins/bundled-runtime-deps.test.ts, adding pidStartTimeMs, a Linux /proc/<pid>/stat reader, predicate checks for mismatched start time, and focused tests. (src/plugins/bundled-runtime-deps.ts:456, 231c7f5107d8)
  • Existing shared PID-reuse helper and session-lock precedent: src/shared/pid-alive.ts already exposes getProcessStartTime(pid) using /proc/<pid>/stat field 22, and session write locks already compare stored starttime to detect recycled-pid; the PR should be reviewed against that existing contract. (src/shared/pid-alive.ts:39, aaa194c58b5a)
  • Docker setup matches the reported persistence/PID-reuse surface: The bundled Compose file mounts openclaw-plugin-runtime-deps at /var/lib/openclaw/plugin-runtime-deps and enables init: true, matching a persistent runtime-deps lock directory across container replacement with stable low gateway PIDs. (docker-compose.yml:26, aaa194c58b5a)

Likely related people:

  • steipete: Recent GitHub history for src/plugins/bundled-runtime-deps.ts shows repeated runtime-deps and plugin mirror maintenance, including cache, pruning, staging, and repair work in the same surface. (role: recent adjacent maintainer; confidence: high; commits: 52a7e2264c92, b4ffef5c5fd8, f71f5bc5866f; files: src/plugins/bundled-runtime-deps.ts, src/plugins/bundled-runtime-deps.test.ts)
  • vincentkoc: The runtime-deps staging history and the existing recycled-PID/session-lock hardening both point to this area; local blame also attributes the current bundled-runtime-deps and shared PID helper baseline to the available grafted history. (role: introduced/adjacent owner; confidence: medium; commits: d49ebe7bdee3, 5a2200b28003, c357235fe6e6; files: src/plugins/bundled-runtime-deps.ts, src/shared/pid-alive.ts, src/agents/session-write-lock.ts)
  • masatohoshino: Added the adjacent dead-PID stale runtime-deps lock regression coverage, which sits next to the missing Docker PID-reuse cases this PR adds. (role: recent regression-test contributor; confidence: medium; commits: 016f5ae862e1; files: src/plugins/bundled-runtime-deps.test.ts)

Remaining risk / open question:

  • The PR introduces a new epoch-millisecond /proc helper instead of reusing the existing tick-based getProcessStartTime(pid) helper; exact equality between Date.now() - process.uptime() and a /proc/os.uptime()-derived epoch should be reviewed so active same-process locks are not misclassified.
  • The PR's local test claims were not independently rerun in this read-only sweep; maintainer CI/Testbox should validate the branch before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against aaa194c58b5a.

Jim Smith and others added 3 commits April 29, 2026 16:43
…time

`shouldRemoveRuntimeDepsLock` previously trusted `isAlive(owner.pid)`
alone when deciding whether a lock could be reclaimed. That works fine
on a normal host: when the writer dies the PID is gone and `isAlive`
returns false. Inside Docker it does not — every Node gateway process
runs as PID 1 (or PID 7 with `init: true`) in its container PID
namespace, so a stale lock left behind by a previous incarnation looks
"alive" to the new one. The 5-minute lock-wait timeout then fires and
the supervisor restarts, and the cycle repeats indefinitely. Operators
have to manually remove `.openclaw-runtime-deps.lock` to recover.

This change records `pidStartTimeMs` alongside `pid` and `createdAtMs`
when the lock is acquired, and consults it in the staleness check.
When both sides have start-time evidence and they disagree, the lock
is treated as stale; otherwise the existing PID-alive-means-fresh
behavior is preserved exactly. The capture point uses
`Date.now() - process.uptime() * 1000` once at module load, and the
read side uses `/proc/<pid>/stat` field 22 on Linux (returning null
elsewhere so legacy semantics still apply on macOS/Windows hosts).

This is strictly additive on the wire format and the predicate:
existing lock files without `pidStartTimeMs` continue to take the same
code path they did before, and platforms that cannot resolve a live
PID's start-time fall back to the same legacy behavior.

Refs openclaw#74346.
@steipete steipete force-pushed the fix/runtime-deps-lock-startTime branch from 1442409 to 42574e4 Compare April 29, 2026 15:48
@openclaw-barnacle openclaw-barnacle Bot added the app: web-ui App: web-ui label Apr 29, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs maintainer review before merge.

What this changes:

The PR records a process starttime in bundled runtime-deps lock owner files, compares it with the live PID starttime to reclaim recycled-PID locks, adds focused regression tests, and adds a changelog entry.

Maintainer follow-up before merge:

This is an open contributor implementation PR tied to open bug #74346; the remaining action is maintainer review, CI/rebase handling, and a product call on legacy lock recovery rather than a separate automated replacement PR.

Review details

Best possible solution:

Land a narrow runtime-deps lock hardening change that uses a comparable process identity for Docker PID reuse, preserves non-Linux and active-lock safety, covers the predicate with regression tests, and clearly documents or handles the one-time behavior for legacy lock files written before the new starttime field existed.

Acceptance criteria:

  • pnpm test src/plugins/bundled-runtime-deps.test.ts
  • pnpm exec oxfmt --check --threads=1 src/plugins/bundled-runtime-deps.ts src/plugins/bundled-runtime-deps.test.ts
  • pnpm check:changed in Testbox before merge

What I checked:

  • Current main still uses PID-only freshness: On current main, shouldRemoveRuntimeDepsLock returns !isAlive(owner.pid) as soon as a pid is present, so createdAtMs is not considered for same-number live PIDs. (src/plugins/bundled-runtime-deps.ts:526, 53e0874864cb)
  • Current main owner records lack process identity: Both the sync and async runtime-deps lock writers on current main persist only pid and createdAtMs, so a later Docker process with the same numeric PID cannot distinguish the old owner. (src/plugins/bundled-runtime-deps.ts:617, 53e0874864cb)
  • Existing helper is the right dependency contract: src/shared/pid-alive.ts already exposes getProcessStartTime(pid), documented as /proc field 22 starttime in clock ticks for detecting PID recycling, which matches the final PR approach. (src/shared/pid-alive.ts:39, 53e0874864cb)
  • PR diff is focused and uses shared starttime: The final PR diff imports getProcessStartTime, stores starttime when available in both lock writers, compares liveStarttime against owner.starttime before falling back to PID-alive behavior, and adds matching/mismatched/missing-starttime tests. (src/plugins/bundled-runtime-deps.ts:457, 1442409ab4e2)
  • Docker surface matches the report: The bundled compose service persists OPENCLAW_PLUGIN_STAGE_DIR to the openclaw-plugin-runtime-deps volume and runs with init: true, matching the reported low-PID reuse plus persistent lock directory scenario. (docker-compose.yml:26, 53e0874864cb)
  • Security review pass: The PR touches only CHANGELOG.md, src/plugins/bundled-runtime-deps.ts, and src/plugins/bundled-runtime-deps.test.ts. It adds no workflow, dependency, lockfile, install-script, publishing, artifact-download, permission, or secret-handling changes. (1442409ab4e2)

Likely related people:

  • steipete: Local blame/log in this checkout attributes the bundled runtime-deps lock code, tests, and shared PID helper surface to Peter Steinberger, and the PR branch includes a maintainer hardening commit by steipete that switches the implementation to the shared starttime contract. (role: recent maintainer and adjacent owner; confidence: high; commits: 66cdbccc8a2a, 1442409ab4e2; files: src/plugins/bundled-runtime-deps.ts, src/plugins/bundled-runtime-deps.test.ts, src/shared/pid-alive.ts)
  • vincentkoc: The provided related review context points to prior recycled-PID/session-lock hardening and shared PID helper history around src/shared/pid-alive.ts and src/agents/session-write-lock.ts; current code confirms that surface is the closest precedent for this runtime-deps fix. (role: adjacent owner; confidence: medium; commits: 5a2200b28003, c357235fe6e6; files: src/shared/pid-alive.ts, src/agents/session-write-lock.ts)

Remaining risk / open question:

  • The final patch intentionally preserves legacy PID-alive behavior when owner.json has no starttime. That keeps active legacy locks safe, but already-stuck Docker installs with old pid/createdAtMs-only locks may still need a one-time manual cleanup or documentation/doctor follow-up.
  • The provided PR metadata says mergeable=false, and current main has continued changelog movement since the PR base; maintainer review should confirm whether the branch needs a rebase or conflict fix before CI can be trusted.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 53e0874864cb.

@steipete steipete merged commit b48f6ca into openclaw:main Apr 29, 2026
66 of 69 checks passed
@steipete
Copy link
Copy Markdown
Contributor

Maintainer follow-up pushed.

  • Reused the existing getProcessStartTime() helper so the lock owner and live process compare the same raw Linux /proc/<pid>/stat starttime units.
  • Preserved legacy/non-Linux behavior: if starttime evidence is unavailable, live PID locks still stay active.
  • Added/kept regression coverage for Docker-style same-PID/different-starttime reuse and same-PID/same-starttime active installs.
  • Added a tiny separate UI lint unblock from current main so the rebased PR could pass CI.

Local proof:

  • pnpm exec oxfmt --check --threads=1 src/plugins/bundled-runtime-deps.ts src/plugins/bundled-runtime-deps.test.ts CHANGELOG.md
  • pnpm test src/plugins/bundled-runtime-deps.test.ts -- --reporter=verbose
  • pnpm check:changed

Thanks @jhsmith409.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] plugin-runtime-deps lock staleness check uses PID alone, blocks Docker gateway restarts (PID is always 7)

2 participants