fix(plugins): disambiguate runtime-deps lock owners by process start-time (Docker PID reuse)#74361
Conversation
Greptile SummaryThis PR fixes the Docker PID-reuse hang in
Confidence Score: 3/5Not safe to merge as-is — the jiffy-vs-high-res mismatch can produce false-positive stale verdicts that stomp live locks. A single P1 bug: the strict equality comparison between a high-res stored timestamp and a jiffy-quantized live read will false-positive for any process that started mid-jiffy, which is the common case. This can cause a healthy process's lock to be deleted by a concurrent waiter, breaking the mutual-exclusion guarantee the lock provides. src/plugins/bundled-runtime-deps.ts — specifically the CURRENT_PROCESS_START_TIME_MS definition (line 462) and its use as pidStartTimeMs (line 698), and the comparison at line 606. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/plugins/bundled-runtime-deps.ts
Line: 606
Comment:
**Strict `!==` comparison between high-res and jiffy-quantized start times produces false positives**
`owner.pidStartTimeMs` is written as `Math.round(Date.now() - process.uptime() * 1000)` (sub-millisecond precision), while `readStartTimeMs` computes `Math.round(Date.now() - os.uptime() * 1000 + starttimeJiffies * 10)` where `starttimeJiffies` is truncated to 10 ms resolution by the kernel. For a live lock owner these two formulas can disagree by up to ~10 ms (the start time falls in the middle of a jiffy), so the strict `!==` can fire on a completely valid, live lock — causing an unrelated process to stomp it.
Concrete example: process A starts 1000.005 s after boot → `CURRENT_PROCESS_START_TIME_MS ≈ T_boot + 1_000_005 ms`, but `readProcessStartTimeMs(A.pid)` resolves `starttimeJiffies = 100000` → returns `T_boot + 1_000_000 ms`. The 5 ms difference triggers a false-positive eviction.
The simplest fix is to derive `CURRENT_PROCESS_START_TIME_MS` from the same jiffy-based formula so both sides of the comparison are quantized identically:
```ts
// Use the same jiffy-quantized formula as readProcessStartTimeMs so that
// the stored value and the live read are always comparable.
const CURRENT_PROCESS_START_TIME_MS = readProcessStartTimeMs(process.pid);
```
Then narrow the `pidStartTimeMs` field type to `number | null` and skip writing it (or write `null`) when `readProcessStartTimeMs` returns `null` (non-Linux). The staleness check already guards `typeof owner.pidStartTimeMs === "number"`, so `null` will fall through to legacy behavior unchanged.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(plugins): disambiguate runtime-deps ..." | Re-trigger Greptile |
| // before. | ||
| if (typeof owner.pidStartTimeMs === "number") { | ||
| const liveStartTimeMs = readStartTimeMs(owner.pid); | ||
| if (liveStartTimeMs !== null && liveStartTimeMs !== owner.pidStartTimeMs) { |
There was a problem hiding this comment.
Strict
!== comparison between high-res and jiffy-quantized start times produces false positives
owner.pidStartTimeMs is written as Math.round(Date.now() - process.uptime() * 1000) (sub-millisecond precision), while readStartTimeMs computes Math.round(Date.now() - os.uptime() * 1000 + starttimeJiffies * 10) where starttimeJiffies is truncated to 10 ms resolution by the kernel. For a live lock owner these two formulas can disagree by up to ~10 ms (the start time falls in the middle of a jiffy), so the strict !== can fire on a completely valid, live lock — causing an unrelated process to stomp it.
Concrete example: process A starts 1000.005 s after boot → CURRENT_PROCESS_START_TIME_MS ≈ T_boot + 1_000_005 ms, but readProcessStartTimeMs(A.pid) resolves starttimeJiffies = 100000 → returns T_boot + 1_000_000 ms. The 5 ms difference triggers a false-positive eviction.
The simplest fix is to derive CURRENT_PROCESS_START_TIME_MS from the same jiffy-based formula so both sides of the comparison are quantized identically:
// Use the same jiffy-quantized formula as readProcessStartTimeMs so that
// the stored value and the live read are always comparable.
const CURRENT_PROCESS_START_TIME_MS = readProcessStartTimeMs(process.pid);Then narrow the pidStartTimeMs field type to number | null and skip writing it (or write null) when readProcessStartTimeMs returns null (non-Linux). The staleness check already guards typeof owner.pidStartTimeMs === "number", so null will fall through to legacy behavior unchanged.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/plugins/bundled-runtime-deps.ts
Line: 606
Comment:
**Strict `!==` comparison between high-res and jiffy-quantized start times produces false positives**
`owner.pidStartTimeMs` is written as `Math.round(Date.now() - process.uptime() * 1000)` (sub-millisecond precision), while `readStartTimeMs` computes `Math.round(Date.now() - os.uptime() * 1000 + starttimeJiffies * 10)` where `starttimeJiffies` is truncated to 10 ms resolution by the kernel. For a live lock owner these two formulas can disagree by up to ~10 ms (the start time falls in the middle of a jiffy), so the strict `!==` can fire on a completely valid, live lock — causing an unrelated process to stomp it.
Concrete example: process A starts 1000.005 s after boot → `CURRENT_PROCESS_START_TIME_MS ≈ T_boot + 1_000_005 ms`, but `readProcessStartTimeMs(A.pid)` resolves `starttimeJiffies = 100000` → returns `T_boot + 1_000_000 ms`. The 5 ms difference triggers a false-positive eviction.
The simplest fix is to derive `CURRENT_PROCESS_START_TIME_MS` from the same jiffy-based formula so both sides of the comparison are quantized identically:
```ts
// Use the same jiffy-quantized formula as readProcessStartTimeMs so that
// the stored value and the live read are always comparable.
const CURRENT_PROCESS_START_TIME_MS = readProcessStartTimeMs(process.pid);
```
Then narrow the `pidStartTimeMs` field type to `number | null` and skip writing it (or write `null`) when `readProcessStartTimeMs` returns `null` (non-Linux). The staleness check already guards `typeof owner.pidStartTimeMs === "number"`, so `null` will fall through to legacy behavior unchanged.
How can I resolve this? If you propose a fix, please make it concise.|
Codex review: needs maintainer review before merge. What this changes: This PR updates bundled runtime-deps lock ownership to persist a process start-time marker, treats live-PID locks as stale when the live PID has a different start time, and adds regression tests for matching, mismatched, and missing start-time evidence cases. Maintainer follow-up before merge: This is an open contributor implementation PR for a confirmed Docker startup bug; the next action is maintainer review and CI, especially deciding whether to adjust the PR to reuse the existing starttime helper/units before merge. Best possible solution: Land a narrow runtime-deps lock hardening fix, either in this PR or an equivalent replacement, that records a comparable process-start identity, treats a live PID with mismatched identity as stale in Docker, preserves legacy and non-Linux lock behavior, and adds focused regression coverage around active locks, PID reuse, and missing start-time evidence. Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against aaa194c58b5a. |
…time `shouldRemoveRuntimeDepsLock` previously trusted `isAlive(owner.pid)` alone when deciding whether a lock could be reclaimed. That works fine on a normal host: when the writer dies the PID is gone and `isAlive` returns false. Inside Docker it does not — every Node gateway process runs as PID 1 (or PID 7 with `init: true`) in its container PID namespace, so a stale lock left behind by a previous incarnation looks "alive" to the new one. The 5-minute lock-wait timeout then fires and the supervisor restarts, and the cycle repeats indefinitely. Operators have to manually remove `.openclaw-runtime-deps.lock` to recover. This change records `pidStartTimeMs` alongside `pid` and `createdAtMs` when the lock is acquired, and consults it in the staleness check. When both sides have start-time evidence and they disagree, the lock is treated as stale; otherwise the existing PID-alive-means-fresh behavior is preserved exactly. The capture point uses `Date.now() - process.uptime() * 1000` once at module load, and the read side uses `/proc/<pid>/stat` field 22 on Linux (returning null elsewhere so legacy semantics still apply on macOS/Windows hosts). This is strictly additive on the wire format and the predicate: existing lock files without `pidStartTimeMs` continue to take the same code path they did before, and platforms that cannot resolve a live PID's start-time fall back to the same legacy behavior. Refs openclaw#74346.
1442409 to
42574e4
Compare
|
Codex review: needs maintainer review before merge. What this changes: The PR records a process starttime in bundled runtime-deps lock owner files, compares it with the live PID starttime to reclaim recycled-PID locks, adds focused regression tests, and adds a changelog entry. Maintainer follow-up before merge: This is an open contributor implementation PR tied to open bug #74346; the remaining action is maintainer review, CI/rebase handling, and a product call on legacy lock recovery rather than a separate automated replacement PR. Review detailsBest possible solution: Land a narrow runtime-deps lock hardening change that uses a comparable process identity for Docker PID reuse, preserves non-Linux and active-lock safety, covers the predicate with regression tests, and clearly documents or handles the one-time behavior for legacy lock files written before the new starttime field existed. Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 53e0874864cb. |
|
Maintainer follow-up pushed.
Local proof:
Thanks @jhsmith409. |
Summary
shouldRemoveRuntimeDepsLockshort-circuits onisAlive(owner.pid)alone, which is unsafe inside containers because PIDs are recycled deterministically — the new gateway is always PID 1 (or PID 7 withinit: true) in its container PID namespace, so a stale lock left behind by a previous incarnation looks "live" to the new one and never gets reclaimed.This PR captures
pidStartTimeMsat lock-acquisition time and consults it in the staleness check. When both sides have start-time evidence and they disagree, the lock is treated as stale. When evidence is missing on either side (legacy locks, non-Linux hosts), the existing PID-alive-means-fresh behavior is preserved exactly.Closes #74346.
The bug
Setup: OpenClaw on Docker (
ghcr.io/openclaw/openclaw:2026.4.24+), Linux host. Reproduced both withdocker compose down/upand with a hard-killed gateway (sigkill, OOM, container kill)..openclaw-runtime-deps.lock/owner.jsonis left behind with{"pid": 7, "createdAtMs": <T0>}.removeRuntimeDepsLockIfStale(lockDir, nowMs), which callsshouldRemoveRuntimeDepsLock(owner, nowMs). Owner haspid: 7.isAlive(7)returnstrue(the new process is PID 7). The function returnsfalse— lock stays.mkdirSync(lockDir)returnsEEXIST. The wait loop spins untilBUNDLED_RUNTIME_DEPS_LOCK_TIMEOUT_MS(5 min) elapses, then errors out and is restarted by the supervisor. Cycle repeats indefinitely.Operators have been working around this by stopping the container and manually removing
.openclaw-runtime-deps.lock. After lock removal the gateway boots in ~35 seconds.The fix
Strictly-additive change to the lock owner record and the staleness predicate:
pidStartTimeMsfield on the lock owner record. Computed once at module load asDate.now() - process.uptime() * 1000and persisted alongsidepid/createdAtMsinowner.json.readProcessStartTimeMs(pid)helper that returns the live PID's start-time in epoch ms on Linux (parses/proc/<pid>/statfield 22, anchored on the last)to handle process names with spaces/parens), ornullelsewhere. Hard-coded100for_SC_CLK_TCKsince the constant cancels out — both ends of the comparison use the same conversion.shouldRemoveRuntimeDepsLockextended: whenowner.pidStartTimeMsis set ANDreadStartTimeMs(owner.pid)returns a value AND the two disagree, the lock is stale. When evidence is missing on either side, the existing PID-alive-means-fresh path is taken unchanged.Why strictly additive
pidStartTimeMs. They continue to take the legacy path. No migration step needed.does not expire active runtime-deps install locks by age alonecontinues to pass.nullfromreadProcessStartTimeMs(no/proc) and take the legacy path. Linux hosts (including Docker Desktop's Linux VM, which all Linux containers run inside on macOS / Windows) get the disambiguation.Tests
src/plugins/bundled-runtime-deps.test.tsadds four new cases:pidStartTimeMsset on the owner, we don't stomp the lock when the live side cannot be verified.All 94 tests in
bundled-runtime-deps.test.tspass locally.Verified
pnpm vitest run --reporter verbose --no-coverage src/plugins/bundled-runtime-deps.test.ts— 94/94 pass.pnpm tsc --noEmit -p tsconfig.json— no new errors related to this file.Notes for review
Date.now() - process.uptime() * 1000) drifts a few ms relative to/proc's starttime jiffies. That's fine because the comparison only runs against the same PID's start-time read from/proclater — equality holds within rounding tolerance and we useMath.roundon both sides.flock(2)rewrite. Aflock-based design (kernel-released on process exit, no staleness logic needed) would be cleaner architecturally but is a much larger change and would require picking a Node lockfile library. Happy to follow up with an RFC if you'd like to go that direction; this PR addresses the immediate Docker hang while preserving every existing invariant.