fix(daemon): enforce single-daemon invariant via atomic bind() (kz8.1)#4
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the daemon single-instance mechanism from a pidfile-based mutex (with race potential) to a bind()-based invariant on the Unix socket path, and consolidates daemon “warm-up” spawning into a single obtain-daemon primitive in both Node and Rust.
Changes:
- Introduces
src/daemon/acquire.ts(obtainDaemon/obtainDaemonAsync) with a spawn-dedup “spawn.lock” optimization. - Updates the daemon to enforce singleton via atomic socket bind, and makes the pidfile diagnostic-only.
- Removes the old detached spawn helper and adds targeted tests for the new acquire/bind behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/daemon-acquire.test.ts | Adds unit tests for obtain-daemon behavior and bind()-based singleton semantics. |
| src/install/index.ts | Switches daemon warm-up from spawnDaemonDetached() to obtainDaemonAsync(). |
| src/index.ts | Switches daemon warm-up from spawnDaemonDetached() to obtainDaemonAsync() on daemon-miss. |
| src/daemon/spawn.ts | Removes the old fire-and-forget spawn helper (replaced by obtain-daemon). |
| src/daemon/server.ts | Replaces pidfile exclusion with bind()/EADDRINUSE arbitration + stale socket recovery; pidfile becomes diagnostic-only. |
| src/daemon/paths.ts | Adds spawnLockPath() for caller-side spawn dedup. |
| src/daemon/acquire.ts | New obtain-daemon primitive implementing connect-probe + spawn-lock + readiness polling. |
| rust-client/src/main.rs | Replaces detached spawn fallback with a flock-gated obtain_daemon_kick() that mirrors the Node obtain-daemon flow. |
Comments suppressed due to low confidence (1)
src/daemon/server.ts:136
- In
isSocketAlive, the 50ms timer treats a slow connect as "not alive". In theEADDRINUSErecovery path, a false-negative here leads to unlinkingsockPatheven though a daemon may actually be listening, which would make the live daemon unreachable via pathname. To make this safe, treat timeouts as "assume alive" (do not unlink), increase the timeout substantially, and/or require multiple failed probes before unlinking.
sock.once("connect", () => done(true));
sock.once("error", () => done(false));
// 50ms is generous; localhost AF_UNIX connect is sub-ms when a listener
// exists. Anything slower implies "no listener" in practice.
setTimeout(() => done(false), 50).unref();
});
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address review threads on PR #4: 1. handleAddressInUse: silently ignoring unlinkSync failure left the system in the worst state — no daemon + stale socket blocking future starts. The unlink failure now logs an error and exits non-zero so the supervisor notices instead of seeing a clean exit. [LAW:no-defensive-null-guards] 2. isSocketAlive: added a `settled` guard + explicit clearTimeout on settle, mirroring canConnect in acquire.ts. The connect/error/timeout race could otherwise call destroy() and resolve() more than once. 3. tryAcquireSpawnLock: non-EEXIST openSync errors (EACCES, ENOTDIR) were silently mapped to "contended", causing obtainDaemon to spin until its deadline and return a misleading "timeout" reason. Widened the return type to `{ kind: "held" | "contended" | "error"; reason }`; unrecoverable errors surface as { kind: "failed", reason: "spawn-lock: ..." } and the call returns promptly. [LAW:no-silent-fallbacks] 4. Doc comments: rust-client mirror lives in main.rs (obtain_daemon section), not a separate acquire.rs. New test verifies the error path returns < 500ms instead of spinning to the 2s deadline. Refs: brandon-process-discipline-kz8.1
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/daemon/acquire.ts:80
spawnFn()is treated as a boolean-returning hook, but if the real spawn implementation (or a test hook) throws synchronously,obtainDaemon()will reject instead of returning{ kind: "failed" }. Consider guarding thespawnFn()call and translating thrown errors into afailedresult (with a reason) so callers consistently get anObtainResult.
if (await canConnect(socketPath(), settings.connectTimeoutMs)) {
return { kind: "attached" };
}
if (!spawnFn()) {
return { kind: "failed", reason: "spawn returned false" };
}
// Poll for the new daemon to bind.
Address review threads from second Copilot pass on PR #4: 1. obtainDaemonAsync was async-first — the first await (connect probe) suspended before child_process.spawn ran, and the caller's immediate process.exit(0) killed the process before the await could resume. The daemon was never warmed in the daemon-miss path. Replaced with a synchronous obtainDaemonKick(opts?) that does lock + spawn + release in one synchronous turn. Switched both fire-and-forget callers (src/index.ts, src/install/index.ts) to it. 2. Cross-runtime spawn-lock semantics were inconsistent: Rust used flock-on-persistent-file (leaves the file behind on release); Node used open(path, "wx") (existence == held). A Rust release left the file in place, which Node then saw as EEXIST and treated as contention until the 10s staleness window expired. Aligned both runtimes on existence-as-lock: Rust now uses OpenOptions::create_new(true) + explicit remove_file on release. Removed AsRawFd import and the libc::flock call path. [LAW:one-type-per-behavior] 3. obtainDaemon is typed Promise<ObtainResult>, but synchronous setup (fs.mkdirSync) could throw, violating the non-throwing contract. Extracted ensureStateDir() that returns a reason string; obtainDaemon converts to { kind: "failed", reason }. obtainDaemonKick silently bails on the same setup failure (it's fire-and-forget with no caller to report to). New tests: synchronous kick proven to complete before microtasks run; state-dir setup failure surfaces as typed failed result (not throw); kick does not double-spawn when lock is already held. Storm test rebuilt + re-run: 10 concurrent `node dist/index.mjs daemon` spawns into a fresh state dir → exactly 1 daemon survives. Refs: brandon-process-discipline-kz8.1
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
rust-client/src/main.rs:383
- The obtain-daemon section comment refers to "flock on spawn.lock" and says the lock auto-releases on process exit via advisory locks. The implementation now uses a lock file created with
create_new(true)and released by unlinking (with time-based staleness reclaim on crash), so the comment is no longer accurate. Please revise this comment to reflect the new existence-as-lock behavior.
// [LAW:single-enforcer] One entry point on the Rust side for "obtain a
// daemon." The atomic bind() inside the daemon is the load-bearing exclusion;
// this flock on spawn.lock is a thundering-herd optimization that prevents N
// concurrent clients from each forking a Node process when one would do.
//
// [LAW:dataflow-not-control-flow] The caller does not get to choose whether
// to spawn — it asks for a daemon, this function decides. The bind() inside
// the daemon ensures that even if this function spawns "redundantly" the
// duplicate daemon exits immediately at bind().
//
// Fire-and-forget shape: the current render is already lost (we hit
// obtain_daemon_kick on a render failure); we just want a daemon to be alive
// for the next refresh. The lock auto-releases when the client process exits
// (advisory locks vanish on close). Total work inside this fn is bounded by
// a few syscalls plus an optional fork+execve.
Address review threads on PR #4 (round 3): 1. src/daemon/server.ts runDaemon header referenced obtainDaemon() as the client recovery path, but callers actually use obtainDaemonKick() (the synchronous variant added in the previous commit). Comment now lists both entry points with their semantics: kick = fire-and-forget, obtainDaemon = caller waits for readiness. 2. rust-client/src/main.rs header and the obtain-daemon-primitive section header both still described the spawn dedup as "flock-gated", but flock was replaced with existence-as-lock (open with O_CREAT | O_EXCL, release by unlink) to align with Node's primitive. Comments now accurately describe the cross-runtime lock mechanism. Doc-only — no behavior change. Refs: brandon-process-discipline-kz8.1
…lock release Address review threads on PR #4 (round 4): 1. writePidfileDiagnostic: fs.writeFileSync({mode: 0o600}) only applies the mode on file creation, leaving a stale pidfile with broader permissions from a prior run untouched. Added explicit fs.chmodSync(pidPath(), 0o600) after the write so 0600 is the invariant regardless of prior state. 2. canConnect (acquire.ts): had a `settled` guard preventing double-resolve but never cleared the timeout, so under tight polling loops (obtainDaemon calls it repeatedly) no-op timers accumulated until the daemon attached or failed. Matches isSocketAlive's pattern: keep the timer handle, clear it on settle. 3. obtain_daemon_kick doc comment in Rust still claimed "the lock auto- releases when the client process exits (advisory locks vanish on close)" — residual flock-era wording. Comment now accurately describes the existence-as-lock release (explicit remove_file) and the crash-recovery path (time-based staleness reclaim for files older than 10s). Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 5): obtainDaemon is typed Promise<ObtainResult>, but spawnFn() was invoked without a try/catch. A synchronous throw from child_process.spawn (ENOENT for the node binary, invalid options, etc.) — or a test-provided spawn that throws — would reject the promise instead of returning the typed failure result. Wrapped the spawnFn() invocation in try/catch and converted thrown errors to { kind: "failed", reason: \`spawn threw: \${message}\` }. The existing finally{} block already released the spawn lock on throw — only the typed-conversion was missing. Same shape as the round-2 fix for mkdirSync. New test injects a throwing spawn fn and asserts the typed-failure result instead of a rejected promise. Refs: brandon-process-discipline-kz8.1
…ilability Address review threads on PR #4 (round 6): 1. server.ts handleAddressInUse: between `isSocketAlive` returning false and `unlinkSync` running, a concurrent recoverer could unlink+bind the path; our unlink would then remove their live socket, leaving two daemons. Added a second isSocketAlive check immediately before unlink — if a live listener appeared between checks, we exit instead of stomping on it. Race window goes from O(handler latency) to sub-microsecond. 2. acquire.ts: STALE_LOCK_MS (10s) was significantly longer than the default totalTimeoutMs (2s), so a crashed spawn-lock holder could block new daemon spawns for up to 10s — making the spawn-lock load-bearing for availability. [LAW:dataflow-not-control-flow] says the lock is an optimization, bind() is the load-bearing exclusion. Added `lockFallbackMs` (default totalTimeoutMs / 2): after that much contention without a daemon appearing, bypass the lock and spawn anyway. The bind() inside the daemon arbitrates duplicates, so a stuck lock degrades to bind-arbitrated contention instead of a multi-second gap. Extracted shared spawn+poll logic into `spawnAndWaitForReady` so the lock-held and lock-fallback paths can't drift. Both go through the same helper; reasonSuffix tags the failure with " (lock-fallback)" for diagnostics when the fallback fires. New test exercises the fallback path: plants a fresh (non-stale) lock and asserts spawn fires after lockFallbackMs (200ms in the test). Storm test re-run: 10 concurrent daemon spawns → 1 survivor. Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 7): obtainDaemonKick treated any non-"held" lock outcome (including unrecoverable errors like EACCES on spawn.lock) as an immediate silent return, which made the spawn-lock load-bearing for availability — exactly what [LAW:dataflow-not-control-flow] forbids. spawn.lock is an optimization; bind() inside the daemon is the load-bearing exclusion. Split the kick's lock-outcome handling into three explicit branches: - "contended": return silently (another caller is spawning) - "error": emit a terse stderr warning and spawn anyway; bind() arbitrates - "held": normal locked-spawn path Extracted `safeSpawn` helper so the throw-capture behavior is shared between the locked and unlocked branches (no drift). New test exercises the lock-error fallback: makes daemonDir read-only so openSync("wx") returns EACCES, asserts the kick spawns anyway + emits the stderr warning. Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 13): `Promise.resolve(fn(...)).finally(cleanup)` does NOT catch synchronous throws from `fn` — the throw propagates before `Promise.resolve` runs, bypassing the finally. Tests that triggered a sync throw inside the helper would leak XDG_STATE_HOME state and temp directories. Refactored to async/await with try-finally so cleanup runs unconditionally regardless of how `fn` fails. Refs: brandon-process-discipline-kz8.1
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/daemon/server.ts:337
- The shutdown() comment still refers to a “singleton mutex” (“…the invariant the singleton mutex relies on”), but the singleton enforcer is now the atomic bind() on the Unix socket. Consider updating this wording to reference the bind/socket-singleton invariant to avoid confusing future maintenance.
// [LAW:single-enforcer] Exactly one path out of the process. Backstop with
// SIGKILL because we previously observed shut-down daemons staying alive in
// uv__io_poll — something (event loop handle, swallowed exception in a
// post-end log write) was preventing process.exit from actually firing.
// The hard kill makes "shutdown was called" mechanically equivalent to
// "process is gone", which is the invariant the singleton mutex relies on.
const kill = setTimeout(() => process.kill(process.pid, "SIGKILL"), 500);
Address review thread on PR #4 (round 14): safeSpawn captured throws but ignored the boolean return from spawnFn. spawnDaemonDetachedReal returns false when process.argv[1] is missing (can happen with unusual exec environments); a test/alternate spawn hook can also signal failure that way. The kick path then went silently into the void — exactly the silent-failure mode the rest of this code is designed to avoid. safeSpawn now writes a terse stderr warning when spawnFn returns false. Both failure modes (throw and false-return) are now visible. Refs: brandon-process-discipline-kz8.1
The pidfile-EEXIST check in acquirePidfile() was a postcondition race — N concurrent spawners would each pay full Node startup, then N-1 lose the EEXIST check after expensive module init. Under multi-session Claude Code load this produced spawn-storms on every daemon restart. Replace with bind()-based exclusion: two daemons cannot both bind() the same Unix socket path. The kernel makes duplicate-daemon unrepresentable at the syscall level. bindOrAttachAndExit() distinguishes "live daemon listening" (connect succeeds → exit cleanly) from "stale socket file from crashed daemon" (connect fails → unlink + rebind once). Add obtainDaemon()/obtain_daemon_kick() primitives — one per runtime — that collapse three previously-independent spawn sites (Rust client on failure, Node fallback, install postlude) onto a single boundary. Each runtime uses native primitives for the caller-side spawn dedup: - Rust: libc::flock(LOCK_EX | LOCK_NB) on spawn.lock (auto-release on process exit) - Node: wx-create on spawn.lock with time-based staleness reclaim The flock is a thundering-herd optimization; the load-bearing exclusion is the bind() inside the daemon. Even if a "redundant" daemon slips past the lock it exits immediately at bind(). Verified: 10 concurrent `node dist/index.mjs daemon` spawns into a fresh state dir → exactly 1 daemon survives. [LAW:single-enforcer] [LAW:dataflow-not-control-flow] [LAW:no-defensive-null-guards] [LAW:one-type-per-behavior] Refs: brandon-process-discipline-kz8.1
Address review threads on PR #4: 1. handleAddressInUse: silently ignoring unlinkSync failure left the system in the worst state — no daemon + stale socket blocking future starts. The unlink failure now logs an error and exits non-zero so the supervisor notices instead of seeing a clean exit. [LAW:no-defensive-null-guards] 2. isSocketAlive: added a `settled` guard + explicit clearTimeout on settle, mirroring canConnect in acquire.ts. The connect/error/timeout race could otherwise call destroy() and resolve() more than once. 3. tryAcquireSpawnLock: non-EEXIST openSync errors (EACCES, ENOTDIR) were silently mapped to "contended", causing obtainDaemon to spin until its deadline and return a misleading "timeout" reason. Widened the return type to `{ kind: "held" | "contended" | "error"; reason }`; unrecoverable errors surface as { kind: "failed", reason: "spawn-lock: ..." } and the call returns promptly. [LAW:no-silent-fallbacks] 4. Doc comments: rust-client mirror lives in main.rs (obtain_daemon section), not a separate acquire.rs. New test verifies the error path returns < 500ms instead of spinning to the 2s deadline. Refs: brandon-process-discipline-kz8.1
Address review threads from second Copilot pass on PR #4: 1. obtainDaemonAsync was async-first — the first await (connect probe) suspended before child_process.spawn ran, and the caller's immediate process.exit(0) killed the process before the await could resume. The daemon was never warmed in the daemon-miss path. Replaced with a synchronous obtainDaemonKick(opts?) that does lock + spawn + release in one synchronous turn. Switched both fire-and-forget callers (src/index.ts, src/install/index.ts) to it. 2. Cross-runtime spawn-lock semantics were inconsistent: Rust used flock-on-persistent-file (leaves the file behind on release); Node used open(path, "wx") (existence == held). A Rust release left the file in place, which Node then saw as EEXIST and treated as contention until the 10s staleness window expired. Aligned both runtimes on existence-as-lock: Rust now uses OpenOptions::create_new(true) + explicit remove_file on release. Removed AsRawFd import and the libc::flock call path. [LAW:one-type-per-behavior] 3. obtainDaemon is typed Promise<ObtainResult>, but synchronous setup (fs.mkdirSync) could throw, violating the non-throwing contract. Extracted ensureStateDir() that returns a reason string; obtainDaemon converts to { kind: "failed", reason }. obtainDaemonKick silently bails on the same setup failure (it's fire-and-forget with no caller to report to). New tests: synchronous kick proven to complete before microtasks run; state-dir setup failure surfaces as typed failed result (not throw); kick does not double-spawn when lock is already held. Storm test rebuilt + re-run: 10 concurrent `node dist/index.mjs daemon` spawns into a fresh state dir → exactly 1 daemon survives. Refs: brandon-process-discipline-kz8.1
Address review threads on PR #4 (round 3): 1. src/daemon/server.ts runDaemon header referenced obtainDaemon() as the client recovery path, but callers actually use obtainDaemonKick() (the synchronous variant added in the previous commit). Comment now lists both entry points with their semantics: kick = fire-and-forget, obtainDaemon = caller waits for readiness. 2. rust-client/src/main.rs header and the obtain-daemon-primitive section header both still described the spawn dedup as "flock-gated", but flock was replaced with existence-as-lock (open with O_CREAT | O_EXCL, release by unlink) to align with Node's primitive. Comments now accurately describe the cross-runtime lock mechanism. Doc-only — no behavior change. Refs: brandon-process-discipline-kz8.1
…lock release Address review threads on PR #4 (round 4): 1. writePidfileDiagnostic: fs.writeFileSync({mode: 0o600}) only applies the mode on file creation, leaving a stale pidfile with broader permissions from a prior run untouched. Added explicit fs.chmodSync(pidPath(), 0o600) after the write so 0600 is the invariant regardless of prior state. 2. canConnect (acquire.ts): had a `settled` guard preventing double-resolve but never cleared the timeout, so under tight polling loops (obtainDaemon calls it repeatedly) no-op timers accumulated until the daemon attached or failed. Matches isSocketAlive's pattern: keep the timer handle, clear it on settle. 3. obtain_daemon_kick doc comment in Rust still claimed "the lock auto- releases when the client process exits (advisory locks vanish on close)" — residual flock-era wording. Comment now accurately describes the existence-as-lock release (explicit remove_file) and the crash-recovery path (time-based staleness reclaim for files older than 10s). Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 5): obtainDaemon is typed Promise<ObtainResult>, but spawnFn() was invoked without a try/catch. A synchronous throw from child_process.spawn (ENOENT for the node binary, invalid options, etc.) — or a test-provided spawn that throws — would reject the promise instead of returning the typed failure result. Wrapped the spawnFn() invocation in try/catch and converted thrown errors to { kind: "failed", reason: \`spawn threw: \${message}\` }. The existing finally{} block already released the spawn lock on throw — only the typed-conversion was missing. Same shape as the round-2 fix for mkdirSync. New test injects a throwing spawn fn and asserts the typed-failure result instead of a rejected promise. Refs: brandon-process-discipline-kz8.1
…ilability Address review threads on PR #4 (round 6): 1. server.ts handleAddressInUse: between `isSocketAlive` returning false and `unlinkSync` running, a concurrent recoverer could unlink+bind the path; our unlink would then remove their live socket, leaving two daemons. Added a second isSocketAlive check immediately before unlink — if a live listener appeared between checks, we exit instead of stomping on it. Race window goes from O(handler latency) to sub-microsecond. 2. acquire.ts: STALE_LOCK_MS (10s) was significantly longer than the default totalTimeoutMs (2s), so a crashed spawn-lock holder could block new daemon spawns for up to 10s — making the spawn-lock load-bearing for availability. [LAW:dataflow-not-control-flow] says the lock is an optimization, bind() is the load-bearing exclusion. Added `lockFallbackMs` (default totalTimeoutMs / 2): after that much contention without a daemon appearing, bypass the lock and spawn anyway. The bind() inside the daemon arbitrates duplicates, so a stuck lock degrades to bind-arbitrated contention instead of a multi-second gap. Extracted shared spawn+poll logic into `spawnAndWaitForReady` so the lock-held and lock-fallback paths can't drift. Both go through the same helper; reasonSuffix tags the failure with " (lock-fallback)" for diagnostics when the fallback fires. New test exercises the fallback path: plants a fresh (non-stale) lock and asserts spawn fires after lockFallbackMs (200ms in the test). Storm test re-run: 10 concurrent daemon spawns → 1 survivor. Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 7): obtainDaemonKick treated any non-"held" lock outcome (including unrecoverable errors like EACCES on spawn.lock) as an immediate silent return, which made the spawn-lock load-bearing for availability — exactly what [LAW:dataflow-not-control-flow] forbids. spawn.lock is an optimization; bind() inside the daemon is the load-bearing exclusion. Split the kick's lock-outcome handling into three explicit branches: - "contended": return silently (another caller is spawning) - "error": emit a terse stderr warning and spawn anyway; bind() arbitrates - "held": normal locked-spawn path Extracted `safeSpawn` helper so the throw-capture behavior is shared between the locked and unlocked branches (no drift). New test exercises the lock-error fallback: makes daemonDir read-only so openSync("wx") returns EACCES, asserts the kick spawns anyway + emits the stderr warning. Refs: brandon-process-discipline-kz8.1
…rrides stale-fresh lock Address review threads on PR #4 (round 8): 1. acquire.ts obtainDaemonKick: on contention, the kick returned silently even if the lock holder had crashed mid-spawn. The lock would stay "fresh" until STALE_LOCK_MS (10s), creating a 10-second availability gap. Added KICK_CONTENDED_OVERRIDE_MS (2000ms) — when the spawn.lock file's age exceeds this, the kick assumes the holder crashed and spawns unlocked with a stderr warning. Threshold is calibrated above the slowest legitimate holder (obtainDaemon's lock-held path can hold ~1.6s for spawnReadyTimeoutMs=1500ms) and well above the kick's own typical hold (<10ms). [LAW:dataflow-not-control-flow] 2. server.ts isSocketAlive treated the 50ms connect timeout as "definitely no listener," which risked unlinking a real (but momentarily slow) daemon's socket. Introduced 3-state probeSocket: "alive" (connect succeeded), "dead" (only on ECONNREFUSED, ENOENT, or ENOTSOCK — definitive), "unknown" (timeout, EPERM, other errors). isSocketAlive treats "unknown" as alive, so the stale-socket recovery path won't unlink an ambiguous socket — the conservative choice avoids destructive false negatives. 3. Side effect of (2): "dead" now includes ENOENT from connect, so the socket file may genuinely be missing by the time we unlinkSync. Added ENOENT tolerance — a missing file already satisfies the goal ("make the path bindable"). Other unlink failures still hard-fail loudly. Refs: brandon-process-discipline-kz8.1
…t file Address review thread on PR #4 (round 9): The "stale socket file" test name promised verification that connect fails with ECONNREFUSED, but the body only checked EADDRINUSE + unlink + rebind. Added an explicit net.connect probe assertion: a plain-file at the socket path must surface one of the three "definitively dead" codes that the round-8 probeSocket treats as unlink-eligible (ENOTSOCK / ECONNREFUSED / ENOENT). This now validates the full premise behind handleAddressInUse's stale-socket recovery: the connect-probe outcome must be one of the codes probeSocket maps to "dead" for the recovery path to actually fire on plain-file stale sockets. Refs: brandon-process-discipline-kz8.1
…leanup Address review threads on PR #4 (round 10): 1. acquire.ts tryAcquireSpawnLock: stale-lock reclaim path treated any unlinkSync failure as a fatal lock error. ENOENT during the reclaim means someone (the rightful holder cleaning up, or another concurrent reclaimer) already removed the file — that IS the desired post-condition. Return to the retry loop so openSync can succeed instead of returning { kind: "error" }. Same shape as the ENOENT-on-unlink tolerance added to server.ts in round 8. 2. test/daemon-acquire.test.ts: removed the manual stderr.write capture-and-reassignment in the "spawns unlocked when lock acquisition errors" test. mockRestore() correctly restores the original; the captured-bound-and-reassigned wrapper was subtly replacing the restored original with a wrapper, which could affect other spies on process.stderr in the same test run. Refs: brandon-process-discipline-kz8.1
…rity Address review thread on PR #4 (round 11): try_acquire_spawn_lock returned Option<PathBuf>, conflating contention and error: any non-EEXIST openSync failure (EACCES, ENOTDIR) and any unrecoverable unlink failure mapped to None, which obtain_daemon_kick treated as "another caller is spawning, trust them." This made the spawn-lock load-bearing for availability on the Rust path — exactly the [LAW:dataflow-not-control-flow] violation we already fixed on the Node side in PR rounds 7-8. Introduced a 3-way LockOutcome enum (Held / Contended / Error) and rewrote obtain_daemon_kick to handle each variant explicitly: - Held: spawn under lock, release by unlink (unchanged) - Contended: if spawn.lock age > 2s (KICK_CONTENDED_OVERRIDE_MS, Rust mirror of the Node constant), assume crashed holder and spawn unlocked with stderr warning; else silent return - Error: stderr "spawn-lock unavailable (reason)" + spawn unlocked ENOENT during stale-lock unlink is also treated as benign — the file is already gone (desired post-condition), so we retry the openSync. Mirrors the round-10 Node-side fix. Both runtimes now have full feature parity for kick-path crash recovery: same 2s contention threshold, same 10s STALE_LOCK, same error-becomes-unlocked-spawn behavior. [LAW:one-type-per-behavior] Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 12): fs.statSync(...).mode returns the file mode including file-type bits (e.g. 0o040000 for directories). Passing that whole value back to chmodSync is non-portable — some platforms accept permission bits only. Both originalMode captures now mask with & 0o7777 (covers standard rwx + sticky/setuid/setgid). Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 13): `Promise.resolve(fn(...)).finally(cleanup)` does NOT catch synchronous throws from `fn` — the throw propagates before `Promise.resolve` runs, bypassing the finally. Tests that triggered a sync throw inside the helper would leak XDG_STATE_HOME state and temp directories. Refactored to async/await with try-finally so cleanup runs unconditionally regardless of how `fn` fails. Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 14): safeSpawn captured throws but ignored the boolean return from spawnFn. spawnDaemonDetachedReal returns false when process.argv[1] is missing (can happen with unusual exec environments); a test/alternate spawn hook can also signal failure that way. The kick path then went silently into the void — exactly the silent-failure mode the rest of this code is designed to avoid. safeSpawn now writes a terse stderr warning when spawnFn returns false. Both failure modes (throw and false-return) are now visible. Refs: brandon-process-discipline-kz8.1
b3e41db to
2643c35
Compare
…ust truncate (5mi) (#99) Sheriff finding #4 [LAW:one-source-of-truth]: the red-error FG/BG/RESET literals were hand-duplicated across error-glyph.ts, server.ts's composeWithDiagnostics, and the Rust mirror, with a false "avoid daemon->client coupling" comment justifying the copies. - New leaf src/render/diagnostic-style.ts owns the diagnostic visual identity (error trio + warning pair + reset); error-glyph.ts and server.ts import it — the same daemon->render direction already blessed for diagnostic-text.ts. The false-dichotomy comments are gone. - check-protocol's three style rows repoint to the leaf in this same commit (missing anchor fails loudly); PREFIX/MAX_MESSAGE_LEN stay anchored in error-glyph.ts (glyph shape, not duplicated in TS). - Folded divergence: Rust truncate now collapses whitespace runs and trims like TS sanitizeAndTruncate, making the byte-identical mirror claim true for multi-control inputs ("\r\n" -> one space in both). The Rust test that pinned the divergence is corrected [LAW:behavior-not-structure], and both suites now pin the same collapse/trim fixtures (incl. the U+FEFF edge JS \s implies). Verified: typecheck, lint, check:protocol (12/12), jest glyph/diagnostic /render-cache suites (31), cargo test (19) all green; residue grep shows no raw style literals outside the leaf and the enforced Rust mirror. Co-authored-by: elton-prawn <264370887+elton-prawn@users.noreply.github.com>
Summary
Replaces the pidfile-EEXIST race in
acquirePidfile()with bind()-based exclusion: two daemons cannot both bind the same Unix socket path, so the kernel makes duplicate-daemon unrepresentable at the syscall level.Three previously-independent daemon spawn sites (Rust client failure path, Node fallback, install postlude) collapse onto a single
obtainDaemon/obtain_daemon_kickprimitive per runtime. Caller-side flock-based spawn dedup is a thundering-herd optimization on top of the load-bearing bind().Test plan
test/daemon-acquire.test.tsadds 6 targeted testsnode dist/index.mjs daemonspawns into a fresh state dir → exactly 1 daemon survivespnpm typecheckcleanpnpm check:protocolagrees (PROTOCOL_VERSION=2)cargo build --releasecleangrep -rn 'acquirePidfile\|spawnDaemonDetached\|spawn_detached_daemon' src/ rust-client/src/returns only the migration-history commentUniversal-law citations
[LAW:single-enforcer][LAW:dataflow-not-control-flow][LAW:no-defensive-null-guards][LAW:one-type-per-behavior]Tracking
Refs:
brandon-process-discipline-kz8.1(child ofkz8epic: process-spawn pileup).Next in the epic (blocked by this landing): kz8.2 single launch boundary + spawn metering.