Skip to content

fix(daemon): enforce single-daemon invariant via atomic bind() (kz8.1)#4

Merged
brandon-fryslie merged 16 commits into
mainfrom
bf/single-daemon-flock
May 16, 2026
Merged

fix(daemon): enforce single-daemon invariant via atomic bind() (kz8.1)#4
brandon-fryslie merged 16 commits into
mainfrom
bf/single-daemon-flock

Conversation

@brandon-fryslie

Copy link
Copy Markdown
Contributor

Summary

Replaces the pidfile-EEXIST race in acquirePidfile() with bind()-based exclusion: two daemons cannot both bind the same Unix socket path, so the kernel makes duplicate-daemon unrepresentable at the syscall level.

Three previously-independent daemon spawn sites (Rust client failure path, Node fallback, install postlude) collapse onto a single obtainDaemon/obtain_daemon_kick primitive per runtime. Caller-side flock-based spawn dedup is a thundering-herd optimization on top of the load-bearing bind().

Test plan

  • All 659 existing tests pass; new test/daemon-acquire.test.ts adds 6 targeted tests
  • Storm test (manual): 10 concurrent node dist/index.mjs daemon spawns into a fresh state dir → exactly 1 daemon survives
  • pnpm typecheck clean
  • pnpm check:protocol agrees (PROTOCOL_VERSION=2)
  • cargo build --release clean
  • grep -rn 'acquirePidfile\|spawnDaemonDetached\|spawn_detached_daemon' src/ rust-client/src/ returns only the migration-history comment

Universal-law citations

[LAW:single-enforcer] [LAW:dataflow-not-control-flow] [LAW:no-defensive-null-guards] [LAW:one-type-per-behavior]

Tracking

Refs: brandon-process-discipline-kz8.1 (child of kz8 epic: process-spawn pileup).

Next in the epic (blocked by this landing): kz8.2 single launch boundary + spawn metering.

Copilot AI review requested due to automatic review settings May 16, 2026 02:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the daemon single-instance mechanism from a pidfile-based mutex (with race potential) to a bind()-based invariant on the Unix socket path, and consolidates daemon “warm-up” spawning into a single obtain-daemon primitive in both Node and Rust.

Changes:

  • Introduces src/daemon/acquire.ts (obtainDaemon / obtainDaemonAsync) with a spawn-dedup “spawn.lock” optimization.
  • Updates the daemon to enforce singleton via atomic socket bind, and makes the pidfile diagnostic-only.
  • Removes the old detached spawn helper and adds targeted tests for the new acquire/bind behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/daemon-acquire.test.ts Adds unit tests for obtain-daemon behavior and bind()-based singleton semantics.
src/install/index.ts Switches daemon warm-up from spawnDaemonDetached() to obtainDaemonAsync().
src/index.ts Switches daemon warm-up from spawnDaemonDetached() to obtainDaemonAsync() on daemon-miss.
src/daemon/spawn.ts Removes the old fire-and-forget spawn helper (replaced by obtain-daemon).
src/daemon/server.ts Replaces pidfile exclusion with bind()/EADDRINUSE arbitration + stale socket recovery; pidfile becomes diagnostic-only.
src/daemon/paths.ts Adds spawnLockPath() for caller-side spawn dedup.
src/daemon/acquire.ts New obtain-daemon primitive implementing connect-probe + spawn-lock + readiness polling.
rust-client/src/main.rs Replaces detached spawn fallback with a flock-gated obtain_daemon_kick() that mirrors the Node obtain-daemon flow.
Comments suppressed due to low confidence (1)

src/daemon/server.ts:136

  • In isSocketAlive, the 50ms timer treats a slow connect as "not alive". In the EADDRINUSE recovery path, a false-negative here leads to unlinking sockPath even though a daemon may actually be listening, which would make the live daemon unreachable via pathname. To make this safe, treat timeouts as "assume alive" (do not unlink), increase the timeout substantially, and/or require multiple failed probes before unlinking.
    sock.once("connect", () => done(true));
    sock.once("error", () => done(false));
    // 50ms is generous; localhost AF_UNIX connect is sub-ms when a listener
    // exists. Anything slower implies "no listener" in practice.
    setTimeout(() => done(false), 50).unref();
  });

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/daemon/server.ts Outdated
Comment thread src/daemon/server.ts Outdated
Comment thread src/daemon/acquire.ts Outdated
Comment thread src/daemon/acquire.ts Outdated
brandon-fryslie added a commit that referenced this pull request May 16, 2026
Address review threads on PR #4:

1. handleAddressInUse: silently ignoring unlinkSync failure left the system
   in the worst state — no daemon + stale socket blocking future starts.
   The unlink failure now logs an error and exits non-zero so the supervisor
   notices instead of seeing a clean exit. [LAW:no-defensive-null-guards]

2. isSocketAlive: added a `settled` guard + explicit clearTimeout on settle,
   mirroring canConnect in acquire.ts. The connect/error/timeout race could
   otherwise call destroy() and resolve() more than once.

3. tryAcquireSpawnLock: non-EEXIST openSync errors (EACCES, ENOTDIR) were
   silently mapped to "contended", causing obtainDaemon to spin until its
   deadline and return a misleading "timeout" reason. Widened the return
   type to `{ kind: "held" | "contended" | "error"; reason }`; unrecoverable
   errors surface as { kind: "failed", reason: "spawn-lock: ..." } and the
   call returns promptly. [LAW:no-silent-fallbacks]

4. Doc comments: rust-client mirror lives in main.rs (obtain_daemon section),
   not a separate acquire.rs.

New test verifies the error path returns < 500ms instead of spinning to the
2s deadline.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 02:39

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/daemon/acquire.ts:80

  • spawnFn() is treated as a boolean-returning hook, but if the real spawn implementation (or a test hook) throws synchronously, obtainDaemon() will reject instead of returning { kind: "failed" }. Consider guarding the spawnFn() call and translating thrown errors into a failed result (with a reason) so callers consistently get an ObtainResult.
        if (await canConnect(socketPath(), settings.connectTimeoutMs)) {
          return { kind: "attached" };
        }
        if (!spawnFn()) {
          return { kind: "failed", reason: "spawn returned false" };
        }
        // Poll for the new daemon to bind.

Comment thread src/index.ts Outdated
Comment thread src/install/index.ts
Comment thread rust-client/src/main.rs Outdated
Comment thread src/daemon/acquire.ts Outdated
Comment thread src/daemon/acquire.ts Outdated
brandon-fryslie added a commit that referenced this pull request May 16, 2026
Address review threads from second Copilot pass on PR #4:

1. obtainDaemonAsync was async-first — the first await (connect probe)
   suspended before child_process.spawn ran, and the caller's immediate
   process.exit(0) killed the process before the await could resume. The
   daemon was never warmed in the daemon-miss path. Replaced with a
   synchronous obtainDaemonKick(opts?) that does lock + spawn + release in
   one synchronous turn. Switched both fire-and-forget callers
   (src/index.ts, src/install/index.ts) to it.

2. Cross-runtime spawn-lock semantics were inconsistent: Rust used
   flock-on-persistent-file (leaves the file behind on release); Node used
   open(path, "wx") (existence == held). A Rust release left the file in
   place, which Node then saw as EEXIST and treated as contention until
   the 10s staleness window expired. Aligned both runtimes on
   existence-as-lock: Rust now uses OpenOptions::create_new(true) +
   explicit remove_file on release. Removed AsRawFd import and the
   libc::flock call path. [LAW:one-type-per-behavior]

3. obtainDaemon is typed Promise<ObtainResult>, but synchronous setup
   (fs.mkdirSync) could throw, violating the non-throwing contract.
   Extracted ensureStateDir() that returns a reason string; obtainDaemon
   converts to { kind: "failed", reason }. obtainDaemonKick silently bails
   on the same setup failure (it's fire-and-forget with no caller to
   report to).

New tests: synchronous kick proven to complete before microtasks run;
state-dir setup failure surfaces as typed failed result (not throw);
kick does not double-spawn when lock is already held.

Storm test rebuilt + re-run: 10 concurrent `node dist/index.mjs daemon`
spawns into a fresh state dir → exactly 1 daemon survives.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 02:51

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

rust-client/src/main.rs:383

  • The obtain-daemon section comment refers to "flock on spawn.lock" and says the lock auto-releases on process exit via advisory locks. The implementation now uses a lock file created with create_new(true) and released by unlinking (with time-based staleness reclaim on crash), so the comment is no longer accurate. Please revise this comment to reflect the new existence-as-lock behavior.
// [LAW:single-enforcer] One entry point on the Rust side for "obtain a
// daemon." The atomic bind() inside the daemon is the load-bearing exclusion;
// this flock on spawn.lock is a thundering-herd optimization that prevents N
// concurrent clients from each forking a Node process when one would do.
//
// [LAW:dataflow-not-control-flow] The caller does not get to choose whether
// to spawn — it asks for a daemon, this function decides. The bind() inside
// the daemon ensures that even if this function spawns "redundantly" the
// duplicate daemon exits immediately at bind().
//
// Fire-and-forget shape: the current render is already lost (we hit
// obtain_daemon_kick on a render failure); we just want a daemon to be alive
// for the next refresh. The lock auto-releases when the client process exits
// (advisory locks vanish on close). Total work inside this fn is bounded by
// a few syscalls plus an optional fork+execve.

Comment thread src/daemon/server.ts Outdated
Comment thread rust-client/src/main.rs Outdated
brandon-fryslie added a commit that referenced this pull request May 16, 2026
Address review threads on PR #4 (round 3):

1. src/daemon/server.ts runDaemon header referenced obtainDaemon() as the
   client recovery path, but callers actually use obtainDaemonKick() (the
   synchronous variant added in the previous commit). Comment now lists
   both entry points with their semantics: kick = fire-and-forget,
   obtainDaemon = caller waits for readiness.

2. rust-client/src/main.rs header and the obtain-daemon-primitive section
   header both still described the spawn dedup as "flock-gated", but
   flock was replaced with existence-as-lock (open with O_CREAT | O_EXCL,
   release by unlink) to align with Node's primitive. Comments now
   accurately describe the cross-runtime lock mechanism.

Doc-only — no behavior change.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 02:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Comment thread rust-client/src/main.rs Outdated
Comment thread src/daemon/server.ts
Comment thread src/daemon/acquire.ts Outdated
brandon-fryslie added a commit that referenced this pull request May 16, 2026
…lock release

Address review threads on PR #4 (round 4):

1. writePidfileDiagnostic: fs.writeFileSync({mode: 0o600}) only applies the
   mode on file creation, leaving a stale pidfile with broader permissions
   from a prior run untouched. Added explicit fs.chmodSync(pidPath(), 0o600)
   after the write so 0600 is the invariant regardless of prior state.

2. canConnect (acquire.ts): had a `settled` guard preventing double-resolve
   but never cleared the timeout, so under tight polling loops (obtainDaemon
   calls it repeatedly) no-op timers accumulated until the daemon attached
   or failed. Matches isSocketAlive's pattern: keep the timer handle, clear
   it on settle.

3. obtain_daemon_kick doc comment in Rust still claimed "the lock auto-
   releases when the client process exits (advisory locks vanish on close)" —
   residual flock-era wording. Comment now accurately describes the
   existence-as-lock release (explicit remove_file) and the crash-recovery
   path (time-based staleness reclaim for files older than 10s).

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 03:05

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread src/daemon/acquire.ts Outdated
brandon-fryslie added a commit that referenced this pull request May 16, 2026
Address review thread on PR #4 (round 5):

obtainDaemon is typed Promise<ObtainResult>, but spawnFn() was invoked
without a try/catch. A synchronous throw from child_process.spawn (ENOENT
for the node binary, invalid options, etc.) — or a test-provided spawn
that throws — would reject the promise instead of returning the typed
failure result.

Wrapped the spawnFn() invocation in try/catch and converted thrown errors
to { kind: "failed", reason: \`spawn threw: \${message}\` }. The existing
finally{} block already released the spawn lock on throw — only the
typed-conversion was missing. Same shape as the round-2 fix for mkdirSync.

New test injects a throwing spawn fn and asserts the typed-failure result
instead of a rejected promise.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 03:10

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread src/daemon/server.ts
Comment thread src/daemon/acquire.ts
brandon-fryslie added a commit that referenced this pull request May 16, 2026
…ilability

Address review threads on PR #4 (round 6):

1. server.ts handleAddressInUse: between `isSocketAlive` returning false and
   `unlinkSync` running, a concurrent recoverer could unlink+bind the path;
   our unlink would then remove their live socket, leaving two daemons.
   Added a second isSocketAlive check immediately before unlink — if a
   live listener appeared between checks, we exit instead of stomping on
   it. Race window goes from O(handler latency) to sub-microsecond.

2. acquire.ts: STALE_LOCK_MS (10s) was significantly longer than the
   default totalTimeoutMs (2s), so a crashed spawn-lock holder could block
   new daemon spawns for up to 10s — making the spawn-lock load-bearing
   for availability. [LAW:dataflow-not-control-flow] says the lock is an
   optimization, bind() is the load-bearing exclusion. Added
   `lockFallbackMs` (default totalTimeoutMs / 2): after that much
   contention without a daemon appearing, bypass the lock and spawn
   anyway. The bind() inside the daemon arbitrates duplicates, so a stuck
   lock degrades to bind-arbitrated contention instead of a multi-second
   gap.

Extracted shared spawn+poll logic into `spawnAndWaitForReady` so the
lock-held and lock-fallback paths can't drift. Both go through the same
helper; reasonSuffix tags the failure with " (lock-fallback)" for
diagnostics when the fallback fires.

New test exercises the fallback path: plants a fresh (non-stale) lock
and asserts spawn fires after lockFallbackMs (200ms in the test).

Storm test re-run: 10 concurrent daemon spawns → 1 survivor.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 03:20

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread src/daemon/acquire.ts
brandon-fryslie added a commit that referenced this pull request May 16, 2026
Address review thread on PR #4 (round 7):

obtainDaemonKick treated any non-"held" lock outcome (including
unrecoverable errors like EACCES on spawn.lock) as an immediate silent
return, which made the spawn-lock load-bearing for availability — exactly
what [LAW:dataflow-not-control-flow] forbids. spawn.lock is an
optimization; bind() inside the daemon is the load-bearing exclusion.

Split the kick's lock-outcome handling into three explicit branches:
- "contended": return silently (another caller is spawning)
- "error": emit a terse stderr warning and spawn anyway; bind() arbitrates
- "held": normal locked-spawn path

Extracted `safeSpawn` helper so the throw-capture behavior is shared
between the locked and unlocked branches (no drift).

New test exercises the lock-error fallback: makes daemonDir read-only so
openSync("wx") returns EACCES, asserts the kick spawns anyway + emits
the stderr warning.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 03:27
brandon-fryslie added a commit that referenced this pull request May 16, 2026
Address review thread on PR #4 (round 13):

`Promise.resolve(fn(...)).finally(cleanup)` does NOT catch synchronous
throws from `fn` — the throw propagates before `Promise.resolve` runs,
bypassing the finally. Tests that triggered a sync throw inside the
helper would leak XDG_STATE_HOME state and temp directories.

Refactored to async/await with try-finally so cleanup runs
unconditionally regardless of how `fn` fails.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 04:09

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/daemon/server.ts:337

  • The shutdown() comment still refers to a “singleton mutex” (“…the invariant the singleton mutex relies on”), but the singleton enforcer is now the atomic bind() on the Unix socket. Consider updating this wording to reference the bind/socket-singleton invariant to avoid confusing future maintenance.
  // [LAW:single-enforcer] Exactly one path out of the process. Backstop with
  // SIGKILL because we previously observed shut-down daemons staying alive in
  // uv__io_poll — something (event loop handle, swallowed exception in a
  // post-end log write) was preventing process.exit from actually firing.
  // The hard kill makes "shutdown was called" mechanically equivalent to
  // "process is gone", which is the invariant the singleton mutex relies on.
  const kill = setTimeout(() => process.kill(process.pid, "SIGKILL"), 500);

Comment thread src/daemon/acquire.ts
brandon-fryslie added a commit that referenced this pull request May 16, 2026
Address review thread on PR #4 (round 14):

safeSpawn captured throws but ignored the boolean return from spawnFn.
spawnDaemonDetachedReal returns false when process.argv[1] is missing
(can happen with unusual exec environments); a test/alternate spawn
hook can also signal failure that way. The kick path then went
silently into the void — exactly the silent-failure mode the rest of
this code is designed to avoid.

safeSpawn now writes a terse stderr warning when spawnFn returns
false. Both failure modes (throw and false-return) are now visible.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie requested a review from Copilot May 16, 2026 04:15

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

The pidfile-EEXIST check in acquirePidfile() was a postcondition race —
N concurrent spawners would each pay full Node startup, then N-1 lose
the EEXIST check after expensive module init. Under multi-session
Claude Code load this produced spawn-storms on every daemon restart.

Replace with bind()-based exclusion: two daemons cannot both bind() the
same Unix socket path. The kernel makes duplicate-daemon unrepresentable
at the syscall level. bindOrAttachAndExit() distinguishes "live daemon
listening" (connect succeeds → exit cleanly) from "stale socket file
from crashed daemon" (connect fails → unlink + rebind once).

Add obtainDaemon()/obtain_daemon_kick() primitives — one per runtime —
that collapse three previously-independent spawn sites (Rust client on
failure, Node fallback, install postlude) onto a single boundary. Each
runtime uses native primitives for the caller-side spawn dedup:
- Rust: libc::flock(LOCK_EX | LOCK_NB) on spawn.lock (auto-release on
  process exit)
- Node: wx-create on spawn.lock with time-based staleness reclaim

The flock is a thundering-herd optimization; the load-bearing exclusion
is the bind() inside the daemon. Even if a "redundant" daemon slips past
the lock it exits immediately at bind().

Verified: 10 concurrent `node dist/index.mjs daemon` spawns into a
fresh state dir → exactly 1 daemon survives.

[LAW:single-enforcer] [LAW:dataflow-not-control-flow]
[LAW:no-defensive-null-guards] [LAW:one-type-per-behavior]

Refs: brandon-process-discipline-kz8.1
Address review threads on PR #4:

1. handleAddressInUse: silently ignoring unlinkSync failure left the system
   in the worst state — no daemon + stale socket blocking future starts.
   The unlink failure now logs an error and exits non-zero so the supervisor
   notices instead of seeing a clean exit. [LAW:no-defensive-null-guards]

2. isSocketAlive: added a `settled` guard + explicit clearTimeout on settle,
   mirroring canConnect in acquire.ts. The connect/error/timeout race could
   otherwise call destroy() and resolve() more than once.

3. tryAcquireSpawnLock: non-EEXIST openSync errors (EACCES, ENOTDIR) were
   silently mapped to "contended", causing obtainDaemon to spin until its
   deadline and return a misleading "timeout" reason. Widened the return
   type to `{ kind: "held" | "contended" | "error"; reason }`; unrecoverable
   errors surface as { kind: "failed", reason: "spawn-lock: ..." } and the
   call returns promptly. [LAW:no-silent-fallbacks]

4. Doc comments: rust-client mirror lives in main.rs (obtain_daemon section),
   not a separate acquire.rs.

New test verifies the error path returns < 500ms instead of spinning to the
2s deadline.

Refs: brandon-process-discipline-kz8.1
Address review threads from second Copilot pass on PR #4:

1. obtainDaemonAsync was async-first — the first await (connect probe)
   suspended before child_process.spawn ran, and the caller's immediate
   process.exit(0) killed the process before the await could resume. The
   daemon was never warmed in the daemon-miss path. Replaced with a
   synchronous obtainDaemonKick(opts?) that does lock + spawn + release in
   one synchronous turn. Switched both fire-and-forget callers
   (src/index.ts, src/install/index.ts) to it.

2. Cross-runtime spawn-lock semantics were inconsistent: Rust used
   flock-on-persistent-file (leaves the file behind on release); Node used
   open(path, "wx") (existence == held). A Rust release left the file in
   place, which Node then saw as EEXIST and treated as contention until
   the 10s staleness window expired. Aligned both runtimes on
   existence-as-lock: Rust now uses OpenOptions::create_new(true) +
   explicit remove_file on release. Removed AsRawFd import and the
   libc::flock call path. [LAW:one-type-per-behavior]

3. obtainDaemon is typed Promise<ObtainResult>, but synchronous setup
   (fs.mkdirSync) could throw, violating the non-throwing contract.
   Extracted ensureStateDir() that returns a reason string; obtainDaemon
   converts to { kind: "failed", reason }. obtainDaemonKick silently bails
   on the same setup failure (it's fire-and-forget with no caller to
   report to).

New tests: synchronous kick proven to complete before microtasks run;
state-dir setup failure surfaces as typed failed result (not throw);
kick does not double-spawn when lock is already held.

Storm test rebuilt + re-run: 10 concurrent `node dist/index.mjs daemon`
spawns into a fresh state dir → exactly 1 daemon survives.

Refs: brandon-process-discipline-kz8.1
Address review threads on PR #4 (round 3):

1. src/daemon/server.ts runDaemon header referenced obtainDaemon() as the
   client recovery path, but callers actually use obtainDaemonKick() (the
   synchronous variant added in the previous commit). Comment now lists
   both entry points with their semantics: kick = fire-and-forget,
   obtainDaemon = caller waits for readiness.

2. rust-client/src/main.rs header and the obtain-daemon-primitive section
   header both still described the spawn dedup as "flock-gated", but
   flock was replaced with existence-as-lock (open with O_CREAT | O_EXCL,
   release by unlink) to align with Node's primitive. Comments now
   accurately describe the cross-runtime lock mechanism.

Doc-only — no behavior change.

Refs: brandon-process-discipline-kz8.1
…lock release

Address review threads on PR #4 (round 4):

1. writePidfileDiagnostic: fs.writeFileSync({mode: 0o600}) only applies the
   mode on file creation, leaving a stale pidfile with broader permissions
   from a prior run untouched. Added explicit fs.chmodSync(pidPath(), 0o600)
   after the write so 0600 is the invariant regardless of prior state.

2. canConnect (acquire.ts): had a `settled` guard preventing double-resolve
   but never cleared the timeout, so under tight polling loops (obtainDaemon
   calls it repeatedly) no-op timers accumulated until the daemon attached
   or failed. Matches isSocketAlive's pattern: keep the timer handle, clear
   it on settle.

3. obtain_daemon_kick doc comment in Rust still claimed "the lock auto-
   releases when the client process exits (advisory locks vanish on close)" —
   residual flock-era wording. Comment now accurately describes the
   existence-as-lock release (explicit remove_file) and the crash-recovery
   path (time-based staleness reclaim for files older than 10s).

Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 5):

obtainDaemon is typed Promise<ObtainResult>, but spawnFn() was invoked
without a try/catch. A synchronous throw from child_process.spawn (ENOENT
for the node binary, invalid options, etc.) — or a test-provided spawn
that throws — would reject the promise instead of returning the typed
failure result.

Wrapped the spawnFn() invocation in try/catch and converted thrown errors
to { kind: "failed", reason: \`spawn threw: \${message}\` }. The existing
finally{} block already released the spawn lock on throw — only the
typed-conversion was missing. Same shape as the round-2 fix for mkdirSync.

New test injects a throwing spawn fn and asserts the typed-failure result
instead of a rejected promise.

Refs: brandon-process-discipline-kz8.1
…ilability

Address review threads on PR #4 (round 6):

1. server.ts handleAddressInUse: between `isSocketAlive` returning false and
   `unlinkSync` running, a concurrent recoverer could unlink+bind the path;
   our unlink would then remove their live socket, leaving two daemons.
   Added a second isSocketAlive check immediately before unlink — if a
   live listener appeared between checks, we exit instead of stomping on
   it. Race window goes from O(handler latency) to sub-microsecond.

2. acquire.ts: STALE_LOCK_MS (10s) was significantly longer than the
   default totalTimeoutMs (2s), so a crashed spawn-lock holder could block
   new daemon spawns for up to 10s — making the spawn-lock load-bearing
   for availability. [LAW:dataflow-not-control-flow] says the lock is an
   optimization, bind() is the load-bearing exclusion. Added
   `lockFallbackMs` (default totalTimeoutMs / 2): after that much
   contention without a daemon appearing, bypass the lock and spawn
   anyway. The bind() inside the daemon arbitrates duplicates, so a stuck
   lock degrades to bind-arbitrated contention instead of a multi-second
   gap.

Extracted shared spawn+poll logic into `spawnAndWaitForReady` so the
lock-held and lock-fallback paths can't drift. Both go through the same
helper; reasonSuffix tags the failure with " (lock-fallback)" for
diagnostics when the fallback fires.

New test exercises the fallback path: plants a fresh (non-stale) lock
and asserts spawn fires after lockFallbackMs (200ms in the test).

Storm test re-run: 10 concurrent daemon spawns → 1 survivor.

Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 7):

obtainDaemonKick treated any non-"held" lock outcome (including
unrecoverable errors like EACCES on spawn.lock) as an immediate silent
return, which made the spawn-lock load-bearing for availability — exactly
what [LAW:dataflow-not-control-flow] forbids. spawn.lock is an
optimization; bind() inside the daemon is the load-bearing exclusion.

Split the kick's lock-outcome handling into three explicit branches:
- "contended": return silently (another caller is spawning)
- "error": emit a terse stderr warning and spawn anyway; bind() arbitrates
- "held": normal locked-spawn path

Extracted `safeSpawn` helper so the throw-capture behavior is shared
between the locked and unlocked branches (no drift).

New test exercises the lock-error fallback: makes daemonDir read-only so
openSync("wx") returns EACCES, asserts the kick spawns anyway + emits
the stderr warning.

Refs: brandon-process-discipline-kz8.1
…rrides stale-fresh lock

Address review threads on PR #4 (round 8):

1. acquire.ts obtainDaemonKick: on contention, the kick returned silently
   even if the lock holder had crashed mid-spawn. The lock would stay
   "fresh" until STALE_LOCK_MS (10s), creating a 10-second availability
   gap. Added KICK_CONTENDED_OVERRIDE_MS (2000ms) — when the spawn.lock
   file's age exceeds this, the kick assumes the holder crashed and
   spawns unlocked with a stderr warning. Threshold is calibrated above
   the slowest legitimate holder (obtainDaemon's lock-held path can hold
   ~1.6s for spawnReadyTimeoutMs=1500ms) and well above the kick's own
   typical hold (<10ms). [LAW:dataflow-not-control-flow]

2. server.ts isSocketAlive treated the 50ms connect timeout as
   "definitely no listener," which risked unlinking a real (but
   momentarily slow) daemon's socket. Introduced 3-state probeSocket:
   "alive" (connect succeeded), "dead" (only on ECONNREFUSED, ENOENT,
   or ENOTSOCK — definitive), "unknown" (timeout, EPERM, other errors).
   isSocketAlive treats "unknown" as alive, so the stale-socket
   recovery path won't unlink an ambiguous socket — the conservative
   choice avoids destructive false negatives.

3. Side effect of (2): "dead" now includes ENOENT from connect, so the
   socket file may genuinely be missing by the time we unlinkSync.
   Added ENOENT tolerance — a missing file already satisfies the goal
   ("make the path bindable"). Other unlink failures still hard-fail
   loudly.

Refs: brandon-process-discipline-kz8.1
…t file

Address review thread on PR #4 (round 9):

The "stale socket file" test name promised verification that connect
fails with ECONNREFUSED, but the body only checked EADDRINUSE +
unlink + rebind. Added an explicit net.connect probe assertion: a
plain-file at the socket path must surface one of the three
"definitively dead" codes that the round-8 probeSocket treats as
unlink-eligible (ENOTSOCK / ECONNREFUSED / ENOENT).

This now validates the full premise behind handleAddressInUse's
stale-socket recovery: the connect-probe outcome must be one of the
codes probeSocket maps to "dead" for the recovery path to actually
fire on plain-file stale sockets.

Refs: brandon-process-discipline-kz8.1
…leanup

Address review threads on PR #4 (round 10):

1. acquire.ts tryAcquireSpawnLock: stale-lock reclaim path treated any
   unlinkSync failure as a fatal lock error. ENOENT during the reclaim
   means someone (the rightful holder cleaning up, or another concurrent
   reclaimer) already removed the file — that IS the desired
   post-condition. Return to the retry loop so openSync can succeed
   instead of returning { kind: "error" }. Same shape as the
   ENOENT-on-unlink tolerance added to server.ts in round 8.

2. test/daemon-acquire.test.ts: removed the manual stderr.write
   capture-and-reassignment in the "spawns unlocked when lock acquisition
   errors" test. mockRestore() correctly restores the original; the
   captured-bound-and-reassigned wrapper was subtly replacing the
   restored original with a wrapper, which could affect other spies on
   process.stderr in the same test run.

Refs: brandon-process-discipline-kz8.1
…rity

Address review thread on PR #4 (round 11):

try_acquire_spawn_lock returned Option<PathBuf>, conflating contention
and error: any non-EEXIST openSync failure (EACCES, ENOTDIR) and any
unrecoverable unlink failure mapped to None, which obtain_daemon_kick
treated as "another caller is spawning, trust them." This made the
spawn-lock load-bearing for availability on the Rust path — exactly
the [LAW:dataflow-not-control-flow] violation we already fixed on
the Node side in PR rounds 7-8.

Introduced a 3-way LockOutcome enum (Held / Contended / Error) and
rewrote obtain_daemon_kick to handle each variant explicitly:
- Held: spawn under lock, release by unlink (unchanged)
- Contended: if spawn.lock age > 2s (KICK_CONTENDED_OVERRIDE_MS,
  Rust mirror of the Node constant), assume crashed holder and
  spawn unlocked with stderr warning; else silent return
- Error: stderr "spawn-lock unavailable (reason)" + spawn unlocked

ENOENT during stale-lock unlink is also treated as benign — the file
is already gone (desired post-condition), so we retry the openSync.
Mirrors the round-10 Node-side fix.

Both runtimes now have full feature parity for kick-path crash
recovery: same 2s contention threshold, same 10s STALE_LOCK, same
error-becomes-unlocked-spawn behavior. [LAW:one-type-per-behavior]

Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 12):

fs.statSync(...).mode returns the file mode including file-type bits
(e.g. 0o040000 for directories). Passing that whole value back to
chmodSync is non-portable — some platforms accept permission bits
only. Both originalMode captures now mask with & 0o7777 (covers
standard rwx + sticky/setuid/setgid).

Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 13):

`Promise.resolve(fn(...)).finally(cleanup)` does NOT catch synchronous
throws from `fn` — the throw propagates before `Promise.resolve` runs,
bypassing the finally. Tests that triggered a sync throw inside the
helper would leak XDG_STATE_HOME state and temp directories.

Refactored to async/await with try-finally so cleanup runs
unconditionally regardless of how `fn` fails.

Refs: brandon-process-discipline-kz8.1
Address review thread on PR #4 (round 14):

safeSpawn captured throws but ignored the boolean return from spawnFn.
spawnDaemonDetachedReal returns false when process.argv[1] is missing
(can happen with unusual exec environments); a test/alternate spawn
hook can also signal failure that way. The kick path then went
silently into the void — exactly the silent-failure mode the rest of
this code is designed to avoid.

safeSpawn now writes a terse stderr warning when spawnFn returns
false. Both failure modes (throw and false-return) are now visible.

Refs: brandon-process-discipline-kz8.1
@brandon-fryslie brandon-fryslie force-pushed the bf/single-daemon-flock branch from b3e41db to 2643c35 Compare May 16, 2026 15:19
@brandon-fryslie brandon-fryslie merged commit 93c2134 into main May 16, 2026
6 checks passed
@brandon-fryslie brandon-fryslie deleted the bf/single-daemon-flock branch May 16, 2026 15:21
brandon-fryslie added a commit that referenced this pull request Jun 10, 2026
…ust truncate (5mi) (#99)

Sheriff finding #4 [LAW:one-source-of-truth]: the red-error FG/BG/RESET
literals were hand-duplicated across error-glyph.ts, server.ts's
composeWithDiagnostics, and the Rust mirror, with a false "avoid
daemon->client coupling" comment justifying the copies.

- New leaf src/render/diagnostic-style.ts owns the diagnostic visual
  identity (error trio + warning pair + reset); error-glyph.ts and
  server.ts import it — the same daemon->render direction already
  blessed for diagnostic-text.ts. The false-dichotomy comments are gone.
- check-protocol's three style rows repoint to the leaf in this same
  commit (missing anchor fails loudly); PREFIX/MAX_MESSAGE_LEN stay
  anchored in error-glyph.ts (glyph shape, not duplicated in TS).
- Folded divergence: Rust truncate now collapses whitespace runs and
  trims like TS sanitizeAndTruncate, making the byte-identical mirror
  claim true for multi-control inputs ("\r\n" -> one space in both).
  The Rust test that pinned the divergence is corrected
  [LAW:behavior-not-structure], and both suites now pin the same
  collapse/trim fixtures (incl. the U+FEFF edge JS \s implies).

Verified: typecheck, lint, check:protocol (12/12), jest glyph/diagnostic
/render-cache suites (31), cargo test (19) all green; residue grep shows
no raw style literals outside the leaf and the enforced Rust mirror.

Co-authored-by: elton-prawn <264370887+elton-prawn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants