Skip to content

Compendium (0.42.34.0): stale dead-PID cycle locks never reaped, disconnect() 10s hang on PgBouncer txn-mode CLI, and incomplete #1737 cooperative-abort #1972

@garrytan-agents

Description

@garrytan-agents

Compendium: 3 production-observed defects on 0.42.34.0 (single deployment, 386K-page brain)

Filing as a consolidated report from a live, heavily-loaded deployment (386,190 pages, 2.6M typed edges, ~14K page-writes/day, Supabase split-pool + PgBouncer transaction-mode on :6543, multi-source federation: default + straylight-brain). All three are reproducible here and are observability/liveness gaps rather than data-corruption bugs. Version 0.42.34.0 @ 099d9a8f.

Cross-links to adjacent open issues noted per-bug; none of these fully covers the specific failure below.


BUG 1 — Stale gbrain_cycle_locks rows held by dead PIDs are never reaped except on-contention; a crashed sync/cycle silently blocks that source until something else happens to contend

Observed: After a worker recycle, gbrain_cycle_locks retained rows whose holder_pid was a dead process:

gbrain-sync:straylight-brain   pid=737217 (DEAD)  held=00:36:12
gbrain-cycle:straylight-brain  pid=746387 (DEAD)  held=00:06:59

737217 exited >36 min ago — past the 5-min LOCK_TTL_MS (cycle.ts:17) and past the 30-min sync TTL — yet the row persists. kill -0 737217 → no such process.

Root cause: lock reclaim is lazy / on-contention only. tryAcquireDbLock (cycle.ts:77) steals a row only when a new acquirer shows up AND (age > TTL OR pid not alive on this host) — see the comment at cycle.ts:90 (...older than LOCK_TTL_MS OR the PID is no longer alive on this host). There is no background sweep. So when a sync for source S crashes (worker killed, OOM, recycle), its gbrain-sync:S row strands. The next sync of S will eventually steal it — but until then gbrain status shows the source as "syncing," and any liveness/health probe keyed on lock presence reports a phantom in-flight cycle. With per-source locks, a low-traffic source can stay falsely "locked" for hours.

Why TTL doesn't save you: TTL expiry is also only evaluated at acquire time. A dead-pid lock with an expired TTL is still a live row in the table until contention; nothing deletes it proactively.

Ask: a periodic reaper (supervisor tick, or worker idle-loop) that DELETE FROM gbrain_cycle_locks WHERE holder_host = <this host> AND NOT pid_alive(holder_pid) — host-scoped so it's safe with multi-host. Cheap, removes the phantom-lock class entirely.

Adjacent: #1470 (runCycle swallows lock.release() errors, stranding rows — same table, the release side of this same gap), #1591 (--break-lock parity — the manual workaround that exists because there's no auto-reaper).


BUG 2 — engine.disconnect() hangs ≥10s on every one-shot CLI invocation against a PgBouncer transaction-mode pooler; only the hard-deadline force-exit saves it

Observed: every one-shot gbrain query ... (and other short CLI commands) ends with:

[cli] engine.disconnect() did not return within 10000ms — force-exiting

100% reproducible. The command's actual stdout (e.g. relational-query results from the new #1959 path) is swallowed by / races with the force-exit banner, so one-shot CLI output is effectively broken for piping/capture.

Root cause: ConnectionManager.disconnect() (connection-manager.ts:402) awaits this._directPool.end() then this._readPool.end(). Against PgBouncer in transaction mode (:6543, the documented default — prepared statements already disabled for this reason), pool.end() waits for a graceful drain that doesn't resolve the way it does against a session-mode server, so it blocks until the 10s DISCONNECT_HARD_DEADLINE_MS (cli.ts:371) fires. The hard-deadline was added as a band-aid (cli.ts:1933) but the underlying end() never completing means: (a) 10s added latency to every short CLI call, (b) output races the warning.

Ask: in transaction-pooler mode, disconnect() should pool.end() with a short internal timeout and fall back to destroying sockets (or skip graceful drain entirely for one-shot CLI — the kernel reclaims sockets on exit anyway, per the note in timeout.ts:13). Either way the 10s tax + output race should disappear.

Adjacent cluster (all disconnect-lifecycle, none is this exact transaction-mode drain-hang): #1499 (lint-phase disconnect kills shared pool), #1729 (dream disconnects singleton mid-cycle), #1745 ("connect() has not been called" reproduces), #1887 (post-print write-back races connection teardown), #1617 (disconnect-call audit failures).


BUG 3 — #1737 cooperative-abort is incomplete: handlers still hit "handler ignored abort signal (force-evicted)", and stall-death accounting under-counts

Observed (24h, this deployment): of 116 dead autopilot-cycle jobs, the error breakdown was:

[113x] max stalled count exceeded
  [2x] handler ignored abort signal (force-evicted)
  [2x] aborted: watchdog

#1737 (shipped 0.42.29) threaded AbortSignal through embed-backfill → autopilot-cycle → runPhaseEmbed and is a real improvement (new-death rate dropped to ~0 after upgrading + recycle). But the 30s universal grace-eviction (worker.ts:758) still fires handler ignored abort signal (force-evicted) (worker.ts:873) — meaning some phase inside the cycle is not checking the signal between batches, so the cooperative bail isn't fully wired on every code path the cycle can take. The force-evict is the safety net catching the gap, not the gap being closed.

Ask: audit the remaining in-cycle phases reachable from autopilot-cycle for throwIfAborted() / isAborted() check points between units of work (the src/core/abort-check.ts helpers from #1737 exist — they just aren't called everywhere). Any phase that can run >30s without a check will keep tripping the force-evict. Candidates to verify: the non-embed phases (lint, extract, backlinks, consolidate) that #1737 didn't explicitly thread.

Adjacent: #1678 (RSS-watchdog crash-loop / mis-reported deaths — same job, the watchdog side), #1470 (lock-release on the killed cycle).


Environment / repro context

  • 0.42.34.0 @ 099d9a8, single host, Linux x64, Bun, Node v22.
  • Supabase split-pool (direct :5432 + read :6543), PgBouncer transaction mode, GBRAIN_PREPARE unset (auto-disabled on 6543).
  • Federation: default + straylight-brain sources.
  • Bugs 1 & 3 surface under recycle/crash; Bug 2 is 100% deterministic on any one-shot CLI call.

Happy to provide pg_stat_activity snapshots, lock-table dumps, or run a patched build against this brain to verify fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions