Compendium: 3 production-observed defects on 0.42.34.0 (single deployment, 386K-page brain)
Filing as a consolidated report from a live, heavily-loaded deployment (386,190 pages, 2.6M typed edges, ~14K page-writes/day, Supabase split-pool + PgBouncer transaction-mode on :6543, multi-source federation: default + straylight-brain). All three are reproducible here and are observability/liveness gaps rather than data-corruption bugs. Version 0.42.34.0 @ 099d9a8f.
Cross-links to adjacent open issues noted per-bug; none of these fully covers the specific failure below.
BUG 1 — Stale gbrain_cycle_locks rows held by dead PIDs are never reaped except on-contention; a crashed sync/cycle silently blocks that source until something else happens to contend
Observed: After a worker recycle, gbrain_cycle_locks retained rows whose holder_pid was a dead process:
gbrain-sync:straylight-brain pid=737217 (DEAD) held=00:36:12
gbrain-cycle:straylight-brain pid=746387 (DEAD) held=00:06:59
737217 exited >36 min ago — past the 5-min LOCK_TTL_MS (cycle.ts:17) and past the 30-min sync TTL — yet the row persists. kill -0 737217 → no such process.
Root cause: lock reclaim is lazy / on-contention only. tryAcquireDbLock (cycle.ts:77) steals a row only when a new acquirer shows up AND (age > TTL OR pid not alive on this host) — see the comment at cycle.ts:90 (...older than LOCK_TTL_MS OR the PID is no longer alive on this host). There is no background sweep. So when a sync for source S crashes (worker killed, OOM, recycle), its gbrain-sync:S row strands. The next sync of S will eventually steal it — but until then gbrain status shows the source as "syncing," and any liveness/health probe keyed on lock presence reports a phantom in-flight cycle. With per-source locks, a low-traffic source can stay falsely "locked" for hours.
Why TTL doesn't save you: TTL expiry is also only evaluated at acquire time. A dead-pid lock with an expired TTL is still a live row in the table until contention; nothing deletes it proactively.
Ask: a periodic reaper (supervisor tick, or worker idle-loop) that DELETE FROM gbrain_cycle_locks WHERE holder_host = <this host> AND NOT pid_alive(holder_pid) — host-scoped so it's safe with multi-host. Cheap, removes the phantom-lock class entirely.
Adjacent: #1470 (runCycle swallows lock.release() errors, stranding rows — same table, the release side of this same gap), #1591 (--break-lock parity — the manual workaround that exists because there's no auto-reaper).
BUG 2 — engine.disconnect() hangs ≥10s on every one-shot CLI invocation against a PgBouncer transaction-mode pooler; only the hard-deadline force-exit saves it
Observed: every one-shot gbrain query ... (and other short CLI commands) ends with:
[cli] engine.disconnect() did not return within 10000ms — force-exiting
100% reproducible. The command's actual stdout (e.g. relational-query results from the new #1959 path) is swallowed by / races with the force-exit banner, so one-shot CLI output is effectively broken for piping/capture.
Root cause: ConnectionManager.disconnect() (connection-manager.ts:402) awaits this._directPool.end() then this._readPool.end(). Against PgBouncer in transaction mode (:6543, the documented default — prepared statements already disabled for this reason), pool.end() waits for a graceful drain that doesn't resolve the way it does against a session-mode server, so it blocks until the 10s DISCONNECT_HARD_DEADLINE_MS (cli.ts:371) fires. The hard-deadline was added as a band-aid (cli.ts:1933) but the underlying end() never completing means: (a) 10s added latency to every short CLI call, (b) output races the warning.
Ask: in transaction-pooler mode, disconnect() should pool.end() with a short internal timeout and fall back to destroying sockets (or skip graceful drain entirely for one-shot CLI — the kernel reclaims sockets on exit anyway, per the note in timeout.ts:13). Either way the 10s tax + output race should disappear.
Adjacent cluster (all disconnect-lifecycle, none is this exact transaction-mode drain-hang): #1499 (lint-phase disconnect kills shared pool), #1729 (dream disconnects singleton mid-cycle), #1745 ("connect() has not been called" reproduces), #1887 (post-print write-back races connection teardown), #1617 (disconnect-call audit failures).
BUG 3 — #1737 cooperative-abort is incomplete: handlers still hit "handler ignored abort signal (force-evicted)", and stall-death accounting under-counts
Observed (24h, this deployment): of 116 dead autopilot-cycle jobs, the error breakdown was:
[113x] max stalled count exceeded
[2x] handler ignored abort signal (force-evicted)
[2x] aborted: watchdog
#1737 (shipped 0.42.29) threaded AbortSignal through embed-backfill → autopilot-cycle → runPhaseEmbed and is a real improvement (new-death rate dropped to ~0 after upgrading + recycle). But the 30s universal grace-eviction (worker.ts:758) still fires handler ignored abort signal (force-evicted) (worker.ts:873) — meaning some phase inside the cycle is not checking the signal between batches, so the cooperative bail isn't fully wired on every code path the cycle can take. The force-evict is the safety net catching the gap, not the gap being closed.
Ask: audit the remaining in-cycle phases reachable from autopilot-cycle for throwIfAborted() / isAborted() check points between units of work (the src/core/abort-check.ts helpers from #1737 exist — they just aren't called everywhere). Any phase that can run >30s without a check will keep tripping the force-evict. Candidates to verify: the non-embed phases (lint, extract, backlinks, consolidate) that #1737 didn't explicitly thread.
Adjacent: #1678 (RSS-watchdog crash-loop / mis-reported deaths — same job, the watchdog side), #1470 (lock-release on the killed cycle).
Environment / repro context
- 0.42.34.0 @ 099d9a8, single host, Linux x64, Bun, Node v22.
- Supabase split-pool (direct
:5432 + read :6543), PgBouncer transaction mode, GBRAIN_PREPARE unset (auto-disabled on 6543).
- Federation:
default + straylight-brain sources.
- Bugs 1 & 3 surface under recycle/crash; Bug 2 is 100% deterministic on any one-shot CLI call.
Happy to provide pg_stat_activity snapshots, lock-table dumps, or run a patched build against this brain to verify fixes.
Compendium: 3 production-observed defects on 0.42.34.0 (single deployment, 386K-page brain)
Filing as a consolidated report from a live, heavily-loaded deployment (386,190 pages, 2.6M typed edges, ~14K page-writes/day, Supabase split-pool + PgBouncer transaction-mode on
:6543, multi-source federation:default+straylight-brain). All three are reproducible here and are observability/liveness gaps rather than data-corruption bugs. Version0.42.34.0@099d9a8f.Cross-links to adjacent open issues noted per-bug; none of these fully covers the specific failure below.
BUG 1 — Stale
gbrain_cycle_locksrows held by dead PIDs are never reaped except on-contention; a crashed sync/cycle silently blocks that source until something else happens to contendObserved: After a worker recycle,
gbrain_cycle_locksretained rows whoseholder_pidwas a dead process:737217exited >36 min ago — past the 5-minLOCK_TTL_MS(cycle.ts:17) and past the 30-min sync TTL — yet the row persists.kill -0 737217→ no such process.Root cause: lock reclaim is lazy / on-contention only.
tryAcquireDbLock(cycle.ts:77) steals a row only when a new acquirer shows up AND (age > TTLORpid not alive on this host) — see the comment at cycle.ts:90 (...older than LOCK_TTL_MS OR the PID is no longer alive on this host). There is no background sweep. So when a sync for sourceScrashes (worker killed, OOM, recycle), itsgbrain-sync:Srow strands. The next sync ofSwill eventually steal it — but until thengbrain statusshows the source as "syncing," and any liveness/health probe keyed on lock presence reports a phantom in-flight cycle. With per-source locks, a low-traffic source can stay falsely "locked" for hours.Why TTL doesn't save you: TTL expiry is also only evaluated at acquire time. A dead-pid lock with an expired TTL is still a live row in the table until contention; nothing deletes it proactively.
Ask: a periodic reaper (supervisor tick, or worker idle-loop) that
DELETE FROM gbrain_cycle_locks WHERE holder_host = <this host> AND NOT pid_alive(holder_pid)— host-scoped so it's safe with multi-host. Cheap, removes the phantom-lock class entirely.Adjacent: #1470 (runCycle swallows
lock.release()errors, stranding rows — same table, the release side of this same gap), #1591 (--break-lockparity — the manual workaround that exists because there's no auto-reaper).BUG 2 —
engine.disconnect()hangs ≥10s on every one-shot CLI invocation against a PgBouncer transaction-mode pooler; only the hard-deadline force-exit saves itObserved: every one-shot
gbrain query ...(and other short CLI commands) ends with:100% reproducible. The command's actual stdout (e.g. relational-query results from the new #1959 path) is swallowed by / races with the force-exit banner, so one-shot CLI output is effectively broken for piping/capture.
Root cause:
ConnectionManager.disconnect()(connection-manager.ts:402) awaitsthis._directPool.end()thenthis._readPool.end(). Against PgBouncer in transaction mode (:6543, the documented default — prepared statements already disabled for this reason),pool.end()waits for a graceful drain that doesn't resolve the way it does against a session-mode server, so it blocks until the 10sDISCONNECT_HARD_DEADLINE_MS(cli.ts:371) fires. The hard-deadline was added as a band-aid (cli.ts:1933) but the underlyingend()never completing means: (a) 10s added latency to every short CLI call, (b) output races the warning.Ask: in transaction-pooler mode,
disconnect()shouldpool.end()with a short internal timeout and fall back to destroying sockets (or skip graceful drain entirely for one-shot CLI — the kernel reclaims sockets on exit anyway, per the note in timeout.ts:13). Either way the 10s tax + output race should disappear.Adjacent cluster (all disconnect-lifecycle, none is this exact transaction-mode drain-hang): #1499 (lint-phase disconnect kills shared pool), #1729 (dream disconnects singleton mid-cycle), #1745 ("connect() has not been called" reproduces), #1887 (post-print write-back races connection teardown), #1617 (disconnect-call audit failures).
BUG 3 —
#1737cooperative-abort is incomplete: handlers still hit "handler ignored abort signal (force-evicted)", and stall-death accounting under-countsObserved (24h, this deployment): of 116 dead
autopilot-cyclejobs, the error breakdown was:#1737(shipped 0.42.29) threadedAbortSignalthrough embed-backfill → autopilot-cycle → runPhaseEmbed and is a real improvement (new-death rate dropped to ~0 after upgrading + recycle). But the 30s universal grace-eviction (worker.ts:758) still fireshandler ignored abort signal (force-evicted)(worker.ts:873) — meaning some phase inside the cycle is not checking the signal between batches, so the cooperative bail isn't fully wired on every code path the cycle can take. The force-evict is the safety net catching the gap, not the gap being closed.Ask: audit the remaining in-cycle phases reachable from
autopilot-cycleforthrowIfAborted()/isAborted()check points between units of work (thesrc/core/abort-check.tshelpers from #1737 exist — they just aren't called everywhere). Any phase that can run >30s without a check will keep tripping the force-evict. Candidates to verify: the non-embed phases (lint, extract, backlinks, consolidate) that #1737 didn't explicitly thread.Adjacent: #1678 (RSS-watchdog crash-loop / mis-reported deaths — same job, the watchdog side), #1470 (lock-release on the killed cycle).
Environment / repro context
:5432+ read:6543), PgBouncer transaction mode,GBRAIN_PREPAREunset (auto-disabled on 6543).default+straylight-brainsources.Happy to provide
pg_stat_activitysnapshots, lock-table dumps, or run a patched build against this brain to verify fixes.