Compendium (0.42.34.0): stale dead-PID cycle locks never reaped, disconnect() 10s hang on PgBouncer txn-mode CLI, and incomplete #1737 cooperative-abort

## Compendium: 3 production-observed defects on 0.42.34.0 (single deployment, 386K-page brain)

Filing as a consolidated report from a live, heavily-loaded deployment (386,190 pages, 2.6M typed edges, ~14K page-writes/day, Supabase split-pool + PgBouncer transaction-mode on `:6543`, multi-source federation: `default` + `straylight-brain`). All three are reproducible here and are observability/liveness gaps rather than data-corruption bugs. Version `0.42.34.0` @ `099d9a8f`.

Cross-links to adjacent open issues noted per-bug; none of these fully covers the specific failure below.

---

### BUG 1 — Stale `gbrain_cycle_locks` rows held by **dead PIDs** are never reaped except on-contention; a crashed sync/cycle silently blocks that source until something else happens to contend

**Observed:** After a worker recycle, `gbrain_cycle_locks` retained rows whose `holder_pid` was a **dead process**:

```
gbrain-sync:straylight-brain   pid=737217 (DEAD)  held=00:36:12
gbrain-cycle:straylight-brain  pid=746387 (DEAD)  held=00:06:59
```

`737217` exited >36 min ago — **past the 5-min `LOCK_TTL_MS`** (cycle.ts:17) and past the 30-min sync TTL — yet the row persists. `kill -0 737217` → no such process.

**Root cause:** lock reclaim is **lazy / on-contention only**. `tryAcquireDbLock` (cycle.ts:77) steals a row only when a *new* acquirer shows up AND (`age > TTL` OR `pid not alive on this host`) — see the comment at cycle.ts:90 (`...older than LOCK_TTL_MS OR the PID is no longer alive on this host`). There is **no background sweep**. So when a sync for source `S` crashes (worker killed, OOM, recycle), its `gbrain-sync:S` row strands. The **next** sync of `S` will eventually steal it — but until then `gbrain status` shows the source as "syncing," and any liveness/health probe keyed on lock presence reports a phantom in-flight cycle. With per-source locks, a low-traffic source can stay falsely "locked" for hours.

**Why TTL doesn't save you:** TTL expiry is also only evaluated *at acquire time*. A dead-pid lock with an expired TTL is still a live row in the table until contention; nothing deletes it proactively.

**Ask:** a periodic reaper (supervisor tick, or worker idle-loop) that `DELETE FROM gbrain_cycle_locks WHERE holder_host = <this host> AND NOT pid_alive(holder_pid)` — host-scoped so it's safe with multi-host. Cheap, removes the phantom-lock class entirely.

Adjacent: #1470 (runCycle swallows `lock.release()` errors, stranding rows — same table, the *release* side of this same gap), #1591 (`--break-lock` parity — the manual workaround that exists *because* there's no auto-reaper).

---

### BUG 2 — `engine.disconnect()` hangs ≥10s on every one-shot CLI invocation against a PgBouncer transaction-mode pooler; only the hard-deadline force-exit saves it

**Observed:** every one-shot `gbrain query ...` (and other short CLI commands) ends with:

```
[cli] engine.disconnect() did not return within 10000ms — force-exiting
```

100% reproducible. The command's actual stdout (e.g. relational-query results from the new #1959 path) is **swallowed by / races with** the force-exit banner, so one-shot CLI output is effectively broken for piping/capture.

**Root cause:** `ConnectionManager.disconnect()` (connection-manager.ts:402) awaits `this._directPool.end()` then `this._readPool.end()`. Against PgBouncer in **transaction mode** (`:6543`, the documented default — prepared statements already disabled for this reason), `pool.end()` waits for a graceful drain that doesn't resolve the way it does against a session-mode server, so it blocks until the 10s `DISCONNECT_HARD_DEADLINE_MS` (cli.ts:371) fires. The hard-deadline was added as a band-aid (cli.ts:1933) but the underlying `end()` never completing means: (a) 10s added latency to every short CLI call, (b) output races the warning.

**Ask:** in transaction-pooler mode, `disconnect()` should `pool.end()` with a short internal timeout and fall back to destroying sockets (or skip graceful drain entirely for one-shot CLI — the kernel reclaims sockets on exit anyway, per the note in timeout.ts:13). Either way the 10s tax + output race should disappear.

Adjacent cluster (all disconnect-lifecycle, none is this exact transaction-mode drain-hang): #1499 (lint-phase disconnect kills shared pool), #1729 (dream disconnects singleton mid-cycle), #1745 ("connect() has not been called" reproduces), #1887 (post-print write-back races connection teardown), #1617 (disconnect-call audit failures).

---

### BUG 3 — `#1737` cooperative-abort is incomplete: handlers still hit "handler ignored abort signal (force-evicted)", and stall-death accounting under-counts

**Observed (24h, this deployment):** of 116 dead `autopilot-cycle` jobs, the error breakdown was:

```
[113x] max stalled count exceeded
  [2x] handler ignored abort signal (force-evicted)
  [2x] aborted: watchdog
```

`#1737` (shipped 0.42.29) threaded `AbortSignal` through embed-backfill → autopilot-cycle → runPhaseEmbed and is a real improvement (new-death rate dropped to ~0 after upgrading + recycle). **But** the 30s universal grace-eviction (worker.ts:758) still fires `handler ignored abort signal (force-evicted)` (worker.ts:873) — meaning some phase inside the cycle is **not** checking the signal between batches, so the cooperative bail isn't fully wired on every code path the cycle can take. The force-evict is the safety net catching the gap, not the gap being closed.

**Ask:** audit the remaining in-cycle phases reachable from `autopilot-cycle` for `throwIfAborted()` / `isAborted()` check points between units of work (the `src/core/abort-check.ts` helpers from #1737 exist — they just aren't called everywhere). Any phase that can run >30s without a check will keep tripping the force-evict. Candidates to verify: the non-embed phases (lint, extract, backlinks, consolidate) that #1737 didn't explicitly thread.

Adjacent: #1678 (RSS-watchdog crash-loop / mis-reported deaths — same job, the watchdog side), #1470 (lock-release on the killed cycle).

---

### Environment / repro context
- 0.42.34.0 @ 099d9a8f, single host, Linux x64, Bun, Node v22.
- Supabase split-pool (direct `:5432` + read `:6543`), PgBouncer transaction mode, `GBRAIN_PREPARE` unset (auto-disabled on 6543).
- Federation: `default` + `straylight-brain` sources.
- Bugs 1 & 3 surface under recycle/crash; Bug 2 is 100% deterministic on any one-shot CLI call.

Happy to provide `pg_stat_activity` snapshots, lock-table dumps, or run a patched build against this brain to verify fixes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compendium (0.42.34.0): stale dead-PID cycle locks never reaped, disconnect() 10s hang on PgBouncer txn-mode CLI, and incomplete #1737 cooperative-abort #1972

Compendium: 3 production-observed defects on 0.42.34.0 (single deployment, 386K-page brain)

BUG 1 — Stale `gbrain_cycle_locks` rows held by dead PIDs are never reaped except on-contention; a crashed sync/cycle silently blocks that source until something else happens to contend

BUG 2 — `engine.disconnect()` hangs ≥10s on every one-shot CLI invocation against a PgBouncer transaction-mode pooler; only the hard-deadline force-exit saves it

BUG 3 — `#1737` cooperative-abort is incomplete: handlers still hit "handler ignored abort signal (force-evicted)", and stall-death accounting under-counts

Environment / repro context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Compendium (0.42.34.0): stale dead-PID cycle locks never reaped, disconnect() 10s hang on PgBouncer txn-mode CLI, and incomplete #1737 cooperative-abort #1972

Description

Compendium: 3 production-observed defects on 0.42.34.0 (single deployment, 386K-page brain)

BUG 1 — Stale gbrain_cycle_locks rows held by dead PIDs are never reaped except on-contention; a crashed sync/cycle silently blocks that source until something else happens to contend

BUG 2 — engine.disconnect() hangs ≥10s on every one-shot CLI invocation against a PgBouncer transaction-mode pooler; only the hard-deadline force-exit saves it

BUG 3 — #1737 cooperative-abort is incomplete: handlers still hit "handler ignored abort signal (force-evicted)", and stall-death accounting under-counts

Environment / repro context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

BUG 1 — Stale `gbrain_cycle_locks` rows held by dead PIDs are never reaped except on-contention; a crashed sync/cycle silently blocks that source until something else happens to contend

BUG 2 — `engine.disconnect()` hangs ≥10s on every one-shot CLI invocation against a PgBouncer transaction-mode pooler; only the hard-deadline force-exit saves it

BUG 3 — `#1737` cooperative-abort is incomplete: handlers still hit "handler ignored abort signal (force-evicted)", and stall-death accounting under-counts