Postgres engine: post-print [last-retrieved] write-back races connection teardown (write CONNECTION_ENDED) → best-effort failure + intermittent ~100% CPU non-exit hang. #1259's drain doesn't cover the Postgres adapter

Summary

After gbrain query/search prints its (correct, fast) results, gbrain performs a best-effort "last-retrieved" write-back. On the native Postgres storage engine, the DB connection/pool is closed before that write-back runs, so it races teardown. Two observed outcomes from the same race:

Fast-fail (common): the write-back hits a dead socket and the process exits, printing [last-retrieved] write-back failed (best-effort): write CONNECTION_ENDED localhost:5432. Results are correct; the line is noise, but it signals the ordering bug.
Hang (intermittent): when the write-back instead blocks on the pool rather than erroring, the CLI pegs one CPU core at ~100% and never exits — must be SIGKILLed (timeout/killpg; rc 137).

This is the post-print teardown that PR #1259 ("drain last-retrieved writes before CLI disconnect") fixed for PGLite (#1247 / #1269 / #1290 / #1343, all PGLite, #1343 Closed). On the Postgres engine adapter the drain ordering isn't honored — the connection is ended before the last-retrieved write is drained. So the fix is engine-specific and the Postgres path regressed / was never covered.
Note: this is not #1368 (the pre-output expansion-path spin, where no results print). Here results are printed (fast, correct), the failure is in teardown, and it reproduces with --expand false, so the expander is ruled out.
Observed (live, 2026-06-05, gbrain 0.41.0.0, Postgres engine)
$ gbrain query "test" --expand false
[0.8326] meetings/2026-04-08-... -- # Insight Spine — ...
[0.8190] people/richard-jerrett -- # Richard Jerrett ...
... (20 correct, well-ranked rows in ~1-3s) ...
[0.5132] people/melody-achi -- # Melody Achi ...
[last-retrieved] write-back failed (best-effort): write CONNECTION_ENDED localhost:5432
$            # exited this run; on other runs the same point hangs at ~100% CPU until killed

Steps to reproduce

Brain on the native Postgres engine (engine: postgres, Homebrew postgresql@17 + pgvector 0.8.0; config per the #1671 migration). Healthy: gbrain doctor → brain_score 100/100, connection: Connected, pgvector: Extension installed.
Run gbrain query "<any query returning at least one page>" --expand false
Results print in ~1-3s, then either (a) the [last-retrieved] write-back failed ... write CONNECTION_ENDED localhost:5432 line prints and the process exits, or (b) the process pegs one core at ~100% CPU and never exits (intermittent — the two branches of the same teardown race).
Reproduces with --expand false (not the #1368 expander) on a freshly-rebuilt, integrity-clean brain (not corruption-dependent).

Environment

gbrain version: 0.41.0.0 (local git checkout, bun link; runs src/cli.ts directly — no dist/)
Storage engine: native PostgreSQL 17.10 (Homebrew postgresql@17, launchd) + pgvector 0.8.0 (built from source)
OS: macOS 26.5.1 / Darwin 25F80 (arm64), Mac mini M4 Pro
Bun: 1.3.14 · Node: v22.22.2
Embedding: llama-server:daniel-embed (bge-m3 Q8_0, 1024d) via llama-swap on 127.0.0.1:11435
gbrain doctor: brain_score 100/100; 194 pages / 411 chunks; schema v93

Likely cause (evidenced)

The CONNECTION_ENDED localhost:5432 confirms the last-retrieved write-back fires after the Postgres connection/pool is closed. #1259 drains the last-retrieved writes before CLI disconnect for the PGLite path; the Postgres engine adapter evidently tears the connection down first (or doesn't await the same drain), so the write-back either errors on a dead socket (fast-fail, process exits — observed) or blocks on the pool, keeping the event loop alive and spinning a core (the ~100% CPU non-exit hang).
Fix direction: ensure the last-retrieved write is awaited and drained before pool.end() / connection close on the Postgres engine, or make the drain hook engine-agnostic (shared shutdown) rather than wired to the PGLite disconnect.
Impact

The CONNECTION_ENDED line is cosmetic (results are correct) but signals the bug; the intermittent hang breaks any harness/cron that waits for process exit — every batch invocation must wrap each call in a timeout/SIGKILL and parse stdout.
Affects native-Postgres deployments, which on macOS 26 (Tahoe) is the only working engine (PGLite WASM aborts there — #223/#1670) — exactly the users forced onto Postgres by the OS update.

Workaround

Wrap each invocation in a kill-after-output bound (own process group, SIGKILL after results print, keep stdout; treat rc 137 as success).
Happy to add a process.getActiveResourcesInfo() dump from a hang instance if needed, or open a PR if the fix is awaiting the last-retrieved drain before pool.end() on the Postgres adapter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres engine: post-print [last-retrieved] write-back races connection teardown (write CONNECTION_ENDED) → best-effort failure + intermittent ~100% CPU non-exit hang. #1259's drain doesn't cover the Postgres adapter #1887

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Postgres engine: post-print [last-retrieved] write-back races connection teardown (write CONNECTION_ENDED) → best-effort failure + intermittent ~100% CPU non-exit hang. #1259's drain doesn't cover the Postgres adapter #1887

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions