Skip to content

Postgres engine: post-print [last-retrieved] write-back races connection teardown (write CONNECTION_ENDED) → best-effort failure + intermittent ~100% CPU non-exit hang. #1259's drain doesn't cover the Postgres adapter #1887

@danieltartaro

Description

@danieltartaro

Summary

After gbrain query/search prints its (correct, fast) results, gbrain performs a best-effort "last-retrieved" write-back. On the native Postgres storage engine, the DB connection/pool is closed before that write-back runs, so it races teardown. Two observed outcomes from the same race:

Fast-fail (common): the write-back hits a dead socket and the process exits, printing [last-retrieved] write-back failed (best-effort): write CONNECTION_ENDED localhost:5432. Results are correct; the line is noise, but it signals the ordering bug.
Hang (intermittent): when the write-back instead blocks on the pool rather than erroring, the CLI pegs one CPU core at ~100% and never exits — must be SIGKILLed (timeout/killpg; rc 137).

This is the post-print teardown that PR #1259 ("drain last-retrieved writes before CLI disconnect") fixed for PGLite (#1247 / #1269 / #1290 / #1343, all PGLite, #1343 Closed). On the Postgres engine adapter the drain ordering isn't honored — the connection is ended before the last-retrieved write is drained. So the fix is engine-specific and the Postgres path regressed / was never covered.
Note: this is not #1368 (the pre-output expansion-path spin, where no results print). Here results are printed (fast, correct), the failure is in teardown, and it reproduces with --expand false, so the expander is ruled out.
Observed (live, 2026-06-05, gbrain 0.41.0.0, Postgres engine)
$ gbrain query "test" --expand false
[0.8326] meetings/2026-04-08-... -- # Insight Spine — ...
[0.8190] people/richard-jerrett -- # Richard Jerrett ...
... (20 correct, well-ranked rows in ~1-3s) ...
[0.5132] people/melody-achi -- # Melody Achi ...
[last-retrieved] write-back failed (best-effort): write CONNECTION_ENDED localhost:5432
$ # exited this run; on other runs the same point hangs at ~100% CPU until killed

Steps to reproduce

Brain on the native Postgres engine (engine: postgres, Homebrew postgresql@17 + pgvector 0.8.0; config per the #1671 migration). Healthy: gbrain doctor → brain_score 100/100, connection: Connected, pgvector: Extension installed.
Run gbrain query "" --expand false
Results print in ~1-3s, then either (a) the [last-retrieved] write-back failed ... write CONNECTION_ENDED localhost:5432 line prints and the process exits, or (b) the process pegs one core at ~100% CPU and never exits (intermittent — the two branches of the same teardown race).
Reproduces with --expand false (not the #1368 expander) on a freshly-rebuilt, integrity-clean brain (not corruption-dependent).

Environment

gbrain version: 0.41.0.0 (local git checkout, bun link; runs src/cli.ts directly — no dist/)
Storage engine: native PostgreSQL 17.10 (Homebrew postgresql@17, launchd) + pgvector 0.8.0 (built from source)
OS: macOS 26.5.1 / Darwin 25F80 (arm64), Mac mini M4 Pro
Bun: 1.3.14 · Node: v22.22.2
Embedding: llama-server:daniel-embed (bge-m3 Q8_0, 1024d) via llama-swap on 127.0.0.1:11435
gbrain doctor: brain_score 100/100; 194 pages / 411 chunks; schema v93

Likely cause (evidenced)

The CONNECTION_ENDED localhost:5432 confirms the last-retrieved write-back fires after the Postgres connection/pool is closed. #1259 drains the last-retrieved writes before CLI disconnect for the PGLite path; the Postgres engine adapter evidently tears the connection down first (or doesn't await the same drain), so the write-back either errors on a dead socket (fast-fail, process exits — observed) or blocks on the pool, keeping the event loop alive and spinning a core (the ~100% CPU non-exit hang).
Fix direction: ensure the last-retrieved write is awaited and drained before pool.end() / connection close on the Postgres engine, or make the drain hook engine-agnostic (shared shutdown) rather than wired to the PGLite disconnect.
Impact

The CONNECTION_ENDED line is cosmetic (results are correct) but signals the bug; the intermittent hang breaks any harness/cron that waits for process exit — every batch invocation must wrap each call in a timeout/SIGKILL and parse stdout.
Affects native-Postgres deployments, which on macOS 26 (Tahoe) is the only working engine (PGLite WASM aborts there — #223/#1670) — exactly the users forced onto Postgres by the OS update.

Workaround

Wrap each invocation in a kill-after-output bound (own process group, SIGKILL after results print, keep stdout; treat rc 137 as success).
Happy to add a process.getActiveResourcesInfo() dump from a hang instance if needed, or open a PR if the fix is awaiting the last-retrieved drain before pool.end() on the Postgres adapter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions