Problem: getConfig lacks retry protection on transient connection loss
Environment
- gbrain v0.41.26.0 (commit 2aed39b)
- Postgres engine behind a connection pooler (port 5433)
Background
The conversation_facts_backfill cycle phase (added in v0.41.11.0, included in ALL_PHASES by default) calls engine.getConfig() to check cycle.conversation_facts_backfill.enabled. When the Postgres connection briefly drops — a transient event — getConfig() throws immediately with "No database connection: connect() has not been called" and crashes the entire dream cycle with exit code 1.
This is the same class of transient connection loss that batch operations (addLinksBatch, addTimelineEntriesBatch, upsertChunks) handle via the withRetry/batchRetry mechanism introduced in v0.41.19.0 (#1537). But getConfig() was not covered.
Root cause
PostgresEngine.getConfig() at src/core/postgres-engine.ts:4432-4436 uses a bare this.sql call with no retry wrapper:
async getConfig(key: string): Promise<string | null> {
const sql = this.sql;
const rows = await sql`SELECT value FROM config WHERE key = ${key}`;
return rows.length > 0 ? (rows[0].value as string) : null;
}
The sql getter (postgres-engine.ts:122-125) has a two-tier fallback:
get sql(): ReturnType<typeof postgres> {
if (this._sql) return this._sql;
return db.getConnection();
}
When a pooler (PgBouncer/Supavisor) recycles a connection, this._sql is set to null during disconnect() (postgres-engine.ts:196-214). If the global singleton in db.ts is also null (which happens when its disconnect() was called as part of the same pooler-initiated teardown, or when no separate module-level connection was ever opened), db.getConnection() throws GBrainError('No database connection', 'connect() has not been called', ...).
The conversation_facts_backfill phase passes this error through to the cycle runner with no try/catch, causing the entire gbrain dream process to exit with code 1.
Impact
gbrain dream exits with code 1 when conversation_facts_backfill runs after a transient connection drop
- Reproduced in deployment: all prior dream phases (extract, patterns, consolidate, etc.) succeeded against the same engine, then
conversation_facts_backfill hit a dropped connection and crashed
- The non-zero exit propagates to cron alerts and operator confusion
Existing defense
v0.41.19.0 (#1537) added withRetry(BULK_RETRY_OPTS) — 3 retries, decorrelated jitter, up to ~12s total — but ONLY to addLinksBatch, addTimelineEntriesBatch, and upsertChunks. The commit message states that retry becomes a data-primitive contract inherited by all callers, but getConfig() and other unguarded methods (listConfigKeys) were not covered — only batch writes were addressed. The CI lint guard (check-no-double-retry.sh) only prevents double-wrapping engine batch methods — it does not ensure all engine methods have retry.
Suggested fix
Add withRetry protection to getConfig():
import { withRetry, BULK_RETRY_OPTS } from './retry';
async getConfig(key: string): Promise<string | null> {
return withRetry(async () => {
const sql = this.sql;
const rows = await sql`SELECT value FROM config WHERE key = ${key}`;
return rows.length > 0 ? (rows[0].value as string) : null;
}, BULK_RETRY_OPTS);
}
BULK_RETRY_OPTS is used for consistency with the existing engine-level retry pattern. While getConfig is a single-row read (not a bulk write), the same retry parameters are appropriate: transient pooler disconnects affect read and write operations identically, and a 12s worst-case wait is negligible for a config read.
An alternative approach — wrapping the sql getter itself to auto-retry — would introduce double-retry risk: methods already wrapped in withRetry would retry at both the getter level and their own wrapper level. The engine-level fix is the safer choice.
Problem:
getConfiglacks retry protection on transient connection lossEnvironment
Background
The
conversation_facts_backfillcycle phase (added in v0.41.11.0, included inALL_PHASESby default) callsengine.getConfig()to checkcycle.conversation_facts_backfill.enabled. When the Postgres connection briefly drops — a transient event —getConfig()throws immediately with"No database connection: connect() has not been called"and crashes the entire dream cycle with exit code 1.This is the same class of transient connection loss that batch operations (
addLinksBatch,addTimelineEntriesBatch,upsertChunks) handle via thewithRetry/batchRetrymechanism introduced in v0.41.19.0 (#1537). ButgetConfig()was not covered.Root cause
PostgresEngine.getConfig()atsrc/core/postgres-engine.ts:4432-4436uses a barethis.sqlcall with no retry wrapper:The
sqlgetter (postgres-engine.ts:122-125) has a two-tier fallback:When a pooler (PgBouncer/Supavisor) recycles a connection,
this._sqlis set tonullduringdisconnect()(postgres-engine.ts:196-214). If the global singleton indb.tsis alsonull(which happens when itsdisconnect()was called as part of the same pooler-initiated teardown, or when no separate module-level connection was ever opened),db.getConnection()throwsGBrainError('No database connection', 'connect() has not been called', ...).The
conversation_facts_backfillphase passes this error through to the cycle runner with no try/catch, causing the entiregbrain dreamprocess to exit with code 1.Impact
gbrain dreamexits with code 1 whenconversation_facts_backfillruns after a transient connection dropconversation_facts_backfillhit a dropped connection and crashedExisting defense
v0.41.19.0 (#1537) added
withRetry(BULK_RETRY_OPTS)— 3 retries, decorrelated jitter, up to ~12s total — but ONLY toaddLinksBatch,addTimelineEntriesBatch, andupsertChunks. The commit message states that retry becomes a data-primitive contract inherited by all callers, butgetConfig()and other unguarded methods (listConfigKeys) were not covered — only batch writes were addressed. The CI lint guard (check-no-double-retry.sh) only prevents double-wrapping engine batch methods — it does not ensure all engine methods have retry.Suggested fix
Add
withRetryprotection togetConfig():BULK_RETRY_OPTSis used for consistency with the existing engine-level retry pattern. WhilegetConfigis a single-row read (not a bulk write), the same retry parameters are appropriate: transient pooler disconnects affect read and write operations identically, and a 12s worst-case wait is negligible for a config read.An alternative approach — wrapping the
sqlgetter itself to auto-retry — would introduce double-retry risk: methods already wrapped inwithRetrywould retry at both the getter level and their own wrapper level. The engine-level fix is the safer choice.