Skip to content

getConfig lacks retry protection on transient connection loss #1603

@justemu

Description

@justemu

Problem: getConfig lacks retry protection on transient connection loss

Environment

  • gbrain v0.41.26.0 (commit 2aed39b)
  • Postgres engine behind a connection pooler (port 5433)

Background

The conversation_facts_backfill cycle phase (added in v0.41.11.0, included in ALL_PHASES by default) calls engine.getConfig() to check cycle.conversation_facts_backfill.enabled. When the Postgres connection briefly drops — a transient event — getConfig() throws immediately with "No database connection: connect() has not been called" and crashes the entire dream cycle with exit code 1.

This is the same class of transient connection loss that batch operations (addLinksBatch, addTimelineEntriesBatch, upsertChunks) handle via the withRetry/batchRetry mechanism introduced in v0.41.19.0 (#1537). But getConfig() was not covered.

Root cause

PostgresEngine.getConfig() at src/core/postgres-engine.ts:4432-4436 uses a bare this.sql call with no retry wrapper:

async getConfig(key: string): Promise<string | null> {
    const sql = this.sql;
    const rows = await sql`SELECT value FROM config WHERE key = ${key}`;
    return rows.length > 0 ? (rows[0].value as string) : null;
}

The sql getter (postgres-engine.ts:122-125) has a two-tier fallback:

get sql(): ReturnType<typeof postgres> {
    if (this._sql) return this._sql;
    return db.getConnection();
}

When a pooler (PgBouncer/Supavisor) recycles a connection, this._sql is set to null during disconnect() (postgres-engine.ts:196-214). If the global singleton in db.ts is also null (which happens when its disconnect() was called as part of the same pooler-initiated teardown, or when no separate module-level connection was ever opened), db.getConnection() throws GBrainError('No database connection', 'connect() has not been called', ...).

The conversation_facts_backfill phase passes this error through to the cycle runner with no try/catch, causing the entire gbrain dream process to exit with code 1.

Impact

  • gbrain dream exits with code 1 when conversation_facts_backfill runs after a transient connection drop
  • Reproduced in deployment: all prior dream phases (extract, patterns, consolidate, etc.) succeeded against the same engine, then conversation_facts_backfill hit a dropped connection and crashed
  • The non-zero exit propagates to cron alerts and operator confusion

Existing defense

v0.41.19.0 (#1537) added withRetry(BULK_RETRY_OPTS) — 3 retries, decorrelated jitter, up to ~12s total — but ONLY to addLinksBatch, addTimelineEntriesBatch, and upsertChunks. The commit message states that retry becomes a data-primitive contract inherited by all callers, but getConfig() and other unguarded methods (listConfigKeys) were not covered — only batch writes were addressed. The CI lint guard (check-no-double-retry.sh) only prevents double-wrapping engine batch methods — it does not ensure all engine methods have retry.

Suggested fix

Add withRetry protection to getConfig():

import { withRetry, BULK_RETRY_OPTS } from './retry';

async getConfig(key: string): Promise<string | null> {
    return withRetry(async () => {
        const sql = this.sql;
        const rows = await sql`SELECT value FROM config WHERE key = ${key}`;
        return rows.length > 0 ? (rows[0].value as string) : null;
    }, BULK_RETRY_OPTS);
}

BULK_RETRY_OPTS is used for consistency with the existing engine-level retry pattern. While getConfig is a single-row read (not a bulk write), the same retry parameters are appropriate: transient pooler disconnects affect read and write operations identically, and a 12s worst-case wait is negligible for a config read.

An alternative approach — wrapping the sql getter itself to auto-retry — would introduce double-retry risk: methods already wrapped in withRetry would retry at both the getter level and their own wrapper level. The engine-level fix is the safer choice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions