getConfig lacks retry protection on transient connection loss

## Problem: `getConfig` lacks retry protection on transient connection loss

### Environment

- gbrain v0.41.26.0 (commit 2aed39b3)
- Postgres engine behind a connection pooler (port 5433)

### Background

The `conversation_facts_backfill` cycle phase (added in v0.41.11.0, included in `ALL_PHASES` by default) calls `engine.getConfig()` to check `cycle.conversation_facts_backfill.enabled`. When the Postgres connection briefly drops — a transient event — `getConfig()` throws immediately with `"No database connection: connect() has not been called"` and crashes the entire dream cycle with exit code 1.

This is the same class of transient connection loss that batch operations (`addLinksBatch`, `addTimelineEntriesBatch`, `upsertChunks`) handle via the `withRetry`/`batchRetry` mechanism introduced in v0.41.19.0 (#1537). But `getConfig()` was not covered.

### Root cause

`PostgresEngine.getConfig()` at `src/core/postgres-engine.ts:4432-4436` uses a bare `this.sql` call with no retry wrapper:

```typescript
async getConfig(key: string): Promise<string | null> {
    const sql = this.sql;
    const rows = await sql`SELECT value FROM config WHERE key = ${key}`;
    return rows.length > 0 ? (rows[0].value as string) : null;
}
```

The `sql` getter (`postgres-engine.ts:122-125`) has a two-tier fallback:

```typescript
get sql(): ReturnType<typeof postgres> {
    if (this._sql) return this._sql;
    return db.getConnection();
}
```

When a pooler (PgBouncer/Supavisor) recycles a connection, `this._sql` is set to `null` during `disconnect()` (`postgres-engine.ts:196-214`). If the global singleton in `db.ts` is also `null` (which happens when its `disconnect()` was called as part of the same pooler-initiated teardown, or when no separate module-level connection was ever opened), `db.getConnection()` throws `GBrainError('No database connection', 'connect() has not been called', ...)`.

The `conversation_facts_backfill` phase passes this error through to the cycle runner with no try/catch, causing the entire `gbrain dream` process to exit with code 1.

### Impact

- `gbrain dream` exits with code 1 when `conversation_facts_backfill` runs after a transient connection drop
- Reproduced in deployment: all prior dream phases (extract, patterns, consolidate, etc.) succeeded against the same engine, then `conversation_facts_backfill` hit a dropped connection and crashed
- The non-zero exit propagates to cron alerts and operator confusion

### Existing defense

v0.41.19.0 (#1537) added `withRetry(BULK_RETRY_OPTS)` — 3 retries, decorrelated jitter, up to ~12s total — but ONLY to `addLinksBatch`, `addTimelineEntriesBatch`, and `upsertChunks`. The commit message states that retry becomes a data-primitive contract inherited by all callers, but `getConfig()` and other unguarded methods (`listConfigKeys`) were not covered — only batch writes were addressed. The CI lint guard (`check-no-double-retry.sh`) only prevents double-wrapping engine batch methods — it does not ensure all engine methods have retry.

### Suggested fix

Add `withRetry` protection to `getConfig()`:

```typescript
import { withRetry, BULK_RETRY_OPTS } from './retry';

async getConfig(key: string): Promise<string | null> {
    return withRetry(async () => {
        const sql = this.sql;
        const rows = await sql`SELECT value FROM config WHERE key = ${key}`;
        return rows.length > 0 ? (rows[0].value as string) : null;
    }, BULK_RETRY_OPTS);
}
```

`BULK_RETRY_OPTS` is used for consistency with the existing engine-level retry pattern. While `getConfig` is a single-row read (not a bulk write), the same retry parameters are appropriate: transient pooler disconnects affect read and write operations identically, and a 12s worst-case wait is negligible for a config read.

An alternative approach — wrapping the `sql` getter itself to auto-retry — would introduce double-retry risk: methods already wrapped in `withRetry` would retry at both the getter level and their own wrapper level. The engine-level fix is the safer choice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getConfig lacks retry protection on transient connection loss #1603

Problem: `getConfig` lacks retry protection on transient connection loss

Environment

Background

Root cause

Impact

Existing defense

Suggested fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

getConfig lacks retry protection on transient connection loss #1603

Description

Problem: getConfig lacks retry protection on transient connection loss

Environment

Background

Root cause

Impact

Existing defense

Suggested fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Problem: `getConfig` lacks retry protection on transient connection loss