Skip to content

autopilot: reconnect loop after 5 consecutive failures logs "config.database_url undefined" indefinitely, never exits #1162

@colin477

Description

@colin477

Summary

When the autopilot hits its 5-consecutive-cycle-failure threshold and logs 5 consecutive cycle failures. Stopping autopilot., the process does not actually exit. Instead it enters a reconnect-retry loop that logs three lines every 5 minutes forever:

[2026-05-18T15:16:53] [reconnect] ERROR: undefined is not an object (evaluating 'config.database_url')
[2026-05-18T15:16:53] [dispatch] ERROR: No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[2026-05-18T15:16:53] [health]   ERROR: No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>

~/.gbrain/config.json on disk is still valid throughout (e.g. gbrain query from the same machine works fine). The autopilot's in-process config reference has gone undefined and the reconnect path doesn't re-read from disk before retrying.

Repro

  1. Create connection pressure that causes autopilot cycles to fail. In my case the trigger was Supabase session-pool exhaustion ((EMAXCONNSESSION) max clients reached in session mode - max clients are limited to pool_size: 15) from accumulated gbrain serve processes across days — see companion issue [link to serve-accumulation issue]. Any sustained DB-unreachability of similar shape should reproduce.
  2. Wait for 5 consecutive autopilot-cycle job failures.
  3. Observe 5 consecutive cycle failures. Stopping autopilot. in ~/.gbrain/autopilot.err.
  4. ps aux | grep gbrain.*autopilot — the process is still running.
  5. tail -f ~/.gbrain/autopilot.err — the three-line reconnect/dispatch/health error block repeats every 5 minutes (the autopilot interval), forever.

Diagnosis

Stack trace from the most recent occurrence:

GBrainError: No database connection: connect() has not been called.
    at getConnection (src/core/db.ts:153)
    at transaction (src/core/postgres-engine.ts:524)
    at failJob (src/core/minions/queue.ts:846)
    at executeJob (src/core/minions/worker.ts:715)

The [reconnect] log line specifically reads undefined is not an object (evaluating 'config.database_url') — meaning the in-process `config` reference itself is undefined at the moment of retry, not just the connection. This points to a state-handoff bug in the reconnect path rather than a transient network issue:

  • The cycle-failure threshold path probably nulls or scopes-out the config object before triggering "Stopping autopilot."
  • The reconnect retry then runs against that nulled config, fails on the dereference, and the process stays alive in a non-functional zombie state.
  • The "Stopping autopilot." message is misleading — it logs intent but doesn't actually terminate the process or restart with a fresh config-load.

Workaround (verified 2026-05-18)

# Force-kill the zombie autopilot
kill -KILL <pid>
rm -f ~/.gbrain/autopilot.lock

# Whatever watchdog you have (launchd / cron / systemd) respawns autopilot
# with a clean config-load and the loop is gone. Verified next cycle reports:
#   [cycle] score=70 elapsed=1s next=300s

Suggested fix directions

  1. After "Stopping autopilot." actually exit with non-zero code so the watchdog can respawn cleanly. Don't drop into a reconnect loop.
  2. OR: have the reconnect path re-read `~/.gbrain/config.json` from disk on each retry instead of relying on the in-process config reference.
  3. OR: when `config` is detected undefined, treat as fatal and exit instead of continuing to retry against an undefined reference.

Any of the three closes the loop. (1) is probably cleanest — preserves the watchdog contract.

Environment

  • gbrain: 0.33.1.1
  • Bun: 1.3.11
  • Platform: macOS Darwin 25.5.0 (arm64)
  • Engine: postgres (Supabase, session-mode pooler port 5432)
  • Discovered while diagnosing a Supabase session-pool exhaustion incident from accumulated `serve` processes; the upstream pool issue was unrelated but exposed this autopilot reconnect bug.

Happy to PR option 1 if the maintainer team agrees on the direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions