Problem
When the Supabase DB has a transient outage (port 5432 ECONNREFUSED), the worker supervisor hits its crash limit and sets max_crashes_exceeded, then stops permanently. It will not recover on its own even after the DB comes back online.
This is a poor failure mode for a transient network/DB issue. The supervisor should distinguish between application crashes (legitimate reason to stop retrying) and connection failures (should retry indefinitely with exponential backoff).
Observed behavior
~/.gbrain/worker-supervisor-error.log fills with:
Cannot connect to database: connect ECONNREFUSED <ip>:5432
After N crashes the supervisor sets max_crashes_exceeded and the gbrain jobs supervisor status shows:
Supervisor: not running
⚠ Max crashes exceeded at <timestamp>
Recovery requires manual intervention: gbrain jobs supervisor start --detach
Environment
- gbrain 0.42.26.0
- Supabase-hosted Postgres (direct connection, port 5432)
- macOS, launchd-managed supervisor
Suggested fix
Two options (either would solve it):
-
Distinguish crash types: Don't count DB connection failures toward max_crashes. Only count application-level crashes. Connection failures should retry with exponential backoff indefinitely (or with a configurable longer window).
-
Configurable crash backoff: Add a supervisor.connection_error_retry_indefinitely config flag (default: true) so the supervisor keeps retrying on ECONNREFUSED/ETIMEDOUT without hitting the crash ceiling.
The current behavior turns a 10-minute Supabase hiccup into a permanently dead supervisor requiring manual intervention.
Problem
When the Supabase DB has a transient outage (port 5432 ECONNREFUSED), the worker supervisor hits its crash limit and sets
max_crashes_exceeded, then stops permanently. It will not recover on its own even after the DB comes back online.This is a poor failure mode for a transient network/DB issue. The supervisor should distinguish between application crashes (legitimate reason to stop retrying) and connection failures (should retry indefinitely with exponential backoff).
Observed behavior
~/.gbrain/worker-supervisor-error.logfills with:After N crashes the supervisor sets
max_crashes_exceededand thegbrain jobs supervisor statusshows:Recovery requires manual intervention:
gbrain jobs supervisor start --detachEnvironment
Suggested fix
Two options (either would solve it):
Distinguish crash types: Don't count DB connection failures toward
max_crashes. Only count application-level crashes. Connection failures should retry with exponential backoff indefinitely (or with a configurable longer window).Configurable crash backoff: Add a
supervisor.connection_error_retry_indefinitelyconfig flag (default: true) so the supervisor keeps retrying on ECONNREFUSED/ETIMEDOUT without hitting the crash ceiling.The current behavior turns a 10-minute Supabase hiccup into a permanently dead supervisor requiring manual intervention.