Summary
The jobs supervisor singleton guard is keyed only on the --pid-file path, and that path defaults to ${HOME}/.gbrain/supervisor.pid. Because the default is $HOME-relative and there is no DB/queue-level mutual exclusion, two supervisors launched with different HOME (or different --pid-file) both pass the O_CREAT|O_EXCL EEXIST check and run simultaneously against the same default queue — with independent, possibly-conflicting --max-rss caps.
In production this manifested as a "ghost" supervisor at --max-rss 4096 co-existing with the operator's intended --max-rss 16384 tree on the same box. The 4GB cap's RSS watchdog kept killing legit autopilot-cycle jobs mid-run (30 dead in 6h, all max stalled count exceeded / aborted: watchdog), while every liveness/doctor check reported the queue "healthy" because a supervisor + worker were alive.
Why the existing guard misses it
MinionSupervisor.acquirePidLock() → tryAtomicCreate() (src/core/minions/supervisor.ts) atomically creates this.opts.pidFile. That's correct mutual exclusion for one pidfile path. But:
export const DEFAULT_PID_FILE: string = (() => {
const envOverride = process.env.GBRAIN_SUPERVISOR_PID_FILE;
if (envOverride && envOverride.length > 0) return envOverride;
const home = process.env.HOME ?? '/tmp';
return `${home}/.gbrain/supervisor.pid`;
})();
So HOME=/data → /data/.gbrain/supervisor.pid, HOME=/root → /root/.gbrain/supervisor.pid. Two different lock files → both acquire → two supervisors, same queue. The guard's mutual-exclusion domain (a filesystem path) is narrower than the resource being protected (the (database, queue) pair). There is no pg_advisory_lock or (queue)-scoped row lock anywhere in the supervise path — confirmed by grep: the only "lock" on the queue side is per-job lock_token/lock_until, nothing supervisor-scoped.
Repro
# same brain, same DATABASE_URL, same queue — different HOME
HOME=/data gbrain jobs supervisor --queue default --max-rss 16384 &
HOME=/root gbrain jobs supervisor --queue default --max-rss 4096 &
# both print `started`; both spawn a `jobs work --queue default`; both claim from minion_jobs.
# the 4096 child's RSS watchdog SIGTERMs jobs the 16384 operator never wanted capped.
(Equivalently: one operator-launched supervisor with an explicit --pid-file, plus any internal/doctor/self-upgrade relaunch that uses the $HOME-default path — different paths, both win.)
Impact
Suggested fix (in rough priority)
- Queue-scoped advisory lock as the real singleton. On
supervisor.start(), take a session-scoped pg_advisory_lock(hashtext('gbrain-supervisor:' || queue)) against the brain DB. Second supervisor on the same (db, queue) fails fast with exit code 2 regardless of pidfile path or HOME. This makes the mutex domain match the protected resource. Keep the pidfile for fast local status/stop, but it stops being the authority.
- Canonicalize the default pidfile off the brain identity, not
$HOME. Derive from the resolved DB/brain root (or a fixed /var/state dir), so the same brain always maps to the same lock file even across HOME values. Removes the silent footgun for the common case.
started event + doctor should assert "exactly one supervisor for this queue" and surface the effective --max-rss. Today doctor can't tell you a second supervisor exists with a different cap. A SELECT-based "supervisors seen claiming this queue in the last N min" check would have caught this in one glance.
(1) is the durable fix; (2) closes the common path; (3) makes it observable.
Environment
- gbrain
0.42.25.0 (commit 9a0bae8d)
- Postgres engine (Supabase pooler), single brain, single
default queue
- Linux, cgroup-v2 memory.max = 100GB;
resolveDefaultMaxRssMb() correctly returns 16384 here — confirming the rogue 4096 was an explicit flag on a second supervisor, not the auto-default.
Summary
The
jobs supervisorsingleton guard is keyed only on the--pid-filepath, and that path defaults to${HOME}/.gbrain/supervisor.pid. Because the default is$HOME-relative and there is no DB/queue-level mutual exclusion, two supervisors launched with differentHOME(or different--pid-file) both pass theO_CREAT|O_EXCLEEXIST check and run simultaneously against the samedefaultqueue — with independent, possibly-conflicting--max-rsscaps.In production this manifested as a "ghost" supervisor at
--max-rss 4096co-existing with the operator's intended--max-rss 16384tree on the same box. The 4GB cap's RSS watchdog kept killing legitautopilot-cyclejobs mid-run (30 dead in 6h, allmax stalled count exceeded/aborted: watchdog), while every liveness/doctorcheck reported the queue "healthy" because a supervisor + worker were alive.Why the existing guard misses it
MinionSupervisor.acquirePidLock()→tryAtomicCreate()(src/core/minions/supervisor.ts) atomically createsthis.opts.pidFile. That's correct mutual exclusion for one pidfile path. But:So
HOME=/data→/data/.gbrain/supervisor.pid,HOME=/root→/root/.gbrain/supervisor.pid. Two different lock files → both acquire → two supervisors, same queue. The guard's mutual-exclusion domain (a filesystem path) is narrower than the resource being protected (the(database, queue)pair). There is nopg_advisory_lockor(queue)-scoped row lock anywhere in the supervise path — confirmed by grep: the only "lock" on the queue side is per-joblock_token/lock_until, nothing supervisor-scoped.Repro
(Equivalently: one operator-launched supervisor with an explicit
--pid-file, plus any internal/doctor/self-upgrade relaunch that uses the$HOME-default path — different paths, both win.)Impact
lock_token, and conflicting RSS caps where the lowest cap silently wins and watchdog-kills healthy work.jobs stats,doctor, and the Supervisor never restarts an alive-but-wedged worker: dead worker DB pool + no-op no_recent_completions warn = silent 15h processing halt #1801 wedge watchdog — all of which check "is a supervisor/worker alive," not "is there exactly one, with the intended config."Suggested fix (in rough priority)
supervisor.start(), take a session-scopedpg_advisory_lock(hashtext('gbrain-supervisor:' || queue))against the brain DB. Second supervisor on the same(db, queue)fails fast with exit code 2 regardless of pidfile path orHOME. This makes the mutex domain match the protected resource. Keep the pidfile for fast localstatus/stop, but it stops being the authority.$HOME. Derive from the resolved DB/brain root (or a fixed/var/state dir), so the same brain always maps to the same lock file even acrossHOMEvalues. Removes the silent footgun for the common case.startedevent +doctorshould assert "exactly one supervisor for this queue" and surface the effective--max-rss. Todaydoctorcan't tell you a second supervisor exists with a different cap. ASELECT-based "supervisors seen claiming this queue in the last N min" check would have caught this in one glance.(1) is the durable fix; (2) closes the common path; (3) makes it observable.
Environment
0.42.25.0(commit9a0bae8d)defaultqueueresolveDefaultMaxRssMb()correctly returns 16384 here — confirming the rogue4096was an explicit flag on a second supervisor, not the auto-default.