Skip to content

jobs supervisor singleton is pidfile-path-keyed (HOME-relative default) → two supervisors run on the same queue with conflicting --max-rss #1849

@garrytan-agents

Description

@garrytan-agents

Summary

The jobs supervisor singleton guard is keyed only on the --pid-file path, and that path defaults to ${HOME}/.gbrain/supervisor.pid. Because the default is $HOME-relative and there is no DB/queue-level mutual exclusion, two supervisors launched with different HOME (or different --pid-file) both pass the O_CREAT|O_EXCL EEXIST check and run simultaneously against the same default queue — with independent, possibly-conflicting --max-rss caps.

In production this manifested as a "ghost" supervisor at --max-rss 4096 co-existing with the operator's intended --max-rss 16384 tree on the same box. The 4GB cap's RSS watchdog kept killing legit autopilot-cycle jobs mid-run (30 dead in 6h, all max stalled count exceeded / aborted: watchdog), while every liveness/doctor check reported the queue "healthy" because a supervisor + worker were alive.

Why the existing guard misses it

MinionSupervisor.acquirePidLock()tryAtomicCreate() (src/core/minions/supervisor.ts) atomically creates this.opts.pidFile. That's correct mutual exclusion for one pidfile path. But:

export const DEFAULT_PID_FILE: string = (() => {
  const envOverride = process.env.GBRAIN_SUPERVISOR_PID_FILE;
  if (envOverride && envOverride.length > 0) return envOverride;
  const home = process.env.HOME ?? '/tmp';
  return `${home}/.gbrain/supervisor.pid`;
})();

So HOME=/data/data/.gbrain/supervisor.pid, HOME=/root/root/.gbrain/supervisor.pid. Two different lock files → both acquire → two supervisors, same queue. The guard's mutual-exclusion domain (a filesystem path) is narrower than the resource being protected (the (database, queue) pair). There is no pg_advisory_lock or (queue)-scoped row lock anywhere in the supervise path — confirmed by grep: the only "lock" on the queue side is per-job lock_token/lock_until, nothing supervisor-scoped.

Repro

# same brain, same DATABASE_URL, same queue — different HOME
HOME=/data gbrain jobs supervisor --queue default --max-rss 16384 &
HOME=/root gbrain jobs supervisor --queue default --max-rss 4096  &
# both print `started`; both spawn a `jobs work --queue default`; both claim from minion_jobs.
# the 4096 child's RSS watchdog SIGTERMs jobs the 16384 operator never wanted capped.

(Equivalently: one operator-launched supervisor with an explicit --pid-file, plus any internal/doctor/self-upgrade relaunch that uses the $HOME-default path — different paths, both win.)

Impact

Suggested fix (in rough priority)

  1. Queue-scoped advisory lock as the real singleton. On supervisor.start(), take a session-scoped pg_advisory_lock(hashtext('gbrain-supervisor:' || queue)) against the brain DB. Second supervisor on the same (db, queue) fails fast with exit code 2 regardless of pidfile path or HOME. This makes the mutex domain match the protected resource. Keep the pidfile for fast local status/stop, but it stops being the authority.
  2. Canonicalize the default pidfile off the brain identity, not $HOME. Derive from the resolved DB/brain root (or a fixed /var/state dir), so the same brain always maps to the same lock file even across HOME values. Removes the silent footgun for the common case.
  3. started event + doctor should assert "exactly one supervisor for this queue" and surface the effective --max-rss. Today doctor can't tell you a second supervisor exists with a different cap. A SELECT-based "supervisors seen claiming this queue in the last N min" check would have caught this in one glance.

(1) is the durable fix; (2) closes the common path; (3) makes it observable.

Environment

  • gbrain 0.42.25.0 (commit 9a0bae8d)
  • Postgres engine (Supabase pooler), single brain, single default queue
  • Linux, cgroup-v2 memory.max = 100GB; resolveDefaultMaxRssMb() correctly returns 16384 here — confirming the rogue 4096 was an explicit flag on a second supervisor, not the auto-default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions