jobs supervisor singleton is pidfile-path-keyed (HOME-relative default) → two supervisors run on the same queue with conflicting --max-rss

## Summary

The `jobs supervisor` singleton guard is keyed **only on the `--pid-file` path**, and that path defaults to `${HOME}/.gbrain/supervisor.pid`. Because the default is `$HOME`-relative and there is **no DB/queue-level mutual exclusion**, two supervisors launched with different `HOME` (or different `--pid-file`) both pass the `O_CREAT|O_EXCL` EEXIST check and **run simultaneously against the same `default` queue** — with independent, possibly-conflicting `--max-rss` caps.

In production this manifested as a "ghost" supervisor at `--max-rss 4096` co-existing with the operator's intended `--max-rss 16384` tree on the same box. The 4GB cap's RSS watchdog kept killing legit `autopilot-cycle` jobs mid-run (**30 dead in 6h**, all `max stalled count exceeded` / `aborted: watchdog`), while every liveness/`doctor` check reported the queue "healthy" because *a* supervisor + worker were alive.

## Why the existing guard misses it

`MinionSupervisor.acquirePidLock()` → `tryAtomicCreate()` (`src/core/minions/supervisor.ts`) atomically creates `this.opts.pidFile`. That's correct mutual exclusion **for one pidfile path**. But:

```ts
export const DEFAULT_PID_FILE: string = (() => {
  const envOverride = process.env.GBRAIN_SUPERVISOR_PID_FILE;
  if (envOverride && envOverride.length > 0) return envOverride;
  const home = process.env.HOME ?? '/tmp';
  return `${home}/.gbrain/supervisor.pid`;
})();
```

So `HOME=/data` → `/data/.gbrain/supervisor.pid`, `HOME=/root` → `/root/.gbrain/supervisor.pid`. Two **different lock files → both acquire → two supervisors, same queue**. The guard's mutual-exclusion domain (a filesystem path) is narrower than the resource being protected (the `(database, queue)` pair). There is no `pg_advisory_lock` or `(queue)`-scoped row lock anywhere in the supervise path — confirmed by grep: the only "lock" on the queue side is per-job `lock_token`/`lock_until`, nothing supervisor-scoped.

## Repro

```bash
# same brain, same DATABASE_URL, same queue — different HOME
HOME=/data gbrain jobs supervisor --queue default --max-rss 16384 &
HOME=/root gbrain jobs supervisor --queue default --max-rss 4096  &
# both print `started`; both spawn a `jobs work --queue default`; both claim from minion_jobs.
# the 4096 child's RSS watchdog SIGTERMs jobs the 16384 operator never wanted capped.
```

(Equivalently: one operator-launched supervisor with an explicit `--pid-file`, plus any internal/`doctor`/self-upgrade relaunch that uses the `$HOME`-default path — different paths, both win.)

## Impact

- Two workers on one queue ⇒ lock contention, double-claims racing on `lock_token`, and **conflicting RSS caps** where the lowest cap silently wins and watchdog-kills healthy work.
- Completely invisible to `jobs stats`, `doctor`, and the #1801 wedge watchdog — all of which check "is *a* supervisor/worker alive," not "is there exactly one, with the intended config."
- The #1801 progress-watchdog and #1824 self-heal can't help: the queue *is* making progress (the other supervisor's worker), so nothing looks wedged while jobs die on the wrong cap.

## Suggested fix (in rough priority)

1. **Queue-scoped advisory lock as the real singleton.** On `supervisor.start()`, take a session-scoped `pg_advisory_lock(hashtext('gbrain-supervisor:' || queue))` against the brain DB. Second supervisor on the same `(db, queue)` fails fast with exit code 2 regardless of pidfile path or `HOME`. This makes the mutex domain match the protected resource. Keep the pidfile for fast local `status`/`stop`, but it stops being the authority.
2. **Canonicalize the default pidfile off the brain identity, not `$HOME`.** Derive from the resolved DB/brain root (or a fixed `/var`/state dir), so the same brain always maps to the same lock file even across `HOME` values. Removes the silent footgun for the common case.
3. **`started` event + `doctor` should assert "exactly one supervisor for this queue" and surface the effective `--max-rss`.** Today `doctor` can't tell you a second supervisor exists with a different cap. A `SELECT`-based "supervisors seen claiming this queue in the last N min" check would have caught this in one glance.

(1) is the durable fix; (2) closes the common path; (3) makes it observable.

## Environment

- gbrain `0.42.25.0` (commit `9a0bae8d`)
- Postgres engine (Supabase pooler), single brain, single `default` queue
- Linux, cgroup-v2 memory.max = 100GB; `resolveDefaultMaxRssMb()` correctly returns 16384 here — confirming the rogue `4096` was an **explicit** flag on a second supervisor, not the auto-default.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs supervisor singleton is pidfile-path-keyed (HOME-relative default) → two supervisors run on the same queue with conflicting --max-rss #1849

Summary

Why the existing guard misses it

Repro

Impact

Suggested fix (in rough priority)

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

jobs supervisor singleton is pidfile-path-keyed (HOME-relative default) → two supervisors run on the same queue with conflicting --max-rss #1849

Description

Summary

Why the existing guard misses it

Repro

Impact

Suggested fix (in rough priority)

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions