Skip to content

Feature: --nice flag on jobs supervisor/work to yield CPU to interactive co-tenants (priority propagated to spawned workers) #1815

@garrytan-agents

Description

@garrytan-agents

Version: gbrain 0.42.10.0 (Postgres engine; minion jobs supervisor + child jobs work)
Type: Feature request
Severity: Medium — production interactive co-tenant (a chat gateway) was starved of CPU for hours by full-throttle brain processing.

TL;DR

jobs supervisor / jobs work run at default process priority and there's no built-in way to make the brain yield CPU to an interactive co-tenant. On a shared box where gbrain runs alongside something latency-sensitive (a chat gateway, a UI server, an editor), a heavy autopilot/embed backlog at full concurrency pins the cores and the foreground process visibly lags. Wrapping the binary in nice from outside almost works — but the supervisor spawns its worker as a child, and operators reasonably expect the priority to propagate to the whole tree. Request: a first-class --nice <n> flag on jobs supervisor (and jobs work) that calls setpriority on itself and is inherited by every spawned child (tini + worker).

Motivation (real incident)

Shared 126GB box: gbrain minion supervisor (concurrency 3) co-located with a chat gateway that handles interactive user turns. After a backlog of ~40 retried jobs + autopilot cycles started draining at full concurrency, load average hit ~7 and the gateway — at default priority, competing head-to-head with a 90%+ CPU worker — fell behind on user-facing responses by minutes.

The fix was simple and correct: run the gbrain tree at nice +15. The brain still gets full concurrency and drains the queue, but it only consumes CPU the interactive process isn't using. Load dropped from ~7 to ~3 with no throughput loss on an otherwise-idle-cored box, and user latency returned to normal. Concurrency starvation (dropping to 1) was the wrong lever — niceness is. Full parallelism for throughput, low scheduling priority so foreground work always wins contention.

Why "just wrap it in nice" isn't enough

  1. Propagation expectation. nice -n 15 gbrain jobs supervisor … does set the supervisor's priority, and Linux does inherit niceness to children, so in practice the worker comes up niced too. But this is implicit and easy to get wrong: any code path that re-execs, detaches, or resets priority breaks it silently, and there's no signal in jobs stats/doctor that the tree is (or isn't) niced. A flag makes the intent explicit, testable, and observable.
  2. Discoverability. Operators hit this exact CPU-contention wall and reinvent the nice wrapper from scratch (we did). A documented --nice flag turns tribal ops knowledge into a supported feature everyone benefits from.
  3. Self-contained supervision. For deployments that use gbrain's own jobs supervisor as the top-level process manager (rather than an external wrapper), there's currently no in-band way to set priority at all.

Proposed behavior

  • Add --nice <n> (range -20..19, default unset = no change) to jobs supervisor and jobs work.
  • On startup, the supervisor calls process.setpriority?.(0, n) (Node ≥10 has os.setPriority) on itself.
  • When spawning the child worker, either (a) rely on inheritance (document it), or better (b) pass the same --nice through and have the worker set its own priority after setpriority-capable startup, so it's robust to any priority reset in the spawn path.
  • Surface the effective niceness in jobs stats / doctor (e.g. worker: pid=… nice=15) so operators can confirm the tree is yielding as intended.
  • No behavior change when the flag is omitted.

Acceptance

  • gbrain jobs supervisor --nice 15 → supervisor and spawned worker both run at OS nice 15 (verifiable via ps -o ni).
  • doctor/jobs stats reports effective worker niceness.
  • Omitting --nice leaves priority untouched (back-compat).

Notes


Filed from a production CPU-contention incident, 2026-06-03. Local remediation: spawn the supervisor tree under nice -n 15, kept concurrency at 3 — full throughput, interactive gateway always wins CPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions