Skip to content

v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (#1815)#1820

Merged
garrytan merged 15 commits into
masterfrom
garrytan/jobs-nice-flag
Jun 3, 2026
Merged

v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (#1815)#1820
garrytan merged 15 commits into
masterfrom
garrytan/jobs-nice-flag

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

What

Adds a --nice <n> flag to gbrain jobs work and gbrain jobs supervisor that sets the process's OS scheduling priority (POSIX -20..19), propagates to spawned workers and their children, and surfaces the effective niceness in jobs stats, jobs supervisor status, and gbrain doctor. Closes #1815.

Full concurrency, low priority — the work finishes just as fast when the box is idle and yields when it's busy. In the incident that drove this, reniceing the job tree took load from ~7 to ~3 with no measurable throughput loss. Distinct from the concurrency/inflight cap (#1801); composes with it (--nice = priority, concurrency = width).

How

  • niceness.tsparseNiceValue (whole-string parse), applyNiceness (re-reads effective in both success and failure paths), getEffectiveNiceness, formatNice.
  • worker-registry.ts — workers self-register their real pid + requested/effective nice under gbrainPath('workers') (brain-isolated); readWorkers prunes ESRCH (keeps EPERM) with a pid-reuse start-time guard. Cleanup on finally and process.on('exit').
  • supervisor-pid.tsreadSupervisorPid extracted from the copy-pasted PID-file + liveness block (now shared by status/doctor/stats).
  • supervisor.ts — nice opts, extracted testable buildWorkerArgs (appends --nice), emits niceness on started/worker_spawned audit events.
  • doctor.ts — separate supervisor_niceness check (warns on requested ≠ effective) so it can't clobber the supervisor crash-check precedence; registered in doctor-categories.
  • Supervisor applies its renice in the foreground-start path only (after the --detach fork), so the long-lived process gets reniced, not the throwaway parent.

Test plan

  • 5 new unit test files (niceness, worker-registry, supervisor-pid, parseNiceFlag, buildWorkerArgs). Affected-area suite green (141 tests).
  • E2E mechanism verified: applyNiceness(7)ps -o ni confirms 7 → registry round-trips → cleanup unlinks.
  • Build clean (bun build --compile).

Reviews

Plan cleared by /plan-eng-review (5 findings resolved) and /codex outside-voice (12 findings: 11 folded, 1 documented). Codex caught the detached-supervisor renice-ordering bug and the parseInt("3.5") gap.

🤖 Generated with Claude Code

Documentation

  • docs/guides/minions-deployment.md: new "Lowering scheduling priority (--nice)" section — full-concurrency-low-priority pattern, GBRAIN_NICE env, how to confirm the effective value.
  • docs/architecture/KEY_FILES.md: added per-file entries for niceness.ts, worker-registry.ts, supervisor-pid.ts; extended the supervisor.ts entry with buildWorkerArgs + nice opts.
  • CHANGELOG.md: 0.42.23.0 entry (user-facing + contributors subsection).

Coverage: --nice flag and GBRAIN_NICE have reference (README/deployment guide) + how-to (deployment guide examples) coverage. No diagram drift. No documentation debt.

garrytan and others added 15 commits June 3, 2026 08:02
OS scheduling-priority primitives for issue #1815:
- niceness.ts: parseNiceValue (whole-string), applyNiceness (re-reads
  effective in success AND failure paths), getEffectiveNiceness, formatNice.
- worker-registry.ts: live workers self-register pid + requested/effective
  nice under gbrainPath('workers'); readWorkers prunes ESRCH (keeps EPERM)
  with a pid-reuse start-time guard.
- supervisor-pid.ts: readSupervisorPid extracted from the copy-pasted
  PID-file + liveness block.
Wires the --nice <n> flag (and GBRAIN_NICE env) through the CLI (issue #1815):
- jobs work: applies niceness + registers the worker; cleanup on finally and
  process.on('exit').
- jobs supervisor: applies in the foreground-start path only (after the
  --detach fork), passes the apply result into MinionSupervisor.
- supervisor.ts: nice opts, extracted testable buildWorkerArgs (appends
  --nice), emits niceness on started/worker_spawned audit events.
- jobs stats / supervisor status: surface effective worker + supervisor nice.
- doctor: separate supervisor_niceness check (warns on requested != effective)
  so it can't clobber the supervisor crash-check precedence; registered in
  doctor-categories.
Unit tests for issue #1815: parseNiceValue rejects 3.5/10abc that parseInt
would accept; applyNiceness re-reads effective on EPERM; registry ESRCH/EPERM +
pid-reuse guard + brain-isolated path; readSupervisorPid states; parseNiceFlag
flag>env precedence; buildWorkerArgs --nice propagation.
--nice flag for jobs work/supervisor (issue #1815).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- minions-deployment.md: niceness tuning section (full concurrency, low priority).
- KEY_FILES.md: entries for niceness.ts, worker-registry.ts, supervisor-pid.ts;
  supervisor.ts entry notes buildWorkerArgs + nice opts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The enrich_thin cycle phase (src/core/cycle.ts ALL_PHASES, between
conversation_facts_backfill and skillopt) shipped without updating the
e2e phase-order expectation, so dream-cycle-phase-order-pglite failed on
master. Sync the expected list to the real ALL_PHASES order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ll contract

v0.41.13.0 intentionally dropped the "--break-lock + --all is refused" guard so
cron can self-heal every source in one call (sync.ts runBreakLock iterates
sources under --all). The e2e test still asserted the old exit-1 refusal and
failed on master. Assert the current contract: the combination is accepted and
takes the iterate / no-active-sources path (exit 0, no refusal message).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native fsevents watcher occasionally missed a freshly written file, timing
out the 15s waitFor (~1/3 on master under load). Three fixes:
- inject a polling chokidar watcher via the source's _watchFactory seam
  (usePolling, 20ms interval) so detection never depends on fsevents timing;
- drop deterministic fixtures BEFORE start so the initial scan
  (ignoreInitial:false) emits them, keeping live-watch coverage only where it's
  robust;
- poll for the dedup hit instead of a fixed 600ms sleep.
15/15 green under stress.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
connect-bearer and serve-stdio-roundtrip init a PGLite brain and spawn serve,
but passed {...process.env} through — leaking an ambient DATABASE_URL /
GBRAIN_DATABASE_URL into the subprocess, which then came up on Postgres and
failed the `engine: pglite` assertion. Strip both DB vars from the spawned env
so the tests are deterministic whether or not the shell/CI has a DB URL set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…flag

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
The DATABASE_URL/GBRAIN_DATABASE_URL strip used `delete` on a narrowly-typed
env literal (tsc-only failure; bun test doesn't typecheck). Annotate
connect-bearer's env as Record<string,string|undefined> and build serve-stdio's
as a concrete Record<string,string> (StdioClientTransport.env rejects undefined).
Runtime behavior unchanged (7/7 + 3/3 green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
worker-registry.test.ts sets process.env.GBRAIN_HOME per-test so gbrainPath
resolves to a temp dir, then lazy-imports the module — a process-global
mutation the parallel isolation lint (rule R1) forbids. Rename to
worker-registry.serial.test.ts: it runs in the serial pass (own bun process,
max-concurrency=1) where env mutation is safe, and the lint skips *.serial
files. No logic change (6/6 green); fixes the failing `verify` CI job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…flag

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	docs/architecture/KEY_FILES.md
#	package.json
#	src/commands/jobs.ts
#	src/core/minions/supervisor.ts
@garrytan garrytan merged commit f11d56c into master Jun 3, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: --nice flag on jobs supervisor/work to yield CPU to interactive co-tenants (priority propagated to spawned workers)

1 participant