Skip to content

fix: zombie PID lock detection + self-healing jsonb repair in dream cycle#539

Closed
garrytan wants to merge 8 commits into
masterfrom
fix/dream-jsonb-and-lock
Closed

fix: zombie PID lock detection + self-healing jsonb repair in dream cycle#539
garrytan wants to merge 8 commits into
masterfrom
fix/dream-jsonb-and-lock

Conversation

@garrytan

@garrytan garrytan commented Apr 30, 2026

Copy link
Copy Markdown
Owner

Problem

The dream cycle's synthesize and patterns phases (v0.23.0) successfully ran subagent jobs that wrote brain pages via brain_put_page tool calls — but the orchestrator consistently reported 0 pages written. All synthesized content was silently lost: never reverse-written to disk, never committed to git, never surfaced in the cycle summary.

Additionally, zombie (defunct) bun processes were holding Postgres cycle locks indefinitely, blocking subsequent cycle runs even though the holder process was dead.

Production impact

  • A 54K-page production brain ran its first synthesize cycle: 8 subagents completed successfully, each writing 2 pages (16 total). The cycle summary reported pages_written: 0.
  • The patterns phase then ran: 1 subagent wrote 3 pattern pages. Summary: patterns_written: 0.
  • 19 brain pages existed only in Supabase with no disk/git representation.
  • Zombie bun workers held cycle locks for 30+ minutes, causing cycle_already_running blocks on every dream attempt.

Error Log

Synthesize cycle summary (production):

{
  "phases": [{
    "phase": "synthesize",
    "status": "ok",
    "summary": "8 transcript(s) synthesized in 398.2s",
    "details": {
      "transcripts_discovered": 90,
      "transcripts_processed": 8,
      "pages_written": 0,
      "reverse_write_count": 0,
      "child_outcomes": [
        {"jobId": 2653, "status": "completed"},
        {"jobId": 2654, "status": "completed"},
        {"jobId": 2655, "status": "completed"},
        {"jobId": 2656, "status": "completed"},
        {"jobId": 2657, "status": "completed"},
        {"jobId": 2658, "status": "completed"},
        {"jobId": 2659, "status": "completed"},
        {"jobId": 2660, "status": "completed"}
      ]
    }
  }]
}

Diagnostic query showing the root cause — input column stored as jsonb string instead of jsonb object:

SELECT id, jsonb_typeof(input) as input_type, input->>'slug' as slug
FROM subagent_tool_executions
WHERE tool_name = 'brain_put_page' AND status = 'complete';

-- id  | input_type | slug
-- 502 | string     | NULL   ← can't extract keys from a string
-- 504 | string     | NULL
-- ...all 16 successful writes return NULL

Zombie lock holder blocking cycle runs:

$ ps -p 1619018 -o pid,stat,cmd
PID  STAT CMD
1619018 Z  [bun] <defunct>

$ SELECT * FROM gbrain_cycle_locks;
-- id: gbrain-cycle, holder_pid: 1619018, ttl_expires_at: +30min
-- PID 1619018 is a zombie — passes kill(pid,0) but can never release the lock

What We Tried

  1. Ran gbrain dream --phase synthesize — returned cycle_already_running (zombie lock).
  2. Manually cleared zombie lock rows in gbrain_cycle_locks — synthesize then ran but hit 12h cooldown (it had already run once, producing the 0-pages-written result).
  3. Queried subagent_tool_executions directly — found all 16 brain_put_page calls with status: 'complete' and full content in the input column.
  4. Diagnosed the encoding: jsonb_typeof(input) returned 'string' instead of 'object'. The input ? 'slug' operator (jsonb key-existence) fails on string values, so collectChildPutPageSlugs returned 0 rows.
  5. Traced to root cause: persistToolExecPending() called JSON.stringify(input) before passing to executeRaw() with ::jsonb cast. The postgres library's .unsafe() method sends string parameters as text — Postgres receives '{"slug":"..."}' as a text string and casts it to a jsonb string scalar (wrapped in quotes), not a jsonb object.
  6. Verified with test: passing a raw object to .unsafe() with ::jsonb correctly stores a jsonb object; JSON.stringify() + ::jsonb produces a double-encoded jsonb string.

Solution

Fix 1: Remove JSON.stringify() from persist functions (already on master, 80b3909)

src/core/minions/handlers/subagent.ts — all three persist functions:

// Before (wrong):
[jobId, messageIdx, toolUseId, toolName, JSON.stringify(input)]

// After (correct):
const jsonbInput = (input != null && typeof input === 'object')
  ? input
  : { _raw: String(input ?? '') };
[jobId, messageIdx, toolUseId, toolName, jsonbInput]

The postgres driver's .unsafe() correctly serialises objects to jsonb objects. JSON.stringify() adds a layer of string escaping that produces a jsonb string scalar.

Fix 2: Self-healing jsonb repair in collectChildPutPageSlugs (this PR)

src/core/cycle/synthesize.ts and src/core/cycle/patterns.ts:

Before querying for slugs, run an idempotent fixup:

UPDATE subagent_tool_executions
   SET input = (input #>> '{}')::jsonb
 WHERE job_id = ANY($1::int[])
   AND jsonb_typeof(input) = 'string'
   AND (input #>> '{}') IS NOT NULL
   AND left(input #>> '{}', 1) = '{'

input #>> '{}' extracts the text content of a jsonb string, then ::jsonb re-parses it as a proper jsonb object. Only touches rows where the encoding is wrong. Ensures older runs' data is accessible without a separate migration.

Fix 3: Zombie PID detection in cycle lock acquisition (this PR)

src/core/cycle.ts:

New isZombiePid(pid) reads /proc/<pid>/status and checks for State: Z:

function isZombiePid(pid: number): boolean {
  try {
    const status = readFileSync(`/proc/${pid}/status`, 'utf-8');
    return /^State:\s+Z/m.test(status);
  } catch { return false; }
}

Postgres lock path (acquirePostgresLock): When the TTL-based upsert returns empty (lock held), check if the holder PID on the same host is a zombie. If so, delete the lock row and retry acquisition.

File lock path (acquireFileLock): The existing pidAlive check now uses !isZombiePid(existingPid) — zombies are treated as dead.

Behavior matrix

Scenario Before After
Subagent writes page via brain_put_page Stored as jsonb string; orchestrator can't find it Stored as jsonb object; orchestrator finds it
Older double-encoded rows in DB Silent 0-page results forever Auto-repaired on next cycle run
Zombie bun process holds cycle lock Blocks all cycles for 30 min until TTL expires Detected via /proc, lock cleared, cycle proceeds
Non-zombie process holds lock (normal) Correctly waits/skips Unchanged — no false positives
macOS/non-Linux (no /proc) N/A isZombiePid returns false, falls back to existing TTL behavior

Results

After deploying fix 1 (80b3909) and manually repairing the data:

  • 16 synthesized pages recovered from DB and committed to git (8 reflections + 8 originals/ideas)
  • 3 pattern pages recovered (cross-session themes)
  • New worker (restarted with fix 1) correctly stores jsonb objects
  • Fix 2 (this PR) ensures any remaining pre-fix rows are auto-repaired
  • Fix 3 (this PR) eliminates the zombie lock blocking pattern

Testing

  • bun run typecheck passes
  • Verified JSON.stringify(obj) + ::jsonb produces jsonb string via production test
  • Verified raw object + ::jsonb produces jsonb object via production test
  • Verified isZombiePid() correctly identifies defunct processes in production
  • Verified self-healing fixup query on 53 double-encoded rows in production
  • Manual end-to-end: synthesize → subagent writes → orchestrator collects slugs → reverse-write to disk → git commit

Related

  • Companion to 80b3909 (fix: jsonb double-encoding in subagent_tool_executions) which fixed the root cause in subagent.ts
  • This PR adds the safety net (self-healing repair + zombie lock detection) on top of that root cause fix

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Wintermute added 2 commits April 30, 2026 18:35
…ycle

Three fixes for production dream cycle reliability:

1. Zombie PID detection in cycle lock acquisition (cycle.ts):
   - kill(pid, 0) returns success for zombie (defunct) processes
   - Zombies can never release the lock or refresh TTL
   - New isZombiePid() reads /proc/<pid>/status for State: Z
   - acquirePostgresLock() clears zombie holders on same host
   - acquireFileLock() treats zombie holders as dead

2. Self-healing jsonb repair in collectChildPutPageSlugs (synthesize.ts, patterns.ts):
   - Pre-fix runs stored input as jsonb string instead of jsonb object
   - input->>'slug' returned null, so orchestrator reported 0 pages written
   - New fixup query un-wraps double-encoded rows before the SELECT
   - Idempotent: only touches rows where jsonb_typeof(input) = 'string'

3. Root cause fix (already on master via 80b3909):
   - Removed JSON.stringify() from persist functions in subagent.ts
   - Passes objects directly to postgres driver for correct jsonb encoding
JSON.stringify(input) + ::jsonb cast produced a jsonb string value
instead of a jsonb object. The postgres library's unsafe() with a
raw object + ::jsonb correctly stores a jsonb object.

This caused collectChildPutPageSlugs to return 0 results (can't
extract ->>'slug' from a jsonb string), making dream synthesize
report '0 pages written' even though subagents successfully wrote
16 pages to the database.

Fix: pass objects as-is to executeRaw, let the postgres driver
handle serialization. Non-object values wrapped in {_raw: ...}
as a safety fallback.
@garrytan garrytan force-pushed the fix/dream-jsonb-and-lock branch from cfd782e to f443c2e Compare April 30, 2026 18:35
garrytan and others added 3 commits April 30, 2026 20:47
…ter) (#528)

* feat: diff-aware E2E test selector

Adds scripts/select-e2e.ts: reads git diff vs origin/master, classifies
the change set (EMPTY/DOC_ONLY/SRC), and emits the relevant E2E test files
on stdout. Fail-closed by design: any unmapped src/ change runs all E2E.

- scripts/e2e-test-map.ts: hand-tuned path-glob -> test files map
- scripts/select-e2e.ts: pure-function selector with three explicit cases
- scripts/run-e2e.sh: accepts optional file list from argv + --dry-run-list
- test/select-e2e.test.ts: 24 cases including 3 codex regression guards
  (skills/, untracked files, unmapped src/)

* feat: local CI gate via docker compose

Adds bun run ci:local — runs every check GH Actions runs (gitleaks +
unit + 29 E2E files) inside a Docker container that bind-mounts the
repo. Pure bind-mount + named volumes (gbrain-ci-node-modules,
gbrain-ci-bun-cache, gbrain-ci-pg-data) for fast warm restarts.

- docker-compose.ci.yml: pgvector/pgvector:pg16 + oven/bun:1
- scripts/ci-local.sh: orchestrator with --diff, --no-pull, --clean
- gitleaks runs on host (scoped to working dir + branch commits)
- DATABASE_URL unset for unit phase (matches GH Actions split)
- git installed in container at startup (oven/bun:1 omits it)
- Postgres host port via GBRAIN_CI_PG_PORT env (default 5434)

Stronger than PR CI: runs all 29 E2E files vs CI's 2-file Tier 1.

* chore: bump version and changelog (v0.23.1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: document local CI gate for v0.23.1

CLAUDE.md gains key-files entries for docker-compose.ci.yml,
scripts/ci-local.sh, scripts/select-e2e.ts + e2e-test-map.ts, and the
scripts/run-e2e.sh argv tweak. Pre-ship requirements section now lists
the Docker-based local gate as Path A alongside the manual lifecycle.

CONTRIBUTING.md tests section adds the bun run ci:local / ci:local:diff /
ci:select-e2e block with prerequisites (Docker engine + gitleaks) and the
GBRAIN_CI_PG_PORT override.

AGENTS.md "Before shipping" promotes ci:local as the easiest path and
keeps the manual lifecycle as a fallback.

README.md Contributing section points to ci:local for the full gate.

CHANGELOG.md untouched — v0.23.1 entry already finalized.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: SHARD=N/M env support in scripts/run-e2e.sh

Filters the E2E file list to every M-th file starting at index N (1-indexed).
Sequential execution within a shard preserves the TRUNCATE CASCADE no-race
property documented at the top of the file. Empty-shard handling under
`set -u` uses ${arr[@]:-} fallback.

Standalone change; not yet wired up in ci-local.sh.

* feat: 4-way parallel E2E shards in ci:local

Replaces the single postgres service with 4 (postgres-1..4) on host ports
5434-5437. scripts/ci-local.sh fans 4 workers via xargs -P4 inside the
runner container; each pinned to its own DATABASE_URL via SHARD=N/4.

Wall-time on a 16-core host: ~6 min sequential -> ~1.5-2 min sharded.
Total full-gate wall-time goes from ~25 min to ~3-5 min warm.

Also handles git-worktree (Conductor) layouts: when /app/.git is a file
instead of a directory, parse the gitdir + commondir and bind-mount the
shared host gitdir at its absolute path. Without this, in-container
`git ls-files` (used by scripts/check-trailing-newline.sh and friends)
exits 128 with "not a git repository". Also runs
`git config --global --add safe.directory '*'` inside the container so
the root-uid container can read host-uid gitdir without "dubious
ownership" rejection.

CHANGELOG entry updated to cover the speedup.

- docker-compose.ci.yml: 4 pgvector services + per-shard named volumes
- scripts/ci-local.sh: parallel xargs orchestration + worktree mount fix
- CHANGELOG.md v0.23.1: 4-way sharded wall-time, 36 E2E files, --no-shard flag

* chore: regenerate llms-full.txt for v0.23.1 doc updates

Required by test/build-llms.test.ts case 4 — committed llms-full.txt
must match `bun run build:llms` output. The CHANGELOG + CLAUDE.md
updates in this branch shifted bytes; regen catches up.

* feat: scripts/run-unit-shard.sh + slow-test convention

Tier 1 + Tier 4 plumbing:
- scripts/run-unit-shard.sh: SHARD=N/M filter for unit files (excludes
  test/e2e/*). Excludes *.slow.test.ts (Tier 4 convention) so the fast
  shard fan-out skips known-slow files; CI's `bun run test` still includes
  them via default discovery.
- scripts/run-slow-tests.sh: companion that runs ONLY *.slow.test.ts.
  Wired as `bun run test:slow`.
- scripts/profile-tests.sh: portable awk parser that extracts the top-N
  slowest tests from any captured `bun test` output. Wired as
  `bun run test:profile`. Use it to pick demotion candidates.

* feat: PGLite snapshot fixture for ~4.5x faster cold init (Tier 3)

scripts/build-pglite-snapshot.ts boots a fresh PGLite, runs the full
initSchema() (forward bootstrap + 30 migrations), and dumps the post-init
state to test/fixtures/pglite-snapshot.tar plus a SHA-256 schema hash
sidecar (.version). Both gitignored — built on demand via
`bun run build:pglite-snapshot`.

PGLiteEngine.connect() reads GBRAIN_PGLITE_SNAPSHOT env: validates the
sidecar hash against the in-process MIGRATIONS hash, loads via PGLite's
loadDataDir blob, sets _snapshotLoaded so initSchema() short-circuits.
Measured per-file cold init drops from 828ms → 181ms.

Bootstrap-correctness tests (bootstrap.test.ts,
schema-bootstrap-coverage.test.ts) explicitly delete the env at file
top so they keep exercising the cold path they verify.

* feat: --classify-only + heartbeat tolerance fix (Tiers 2 + flake fix)

- scripts/select-e2e.ts: --classify-only flag emits EMPTY|DOC_ONLY|SRC.
  Used by ci-local.sh's --diff fast-path to skip the heavy gate when
  only docs changed.
- test/progress.test.ts: startHeartbeat tolerance widened to 1-20 over
  200ms (was 2-6 over 85ms). Under 4-way parallel shard load on a
  contended host, setTimeout's effective quantum balloons and the tight
  bound flakes. The test still verifies "fires multiple times, stops
  cleanly" — exact count was never load-bearing.

* feat: 4-way unit + E2E sharding in ci-local.sh + CHANGELOG (Tiers 1-4)

ci-local.sh ties the four tiers together:
- Tier 2: pre-flight diff classification on host. DOC_ONLY exits in ~5s
  (gitleaks only, no postgres, no container).
- Tier 1: guards + typecheck run ONCE before fan-out. xargs -P4 then
  spawns 4 shards inside the runner container, each running unit phase
  (env -u DATABASE_URL bash run-unit-shard.sh) followed by E2E phase
  (DATABASE_URL=postgres-N bash run-e2e.sh) — both sharded N/4. Per-shard
  logs in /tmp/shard-logs/shard-N.log; printed in shard order at the end.
- Tier 3: snapshot fixture built once at runner startup if missing,
  GBRAIN_PGLITE_SNAPSHOT exported so all shards inherit.
- Tier 4: run-unit-shard.sh excludes *.slow.test.ts; run-slow-tests.sh
  + test:slow npm script handle the demoted set.
- --no-shard preserves the legacy single-process flow for debug.

package.json: build:pglite-snapshot, test:slow, test:profile scripts.

Measured wall-time on 16-core host: 100s warm (down from ~22 min cold
single-process). 4 shards × ~640-1024 unit tests each, plus 9 E2E
files each. PGLite snapshot saves 4.5× per cold init (828ms → 181ms).

CHANGELOG.md updated with measured numbers + four-tier breakdown.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rdict model tests (#527)

* v0.23.1 fix: dream self-consumption guard + configurable verdict model

Built-in isDreamOutput() guard in transcript-discovery.ts auto-skips
any transcript whose first 2000 chars contain dream output slug prefixes
(wiki/personal/reflections/, wiki/originals/ideas/, wiki/personal/patterns/,
dream-cycle-summaries/). Prevents infinite recursion if dream output is
ever fed back into the corpus.

judgeSignificance() now accepts a verdictModel parameter, loaded from
dream.synthesize.verdict_model config key. Default: claude-haiku-4-5.

3 new test cases covering the guard.

* feat(dream): replace content-prefix guard with orchestrator-stamped marker

The v0.23.1 prefix-string guard had two flaws caught by codex review.
serializeMarkdown does not embed the page slug into body content, so
the heuristic could miss real dream output. And real conversation
transcripts often cite brain slugs ("earlier I wrote about
wiki/personal/reflections/identity..."), so the heuristic dropped
legitimate transcripts silently.

Swap content inference for explicit identity. renderPageToMarkdown and
writeSummaryPage now stamp `dream_generated: true` + `dream_cycle_date`
into frontmatter at render time. Guard checks for the marker via
DREAM_OUTPUT_MARKER_RE (anchored at frontmatter open, BOM/CRLF
tolerant, scans first 2000 chars, word boundary on `true`). Cannot
drift, cannot false-positive on user text, cannot miss real output.

Tests built from a real Page → renderPageToMarkdown → isDreamOutput
round-trip (codex finding #5 — synthetic strings don't prove the
guard catches what synthesize actually produces). Coverage: regression
fixture, false-positive prevention on user transcripts citing slugs,
CRLF+BOM, whitespace/case variants, anchor-at-byte-0, perf bound,
bypass plumbing, dream_generatedfoo word-boundary check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dream): --unsafe-bypass-dream-guard CLI flag

Explicit opt-in to disable the synthesize self-consumption guard. The
flag is intentionally NOT tied to --input — codex review caught that
implicit bypass is a footgun: any caller could synthesize a dream-
generated page directly via --input, get a cached positive verdict,
and silently re-trigger the loop bug.

Plumbing: dream.ts CLI parses the flag → DreamArgs.bypassDreamGuard →
runCycle({ synthBypassDreamGuard }) → SynthesizePhaseOpts.bypassDreamGuard
→ discoverTranscripts({ bypassGuard }) and readSingleTranscript.
Loud stderr warning at phase entry when set so the cost is visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.23.2 chore: bump version + CHANGELOG for corrected guard architecture

Replaces the v0.23.1 release notes with the v0.23.2 voice describing
the orchestrator-stamped marker approach and the --unsafe-bypass-dream-guard
flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync project docs for v0.23.2 marker-based guard

Update CLAUDE.md Key Files entries for src/core/cycle/synthesize.ts,
src/core/cycle/transcript-discovery.ts, and src/commands/dream.ts to
reflect the v0.23.2 dream_generated frontmatter marker that replaces the
v0.23.1 content-prefix self-consumption guard, plus the new
--unsafe-bypass-dream-guard CLI flag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: regenerate llms-full.txt for v0.23.2 CLAUDE.md updates

CI's `build-llms generator > committed match generator output` guard
caught drift after the v0.23.2 doc-sync (commit 507edb1) updated three
Key Files entries in CLAUDE.md without re-running `bun run build:llms`.

The llms.txt index didn't drift (no new doc URLs); only the inlined
llms-full.txt bundle needed refreshing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): round-trip dream-recursion coverage for v0.23.2 marker guard

Three new PGLite E2E cases exercise the actual production loop scenario
end-to-end. Unit tests covered the bug class at the function-pair level
(renderPageToMarkdown → readSingleTranscript). These cover it at the
phase level: runPhaseSynthesize with a real engine, real putPage, real
renderPageToMarkdown, real corpus-dir discovery.

1. Leaked dream output is skipped on next synthesize run. The reflection
   page is inserted, reverse-rendered (which stamps `dream_generated:
   true`), dropped into the corpus dir as .txt, and the next phase run
   reports "no transcripts to process" with a stderr skip log. Verdict
   cache stays untouched so a future legit edit isn't shadowed by a
   stale cached "false".

2. bypassDreamGuard=true at phase entry re-enables ingestion. Same
   marked file gets discovered through the loud-warning path. Proves
   --unsafe-bypass-dream-guard plumbing reaches discoverTranscripts at
   phase scope.

3. Mixed corpus (leaked dream output + real conversation transcript)
   discovers exactly the real one. Pins codex finding #1's headline
   false-positive case: a transcript citing wiki/personal/reflections/
   in body must NOT be skipped.

Stderr capture via process.stderr.write spy with try/finally restore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): use valid PageType 'note' in round-trip E2E fixtures

CI typecheck caught three TS2322 violations in the round-trip E2E
fixtures: 'reflection' is not a member of PageType. Reflections are
filed as 'note' in production (renderPageToMarkdown falls back to 'note'
for unknown types).

No behavior change — the guard test still exercises the same
serializeMarkdown → discoverTranscripts loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(claude): require `bun run typecheck` before push

The pre-ship section listed `bun test` as the unit-test path but didn't
flag the trap: `bun test` (the bun runner) does NOT run TypeScript type
checking. Only `bun run test` (the npm script) does, because it chains
`bun run typecheck` + the four shell pre-checks before the runner.

CI on PR #527 caught a `'reflection'` literal that `PageType` doesn't
admit (PageType is a closed union). The runtime E2E and `bun test`
both passed locally because the runner doesn't gate on TS. The
separate typecheck stage in CI rejected it.

New rule: run `bun run typecheck` (or `bun run test`, which wraps it,
or `bun run ci:local` for the full gate) before pushing. The runner-
alone path is for hot-loop test iteration only.

Also regenerated llms-full.txt for the CLAUDE.md update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Wintermute <wintermute@garrytan.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: v0.19.0 — skillify loop + AGENTS.md compat + brain-first convention

This is the v0.19.0 release. The branch ships four new CLI commands, a
refactor to check-resolvable, and an expansion of the brain-first
convention for sub-agent tool discovery. The original commit message
described only the convention expansion, undercounting the scope by ~5x;
this amend captures the full release.

NEW COMMANDS

- gbrain skillify scaffold <name>     — 4 stub files + idempotent resolver row
- gbrain skillify check [path]        — 10-item post-task audit (promoted)
- gbrain skillpack list / install     — curated 25-skill bundle, atomic install
- gbrain skillpack diff <name>        — per-file diff preview
- gbrain routing-eval                 — dedicated CI verb for Check 5 fixtures

CHECK-RESOLVABLE REFACTOR

- Accepts AGENTS.md as a resolver file alongside RESOLVER.md, at either
  the skills directory or one level up (workspace root layout).
- Auto-derives the skill manifest by walking skills/*/SKILL.md when
  manifest.json is missing.
- Splits ResolvableReport into errors[] + warnings[] so advisory checks
  (filing audit, routing gaps, DRY violations) don't break CI by default.
- New --strict opt-in flag promotes warnings to exit 1.

BRAIN-FIRST CONVENTION

- skills/conventions/brain-first.md expanded from 5-step lookup guide to
  full sub-agent reference: tool inventory, lookup chain, score thresholds,
  authority hierarchy, sync rules, entity page conventions, sub-agent
  propagation rule.

PRODUCTION-READINESS HARDENING (this branch's review pass)

- routing-eval --llm: emits stderr placeholder notice + runs structural
  layer only. README, CHANGELOG, CLI help all rewritten consistently.
  Was a silent no-op against documented contract.
- skillpack installer: receipt comment in fence (cumulative-slugs="...")
  preserves single-skill-install accumulation while letting install --all
  prune removed bundle skills cleanly. Unknown rows preserved + stderr
  warning for the operating agent. Pre-v0.19 fences upgrade silently.
- skillify scaffold: resolver-row regex broadened to detect backticked,
  quoted, and bare path forms. No duplicate row on --force after the
  user normalizes formatting.
- scripts/check-privacy.sh: now wired into package.json test chain so
  the wintermute-ban rule is actually enforced. New regression test.
- E2E Tier 2 (LLM skills) promoted from schedule-only to required per-PR
  CI. Local Tier 1 + Tier 2 verified clean.
- Stale v0.17/v0.18 version labels rewritten across new files.

TESTS

- test/routing-eval-cli.test.ts: 4 cases covering --llm warn semantics
- test/privacy-script-wired.test.ts: regression guard for CI wiring
- test/skillpack-install.test.ts: 4 new cases for receipt + cumulative
  + unknown-row preserve+warn + pre-v0.19 upgrade path
- test/skillify-scaffold.test.ts: 4 new cases for broadened regex

VERIFICATION

- bun test: 2237 pass / 18 known PGLite-contention flakes (CI green;
  documented as P3 dev-experience in TODOS.md)
- bun run typecheck: clean
- bun run test:e2e: 18/19 files green (1 pre-existing flake on master,
  not caused by this branch — verified via git stash)
- llms.txt + llms-full.txt regenerated to match README + CHANGELOG

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: scrub banned fork name from public artifacts

The privacy guard wired into the test chain in this branch caught 5
pre-existing references to the banned OpenClaw fork name in CHANGELOG.md
(2x), skills/migrations/v0.19.0.md (1x), src/cli.ts (1x), and
src/commands/sync.ts (1x). All originated in master's v0.19.0 release
notes and migration doc when the privacy script existed but wasn't
wired into CI yet.

Replacements per CLAUDE.md privacy mapping:
- Origin-story copy (CHANGELOG layer narratives, code comments naming
  the production deployment that drove the feature) → "Garry's OpenClaw"
- Reader-facing migration step → "your OpenClaw"

No code semantics changed. Comments + headings only.

Verification: scripts/check-privacy.sh exits 0, full CI guard chain
green (privacy + jsonb + progress + wasm + typecheck).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump VERSION to 0.24.0 + new CHANGELOG entry

Bump branch version above master's v0.21.0 per CLAUDE.md
"CHANGELOG + VERSION are branch-scoped" rule. The new v0.24.0 entry at
the top of CHANGELOG covers what THIS branch adds vs master:

- routing-eval --llm honesty pass (4-surface contract drift fix)
- skillpack installer cumulative-receipt + unknown-row preserve+warn
  (the Codex-caught regression that would have shipped in master if
  the original v0.19.0 had landed without this branch's review pass)
- skillify scaffold resolver-row regex broadening (backtick + quoted
  + bare forms; idempotency contract preserved under hand-editing)
- 5 banned-name leaks scrubbed from public artifacts
- check-privacy.sh wired into CI test chain + regression guard test
- 7 stale v0.17/v0.18 version labels rewritten across 5 files
- Tier 2 (LLM-skills E2E) promoted from schedule-only to required per-PR

VERSION 0.21.0 → 0.24.0
package.json version field synced.
llms.txt + llms-full.txt regenerated (no content drift; sizes match).

Test suite: 62/62 green across the 5 test files this branch added or
extended (routing-eval-cli, privacy-script-wired, skillpack-install,
skillify-scaffold, build-llms).

CI guards: privacy + jsonb + progress + wasm + typecheck all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.24.0

Auto-discovered drift via /document-release after the v0.24.0 hardening
pass landed. All factual corrections clearly warranted by the diff.

CLAUDE.md:
- Skillpack installer: documented the cumulative-slugs receipt comment,
  install --all prune semantics, unknown-row preserve+warn behavior,
  and pre-v0.24 silent upgrade. Was previously vague about
  "tracks a skill manifest so install --update diffs cleanly" without
  explaining what the receipt is or why it matters.
- routing-eval: replaced the false claim that --llm "opts into a Haiku
  tie-break layer for CI." Now correctly describes the placeholder
  semantic landed in v0.24.0 (stderr notice + structural-only run).

README.md:
- Skillpack section: added one paragraph on the receipt comment + the
  user-visible stderr message for hand-added rows. Connects the safe
  rerun promise to the v0.24.0 implementation that actually enforces it.

CONTRIBUTING.md:
- Running tests section: now recommends `bun run test` (full CI guard
  chain + typecheck + tests) before pushing. Names each guard so new
  contributors understand what catches what. The privacy guard (newly
  wired in v0.24.0) is one of these — without `bun run test` you'd skip
  it locally and find out from CI.

llms-full.txt: regenerated to reflect CLAUDE.md changes.

Verification: full guard chain green locally (privacy + jsonb + progress
+ wasm + typecheck).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Garry Tan <garry@ycombinator.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Qodo-Free-For-OSS

Copy link
Copy Markdown

Hi, The self-healing UPDATE casts the job_id filter parameter to int[], but subagent_tool_executions.job_id is BIGINT; this will throw "integer out of range" when job IDs exceed 32-bit and will break synthesize/patterns provenance collection.

Severity: action required | Category: correctness

How to fix: Use bigint[] for job_id

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

The self-healing UPDATE in dream-cycle slug provenance uses job_id = ANY($1::int[]), but subagent_tool_executions.job_id is BIGINT. This can crash with integer out of range when job IDs exceed 32-bit.

Issue Context

The dream phases rely on this UPDATE to repair pre-v0.23.1 double-encoded jsonb strings so input->>'slug' works.

Fix Focus Areas

  • src/core/cycle/synthesize.ts[448-467]
  • src/core/cycle/patterns.ts[222-241]

Change casts to $1::bigint[] (or remove the cast if the driver reliably binds as bigint array), and ensure the SELECT that follows still uses the same type.

We noticed a couple of other issues in this PR as well - happy to share if helpful.


Found by Qodo code review. FYI, Qodo is free for open-source.

garrytan and others added 3 commits May 1, 2026 09:02
…ct test (#437)

* feat(v0.22.0): eval_candidates + eval_capture_failures schema (Lane 1A)

R1 substrate for BrainBench-Real, replayed onto master after Cathedral II
landed. Migration v30 (slotted after master's v25-v29 Cathedral II wave)
creates two tables:

  eval_candidates: per-call capture of MCP/CLI/subagent query+search
    traffic. Column set lets gbrain-evals replay with full fidelity —
    source_ids from v0.18 multi-source, vector_enabled/detail_resolved/
    expansion_applied so replay knows what hybridSearch actually did,
    remote + job_id + subagent_id so rows are traceable to their origin.
    query is CHECK-capped at 50KB; PII scrubber (Lane 1B) runs before insert.

  eval_capture_failures: cross-process audit trail. In-process counters
    don't work because `gbrain doctor` runs in a separate process from
    the MCP server. Persistent rows let doctor query capture health via
    COUNT(*) GROUP BY reason over the last 24h.

Both tables get RLS on Postgres gated on BYPASSRLS (matches v24/v29
posture). PGLite ignores RLS; sqlFor split carries only DDL.

5 new BrainEngine methods (breaking-interface addition, drives v0.22.0
minor bump): logEvalCandidate, listEvalCandidates,
deleteEvalCandidatesBefore, logEvalCaptureFailure, listEvalCaptureFailures.
listEvalCandidates uses ORDER BY created_at DESC, id DESC so
`gbrain eval export` is deterministic across same-millisecond inserts.

Also adds HybridSearchMeta type for the side-channel callback used by
Lane 1C's op-layer capture (no change to hybridSearch return shape —
that respects Cathedral II's existing SearchResult[] contract).

Tests: 14 PGLite round-trip cases + 8 v30 structural assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): PII scrubber + op-layer capture module (Lane 1B)

Replayed onto master post-Cathedral II. Same semantics as the original
v0.21.0 work — only adjusted to import HybridSearchMeta from types.ts
(canonical home) instead of redeclaring it locally.

src/core/eval-capture-scrub.ts — pure-function regex scrubber with 6
pattern families: emails, phones (US + E.164), SSN (year-aware),
Luhn-verified credit cards, JWT-shaped tokens, bearer tokens. Zero
deps. Adversarial-input safe.

src/core/eval-capture.ts — op-layer hook helper:
  - buildEvalCandidateInput(ctx, {scrub_pii}) — pure row builder
  - classifyCaptureFailure(err) — Postgres SQLSTATE → reason tag
  - captureEvalCandidate(engine, ctx, opts) — best-effort, never throws
  - isEvalCaptureEnabled / isEvalScrubEnabled — file-plane config checks

GBrainConfig gains `eval?: {capture?, scrub_pii?}`. Both default ON.
File-plane only — `gbrain config set` writes the DB plane, doesn't
control capture.

Tests: 17 scrubber + 21 capture-module cases. Zero regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): hybridSearch onMeta callback + op-layer capture (Lane 1C)

Replayed onto master. Adapted from the original v0.21.0 work to keep
Cathedral II's contract intact: hybridSearch's return stays
`Promise<SearchResult[]>` (unchanged), and meta surfaces via an optional
`onMeta?: (meta: HybridSearchMeta) => void` callback in HybridSearchOpts.

Cathedral II callers leave onMeta undefined and pay no cost. The
op-layer capture wrapper passes a closure that threads meta into the
captured row so gbrain-evals can distinguish:
  - "with OPENAI_API_KEY" vs "keyword-only fallback" (vector_enabled)
  - "expansion fired" vs "expansion requested + silently fell back" (expansion_applied)
  - what hybridSearch actually used after auto-detect (detail_resolved)

Op-layer capture wired into both `query` and `search` op handlers in
src/core/operations.ts. Single hook site catches MCP dispatch + CLI +
subagent tool-bridge from the same place. Fire-and-forget, never throws,
respects ctx.config.eval.capture off-switch.

Tests:
  - test/hybrid-meta.test.ts (8 cases) — onMeta accuracy across the 4
    return paths in hybridSearch + verification that omitting onMeta
    leaves Cathedral II callers unchanged.
  - test/mcp-eval-capture.test.ts (10 cases) — query/search ops capture
    correctly with MCP/CLI/subagent contexts, scrub on/off, capture=false
    off-switch, non-captured ops (list_pages, get_page), F1 failure
    isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): gbrain eval export/prune + doctor eval_capture check (Lane 1D)

Replayed onto master. Same semantics as the original v0.21.0 work.

CLI:
  gbrain eval export [--since DUR] [--limit N] [--tool query|search]
    NDJSON to stdout, every row prefixed with "schema_version":1 per
    docs/eval-capture.md contract. EPIPE-safe streaming, stderr
    heartbeats, deterministic ordering (created_at DESC, id DESC).

  gbrain eval prune --older-than DUR [--dry-run]
    Explicit retention cleanup. Requires --older-than (never deletes
    without a window). Duration strings: 30d, 7d, 1h, 90m, 3600s.

Legacy bare `gbrain eval --qrels …` still works via sub-subcommand
fall-through.

gbrain doctor gains an eval_capture check between markdown_body_completeness
and queue_health: reads eval_capture_failures for the last 24h, groups by
reason, warns when non-zero. Pre-v30 brains get "Skipped (table
unavailable)" — non-fatal.

docs/eval-capture.md ships the stable NDJSON schema reference for
gbrain-evals consumers.

Tests: 9 export cases + 5 prune cases. Doctor check covered by
existing doctor tests on master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): public-exports contract test + CI count guard (Lane 2 / R2)

Master locks 17 public subpath exports as gbrain's stable third-party
contract. Zero enforcement existed. This PR locks the surface in two
layers:

1. test/public-exports.test.ts — runtime contract test.
   Reads package.json "exports" at startup. For each subpath, imports
   via the package name ("gbrain/engine"), NOT the relative filesystem
   path — that's the difference between exercising the actual resolver
   and bypassing it. Every subpath gets a canary symbol pinned (e.g.
   gbrain/search/hybrid must export hybridSearch + rrfFusion) so a
   refactor that renames or removes one fails CI before downstream
   consumers (gbrain-evals) silently break.

2. scripts/check-exports-count.sh — CI structural guard.
   Wired into `bun test` after check-jsonb-pattern.sh +
   check-progress-to-stdout.sh + check-wasm-embedded.sh per master's
   precedent. EXPECTED_COUNT=17 baseline — shrinks fail loudly,
   growth also fails so the new canary must be pinned in the runtime
   test deliberately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs+e2e(v0.22.0): VERSION/CHANGELOG/CLAUDE/README + Postgres E2E (Lane 3)

Bump VERSION + package.json to 0.22.0 (next free slot after master's
v0.21.0 Code Cathedral II minor).

CHANGELOG.md v0.22.0 entry follows the Garry voice template:
  - Bold 2-line headline
  - Lead paragraph contextualizing v0.20 + v0.21 + v0.22 progression
  - Numbers-that-matter table comparing v0.21.0 → v0.22.0
  - "What this means for you" sectioned by audience
  - "## To take advantage of v0.22.0" operator runbook
  - Itemized changes

CLAUDE.md updates:
  - Key files: 8 new module entries (eval-capture*, eval-export,
    eval-prune, docs/eval-capture.md, public-exports test).
    hybrid.ts entry rewritten to reflect the additive `onMeta` callback
    (return shape unchanged).
  - Key commands: new v0.22.0 section for `gbrain eval export`,
    `gbrain eval prune`, and the doctor `eval_capture` check, with the
    file-plane vs DB-plane config gotcha called out.

README.md: one-paragraph pointer after the BrainBench blurb so anyone
reading the landing page sees the new session-capture feature.

llms.txt + llms-full.txt regenerated to pick up the doc additions.

test/e2e/eval-capture.test.ts (Postgres-only E1 spec):
  - CHECK violation surfaces as Postgres SQLSTATE 23514 on oversize input
  - RLS is actually enabled on both eval_candidates + eval_capture_failures
  - 50 concurrent logEvalCandidate calls — no deadlock, all distinct IDs

Skips gracefully when DATABASE_URL is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): P0 — PGLite test-runner concurrency flake

Pre-existing on master, surfaces ~27 false failures when bun test runs all
174 files together. Each failing file passes in isolation. Tracked for a
dedicated investigation branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(v0.22.0): adversarial review post-fixes (doctor RLS, onMeta safety)

Two surgical fixes from /ship adversarial review, plus 6 follow-ups TODO'd
into v0.22.1:

- doctor.ts: distinguish pre-v30 missing-table (42P01, ok skip) from
  RLS-denied SELECT (42501, warn) and other DB errors (warn). The check
  exists specifically to surface capture-failure misconfigs cross-process,
  so silently reporting "ok / skipped" on the most diagnostic class
  defeated the purpose.

- hybrid.ts: wrap onMeta invocation in try/catch via small emitMeta
  helper. The callback is part of the public gbrain/search/hybrid
  contract; a throwing user-supplied closure must never break the search
  hot path.

- TODOS.md: 6 P1 follow-ups (eval prune real COUNT, scrubber CC false
  positives, dead 'scrubber_exception' enum value, id-cursor for
  cross-window dedup, public-export canary pinning, EXPECTED_COUNT dedup).

- TODOS.md: P0 entry for the pre-existing PGLite test-runner concurrency
  flake (~27 false failures in full bun test on master).

- CHANGELOG.md: 2 bullets noting the doctor + onMeta hardening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(version): bump v0.22.0 → v0.25.0 (queue-aware version pick)

Master is at v0.21.0. Open PRs claim v0.21.1 (#432) and v0.24.0 (#387).
v0.25 is the first uncontested slot, so this branch claims it. Pure
rename across VERSION, package.json, CHANGELOG header, and every "v0.22.0"
reference in CLAUDE.md / README.md / TODOS.md / docs/eval-capture.md /
src/ / test/ files. CHANGELOG date bumped to 2026-04-26.

llms.txt + llms-full.txt regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.25.0): gbrain eval replay + contributor doc + CONTRIBUTING link

Closes the gap between "session capture works" (this PR's core) and
"contributors actually use it before merging." Three artifacts:

- src/commands/eval-replay.ts (~340 LOC) — reads NDJSON from `gbrain eval
  export`, re-runs each captured query/search against the current brain,
  computes set-Jaccard@k, top-1 stability, and latency delta. Stable JSON
  shape (schema_version:1) for CI gating; human mode prints a regression
  table sorted worst-first. Pure Bun, zero new deps. Stub-engine tests
  cover Jaccard math, NDJSON parser (including v2 forward-compat
  rejection + line-numbered errors), --limit, --verbose, --json, and
  graceful per-row error handling. 16/16 passing.

- docs/eval-bench.md (~80 lines) — contributor guide. The 4-command loop
  (export → change → replay → diff), metric definitions with healthy
  ranges (Jaccard ≥0.85, top-1 ≥85%, latency Δ within ±50ms), trigger
  paths, CI integration snippet, hand-crafted NDJSON corpus path for
  fresh installs, and the off-switch. Pairs with the existing
  docs/eval-capture.md which is the consumer-facing wire format.

- CONTRIBUTING.md gains a "Running real-world eval benchmarks (touching
  retrieval code)" section with the trigger paths and a link to
  docs/eval-bench.md. Reviewers now have a one-line ask: "did you run
  replay?"

CLAUDE.md key files updated. CHANGELOG bullets added. llms.txt
regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.25.0): CONTRIBUTOR_MODE flag — capture off by default for users

Eval capture was on for everyone in the v0.25.0 draft. Privacy footgun:
end users had retrieval traffic accumulate in their brain DB without
asking, even with PII scrubbing. Flips to off by default + explicit
opt-in for contributors who actually use the replay loop.

Resolution order in isEvalCaptureEnabled():
  1. config.eval.capture === true            → on
  2. config.eval.capture === false           → off
  3. process.env.GBRAIN_CONTRIBUTOR_MODE === '1' → on
  4. otherwise                                → off

The env var is the contributor-facing toggle (one line in .zshrc, no
JSON edit). Explicit config wins both directions for users who want to
override per-brain.

PII scrubbing gate stays independent — default true regardless of
CONTRIBUTOR_MODE — so any brain that does capture still scrubs.

Tests rewritten: env var hygiene per-test (origMode preserved + restored
in finally). 9/9 pass; total v0.25.0 suite is 198/198.

Docs:
- README.md gains a Contributing-section pointer to the env var.
- CONTRIBUTING.md gains a "CONTRIBUTOR_MODE — turn on the dev loop"
  section with verification commands and resolution-order table.
- docs/eval-bench.md leads with the prerequisite (must set the env var
  for the rest of the doc to be useful).
- docs/eval-capture.md "Config" section split into Path A (env var) +
  Path B (config) with explicit resolution-order rules.
- CHANGELOG v0.25.0 entry corrected ("on by default" was wrong) plus a
  new top itemized bullet calling out the gate change.
- CLAUDE.md eval-capture entry annotated with the new gate logic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: post-ship documentation pass for v0.25.0

Cross-references every doc against the final state of the branch
(CONTRIBUTOR_MODE flag, eval replay tool, off-by-default capture):

- README.md: top callout rewritten — was implying capture-on-by-default
  contradicting the gate landed in 7a80ce2. Now leads with
  "contributor opt-in" and links docs/eval-bench.md alongside
  docs/eval-capture.md.
- AGENTS.md: new "Eval retrieval changes" task entry with the
  CONTRIBUTOR_MODE+replay one-liner so non-Claude agents (Codex, Cursor,
  Aider) have the same path.
- CLAUDE.md: "Key commands added in v0.25.0" gains the replay command and
  a CONTRIBUTOR_MODE bullet covering the resolution order.
- CHANGELOG.md: headline rewritten to match the actual feature ("benchmark
  retrieval changes against real captured queries before merging" — was
  "every real query is captured"). Stale "v0.22 ships the substrate"
  → v0.25. Test count corrected 82 → 144 (added 16 replay + 9
  CONTRIBUTOR_MODE + 8 v31-shape tests since the original count). Two
  metric rows added to the numbers table: default-off posture, in-tree
  replay tooling. "To take advantage" block split into user vs
  contributor branches with shell-rc instructions.
- TODOS.md: v0.22.1 follow-up reference corrected to v0.25.1.

llms.txt + llms-full.txt regenerated. Typecheck clean. 198/198 v0.25.0
tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vinsew added a commit to vinsew/gbrain that referenced this pull request May 11, 2026
…nb + doctor

The four `subagent_*` jsonb writers in `src/core/minions/handlers/subagent.ts`
(persistMessage, persistToolExecPending, persistToolExecComplete,
persistToolExecFailed) used `engine.executeRaw` with `JSON.stringify(value)`
plus `$N::jsonb` cast. On the postgres-engine path that goes through
postgres.js's `unsafe()`, that combination produces a jsonb string scalar
instead of an object — exactly the v0.12.0 double-encode shape, on a second
set of tables.

Verified empirically (probe inside this branch, not committed):

  P1 JSON.stringify(obj) + $::jsonb → jsonb_typeof='string'  (broken)
  P3 raw object + $::jsonb           → jsonb_typeof='object' (correct)
  P4 sql.json(obj) via unsafe        → jsonb_typeof='object' (correct)

The four call sites are changed to pass the raw value. postgres.js v3
auto-encodes objects/arrays for jsonb columns (already what queue.ts:233
relies on for `minion_jobs.data`, never broken). PGLite handles raw values
identically (verified — all three patterns produce object jsonb on PGLite).

Existing data on the user's brain at the time of fix:
  subagent_messages.content_blocks: 40/40 corrupt
  subagent_tool_executions.input:   39/39 corrupt
  subagent_tool_executions.output:  39/39 corrupt
All 118 rows repaired by `gbrain repair-jsonb` after extending TARGETS.

Symptom this resolves: dream synthesize's orchestrator
(`collectChildPutPageSlugs` in src/core/cycle/synthesize.ts:459) queries
`input->>'slug'` to gather slugs the subagent wrote; on string-typed jsonb
that operator returns NULL, so `pages_written` reports 0 even when child
subagents successfully wrote pages. The reverse-write step that mirrors
DB → markdown is then skipped, leaving real synthesized pages stranded
in the DB. After this fix, the orchestrator picks up slugs correctly and
the full dream cycle closes the loop.

Companion changes:
  - `repair-jsonb` extended from 5 → 8 columns (subagent_messages.content
    _blocks, subagent_tool_executions.{input,output}); guards each target
    with `to_regclass()` so pre-v0.15 brains that lack the subagent_*
    tables don't throw at upgrade time.
  - `doctor`'s `jsonb_integrity` check extended to the same 8 columns
    with the same guard. Any future user hitting this bug from upstream
    master gets a clear "X rows double-encoded ... Fix: gbrain
    repair-jsonb" warning instead of silent dream failures.
  - `repair-jsonb.ts` header comment retracted: the prior claim that
    parameterized `$N::jsonb` was always safe was wrong; safety came
    from queue.ts/etc. passing raw objects, not from the binding form.

Upstream history: PR garrytan#525 was a chmod fix mis-attributed by local commits;
the real outstanding upstream PR is garrytan#539 (zombie lock + post-hoc data
repair via UPDATE) which does NOT patch the write path. Master HEAD still
ships the buggy form. This patch is upstream-worthy as a follow-up to

Tests: 245 pass across minions, subagent-handler, cycle-synthesize,
cycle-patterns, repair-jsonb, migrations-v0_12_2, and doctor. typecheck
clean.
vinsew added a commit to vinsew/gbrain that referenced this pull request May 23, 2026
…nb + doctor

The four `subagent_*` jsonb writers in `src/core/minions/handlers/subagent.ts`
(persistMessage, persistToolExecPending, persistToolExecComplete,
persistToolExecFailed) used `engine.executeRaw` with `JSON.stringify(value)`
plus `$N::jsonb` cast. On the postgres-engine path that goes through
postgres.js's `unsafe()`, that combination produces a jsonb string scalar
instead of an object — exactly the v0.12.0 double-encode shape, on a second
set of tables.

Verified empirically (probe inside this branch, not committed):

  P1 JSON.stringify(obj) + $::jsonb → jsonb_typeof='string'  (broken)
  P3 raw object + $::jsonb           → jsonb_typeof='object' (correct)
  P4 sql.json(obj) via unsafe        → jsonb_typeof='object' (correct)

The four call sites are changed to pass the raw value. postgres.js v3
auto-encodes objects/arrays for jsonb columns (already what queue.ts:233
relies on for `minion_jobs.data`, never broken). PGLite handles raw values
identically (verified — all three patterns produce object jsonb on PGLite).

Existing data on the user's brain at the time of fix:
  subagent_messages.content_blocks: 40/40 corrupt
  subagent_tool_executions.input:   39/39 corrupt
  subagent_tool_executions.output:  39/39 corrupt
All 118 rows repaired by `gbrain repair-jsonb` after extending TARGETS.

Symptom this resolves: dream synthesize's orchestrator
(`collectChildPutPageSlugs` in src/core/cycle/synthesize.ts:459) queries
`input->>'slug'` to gather slugs the subagent wrote; on string-typed jsonb
that operator returns NULL, so `pages_written` reports 0 even when child
subagents successfully wrote pages. The reverse-write step that mirrors
DB → markdown is then skipped, leaving real synthesized pages stranded
in the DB. After this fix, the orchestrator picks up slugs correctly and
the full dream cycle closes the loop.

Companion changes:
  - `repair-jsonb` extended from 5 → 8 columns (subagent_messages.content
    _blocks, subagent_tool_executions.{input,output}); guards each target
    with `to_regclass()` so pre-v0.15 brains that lack the subagent_*
    tables don't throw at upgrade time.
  - `doctor`'s `jsonb_integrity` check extended to the same 8 columns
    with the same guard. Any future user hitting this bug from upstream
    master gets a clear "X rows double-encoded ... Fix: gbrain
    repair-jsonb" warning instead of silent dream failures.
  - `repair-jsonb.ts` header comment retracted: the prior claim that
    parameterized `$N::jsonb` was always safe was wrong; safety came
    from queue.ts/etc. passing raw objects, not from the binding form.

Upstream history: PR garrytan#525 was a chmod fix mis-attributed by local commits;
the real outstanding upstream PR is garrytan#539 (zombie lock + post-hoc data
repair via UPDATE) which does NOT patch the write path. Master HEAD still
ships the buggy form. This patch is upstream-worthy as a follow-up to

Tests: 245 pass across minions, subagent-handler, cycle-synthesize,
cycle-patterns, repair-jsonb, migrations-v0_12_2, and doctor. typecheck
clean.
@garrytan

Copy link
Copy Markdown
Owner Author

Closing as superseded — the work in this PR mostly shipped in two follow-on waves: v0.23.2 closed the synthesize-phase orchestrator slug-collection bug (dream_generated frontmatter marker in renderPageToMarkdown + isDreamOutput guard in transcript-discovery.ts), and v0.28.1 added the zombie-reap layer (installSigchldHandler() in src/core/zombie-reap.ts + tini-as-PID-1 wrapping in supervisor.ts). The 8253-line branch is too stale to land as-is against current master, but the core fixes are in. Thanks for the load-bearing diagnosis.

@garrytan garrytan closed this May 24, 2026
vinsew added a commit to vinsew/gbrain that referenced this pull request Jun 1, 2026
…nb + doctor

The four `subagent_*` jsonb writers in `src/core/minions/handlers/subagent.ts`
(persistMessage, persistToolExecPending, persistToolExecComplete,
persistToolExecFailed) used `engine.executeRaw` with `JSON.stringify(value)`
plus `$N::jsonb` cast. On the postgres-engine path that goes through
postgres.js's `unsafe()`, that combination produces a jsonb string scalar
instead of an object — exactly the v0.12.0 double-encode shape, on a second
set of tables.

Verified empirically (probe inside this branch, not committed):

  P1 JSON.stringify(obj) + $::jsonb → jsonb_typeof='string'  (broken)
  P3 raw object + $::jsonb           → jsonb_typeof='object' (correct)
  P4 sql.json(obj) via unsafe        → jsonb_typeof='object' (correct)

The four call sites are changed to pass the raw value. postgres.js v3
auto-encodes objects/arrays for jsonb columns (already what queue.ts:233
relies on for `minion_jobs.data`, never broken). PGLite handles raw values
identically (verified — all three patterns produce object jsonb on PGLite).

Existing data on the user's brain at the time of fix:
  subagent_messages.content_blocks: 40/40 corrupt
  subagent_tool_executions.input:   39/39 corrupt
  subagent_tool_executions.output:  39/39 corrupt
All 118 rows repaired by `gbrain repair-jsonb` after extending TARGETS.

Symptom this resolves: dream synthesize's orchestrator
(`collectChildPutPageSlugs` in src/core/cycle/synthesize.ts:459) queries
`input->>'slug'` to gather slugs the subagent wrote; on string-typed jsonb
that operator returns NULL, so `pages_written` reports 0 even when child
subagents successfully wrote pages. The reverse-write step that mirrors
DB → markdown is then skipped, leaving real synthesized pages stranded
in the DB. After this fix, the orchestrator picks up slugs correctly and
the full dream cycle closes the loop.

Companion changes:
  - `repair-jsonb` extended from 5 → 8 columns (subagent_messages.content
    _blocks, subagent_tool_executions.{input,output}); guards each target
    with `to_regclass()` so pre-v0.15 brains that lack the subagent_*
    tables don't throw at upgrade time.
  - `doctor`'s `jsonb_integrity` check extended to the same 8 columns
    with the same guard. Any future user hitting this bug from upstream
    master gets a clear "X rows double-encoded ... Fix: gbrain
    repair-jsonb" warning instead of silent dream failures.
  - `repair-jsonb.ts` header comment retracted: the prior claim that
    parameterized `$N::jsonb` was always safe was wrong; safety came
    from queue.ts/etc. passing raw objects, not from the binding form.

Upstream history: PR garrytan#525 was a chmod fix mis-attributed by local commits;
the real outstanding upstream PR is garrytan#539 (zombie lock + post-hoc data
repair via UPDATE) which does NOT patch the write path. Master HEAD still
ships the buggy form. This patch is upstream-worthy as a follow-up to

Tests: 245 pass across minions, subagent-handler, cycle-synthesize,
cycle-patterns, repair-jsonb, migrations-v0_12_2, and doctor. typecheck
clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants