Skip to content

v0.24.0: production-hardening pass on the skillify loop#387

Merged
garrytan merged 9 commits into
masterfrom
feat/brain-first-convention
May 1, 2026
Merged

v0.24.0: production-hardening pass on the skillify loop#387
garrytan merged 9 commits into
masterfrom
feat/brain-first-convention

Conversation

@garrytan

@garrytan garrytan commented Apr 24, 2026

Copy link
Copy Markdown
Owner

Summary

The v0.24.0 release. Production-hardening pass on top of master's existing v0.19.0 skillify work. No new features. No new commands. Every public contract that lied about itself, every silent footgun, every CI guard that wasn't wired up — fixed.

The skillify loop stops lying. Privacy guard runs, --llm is honest, ghost rows go away. Plus: Tier 2 LLM-skill tests now block every PR, not just the nightly cron.

Biggest fixes

  • Skillpack installer would have silently deleted your skills. The original v0.19 design's "rebuild managed block" path was load-bearing wrong — a user installing alpha then later running gbrain skillpack install beta alone would have lost alpha. Codex caught it during cross-model review. The fix preserves cumulative-install semantics via a receipt comment in the fence: <!-- gbrain:skillpack:manifest cumulative-slugs="alpha,beta,..." -->. Old fences upgrade silently. User-added rows survive with a stderr warning telling the operating agent to investigate. install --all is now the only path that prunes; per-skill install never destroys what it didn't install.
  • gbrain routing-eval --llm was a documented feature that did nothing. README, CHANGELOG, and CLI help all said it ran an LLM tie-break layer. The code returned structural-only results with no warning, no error, no signal at all. v0.24.0 makes the flag honest across all four touchpoints. Until the LLM layer ships, --llm emits a stderr placeholder notice and runs structural. CI logs see it. Docs match the code. The release notes don't lie.
  • scripts/check-privacy.sh was orphan code. Existed in the repo to enforce the OpenClaw fork-name ban from CLAUDE.md:550, never wired anywhere. v0.24.0 prepends it to package.json's "test" chain alongside the other check-*.sh guards. A regression test asserts the wiring stays. The first run caught 5 banned-name references that had been sitting in master's CHANGELOG.md, src/cli.ts, src/commands/sync.ts, and skills/migrations/v0.19.0.md for releases — fixed in the same wave.
  • Tier 2 (LLM-skill E2E) was schedule-only. test/e2e/skills.test.ts ran nightly, not per-PR. v0.24.0 promotes it to required per-PR CI using existing repo secrets. ~3-5 min added per PR for real protection against LLM-adjacent regressions.

The numbers that matter

Metric BEFORE v0.24.0 AFTER v0.24.0 Δ
routing-eval --llm matches docs no yes fixed
Public-contract drift surfaces fixed 4 0 −4
Cumulative-install regression test none test 8a +1
Banned-name leaks in tracked files 5 0 −5
check-privacy.sh runs in CI never every PR wired
Tier 2 LLM-skill E2E per-PR gate no yes wired
Stale v0.17/v0.18 labels in new code 7 sites / 5 files 0 −7
Skillify scaffold under hand-edited resolver backtick-only backtick + quoted + bare fixed

Cross-model review trail

CEO + Eng + Codex outside voice. 14 user decisions captured, 0 unresolved, 1 critical Codex catch (the cumulative-install regression that would have shipped). Two-model review caught a one-model-blind spot. Receipt design lives in src/core/skillpack/installer.ts:applyManagedBlock.

Test plan

  • bun install && bun run typecheck && bun test — green (18 known PGLite-contention flakes are local-Mac-only, documented as P3 dev-experience in TODOS.md; CI is unaffected)
  • bun run test:e2e — Tier 1 lifecycle clean (1 pre-existing master flake unrelated to this branch, verified via git stash baseline)
  • bun test test/e2e/openclaw-reference-compat.test.ts — 8/8 green
  • All 4 CI guards (privacy + jsonb + progress + wasm) — green locally
  • 62/62 pass across this branch's 5 new/extended test files: test/routing-eval-cli.test.ts (4), test/privacy-script-wired.test.ts (3), test/skillpack-install.test.ts (+4 = 30), test/skillify-scaffold.test.ts (+4 = 18), test/build-llms.test.ts (7)
  • CI Tier 1 + Tier 2 (now required per-PR after this lands)

Upgrade path

gbrain upgrade does this automatically. To verify:

gbrain --version                                 # should say 0.24.0
gbrain routing-eval --llm 2>&1 | grep -i placeholder
# expect: "[routing-eval] --llm flag is a placeholder in this release..."

gbrain skillpack install <name>
grep "gbrain:skillpack:manifest cumulative-slugs" $OPENCLAW_WORKSPACE/AGENTS.md
# expect a receipt line listing every gbrain-installed slug

grep "check-privacy.sh" package.json
# expect a hit in scripts.test

No schema migration. Existing brains work unchanged.

Files changed

  • VERSION 0.21.0 → 0.24.0
  • package.json version + new check-privacy.sh in test chain + check:privacy alias
  • CHANGELOG.md new top-of-file entry, ~100 lines
  • README.md, CLI help text — --llm rewrite (4-surface honesty pass)
  • src/commands/routing-eval.ts — stderr placeholder notice
  • src/core/skillpack/installer.ts — receipt + cumulative + unknown-row warn (~100 LOC)
  • src/core/skillify/generator.ts — broadened resolver-row regex
  • src/core/routing-eval.ts, src/core/filing-audit.ts, src/commands/check-resolvable.ts, src/commands/skillify.ts, src/commands/skillpack.ts — version label scrub
  • src/cli.ts, src/commands/sync.ts, skills/migrations/v0.19.0.md — banned-name scrub
  • .github/workflows/e2e.yml — Tier 2 promoted to per-PR
  • 2 new test files + 2 extended test files
  • llms.txt, llms-full.txt regenerated to match

🤖 Generated with Claude Code

Documentation

/document-release ran after v0.24.0 landed and updated 3 docs to match:

  • CLAUDE.md — fixed the false claim that routing-eval --llm "opts into a Haiku tie-break layer for CI" (now describes the v0.24.0 placeholder semantic). Documented the skillpack installer's cumulative-slugs receipt + unknown-row preserve+warn behavior.
  • README.md — added one paragraph to the gbrain skillpack install section explaining the receipt comment + the user-visible stderr message for hand-added rows.
  • CONTRIBUTING.md — recommended bun run test (full CI guard chain) over bare bun test for local pre-push verification. Named each guard (privacy, jsonb, progress, wasm) so new contributors understand what catches what.
  • llms-full.txt — regenerated to reflect the CLAUDE.md changes.

No risky/subjective decisions surfaced; all updates were factual corrections clearly warranted by the v0.24.0 diff.

…tion

This is the v0.19.0 release. The branch ships four new CLI commands, a
refactor to check-resolvable, and an expansion of the brain-first
convention for sub-agent tool discovery. The original commit message
described only the convention expansion, undercounting the scope by ~5x;
this amend captures the full release.

NEW COMMANDS

- gbrain skillify scaffold <name>     — 4 stub files + idempotent resolver row
- gbrain skillify check [path]        — 10-item post-task audit (promoted)
- gbrain skillpack list / install     — curated 25-skill bundle, atomic install
- gbrain skillpack diff <name>        — per-file diff preview
- gbrain routing-eval                 — dedicated CI verb for Check 5 fixtures

CHECK-RESOLVABLE REFACTOR

- Accepts AGENTS.md as a resolver file alongside RESOLVER.md, at either
  the skills directory or one level up (workspace root layout).
- Auto-derives the skill manifest by walking skills/*/SKILL.md when
  manifest.json is missing.
- Splits ResolvableReport into errors[] + warnings[] so advisory checks
  (filing audit, routing gaps, DRY violations) don't break CI by default.
- New --strict opt-in flag promotes warnings to exit 1.

BRAIN-FIRST CONVENTION

- skills/conventions/brain-first.md expanded from 5-step lookup guide to
  full sub-agent reference: tool inventory, lookup chain, score thresholds,
  authority hierarchy, sync rules, entity page conventions, sub-agent
  propagation rule.

PRODUCTION-READINESS HARDENING (this branch's review pass)

- routing-eval --llm: emits stderr placeholder notice + runs structural
  layer only. README, CHANGELOG, CLI help all rewritten consistently.
  Was a silent no-op against documented contract.
- skillpack installer: receipt comment in fence (cumulative-slugs="...")
  preserves single-skill-install accumulation while letting install --all
  prune removed bundle skills cleanly. Unknown rows preserved + stderr
  warning for the operating agent. Pre-v0.19 fences upgrade silently.
- skillify scaffold: resolver-row regex broadened to detect backticked,
  quoted, and bare path forms. No duplicate row on --force after the
  user normalizes formatting.
- scripts/check-privacy.sh: now wired into package.json test chain so
  the wintermute-ban rule is actually enforced. New regression test.
- E2E Tier 2 (LLM skills) promoted from schedule-only to required per-PR
  CI. Local Tier 1 + Tier 2 verified clean.
- Stale v0.17/v0.18 version labels rewritten across new files.

TESTS

- test/routing-eval-cli.test.ts: 4 cases covering --llm warn semantics
- test/privacy-script-wired.test.ts: regression guard for CI wiring
- test/skillpack-install.test.ts: 4 new cases for receipt + cumulative
  + unknown-row preserve+warn + pre-v0.19 upgrade path
- test/skillify-scaffold.test.ts: 4 new cases for broadened regex

VERIFICATION

- bun test: 2237 pass / 18 known PGLite-contention flakes (CI green;
  documented as P3 dev-experience in TODOS.md)
- bun run typecheck: clean
- bun run test:e2e: 18/19 files green (1 pre-existing flake on master,
  not caused by this branch — verified via git stash)
- llms.txt + llms-full.txt regenerated to match README + CHANGELOG

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan force-pushed the feat/brain-first-convention branch from 6ee4bf7 to 399b9a4 Compare April 26, 2026 05:38
@garrytan garrytan changed the title feat: expand brain-first convention for sub-agent tool discovery feat: v0.19.0 — skillify loop + AGENTS.md compat + brain-first convention Apr 26, 2026
Garry Tan and others added 3 commits April 25, 2026 22:50
Master moved from v0.19.0 → v0.21.0 (Code Cathedral I + II) while this
branch worked on production-readiness hardening.

Conflicts resolved:
- TODOS.md: kept master's v0.21.0 Code Cathedral II follow-ups (B2,
  A4, C6, cross-file edge resolution) AND this branch's P3 dev-experience
  TODO for PGLite test parallelism on M-series Macs.
- package.json scripts.test: combined both check chains.
  Now runs check-privacy.sh (this branch) + check-jsonb-pattern.sh +
  check-progress-to-stdout.sh + check-wasm-embedded.sh (master) +
  typecheck + bun test --timeout=60000 (master).

CHANGELOG state after merge: master's v0.21.0 (Code Cathedral II) sits
at top, this branch's v0.19.0 (skillify + AGENTS.md compat) sits below
master's later entries. The duplicate v0.19.0 entries (master's
"code-first brain" 0.19.0 dated 2026-04-23 + this branch's
"skillify loop" 0.19.0 dated 2026-04-22) reflect master's existing
state — not introduced by this merge.

VERSION is 0.21.0 (master's). A follow-up commit can bump to v0.21.1
and reframe this branch's CHANGELOG entry above v0.21.0 per CLAUDE.md
convention. Out of scope for this merge.

Verification:
- bun install — 3 new packages from master (web-tree-sitter etc.)
- bun run typecheck — clean
- bun test (this branch's new tests): 55/55 pass
  (skillpack-install, skillify-scaffold, routing-eval-cli, privacy-script-wired)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The privacy guard wired into the test chain in this branch caught 5
pre-existing references to the banned OpenClaw fork name in CHANGELOG.md
(2x), skills/migrations/v0.19.0.md (1x), src/cli.ts (1x), and
src/commands/sync.ts (1x). All originated in master's v0.19.0 release
notes and migration doc when the privacy script existed but wasn't
wired into CI yet.

Replacements per CLAUDE.md privacy mapping:
- Origin-story copy (CHANGELOG layer narratives, code comments naming
  the production deployment that drove the feature) → "Garry's OpenClaw"
- Reader-facing migration step → "your OpenClaw"

No code semantics changed. Comments + headings only.

Verification: scripts/check-privacy.sh exits 0, full CI guard chain
green (privacy + jsonb + progress + wasm + typecheck).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump branch version above master's v0.21.0 per CLAUDE.md
"CHANGELOG + VERSION are branch-scoped" rule. The new v0.24.0 entry at
the top of CHANGELOG covers what THIS branch adds vs master:

- routing-eval --llm honesty pass (4-surface contract drift fix)
- skillpack installer cumulative-receipt + unknown-row preserve+warn
  (the Codex-caught regression that would have shipped in master if
  the original v0.19.0 had landed without this branch's review pass)
- skillify scaffold resolver-row regex broadening (backtick + quoted
  + bare forms; idempotency contract preserved under hand-editing)
- 5 banned-name leaks scrubbed from public artifacts
- check-privacy.sh wired into CI test chain + regression guard test
- 7 stale v0.17/v0.18 version labels rewritten across 5 files
- Tier 2 (LLM-skills E2E) promoted from schedule-only to required per-PR

VERSION 0.21.0 → 0.24.0
package.json version field synced.
llms.txt + llms-full.txt regenerated (no content drift; sizes match).

Test suite: 62/62 green across the 5 test files this branch added or
extended (routing-eval-cli, privacy-script-wired, skillpack-install,
skillify-scaffold, build-llms).

CI guards: privacy + jsonb + progress + wasm + typecheck all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title feat: v0.19.0 — skillify loop + AGENTS.md compat + brain-first convention v0.24.0: production-hardening pass on the skillify loop Apr 26, 2026
garrytan added a commit that referenced this pull request Apr 26, 2026
Master is at v0.21.0. Open PRs claim v0.21.1 (#432) and v0.24.0 (#387).
v0.25 is the first uncontested slot, so this branch claims it. Pure
rename across VERSION, package.json, CHANGELOG header, and every "v0.22.0"
reference in CLAUDE.md / README.md / TODOS.md / docs/eval-capture.md /
src/ / test/ files. CHANGELOG date bumped to 2026-04-26.

llms.txt + llms-full.txt regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Garry Tan and others added 5 commits April 25, 2026 23:05
Auto-discovered drift via /document-release after the v0.24.0 hardening
pass landed. All factual corrections clearly warranted by the diff.

CLAUDE.md:
- Skillpack installer: documented the cumulative-slugs receipt comment,
  install --all prune semantics, unknown-row preserve+warn behavior,
  and pre-v0.24 silent upgrade. Was previously vague about
  "tracks a skill manifest so install --update diffs cleanly" without
  explaining what the receipt is or why it matters.
- routing-eval: replaced the false claim that --llm "opts into a Haiku
  tie-break layer for CI." Now correctly describes the placeholder
  semantic landed in v0.24.0 (stderr notice + structural-only run).

README.md:
- Skillpack section: added one paragraph on the receipt comment + the
  user-visible stderr message for hand-added rows. Connects the safe
  rerun promise to the v0.24.0 implementation that actually enforces it.

CONTRIBUTING.md:
- Running tests section: now recommends `bun run test` (full CI guard
  chain + typecheck + tests) before pushing. Names each guard so new
  contributors understand what catches what. The privacy guard (newly
  wired in v0.24.0) is one of these — without `bun run test` you'd skip
  it locally and find out from CI.

llms-full.txt: regenerated to reflect CLAUDE.md changes.

Verification: full guard chain green locally (privacy + jsonb + progress
+ wasm + typecheck).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master moved from v0.21.0 → v0.22.9 (10 PRs: source-aware ranking,
frontmatter-guard, autopilot fixes, MCP HTTP transport, schema-bootstrap
self-healing, doctor batch loads, sync error-code summaries).

Conflicts resolved:
- VERSION: kept this branch's 0.24.0 per CLAUDE.md "branch-scoped
  CHANGELOG + VERSION" rule (master's 0.22.9 is below us).
- package.json version: kept 0.24.0.
- CHANGELOG.md: stitched to keep this branch's v0.24.0 entry on top,
  followed by master's new v0.22.9 → v0.22.0 entries, followed by
  v0.21.0 and below. Sequence is now monotonic above v0.21.0.
  Auto-merger had tangled my v0.24.0 entry with master's v0.22.6.1
  via shared "### Itemized changes" tokens; fixed by checkout --ours
  + manual splice.

Privacy scrub (CI guard from this branch caught new master leaks):
- src/core/search/source-boost.ts — hardcoded default boost-map key
  renamed: 'wintermute/chat/' → 'openclaw/chat/'. Behavior delta:
  default downgrade now matches openclaw/chat/ prefix instead of the
  fork-specific name. Users with the old shape can override via
  GBRAIN_SOURCE_BOOST="openclaw/chat/:0.5,your/path/:0.5".
- test/e2e/engine-parity.test.ts, test/e2e/search-swamp.test.ts,
  test/sql-ranking.test.ts — fixture slugs renamed to match the new
  default key.
- CHANGELOG.md (master's v0.22.0 entry prose) — replaced 5 references
  to the banned name in user-facing release notes.
- docs/integrations/pre-commit.md — replaced "fork (Wintermute, Hermes,
  OpenClaw)" with "your OpenClaw" per CLAUDE.md privacy mapping.
- llms-full.txt regenerated.

Verification:
- bun install — 0 new packages
- bun run typecheck — clean
- scripts/check-privacy.sh — exit 0
- This branch's 5 test files (skillpack-install + skillify-scaffold +
  routing-eval-cli + privacy-script-wired + build-llms) — 63/63 pass
- test/sql-ranking.test.ts — 40/40 pass against the renamed key

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master moved from v0.22.9 → v0.23.0 (9 PRs: dream synthesizes
conversations into brain pages; claw-test friction harness; frontmatter
inference; minions bare-worker self-monitoring; parallel sync; sync
error-code summary; storage tiering; autopilot-cycle phases passthrough).

Conflicts resolved:
- VERSION: kept this branch's 0.24.0 per CLAUDE.md branch-scoped rule.
- package.json: kept 0.24.0; combined test-script chain — both my
  check-privacy.sh AND master's new check-trailing-newline.sh now run
  alongside check-jsonb-pattern.sh + check-progress-to-stdout.sh +
  check-wasm-embedded.sh + typecheck + bun test --timeout=60000.
- CHANGELOG.md: stitched. v0.24.0 stays on top, master's new v0.23.0 +
  v0.22.16 + v0.22.15 + v0.22.14 + v0.22.13 + v0.22.12 + v0.22.11 +
  v0.22.10 entries spliced between v0.24.0 and v0.22.9 (the previous
  merge boundary). Sequence above v0.21.0 monotonically descending.

Verification:
- bun install — 6 new packages (web-tree-sitter wave + bun-types bump)
- All 5 CI guards green: privacy + jsonb + progress + trailing-newline + wasm
- bun run typecheck — clean
- This branch's tests + sql-ranking: 103/103 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master moved from v0.23.0 → v0.23.1 (PR #528: local CI gate +
4-tier wall-time optimization, ~13x faster). Plus a jsonb fix that
master shipped + reverted in the same window (no-op for this branch).

Conflicts resolved:
- VERSION: kept this branch's 0.24.0 per CLAUDE.md branch-scoped rule.
- package.json: kept 0.24.0; combined the test-script chain so
  check-privacy.sh (mine) coexists with master's new check-trailing-
  newline.sh + the existing jsonb/progress/wasm/typecheck guards. Also
  pulled in master's new build:pglite-snapshot script.
- CONTRIBUTING.md: combined paragraphs. My "Use bun run test" guard
  description now precedes master's new "Local CI gate (v0.23.1+)"
  block describing bun run ci:local. Both are useful; both stay.
- CHANGELOG.md: my v0.24.0 entry stays on top; master's new v0.23.1
  entry slots between v0.24.0 and v0.23.0. Sequence above v0.21.0
  monotonically descending.

Verification:
- bun install — 0 new packages
- All 5 CI guards green: privacy + jsonb + progress + trailing-newline + wasm
- bun run typecheck — clean
- This branch's tests: 103/103 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master moved from v0.23.1 → v0.23.2 (PR #527: orchestrator-stamped
self-consumption marker for the dream cycle + verdict-model unit tests).

Conflicts resolved:
- VERSION: kept this branch's 0.24.0 per CLAUDE.md branch-scoped rule.
- package.json version: kept 0.24.0.
- CHANGELOG.md: my v0.24.0 stays on top; master's new v0.23.2 entry
  spliced between v0.24.0 and v0.23.1. Sequence above v0.21.0 monotonic.

Verification:
- bun install — 0 new packages
- All 5 CI guards green: privacy + jsonb + progress + trailing-newline + wasm
- bun run typecheck — clean
- This branch's tests: 103/103 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 4fc1246 into master May 1, 2026
7 checks passed
garrytan added a commit that referenced this pull request May 1, 2026
…ct test (#437)

* feat(v0.22.0): eval_candidates + eval_capture_failures schema (Lane 1A)

R1 substrate for BrainBench-Real, replayed onto master after Cathedral II
landed. Migration v30 (slotted after master's v25-v29 Cathedral II wave)
creates two tables:

  eval_candidates: per-call capture of MCP/CLI/subagent query+search
    traffic. Column set lets gbrain-evals replay with full fidelity —
    source_ids from v0.18 multi-source, vector_enabled/detail_resolved/
    expansion_applied so replay knows what hybridSearch actually did,
    remote + job_id + subagent_id so rows are traceable to their origin.
    query is CHECK-capped at 50KB; PII scrubber (Lane 1B) runs before insert.

  eval_capture_failures: cross-process audit trail. In-process counters
    don't work because `gbrain doctor` runs in a separate process from
    the MCP server. Persistent rows let doctor query capture health via
    COUNT(*) GROUP BY reason over the last 24h.

Both tables get RLS on Postgres gated on BYPASSRLS (matches v24/v29
posture). PGLite ignores RLS; sqlFor split carries only DDL.

5 new BrainEngine methods (breaking-interface addition, drives v0.22.0
minor bump): logEvalCandidate, listEvalCandidates,
deleteEvalCandidatesBefore, logEvalCaptureFailure, listEvalCaptureFailures.
listEvalCandidates uses ORDER BY created_at DESC, id DESC so
`gbrain eval export` is deterministic across same-millisecond inserts.

Also adds HybridSearchMeta type for the side-channel callback used by
Lane 1C's op-layer capture (no change to hybridSearch return shape —
that respects Cathedral II's existing SearchResult[] contract).

Tests: 14 PGLite round-trip cases + 8 v30 structural assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): PII scrubber + op-layer capture module (Lane 1B)

Replayed onto master post-Cathedral II. Same semantics as the original
v0.21.0 work — only adjusted to import HybridSearchMeta from types.ts
(canonical home) instead of redeclaring it locally.

src/core/eval-capture-scrub.ts — pure-function regex scrubber with 6
pattern families: emails, phones (US + E.164), SSN (year-aware),
Luhn-verified credit cards, JWT-shaped tokens, bearer tokens. Zero
deps. Adversarial-input safe.

src/core/eval-capture.ts — op-layer hook helper:
  - buildEvalCandidateInput(ctx, {scrub_pii}) — pure row builder
  - classifyCaptureFailure(err) — Postgres SQLSTATE → reason tag
  - captureEvalCandidate(engine, ctx, opts) — best-effort, never throws
  - isEvalCaptureEnabled / isEvalScrubEnabled — file-plane config checks

GBrainConfig gains `eval?: {capture?, scrub_pii?}`. Both default ON.
File-plane only — `gbrain config set` writes the DB plane, doesn't
control capture.

Tests: 17 scrubber + 21 capture-module cases. Zero regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): hybridSearch onMeta callback + op-layer capture (Lane 1C)

Replayed onto master. Adapted from the original v0.21.0 work to keep
Cathedral II's contract intact: hybridSearch's return stays
`Promise<SearchResult[]>` (unchanged), and meta surfaces via an optional
`onMeta?: (meta: HybridSearchMeta) => void` callback in HybridSearchOpts.

Cathedral II callers leave onMeta undefined and pay no cost. The
op-layer capture wrapper passes a closure that threads meta into the
captured row so gbrain-evals can distinguish:
  - "with OPENAI_API_KEY" vs "keyword-only fallback" (vector_enabled)
  - "expansion fired" vs "expansion requested + silently fell back" (expansion_applied)
  - what hybridSearch actually used after auto-detect (detail_resolved)

Op-layer capture wired into both `query` and `search` op handlers in
src/core/operations.ts. Single hook site catches MCP dispatch + CLI +
subagent tool-bridge from the same place. Fire-and-forget, never throws,
respects ctx.config.eval.capture off-switch.

Tests:
  - test/hybrid-meta.test.ts (8 cases) — onMeta accuracy across the 4
    return paths in hybridSearch + verification that omitting onMeta
    leaves Cathedral II callers unchanged.
  - test/mcp-eval-capture.test.ts (10 cases) — query/search ops capture
    correctly with MCP/CLI/subagent contexts, scrub on/off, capture=false
    off-switch, non-captured ops (list_pages, get_page), F1 failure
    isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): gbrain eval export/prune + doctor eval_capture check (Lane 1D)

Replayed onto master. Same semantics as the original v0.21.0 work.

CLI:
  gbrain eval export [--since DUR] [--limit N] [--tool query|search]
    NDJSON to stdout, every row prefixed with "schema_version":1 per
    docs/eval-capture.md contract. EPIPE-safe streaming, stderr
    heartbeats, deterministic ordering (created_at DESC, id DESC).

  gbrain eval prune --older-than DUR [--dry-run]
    Explicit retention cleanup. Requires --older-than (never deletes
    without a window). Duration strings: 30d, 7d, 1h, 90m, 3600s.

Legacy bare `gbrain eval --qrels …` still works via sub-subcommand
fall-through.

gbrain doctor gains an eval_capture check between markdown_body_completeness
and queue_health: reads eval_capture_failures for the last 24h, groups by
reason, warns when non-zero. Pre-v30 brains get "Skipped (table
unavailable)" — non-fatal.

docs/eval-capture.md ships the stable NDJSON schema reference for
gbrain-evals consumers.

Tests: 9 export cases + 5 prune cases. Doctor check covered by
existing doctor tests on master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.22.0): public-exports contract test + CI count guard (Lane 2 / R2)

Master locks 17 public subpath exports as gbrain's stable third-party
contract. Zero enforcement existed. This PR locks the surface in two
layers:

1. test/public-exports.test.ts — runtime contract test.
   Reads package.json "exports" at startup. For each subpath, imports
   via the package name ("gbrain/engine"), NOT the relative filesystem
   path — that's the difference between exercising the actual resolver
   and bypassing it. Every subpath gets a canary symbol pinned (e.g.
   gbrain/search/hybrid must export hybridSearch + rrfFusion) so a
   refactor that renames or removes one fails CI before downstream
   consumers (gbrain-evals) silently break.

2. scripts/check-exports-count.sh — CI structural guard.
   Wired into `bun test` after check-jsonb-pattern.sh +
   check-progress-to-stdout.sh + check-wasm-embedded.sh per master's
   precedent. EXPECTED_COUNT=17 baseline — shrinks fail loudly,
   growth also fails so the new canary must be pinned in the runtime
   test deliberately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs+e2e(v0.22.0): VERSION/CHANGELOG/CLAUDE/README + Postgres E2E (Lane 3)

Bump VERSION + package.json to 0.22.0 (next free slot after master's
v0.21.0 Code Cathedral II minor).

CHANGELOG.md v0.22.0 entry follows the Garry voice template:
  - Bold 2-line headline
  - Lead paragraph contextualizing v0.20 + v0.21 + v0.22 progression
  - Numbers-that-matter table comparing v0.21.0 → v0.22.0
  - "What this means for you" sectioned by audience
  - "## To take advantage of v0.22.0" operator runbook
  - Itemized changes

CLAUDE.md updates:
  - Key files: 8 new module entries (eval-capture*, eval-export,
    eval-prune, docs/eval-capture.md, public-exports test).
    hybrid.ts entry rewritten to reflect the additive `onMeta` callback
    (return shape unchanged).
  - Key commands: new v0.22.0 section for `gbrain eval export`,
    `gbrain eval prune`, and the doctor `eval_capture` check, with the
    file-plane vs DB-plane config gotcha called out.

README.md: one-paragraph pointer after the BrainBench blurb so anyone
reading the landing page sees the new session-capture feature.

llms.txt + llms-full.txt regenerated to pick up the doc additions.

test/e2e/eval-capture.test.ts (Postgres-only E1 spec):
  - CHECK violation surfaces as Postgres SQLSTATE 23514 on oversize input
  - RLS is actually enabled on both eval_candidates + eval_capture_failures
  - 50 concurrent logEvalCandidate calls — no deadlock, all distinct IDs

Skips gracefully when DATABASE_URL is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): P0 — PGLite test-runner concurrency flake

Pre-existing on master, surfaces ~27 false failures when bun test runs all
174 files together. Each failing file passes in isolation. Tracked for a
dedicated investigation branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(v0.22.0): adversarial review post-fixes (doctor RLS, onMeta safety)

Two surgical fixes from /ship adversarial review, plus 6 follow-ups TODO'd
into v0.22.1:

- doctor.ts: distinguish pre-v30 missing-table (42P01, ok skip) from
  RLS-denied SELECT (42501, warn) and other DB errors (warn). The check
  exists specifically to surface capture-failure misconfigs cross-process,
  so silently reporting "ok / skipped" on the most diagnostic class
  defeated the purpose.

- hybrid.ts: wrap onMeta invocation in try/catch via small emitMeta
  helper. The callback is part of the public gbrain/search/hybrid
  contract; a throwing user-supplied closure must never break the search
  hot path.

- TODOS.md: 6 P1 follow-ups (eval prune real COUNT, scrubber CC false
  positives, dead 'scrubber_exception' enum value, id-cursor for
  cross-window dedup, public-export canary pinning, EXPECTED_COUNT dedup).

- TODOS.md: P0 entry for the pre-existing PGLite test-runner concurrency
  flake (~27 false failures in full bun test on master).

- CHANGELOG.md: 2 bullets noting the doctor + onMeta hardening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(version): bump v0.22.0 → v0.25.0 (queue-aware version pick)

Master is at v0.21.0. Open PRs claim v0.21.1 (#432) and v0.24.0 (#387).
v0.25 is the first uncontested slot, so this branch claims it. Pure
rename across VERSION, package.json, CHANGELOG header, and every "v0.22.0"
reference in CLAUDE.md / README.md / TODOS.md / docs/eval-capture.md /
src/ / test/ files. CHANGELOG date bumped to 2026-04-26.

llms.txt + llms-full.txt regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.25.0): gbrain eval replay + contributor doc + CONTRIBUTING link

Closes the gap between "session capture works" (this PR's core) and
"contributors actually use it before merging." Three artifacts:

- src/commands/eval-replay.ts (~340 LOC) — reads NDJSON from `gbrain eval
  export`, re-runs each captured query/search against the current brain,
  computes set-Jaccard@k, top-1 stability, and latency delta. Stable JSON
  shape (schema_version:1) for CI gating; human mode prints a regression
  table sorted worst-first. Pure Bun, zero new deps. Stub-engine tests
  cover Jaccard math, NDJSON parser (including v2 forward-compat
  rejection + line-numbered errors), --limit, --verbose, --json, and
  graceful per-row error handling. 16/16 passing.

- docs/eval-bench.md (~80 lines) — contributor guide. The 4-command loop
  (export → change → replay → diff), metric definitions with healthy
  ranges (Jaccard ≥0.85, top-1 ≥85%, latency Δ within ±50ms), trigger
  paths, CI integration snippet, hand-crafted NDJSON corpus path for
  fresh installs, and the off-switch. Pairs with the existing
  docs/eval-capture.md which is the consumer-facing wire format.

- CONTRIBUTING.md gains a "Running real-world eval benchmarks (touching
  retrieval code)" section with the trigger paths and a link to
  docs/eval-bench.md. Reviewers now have a one-line ask: "did you run
  replay?"

CLAUDE.md key files updated. CHANGELOG bullets added. llms.txt
regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.25.0): CONTRIBUTOR_MODE flag — capture off by default for users

Eval capture was on for everyone in the v0.25.0 draft. Privacy footgun:
end users had retrieval traffic accumulate in their brain DB without
asking, even with PII scrubbing. Flips to off by default + explicit
opt-in for contributors who actually use the replay loop.

Resolution order in isEvalCaptureEnabled():
  1. config.eval.capture === true            → on
  2. config.eval.capture === false           → off
  3. process.env.GBRAIN_CONTRIBUTOR_MODE === '1' → on
  4. otherwise                                → off

The env var is the contributor-facing toggle (one line in .zshrc, no
JSON edit). Explicit config wins both directions for users who want to
override per-brain.

PII scrubbing gate stays independent — default true regardless of
CONTRIBUTOR_MODE — so any brain that does capture still scrubs.

Tests rewritten: env var hygiene per-test (origMode preserved + restored
in finally). 9/9 pass; total v0.25.0 suite is 198/198.

Docs:
- README.md gains a Contributing-section pointer to the env var.
- CONTRIBUTING.md gains a "CONTRIBUTOR_MODE — turn on the dev loop"
  section with verification commands and resolution-order table.
- docs/eval-bench.md leads with the prerequisite (must set the env var
  for the rest of the doc to be useful).
- docs/eval-capture.md "Config" section split into Path A (env var) +
  Path B (config) with explicit resolution-order rules.
- CHANGELOG v0.25.0 entry corrected ("on by default" was wrong) plus a
  new top itemized bullet calling out the gate change.
- CLAUDE.md eval-capture entry annotated with the new gate logic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: post-ship documentation pass for v0.25.0

Cross-references every doc against the final state of the branch
(CONTRIBUTOR_MODE flag, eval replay tool, off-by-default capture):

- README.md: top callout rewritten — was implying capture-on-by-default
  contradicting the gate landed in 7a80ce2. Now leads with
  "contributor opt-in" and links docs/eval-bench.md alongside
  docs/eval-capture.md.
- AGENTS.md: new "Eval retrieval changes" task entry with the
  CONTRIBUTOR_MODE+replay one-liner so non-Claude agents (Codex, Cursor,
  Aider) have the same path.
- CLAUDE.md: "Key commands added in v0.25.0" gains the replay command and
  a CONTRIBUTOR_MODE bullet covering the resolution order.
- CHANGELOG.md: headline rewritten to match the actual feature ("benchmark
  retrieval changes against real captured queries before merging" — was
  "every real query is captured"). Stale "v0.22 ships the substrate"
  → v0.25. Test count corrected 82 → 144 (added 16 replay + 9
  CONTRIBUTOR_MODE + 8 v31-shape tests since the original count). Two
  metric rows added to the numbers table: default-off posture, in-tree
  replay tooling. "To take advantage" block split into user vs
  contributor branches with shell-rc instructions.
- TODOS.md: v0.22.1 follow-up reference corrected to v0.25.1.

llms.txt + llms-full.txt regenerated. Typecheck clean. 198/198 v0.25.0
tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant