Skip to content

Correct Hermes Agent link in README.md#16

Closed
vincentamato wants to merge 1 commit intogarrytan:masterfrom
vincentamato:vincentamato/fix-readme-hermes-link
Closed

Correct Hermes Agent link in README.md#16
vincentamato wants to merge 1 commit intogarrytan:masterfrom
vincentamato:vincentamato/fix-readme-hermes-link

Conversation

@vincentamato
Copy link
Copy Markdown

Corrected the link to the Hermes Agent website in the README

@vincentamato vincentamato changed the title Update Hermes Agent link in README.md Correct Hermes Agent link in README.md Apr 10, 2026
@garrytan
Copy link
Copy Markdown
Owner

Thank you for catching the broken Hermes link! This was included in our community fix wave (PR #38, v0.6.1). We went with the GitHub repo URL from PR #34. Really appreciate you jumping on this!

@garrytan garrytan closed this Apr 11, 2026
@vincentamato vincentamato deleted the vincentamato/fix-readme-hermes-link branch April 11, 2026 04:50
@vincentamato
Copy link
Copy Markdown
Author

🫡

garrytan added a commit that referenced this pull request Apr 15, 2026
- #1: Crontab install used echo pipe with shell-interpolated values.
  Now uses a temp file via crontab(1) and single-quote escaping on all
  interpolated paths. No shell expansion possible.

- #2: OPENAI_API_KEY was baked as plaintext into the launchd plist
  (readable by any local process, backed up by Time Machine). Now uses
  a wrapper script (~/.gbrain/autopilot-run.sh) that sources ~/.zshrc
  at runtime. No secrets in plist or crontab.

- #16: extract.ts used a custom 20-line YAML parser that only handled
  single-line key:value pairs. Multi-line arrays (attendees list with
  - items) were silently ignored. Now uses the project's gray-matter
  parser via parseMarkdown() from src/core/markdown.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 15, 2026
* feat: migrate 8 existing skills to conformance format

Add YAML frontmatter (name, version, description, triggers, tools, mutating),
Contract, Anti-Patterns, and Output Format sections to all existing skills.
Rename Workflow to Phases. Ingest becomes thin router delegating to specialized
ingestion skills (Phase 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add RESOLVER.md, conventions directory, and output rules

RESOLVER.md is the skill dispatcher modeled on Wintermute's AGENTS.md.
Categorized routing table: Always-on, Brain ops, Ingestion, Thinking,
Operational, Setup, Identity. Conventions directory extracts cross-cutting
rules (quality, brain-first lookup, model routing, test-before-bulk).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add skills conformance and resolver validation tests

skills-conformance.test.ts validates every skill has YAML frontmatter with
required fields, Contract, Anti-Patterns, and Output Format sections, and
manifest.json coverage. resolver.test.ts validates routing table categories,
skill path existence, and manifest-to-resolver coverage. 50 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 9 brain skills from Wintermute (Phase 2)

Generalized from Wintermute's battle-tested skills:
- signal-detector: always-on idea+entity capture on every message
- brain-ops: brain-first lookup, read-enrich-write loop, source attribution
- idea-ingest: links/articles/tweets with author people page mandatory
- media-ingest: video/audio/PDF/book with entity extraction (absorbs video/youtube/book)
- meeting-ingestion: transcripts with attendee enrichment chaining
- citation-fixer: audit and fix citation formatting
- repo-architecture: filing rules by primary subject
- skill-creator: create skills with conformance standard + MECE check
- daily-task-manager: task lifecycle with priority levels

All Garry-specific references generalized. Core workflows preserved.
Updated RESOLVER.md and manifest.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add operational infrastructure + identity layer (Phase 3)

Operational skills:
- daily-task-prep: morning prep with calendar context and open threads
- cross-modal-review: quality gate via second model with refusal routing
- cron-scheduler: schedule staggering, quiet hours, wake-up override, idempotency
- reports: timestamped reports with keyword routing
- testing: skill validation framework (conformance checks)
- soul-audit: 6-phase interview generating SOUL.md, USER.md, ACCESS_POLICY.md, HEARTBEAT.md
- webhook-transforms: external events to brain signals with dead-letter queue

Identity layer:
- SOUL.md template (agent identity, generated by soul-audit)
- USER.md template (user profile, generated by soul-audit)
- ACCESS_POLICY.md template (4-tier access control)
- HEARTBEAT.md template (operational cadence)
- cross-modal.yaml convention (review pairs, refusal routing chain)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update CLAUDE.md with 24 skills, RESOLVER.md, conventions, templates

GBrain is now a GStack mod for agent platforms. Updated architecture description,
key files listing (16 new skill files, RESOLVER.md, conventions, templates), skills
section (24 skills organized by resolver categories), and testing section (new
conformance and resolver tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add GStack detection + mod status to gbrain init (Phase 4)

After brain initialization, gbrain init now reports:
- Number of skills loaded (from manifest.json)
- GStack detection (checks known host paths, uses gstack-global-discover if available)
- GStack install instructions if not found
- Resolver and soul-audit pointers

Also adds installDefaultTemplates() for SOUL.md/USER.md/ACCESS_POLICY.md/HEARTBEAT.md
deployment, and detectGStack() using gstack-global-discover with fallback to known paths
(DRY: doesn't reimplement GStack's host detection logic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: v0.10.0 release documentation

- CHANGELOG: 24 skills, signal detector, RESOLVER.md, soul-audit, access control,
  conventions, conformance standard, GStack detection in init
- README: updated skill section with 24 skills, resolver, conventions
- TODOS: added runtime MCP access control (P1)
- VERSION: 0.9.2 → 0.10.0
- package.json + manifest.json version bumped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add skill table to CHANGELOG v0.10.0

16-row table detailing every new skill, what it does, and why it matters.
Written to sell the upgrade, not document the implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore package.json version after merge conflict resolution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: zero-based README rewrite for GStackBrain v0.10.0

Lead with GStack mod identity. 24 skills table organized by category.
Install block references RESOLVER.md and soul-audit. GBrain+GStack
relationship explained. Removed redundancy (733 -> 406 lines).
All essential content preserved: install, recipes, architecture,
search, commands, engines, voice, knowledge model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: extract install block to INSTALL_FOR_AGENTS.md, simplify README

The 30-line copy-paste install block becomes one line:
"Retrieve and follow INSTALL_FOR_AGENTS.md"

Benefits: agent always gets latest instructions (no stale copy-paste),
README stays clean, install details live where agents read them.

README now leads with what GBrain does ("gives your agent a brain")
instead of GStack relationship. Removed "requires frontier model" note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 3 bugs in init.ts from merge conflict resolution

1. llstatSync typo (merge corruption) → lstatSync
2. __dirname undefined in ESM module → fileURLToPath polyfill
3. require('fs') in ESM → use imported readFileSync

All three would crash gbrain init at runtime. Caught by /review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add checkResolvable shared core function for resolver validation

Shared function at src/core/check-resolvable.ts validates that all skills
are reachable from RESOLVER.md, detects MECE overlaps (with whitelist for
always-on/router skills), finds gaps in frontmatter triggers, and scans
for DRY violations. Returns structured ResolvableIssue objects with
machine-parseable fix objects alongside human-readable action strings.

Three call sites: bun test, gbrain doctor, skill-creator skill.

Cleans up test/resolver.test.ts: removes stale 9-line skip list, imports
from production check-resolvable.ts instead of reimplementing parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: expand doctor with resolver validation, filesystem-first architecture

Doctor now runs filesystem checks (resolver health, skill conformance) before
connecting to DB. New --fast flag skips DB checks. Falls back to filesystem-only
when DB is unavailable. Adds schema_version: 2 to JSON output, composite health
score (0-100), and structured issues array with action strings for agent parsing.

Resolver health check calls checkResolvable() and surfaces actionable fix
instructions. Link integrity check uses engine.getHealth() dead_links count.

CLI routing split: doctor dispatched before connectEngine() so filesystem
checks always run. Fixes Codex-identified blocker where doctor required DB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add adaptive load-aware throttling and fail-improve loop

backoff.ts: System load checking (CPU via os.loadavg, memory via os.freemem),
exponential backoff with 20-attempt max guard, active hours multiplier (2x
slower during waking hours), concurrent process limit (max 2). Windows-safe:
defaults to "proceed" when os.loadavg returns zeros.

fail-improve.ts: Deterministic-first, LLM-fallback pattern with JSONL failure
logging. Cascade failure handling: when both paths fail, throws LLM error and
logs both. Log rotation at 1000 entries. Call count tracking for deterministic
hit rate metrics. Auto-generates test cases from successful LLM fallbacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add transcription service and enrichment-as-a-service

transcription.ts: Groq Whisper (default) with OpenAI fallback. Files >25MB
segmented via ffmpeg. Provider auto-detection from env vars. Clear error
messages for missing API keys and unsupported formats.

enrichment-service.ts: Global enrichment service callable from any ingest
pathway. Entity slug generation (people/jane-doe, companies/acme-corp),
mention counting via searchKeyword, tier auto-escalation (Tier 3→2→1 based
on mention frequency and source diversity), batch enrichment with backoff
throttling, regex-based entity extraction from text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add data-research skill with recipe system, extraction, dedup, tracker

New skill: data-research — one parameterized pipeline for any email-to-
structured-data workflow (investor updates, donations, company metrics).
7-phase pipeline: define recipe, search, classify, extract (with extraction
integrity rule), archive, deduplicate, update tracker.

data-research.ts: Recipe validation, MRR/ARR/runway/headcount regex
extraction (battle-tested patterns), dedup with configurable tolerance,
markdown tracker parsing/appending, quarterly/monthly date windowing,
6-phase HTML email stripping with 500KB ReDoS cap.

Registers data-research in manifest.json (25th skill) and RESOLVER.md.
Fixes backoff test robustness for high-load systems.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.10.0 infrastructure additions

CLAUDE.md: added 6 new core files (check-resolvable, backoff, fail-improve,
transcription, enrichment-service, data-research), 6 new test files, updated
skill count to 25, test file count to 34.

README.md: updated skill count to 25, added data-research to skills table.

CHANGELOG.md: added Infrastructure section documenting resolver validation,
doctor expansion, adaptive throttling, fail-improve loop, voice transcription,
enrichment service, and data-research skill.

TODOS.md: anonymized personal references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: doctor.ts use ES module imports, harden backoff test

Replace require('fs') with ES module import in doctor.ts for consistency
with the rest of the file. Backoff test made resilient to parallel test
execution leaking module-level state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sync --watch routing, dead_links parity, doctor command, embed --slugs

- Move sync to CLI_ONLY so --watch flag reaches runSync() (was routed through
  operation layer which only calls performSync single-pass)
- Hide sync_brain from CLI help (MCP still exposes it)
- Fix performFullSync missing sync state persistence (C1)
- Align Postgres dead_links query to match PGLite (count dangling links, not
  empty-content chunks) (C3)
- Fix doctor recommending nonexistent 'gbrain embed refresh' (C4)
- Refactor doctor outputResults to not call process.exit directly
- Add --slugs flag to embed for targeted page embedding
- Add sync auto-extract + auto-embed after performSync
- Add noExtract to SyncOpts
- Route extract, features, autopilot in CLI_ONLY
- Update help text with new commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: extract, features, and autopilot commands

- gbrain extract <links|timeline|all> — batch extraction of links and timeline
  entries from brain markdown files. Broad regex for all .md links (C7: filters
  external URLs). Frontmatter field parsing (company, investors, attendees).
  Directory-based link type inference. JSONL progress on stderr for agents.
  Sync integration hooks (extractLinksForSlugs, extractTimelineForSlugs).

- gbrain features [--json] [--auto-fix] — scan brain usage, pitch unused features
  with the user's own numbers. Priority 1 (data quality): missing embeddings,
  dead links. Priority 2 (unused features): zero links, zero timeline, low
  coverage, unconfigured integrations, no sync. Embedded recipe metadata for
  binary-safe integration detection. Persistence in ~/.gbrain/feature-offers.json.
  Doctor teaser hook. Upgrade hook.

- gbrain autopilot [--repo] [--interval N] — self-maintaining brain daemon.
  Pipeline: sync → extract → embed. Health-based adaptive scheduling
  (brain_score >= 90 doubles interval, < 70 halves it). --install/--uninstall
  for launchd (macOS) and crontab (Linux). Signal handling. Consecutive error
  tracking (stops at 5). Log to ~/.gbrain/autopilot.log.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: hook features scan into post-upgrade flow

After gbrain post-upgrade completes, automatically run gbrain features to show
the user what's new and what to fix. Best-effort (doesn't fail the upgrade).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: brain_score (0-100) in BrainHealth

Weighted composite score computed in getHealth() for both Postgres and PGLite:
  embed_coverage: 0.35, link_density: 0.25, timeline_coverage: 0.15,
  no_orphans: 0.15, no_dead_links: 0.10

Returns 0 for empty brains. Agents use brain_score as a health gate.
Autopilot uses it for adaptive scheduling (>=90 slows down, <70 speeds up).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: extract and features unit tests

25 tests covering:
- extractMarkdownLinks: relative links, external URL filtering, edge cases
- extractLinksFromFile: slug resolution, frontmatter parsing, directory-based
  type inference (works_at, deal_for, invested_in)
- extractTimelineFromContent: bullet format, header format with detail,
  em/en dash handling, empty content
- features: module exports, brain_score calculation weights, CLI routing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: instruction layer for extract, features, autopilot

Agent-facing tools are invisible without instruction-layer coverage.
- RESOLVER.md: add routing for extract, features, autopilot
- maintain/SKILL.md: add link graph extraction, timeline extraction,
  autopilot check sections

Without these, agents reading skills/ will never discover or run the
new commands. This is the #1 DX finding from the devex review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.10.1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: sync CLAUDE.md with v0.10.1 additions

Add extract.ts, features.ts, autopilot.ts to key files.
Add extract.test.ts, features.test.ts to test list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: adversarial review fixes — 7 issues

- #3: autopilot extract step was a no-op (imported but never called)
- #6: PGLite orphan_pages query aligned with Postgres (check both inbound+outbound)
- #8: embedPage throws instead of process.exit (was killing sync/autopilot)
- #9: dead-links set auto_fixable=false (needs repo path we may not have)
- #10: JSON auto-fix output was dead code (unreachable !jsonMode check)
- #14: autopilot lock file prevents concurrent instances
- #20: --dir without value no longer crashes extract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* security: fix command injection + plaintext API key in daemon install

- #1: Crontab install used echo pipe with shell-interpolated values.
  Now uses a temp file via crontab(1) and single-quote escaping on all
  interpolated paths. No shell expansion possible.

- #2: OPENAI_API_KEY was baked as plaintext into the launchd plist
  (readable by any local process, backed up by Time Machine). Now uses
  a wrapper script (~/.gbrain/autopilot-run.sh) that sources ~/.zshrc
  at runtime. No secrets in plist or crontab.

- #16: extract.ts used a custom 20-line YAML parser that only handled
  single-line key:value pairs. Multi-line arrays (attendees list with
  - items) were silently ignored. Now uses the project's gray-matter
  parser via parseMarkdown() from src/core/markdown.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 20, 2026
…ntract

Two modules that together wire Cat 8 / Cat 9 / Cat 5 end-to-end scoring.

**eval/runner/judge.ts** — Haiku 4.5 via tool-use `score_answer`. Input is
the structured JudgeEvidence contract (fix #16 from the plan's codex
review): probe + final_answer_text + evidence_refs + tool_call_summary +
ground_truth_pages + rubric. Raw tool output NEVER reaches the judge —
that's the Section-3 defense against paraphrased prompt-injection payloads
in gold/poison.json.

Retry policy: one retry on malformed tool_use response. If the second
attempt is still malformed, score the probe as `judge_failed` (all scores
0, verdict=fail) so the run still completes.

Aggregation: weighted mean across rubric criteria. Canonical thresholds
(pass ≥3.5, partial 2.5-3.5, fail <2.5) — judge can propose a verdict but
the computed verdict from the weighted mean is what the scorecard records.
This prevents the model from inflating or deflating its own verdict.

Score values are clamped to 0-5 on parse even if the model returns out of
range. `assertNoRawToolOutput(evidence)` is a regression guard that
returns the list of forbidden fields (tool_result, raw_transcript, etc.)
if any leak into the evidence contract.

**eval/runner/adapters/claude-sonnet-with-tools.ts** — The agent adapter.
Implements `Adapter` interface minimally: `init()` spins up PGLite and
seeds it, `query()` throws because the adapter is Cat 8/9-only and emits
a final-answer text, not a RankedDoc[]. Retrieval scorecard stays at 4
adapters.

`runAgentLoop(probeId, text, state, config)` drives the multi-turn loop:
Sonnet → tool_use → tool-bridge.executeTool → tool_result → back to
Sonnet. Turn cap 10. max_tokens 1024. System prompt (brain-first iron
law, citation format, amara context) is cached via cache_control.
Exponential backoff on rate-limit errors (1s, 2s, 4s).

Emits a `Transcript` per eval/schemas/transcript.schema.json — consumed
directly by recorder.ts for the flight-recorder bundle.

`brain_first_ordering` classifies Cat 8's flagship metric: did the agent
call search/get_page BEFORE producing the final answer? The `no_brain_calls`
case (agent answers from general knowledge without ever hitting the brain)
is the compliance failure to surface.

ForbiddenOpError + UnknownToolError from the bridge are caught in the
agent loop and surfaced as tool_result with is_error=true — keeps the
loop going and preserves full audit trail for the judge.

**Tests (35 new):** judge (23) — happy path, retry, fallback, evidence
contract sanitization, rendered prompt does not contain raw tool_result
text, verdict thresholds, score clamping, weighted mean with mixed
weights, parseToolUse rejects malformed input. agent-adapter (12) —
Adapter.query() throws, init() seeds PGLite, end-to-end tool loop with
stubbed Sonnet, turn cap exhaustion, mutating-op rejection surfaces as
tool_result error, extractSlugs regex.

All 12 agent tests take ~23s because PGLite runs 13 schema migrations per
test; the alternative of shared-engine-across-tests was rejected so each
test is isolated.

Total eval suite now: 167 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TFITZ57 added a commit to TFITZ57/gbrain that referenced this pull request Apr 23, 2026
* feat: GStackBrain — 16 new skills, resolver, conventions, identity layer (v0.10.0) (#120)

* feat: migrate 8 existing skills to conformance format

Add YAML frontmatter (name, version, description, triggers, tools, mutating),
Contract, Anti-Patterns, and Output Format sections to all existing skills.
Rename Workflow to Phases. Ingest becomes thin router delegating to specialized
ingestion skills (Phase 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add RESOLVER.md, conventions directory, and output rules

RESOLVER.md is the skill dispatcher modeled on Wintermute's AGENTS.md.
Categorized routing table: Always-on, Brain ops, Ingestion, Thinking,
Operational, Setup, Identity. Conventions directory extracts cross-cutting
rules (quality, brain-first lookup, model routing, test-before-bulk).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add skills conformance and resolver validation tests

skills-conformance.test.ts validates every skill has YAML frontmatter with
required fields, Contract, Anti-Patterns, and Output Format sections, and
manifest.json coverage. resolver.test.ts validates routing table categories,
skill path existence, and manifest-to-resolver coverage. 50 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 9 brain skills from Wintermute (Phase 2)

Generalized from Wintermute's battle-tested skills:
- signal-detector: always-on idea+entity capture on every message
- brain-ops: brain-first lookup, read-enrich-write loop, source attribution
- idea-ingest: links/articles/tweets with author people page mandatory
- media-ingest: video/audio/PDF/book with entity extraction (absorbs video/youtube/book)
- meeting-ingestion: transcripts with attendee enrichment chaining
- citation-fixer: audit and fix citation formatting
- repo-architecture: filing rules by primary subject
- skill-creator: create skills with conformance standard + MECE check
- daily-task-manager: task lifecycle with priority levels

All Garry-specific references generalized. Core workflows preserved.
Updated RESOLVER.md and manifest.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add operational infrastructure + identity layer (Phase 3)

Operational skills:
- daily-task-prep: morning prep with calendar context and open threads
- cross-modal-review: quality gate via second model with refusal routing
- cron-scheduler: schedule staggering, quiet hours, wake-up override, idempotency
- reports: timestamped reports with keyword routing
- testing: skill validation framework (conformance checks)
- soul-audit: 6-phase interview generating SOUL.md, USER.md, ACCESS_POLICY.md, HEARTBEAT.md
- webhook-transforms: external events to brain signals with dead-letter queue

Identity layer:
- SOUL.md template (agent identity, generated by soul-audit)
- USER.md template (user profile, generated by soul-audit)
- ACCESS_POLICY.md template (4-tier access control)
- HEARTBEAT.md template (operational cadence)
- cross-modal.yaml convention (review pairs, refusal routing chain)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update CLAUDE.md with 24 skills, RESOLVER.md, conventions, templates

GBrain is now a GStack mod for agent platforms. Updated architecture description,
key files listing (16 new skill files, RESOLVER.md, conventions, templates), skills
section (24 skills organized by resolver categories), and testing section (new
conformance and resolver tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add GStack detection + mod status to gbrain init (Phase 4)

After brain initialization, gbrain init now reports:
- Number of skills loaded (from manifest.json)
- GStack detection (checks known host paths, uses gstack-global-discover if available)
- GStack install instructions if not found
- Resolver and soul-audit pointers

Also adds installDefaultTemplates() for SOUL.md/USER.md/ACCESS_POLICY.md/HEARTBEAT.md
deployment, and detectGStack() using gstack-global-discover with fallback to known paths
(DRY: doesn't reimplement GStack's host detection logic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: v0.10.0 release documentation

- CHANGELOG: 24 skills, signal detector, RESOLVER.md, soul-audit, access control,
  conventions, conformance standard, GStack detection in init
- README: updated skill section with 24 skills, resolver, conventions
- TODOS: added runtime MCP access control (P1)
- VERSION: 0.9.2 → 0.10.0
- package.json + manifest.json version bumped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add skill table to CHANGELOG v0.10.0

16-row table detailing every new skill, what it does, and why it matters.
Written to sell the upgrade, not document the implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore package.json version after merge conflict resolution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: zero-based README rewrite for GStackBrain v0.10.0

Lead with GStack mod identity. 24 skills table organized by category.
Install block references RESOLVER.md and soul-audit. GBrain+GStack
relationship explained. Removed redundancy (733 -> 406 lines).
All essential content preserved: install, recipes, architecture,
search, commands, engines, voice, knowledge model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: extract install block to INSTALL_FOR_AGENTS.md, simplify README

The 30-line copy-paste install block becomes one line:
"Retrieve and follow INSTALL_FOR_AGENTS.md"

Benefits: agent always gets latest instructions (no stale copy-paste),
README stays clean, install details live where agents read them.

README now leads with what GBrain does ("gives your agent a brain")
instead of GStack relationship. Removed "requires frontier model" note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 3 bugs in init.ts from merge conflict resolution

1. llstatSync typo (merge corruption) → lstatSync
2. __dirname undefined in ESM module → fileURLToPath polyfill
3. require('fs') in ESM → use imported readFileSync

All three would crash gbrain init at runtime. Caught by /review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add checkResolvable shared core function for resolver validation

Shared function at src/core/check-resolvable.ts validates that all skills
are reachable from RESOLVER.md, detects MECE overlaps (with whitelist for
always-on/router skills), finds gaps in frontmatter triggers, and scans
for DRY violations. Returns structured ResolvableIssue objects with
machine-parseable fix objects alongside human-readable action strings.

Three call sites: bun test, gbrain doctor, skill-creator skill.

Cleans up test/resolver.test.ts: removes stale 9-line skip list, imports
from production check-resolvable.ts instead of reimplementing parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: expand doctor with resolver validation, filesystem-first architecture

Doctor now runs filesystem checks (resolver health, skill conformance) before
connecting to DB. New --fast flag skips DB checks. Falls back to filesystem-only
when DB is unavailable. Adds schema_version: 2 to JSON output, composite health
score (0-100), and structured issues array with action strings for agent parsing.

Resolver health check calls checkResolvable() and surfaces actionable fix
instructions. Link integrity check uses engine.getHealth() dead_links count.

CLI routing split: doctor dispatched before connectEngine() so filesystem
checks always run. Fixes Codex-identified blocker where doctor required DB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add adaptive load-aware throttling and fail-improve loop

backoff.ts: System load checking (CPU via os.loadavg, memory via os.freemem),
exponential backoff with 20-attempt max guard, active hours multiplier (2x
slower during waking hours), concurrent process limit (max 2). Windows-safe:
defaults to "proceed" when os.loadavg returns zeros.

fail-improve.ts: Deterministic-first, LLM-fallback pattern with JSONL failure
logging. Cascade failure handling: when both paths fail, throws LLM error and
logs both. Log rotation at 1000 entries. Call count tracking for deterministic
hit rate metrics. Auto-generates test cases from successful LLM fallbacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add transcription service and enrichment-as-a-service

transcription.ts: Groq Whisper (default) with OpenAI fallback. Files >25MB
segmented via ffmpeg. Provider auto-detection from env vars. Clear error
messages for missing API keys and unsupported formats.

enrichment-service.ts: Global enrichment service callable from any ingest
pathway. Entity slug generation (people/jane-doe, companies/acme-corp),
mention counting via searchKeyword, tier auto-escalation (Tier 3→2→1 based
on mention frequency and source diversity), batch enrichment with backoff
throttling, regex-based entity extraction from text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add data-research skill with recipe system, extraction, dedup, tracker

New skill: data-research — one parameterized pipeline for any email-to-
structured-data workflow (investor updates, donations, company metrics).
7-phase pipeline: define recipe, search, classify, extract (with extraction
integrity rule), archive, deduplicate, update tracker.

data-research.ts: Recipe validation, MRR/ARR/runway/headcount regex
extraction (battle-tested patterns), dedup with configurable tolerance,
markdown tracker parsing/appending, quarterly/monthly date windowing,
6-phase HTML email stripping with 500KB ReDoS cap.

Registers data-research in manifest.json (25th skill) and RESOLVER.md.
Fixes backoff test robustness for high-load systems.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.10.0 infrastructure additions

CLAUDE.md: added 6 new core files (check-resolvable, backoff, fail-improve,
transcription, enrichment-service, data-research), 6 new test files, updated
skill count to 25, test file count to 34.

README.md: updated skill count to 25, added data-research to skills table.

CHANGELOG.md: added Infrastructure section documenting resolver validation,
doctor expansion, adaptive throttling, fail-improve loop, voice transcription,
enrichment service, and data-research skill.

TODOS.md: anonymized personal references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: doctor.ts use ES module imports, harden backoff test

Replace require('fs') with ES module import in doctor.ts for consistency
with the rest of the file. Backoff test made resilient to parallel test
execution leaking module-level state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README rewrite with production brain stats, sample output, new infrastructure

Lead with the flex: 17,888 pages, 4,383 people, 723 companies, 526 meeting
transcripts built in 12 days. Show sample query output so readers see what
they'll get. Document self-improving infrastructure (tier auto-escalation,
fail-improve loop, doctor trajectory). Add data-research recipes to Getting
Data In. Update commands section with doctor --fix, transcribe, research
init/list. Fix stale "24" references to "25".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README lead with YC President origin and production agent deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README lead with skill philosophy and link to Thin Harness Fat Skills

Skills section now explains: skill files are code, they encode entire
workflows, they call deterministic TypeScript for the parts that shouldn't
be LLM judgment. Links to the tweet and the architecture essay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: link GStack repo, add 70K stars and 30K daily users

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove meeting transcript count from README (sensitive)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README lead with YC President origin and production agent deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename political-donations recipe to expense-tracker (sensitivity)

Renamed the built-in data-research recipe from political-donations to
expense-tracker across README, CHANGELOG, SKILL.md, and reports routing.
Same extraction patterns (amounts, dates, recipients), neutral framing.
Also renamed social-radar keyword route to social-mentions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sync pipeline, extract, features, autopilot (v0.10.1) (#129)

* feat: migrate 8 existing skills to conformance format

Add YAML frontmatter (name, version, description, triggers, tools, mutating),
Contract, Anti-Patterns, and Output Format sections to all existing skills.
Rename Workflow to Phases. Ingest becomes thin router delegating to specialized
ingestion skills (Phase 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add RESOLVER.md, conventions directory, and output rules

RESOLVER.md is the skill dispatcher modeled on Wintermute's AGENTS.md.
Categorized routing table: Always-on, Brain ops, Ingestion, Thinking,
Operational, Setup, Identity. Conventions directory extracts cross-cutting
rules (quality, brain-first lookup, model routing, test-before-bulk).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add skills conformance and resolver validation tests

skills-conformance.test.ts validates every skill has YAML frontmatter with
required fields, Contract, Anti-Patterns, and Output Format sections, and
manifest.json coverage. resolver.test.ts validates routing table categories,
skill path existence, and manifest-to-resolver coverage. 50 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 9 brain skills from Wintermute (Phase 2)

Generalized from Wintermute's battle-tested skills:
- signal-detector: always-on idea+entity capture on every message
- brain-ops: brain-first lookup, read-enrich-write loop, source attribution
- idea-ingest: links/articles/tweets with author people page mandatory
- media-ingest: video/audio/PDF/book with entity extraction (absorbs video/youtube/book)
- meeting-ingestion: transcripts with attendee enrichment chaining
- citation-fixer: audit and fix citation formatting
- repo-architecture: filing rules by primary subject
- skill-creator: create skills with conformance standard + MECE check
- daily-task-manager: task lifecycle with priority levels

All Garry-specific references generalized. Core workflows preserved.
Updated RESOLVER.md and manifest.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add operational infrastructure + identity layer (Phase 3)

Operational skills:
- daily-task-prep: morning prep with calendar context and open threads
- cross-modal-review: quality gate via second model with refusal routing
- cron-scheduler: schedule staggering, quiet hours, wake-up override, idempotency
- reports: timestamped reports with keyword routing
- testing: skill validation framework (conformance checks)
- soul-audit: 6-phase interview generating SOUL.md, USER.md, ACCESS_POLICY.md, HEARTBEAT.md
- webhook-transforms: external events to brain signals with dead-letter queue

Identity layer:
- SOUL.md template (agent identity, generated by soul-audit)
- USER.md template (user profile, generated by soul-audit)
- ACCESS_POLICY.md template (4-tier access control)
- HEARTBEAT.md template (operational cadence)
- cross-modal.yaml convention (review pairs, refusal routing chain)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update CLAUDE.md with 24 skills, RESOLVER.md, conventions, templates

GBrain is now a GStack mod for agent platforms. Updated architecture description,
key files listing (16 new skill files, RESOLVER.md, conventions, templates), skills
section (24 skills organized by resolver categories), and testing section (new
conformance and resolver tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add GStack detection + mod status to gbrain init (Phase 4)

After brain initialization, gbrain init now reports:
- Number of skills loaded (from manifest.json)
- GStack detection (checks known host paths, uses gstack-global-discover if available)
- GStack install instructions if not found
- Resolver and soul-audit pointers

Also adds installDefaultTemplates() for SOUL.md/USER.md/ACCESS_POLICY.md/HEARTBEAT.md
deployment, and detectGStack() using gstack-global-discover with fallback to known paths
(DRY: doesn't reimplement GStack's host detection logic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: v0.10.0 release documentation

- CHANGELOG: 24 skills, signal detector, RESOLVER.md, soul-audit, access control,
  conventions, conformance standard, GStack detection in init
- README: updated skill section with 24 skills, resolver, conventions
- TODOS: added runtime MCP access control (P1)
- VERSION: 0.9.2 → 0.10.0
- package.json + manifest.json version bumped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add skill table to CHANGELOG v0.10.0

16-row table detailing every new skill, what it does, and why it matters.
Written to sell the upgrade, not document the implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore package.json version after merge conflict resolution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: zero-based README rewrite for GStackBrain v0.10.0

Lead with GStack mod identity. 24 skills table organized by category.
Install block references RESOLVER.md and soul-audit. GBrain+GStack
relationship explained. Removed redundancy (733 -> 406 lines).
All essential content preserved: install, recipes, architecture,
search, commands, engines, voice, knowledge model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: extract install block to INSTALL_FOR_AGENTS.md, simplify README

The 30-line copy-paste install block becomes one line:
"Retrieve and follow INSTALL_FOR_AGENTS.md"

Benefits: agent always gets latest instructions (no stale copy-paste),
README stays clean, install details live where agents read them.

README now leads with what GBrain does ("gives your agent a brain")
instead of GStack relationship. Removed "requires frontier model" note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 3 bugs in init.ts from merge conflict resolution

1. llstatSync typo (merge corruption) → lstatSync
2. __dirname undefined in ESM module → fileURLToPath polyfill
3. require('fs') in ESM → use imported readFileSync

All three would crash gbrain init at runtime. Caught by /review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add checkResolvable shared core function for resolver validation

Shared function at src/core/check-resolvable.ts validates that all skills
are reachable from RESOLVER.md, detects MECE overlaps (with whitelist for
always-on/router skills), finds gaps in frontmatter triggers, and scans
for DRY violations. Returns structured ResolvableIssue objects with
machine-parseable fix objects alongside human-readable action strings.

Three call sites: bun test, gbrain doctor, skill-creator skill.

Cleans up test/resolver.test.ts: removes stale 9-line skip list, imports
from production check-resolvable.ts instead of reimplementing parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: expand doctor with resolver validation, filesystem-first architecture

Doctor now runs filesystem checks (resolver health, skill conformance) before
connecting to DB. New --fast flag skips DB checks. Falls back to filesystem-only
when DB is unavailable. Adds schema_version: 2 to JSON output, composite health
score (0-100), and structured issues array with action strings for agent parsing.

Resolver health check calls checkResolvable() and surfaces actionable fix
instructions. Link integrity check uses engine.getHealth() dead_links count.

CLI routing split: doctor dispatched before connectEngine() so filesystem
checks always run. Fixes Codex-identified blocker where doctor required DB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add adaptive load-aware throttling and fail-improve loop

backoff.ts: System load checking (CPU via os.loadavg, memory via os.freemem),
exponential backoff with 20-attempt max guard, active hours multiplier (2x
slower during waking hours), concurrent process limit (max 2). Windows-safe:
defaults to "proceed" when os.loadavg returns zeros.

fail-improve.ts: Deterministic-first, LLM-fallback pattern with JSONL failure
logging. Cascade failure handling: when both paths fail, throws LLM error and
logs both. Log rotation at 1000 entries. Call count tracking for deterministic
hit rate metrics. Auto-generates test cases from successful LLM fallbacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add transcription service and enrichment-as-a-service

transcription.ts: Groq Whisper (default) with OpenAI fallback. Files >25MB
segmented via ffmpeg. Provider auto-detection from env vars. Clear error
messages for missing API keys and unsupported formats.

enrichment-service.ts: Global enrichment service callable from any ingest
pathway. Entity slug generation (people/jane-doe, companies/acme-corp),
mention counting via searchKeyword, tier auto-escalation (Tier 3→2→1 based
on mention frequency and source diversity), batch enrichment with backoff
throttling, regex-based entity extraction from text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add data-research skill with recipe system, extraction, dedup, tracker

New skill: data-research — one parameterized pipeline for any email-to-
structured-data workflow (investor updates, donations, company metrics).
7-phase pipeline: define recipe, search, classify, extract (with extraction
integrity rule), archive, deduplicate, update tracker.

data-research.ts: Recipe validation, MRR/ARR/runway/headcount regex
extraction (battle-tested patterns), dedup with configurable tolerance,
markdown tracker parsing/appending, quarterly/monthly date windowing,
6-phase HTML email stripping with 500KB ReDoS cap.

Registers data-research in manifest.json (25th skill) and RESOLVER.md.
Fixes backoff test robustness for high-load systems.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.10.0 infrastructure additions

CLAUDE.md: added 6 new core files (check-resolvable, backoff, fail-improve,
transcription, enrichment-service, data-research), 6 new test files, updated
skill count to 25, test file count to 34.

README.md: updated skill count to 25, added data-research to skills table.

CHANGELOG.md: added Infrastructure section documenting resolver validation,
doctor expansion, adaptive throttling, fail-improve loop, voice transcription,
enrichment service, and data-research skill.

TODOS.md: anonymized personal references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: doctor.ts use ES module imports, harden backoff test

Replace require('fs') with ES module import in doctor.ts for consistency
with the rest of the file. Backoff test made resilient to parallel test
execution leaking module-level state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sync --watch routing, dead_links parity, doctor command, embed --slugs

- Move sync to CLI_ONLY so --watch flag reaches runSync() (was routed through
  operation layer which only calls performSync single-pass)
- Hide sync_brain from CLI help (MCP still exposes it)
- Fix performFullSync missing sync state persistence (C1)
- Align Postgres dead_links query to match PGLite (count dangling links, not
  empty-content chunks) (C3)
- Fix doctor recommending nonexistent 'gbrain embed refresh' (C4)
- Refactor doctor outputResults to not call process.exit directly
- Add --slugs flag to embed for targeted page embedding
- Add sync auto-extract + auto-embed after performSync
- Add noExtract to SyncOpts
- Route extract, features, autopilot in CLI_ONLY
- Update help text with new commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: extract, features, and autopilot commands

- gbrain extract <links|timeline|all> — batch extraction of links and timeline
  entries from brain markdown files. Broad regex for all .md links (C7: filters
  external URLs). Frontmatter field parsing (company, investors, attendees).
  Directory-based link type inference. JSONL progress on stderr for agents.
  Sync integration hooks (extractLinksForSlugs, extractTimelineForSlugs).

- gbrain features [--json] [--auto-fix] — scan brain usage, pitch unused features
  with the user's own numbers. Priority 1 (data quality): missing embeddings,
  dead links. Priority 2 (unused features): zero links, zero timeline, low
  coverage, unconfigured integrations, no sync. Embedded recipe metadata for
  binary-safe integration detection. Persistence in ~/.gbrain/feature-offers.json.
  Doctor teaser hook. Upgrade hook.

- gbrain autopilot [--repo] [--interval N] — self-maintaining brain daemon.
  Pipeline: sync → extract → embed. Health-based adaptive scheduling
  (brain_score >= 90 doubles interval, < 70 halves it). --install/--uninstall
  for launchd (macOS) and crontab (Linux). Signal handling. Consecutive error
  tracking (stops at 5). Log to ~/.gbrain/autopilot.log.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: hook features scan into post-upgrade flow

After gbrain post-upgrade completes, automatically run gbrain features to show
the user what's new and what to fix. Best-effort (doesn't fail the upgrade).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: brain_score (0-100) in BrainHealth

Weighted composite score computed in getHealth() for both Postgres and PGLite:
  embed_coverage: 0.35, link_density: 0.25, timeline_coverage: 0.15,
  no_orphans: 0.15, no_dead_links: 0.10

Returns 0 for empty brains. Agents use brain_score as a health gate.
Autopilot uses it for adaptive scheduling (>=90 slows down, <70 speeds up).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: extract and features unit tests

25 tests covering:
- extractMarkdownLinks: relative links, external URL filtering, edge cases
- extractLinksFromFile: slug resolution, frontmatter parsing, directory-based
  type inference (works_at, deal_for, invested_in)
- extractTimelineFromContent: bullet format, header format with detail,
  em/en dash handling, empty content
- features: module exports, brain_score calculation weights, CLI routing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: instruction layer for extract, features, autopilot

Agent-facing tools are invisible without instruction-layer coverage.
- RESOLVER.md: add routing for extract, features, autopilot
- maintain/SKILL.md: add link graph extraction, timeline extraction,
  autopilot check sections

Without these, agents reading skills/ will never discover or run the
new commands. This is the #1 DX finding from the devex review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.10.1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: sync CLAUDE.md with v0.10.1 additions

Add extract.ts, features.ts, autopilot.ts to key files.
Add extract.test.ts, features.test.ts to test list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: adversarial review fixes — 7 issues

- #3: autopilot extract step was a no-op (imported but never called)
- #6: PGLite orphan_pages query aligned with Postgres (check both inbound+outbound)
- #8: embedPage throws instead of process.exit (was killing sync/autopilot)
- #9: dead-links set auto_fixable=false (needs repo path we may not have)
- #10: JSON auto-fix output was dead code (unreachable !jsonMode check)
- #14: autopilot lock file prevents concurrent instances
- #20: --dir without value no longer crashes extract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* security: fix command injection + plaintext API key in daemon install

- #1: Crontab install used echo pipe with shell-interpolated values.
  Now uses a temp file via crontab(1) and single-quote escaping on all
  interpolated paths. No shell expansion possible.

- #2: OPENAI_API_KEY was baked as plaintext into the launchd plist
  (readable by any local process, backed up by Time Machine). Now uses
  a wrapper script (~/.gbrain/autopilot-run.sh) that sources ~/.zshrc
  at runtime. No secrets in plist or crontab.

- #16: extract.ts used a custom 20-line YAML parser that only handled
  single-line key:value pairs. Multi-line arrays (attendees list with
  - items) were silently ignored. Now uses the project's gray-matter
  parser via parseMarkdown() from src/core/markdown.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* security: fix wave 3 — 9 vulns (file_upload, SSRF, recipe trust, prompt injection) (#174)

* feat(engine): add cap parameter to clampSearchLimit (H6)

clampSearchLimit(limit, defaultLimit, cap = MAX_SEARCH_LIMIT) — third arg
is a caller-specified cap so operation handlers can enforce limits below
MAX_SEARCH_LIMIT. Backward compatible: existing two-arg callers still cap
at MAX_SEARCH_LIMIT.

This fixes a Codex-caught semantics bug: the prior signature took (limit,
defaultLimit) where the second arg was misread as a cap. clampSearchLimit(x, 20)
was actually allowing values up to 100, not 20.

* feat(integrations): SSRF defense + recipe trust boundary (B1, B2, Fix 2, Fix 4, B3, B4)

- B1: split loadAllRecipes into trusted (package-bundled) and untrusted
  (cwd/recipes, $GBRAIN_RECIPES_DIR) tiers. Only package-bundled recipes
  get embedded=true. Closes the fake trust boundary that let any cwd-local
  recipe bypass health-check gates.
- B2: hard-block string health_checks for non-embedded recipes (was previously
  only blocked when isUnsafeHealthCheck regex matched, which the cwd recipe
  exploit bypassed). Embedded recipes still get the regex defense.
- Fix 2: gate command DSL health_checks on isEmbedded. Non-embedded
  recipes cannot spawnSync.
- Fix 4 + B3 + B4: gate http DSL health_checks on isEmbedded; for embedded
  recipes, validate URLs via new isInternalUrl() before fetch:
  - Scheme allowlist (http/https only): blocks file:, data:, blob:, ftp:, javascript:
  - IPv4 range check covering hex/octal/decimal/single-integer bypass forms
  - IPv6 loopback ::1 + IPv4-mapped ::ffff: (canonicalized hex hextets handled)
  - Metadata hostnames (AWS, GCP, instance-data) blocked
  - fetch with redirect: 'manual' + per-hop re-validation up to 3 hops

Original PRs #105-109 by @garagon. Wave 3 collector branch reimplemented
the fixes after Codex outside-voice review found that PRs #106/#108 alone
did not actually gate cwd-local recipes (B1) and that PR #108 missed
redirect-following SSRF (B3) and non-http schemes (B4).

* feat(file_upload): path/slug/filename validation + remote-caller confinement (Fix 1, B5, H5, M4, Fix 5)

- Fix 1 + B5 + H1: validateUploadPath uses realpathSync + path.relative
  to defeat symlink-parent traversal. lstatSync alone (the original PR #105
  approach) only catches final-component symlinks; a symlinked parent dir
  still followed to /etc/passwd. Now the entire path chain is resolved.
- H5: validatePageSlug uses an allowlist regex (alphanumeric + hyphens,
  slash-separated segments). Closes URL-encoded traversal (%2e%2e%2f),
  Unicode lookalikes, backslashes, control chars implicitly.
- M4: validateFilename allowlist regex. Rejects control chars, backslash,
  RTL override (\u202E), leading dot/dash. Filename flows into storage_path
  so this matters for every storage backend.
- Fix 5: clamp list_pages and get_ingest_log limits at the operation layer
  via new clampSearchLimit cap parameter (list_pages caps at 100,
  get_ingest_log at 50). Internal bulk commands bypass the operation
  layer and remain uncapped.
- New OperationContext.remote flag distinguishes trusted local CLI from
  untrusted MCP callers. file_upload uses strict cwd confinement when
  remote=true (default), loose mode when remote=false (CLI). MCP stdio
  server sets remote=true; cli.ts and handleToolCall (gbrain call) set
  remote=false.

Original PR #105 by @garagon. Issue #139 reported by @Hybirdss.

* feat(search): query sanitization + structural prompt boundary (Fix 3, M1, M2, M3)

- M1: restructure callHaikuForExpansion to use a system message that declares
  the user query as untrusted data, plus an XML-tagged <user_query> boundary
  in the user message. Layered defense with the existing tool_choice constraint
  (3 layers vs 1).
- Fix 3 (regex sanitizer, defense-in-depth): sanitizeQueryForPrompt strips
  triple-backtick code fences, XML/HTML tags, leading injection prefixes,
  and caps at 500 chars. Original query is still used for downstream search;
  only the LLM-facing copy is sanitized.
- M2: sanitizeExpansionOutput validates the model's alternative_queries array
  before it flows into search. Strips control chars, caps length, dedupes
  case-insensitively, drops empty/non-string items, caps to 2 items.
- M3: console.warn on stripped content NEVER logs the query text — privacy-safe
  debug signal only.

Original PR #107 by @garagon. M1/M2/M3 are wave 3 hardening per Codex review.

* chore: bump version and changelog (v0.10.2)

Security wave 3: 9 vulnerabilities closed across file_upload, recipe trust
boundary, SSRF defense, prompt injection, and limit clamping. See CHANGELOG
for full details.

Contributors:
- @garagon (PRs #105-109)
- @Hybirdss (Issue #139)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync documentation with v0.10.2 security wave 3

- CLAUDE.md: document OperationContext.remote, new security helpers
  (validateUploadPath, validatePageSlug, validateFilename, isInternalUrl,
  parseOctet, hostnameToOctets, isPrivateIpv4, getRecipeDirs,
  sanitizeQueryForPrompt, sanitizeExpansionOutput), updated clampSearchLimit
  signature, recipe trust boundary, new test files
- docs/integrations/README.md: replace string-form health_check example
  with typed DSL (string checks now hard-block for non-embedded recipes);
  add recipe trust boundary subsection
- docs/mcp/DEPLOY.md: document file_upload remote-caller cwd confinement,
  symlink rejection, slug/filename allowlists

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Minions v7 + v0.11.1 canonical migration + skillify (#130)

* feat: add minion_jobs schema, migration v5, and executeRaw to BrainEngine

Foundation for the Minions job queue system. Adds:
- minion_jobs table (20 columns) with CHECK constraints, partial indexes,
  and RLS. Inspired by BullMQ's job model, adapted for Postgres.
- Migration v5 creates the table for existing databases.
- executeRaw<T>() method on BrainEngine interface for raw SQL access,
  needed by the Minions module for claim queries (FOR UPDATE SKIP LOCKED),
  token-fenced writes, and atomic stall detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Minions job queue — queue, worker, backoff, types

BullMQ-inspired Postgres-native job queue built into GBrain. No Redis.
No external dependencies. Postgres transactions replace Lua scripts.

- MinionQueue: submit, claim (FOR UPDATE SKIP LOCKED), complete/fail
  (token-fenced), atomic stall detection (CTE), delayed promotion,
  parent-child resolution, prune, stats
- MinionWorker: handler registry, lock renewal, graceful SIGTERM,
  exponential backoff with jitter, UnrecoverableError bypass
- MinionJobContext: updateProgress(), log(), isActive() for handlers
- 8-state machine: waiting/active/completed/failed/delayed/dead/
  cancelled/waiting-children

Patterns stolen from: BullMQ (lock tokens, stall detection, flows),
Sidekiq (dead set, backoff formula), Inngest (checkpoint/resume).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: 43 tests for Minions job queue

Full coverage of the Minions module against PGLite in-memory:
- Queue CRUD (9): submit, get, list, remove, cancel, retry, duplicate
- State machine (6): waiting→active→completed/failed, retry→delayed→waiting
- Backoff (4): exponential, fixed, jitter range, attempts_made=0 edge
- Stall detection (3): detect stalled, counter increment, max→dead
- Dependencies (5): parent waits, fail_parent, continue, remove_dep, orphan
- Worker lifecycle (5): register, start-without-handlers, claim+execute,
  non-Error throws, UnrecoverableError bypass
- Lock management (3): renewal, token mismatch, claim sets lock fields
- Claim mechanics (4): empty queue, priority ordering, name filtering,
  delayed promotion timing
- Cancel & retry (2): cancel active, retry dead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Minions CLI commands and MCP operations

Wire Minions into the GBrain CLI and MCP layer:

CLI (gbrain jobs):
  submit <name> [--params JSON] [--follow] [--dry-run]
  list [--status S] [--queue Q] [--limit N]
  get <id> — detailed view with attempt history
  cancel/retry/delete <id>
  prune [--older-than 30d]
  stats — job health dashboard
  work [--queue Q] [--concurrency N] — Postgres-only worker daemon

6 MCP operations (contract-first, auto-exposed via MCP server):
  submit_job, get_job, list_jobs, cancel_job, retry_job, get_job_progress

Built-in handlers: sync, embed, lint, import. --follow runs inline.
Worker daemon blocked on PGLite (exclusive file lock).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for Minions job queue

CLAUDE.md: added Minions files to key files, updated operation count (36),
BrainEngine method count (38), test file count (45), added jobs CLI commands.
CHANGELOG.md: added Minions entry to v0.10.0 (background jobs, retry, stall
detection, worker daemon).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Minions v2 — agent orchestration primitives (pause/resume, inbox, tokens, replay)

Adds the foundation for Minions as universal agent orchestration infrastructure.
GBrain's Postgres-native job queue now supports durable, observable, steerable
background agents. The OpenClaw plugin (separate repo) will consume these via
library import, not MCP, for zero-latency local integration.

## New capabilities

- **Concurrent worker** — Promise pool replaces sequential loop. Per-job
  AbortController for cooperative cancellation. Graceful shutdown waits for
  all in-flight jobs via Promise.allSettled.
- **Pause/resume** — pauseJob clears the lock and fires AbortSignal on active
  jobs. Handlers check ctx.signal.aborted and exit cleanly. resumeJob returns
  paused jobs to waiting. Catch block skips failJob when signal.aborted.
- **Inbox (separate table)** — minion_inbox table for sidechannel messages.
  sendMessage with sender validation (parent job or admin). readInbox is
  token-fenced and marks read_at atomically. Separate table avoids row bloat
  from rewriting JSONB on every send.
- **Token accounting** — tokens_input/tokens_output/tokens_cache_read columns.
  updateTokens accumulates; completeJob rolls child tokens up to parent.
  USD cost computed at read time (no cost_usd column — pricing too volatile).
- **Job replay** — replayJob clones a terminal job with optional data overrides.
  New job, fresh attempts, no parent link.

## Handler contract additions

MinionJobContext now provides:
- `signal: AbortSignal` — cooperative cancellation
- `updateTokens(tokens)` — accumulate token usage
- `readInbox()` — check for sidechannel messages
- `log()` — now accepts string or TranscriptEntry

## MCP operations added

pause_job, resume_job, replay_job, send_job_message — all auto-generate CLI
commands and MCP server endpoints.

## Library exports

package.json exports map adds ./minions and ./engine-factory paths so plugins
can `import { MinionQueue } from 'gbrain/minions'` for direct library use.

## Instruction layer (the teaching)

- skills/minion-orchestrator/SKILL.md — when/how to use Minions, decision
  matrix, lifecycle management, anti-patterns
- skills/conventions/subagent-routing.md — cross-cutting rule: all background
  work goes through Minions
- RESOLVER.md — trigger entries for agent orchestration
- manifest.json — registered

## Schema migration v6

Additive: 3 token columns, paused status, minion_inbox table with unread index.
Full Postgres + PGLite support. No backfill needed.

## Tests

65 tests (was 43): pause/resume (5), inbox (6), tokens (4), replay (4),
concurrent worker context (3), plus all existing coverage.

## What's NOT in this commit

Deferred to follow-up PRs:
- LISTEN/NOTIFY subscribe (needs real Postgres E2E)
- Resource governor (depends on concurrent worker stress testing)
- Routing eval harness (needs API keys + benchmark data)
- OpenClaw plugin (separate @gbrain/openclaw-minions-plugin repo)

See docs/designs/MINIONS_AGENT_ORCHESTRATION.md for full CEO-approved design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(minions): migration v7 — agent_parity_layer schema

Adds columns on minion_jobs (depth, max_children, timeout_ms, timeout_at,
remove_on_complete, remove_on_fail, idempotency_key) plus the new
minion_attachments table. Three partial indexes for bounded scans:
idx_minion_jobs_timeout, idx_minion_jobs_parent_status, and
uniq_minion_jobs_idempotency. Check constraints enforce non-negative depth
and positive child cap / timeout.

Additive migration — existing installs pick it up via ensureSchema on next
use. No user action required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(minions): extend types for v7 parity layer

Extends MinionJob with depth/max_children/timeout_ms/timeout_at/
remove_on_complete/remove_on_fail/idempotency_key. Extends MinionJobInput
with the same options plus max_spawn_depth override. Adds MinionQueueOpts
(maxSpawnDepth default 5, maxAttachmentBytes default 5 MiB). Adds
AttachmentInput/Attachment shapes and ChildDoneMessage in the InboxMessage
union. rowToMinionJob updated to pick up the new columns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(minions): attachments validator

New module validateAttachment() gates every attachment write. Rejects empty
filenames, path traversal (.., /, \), null bytes, oversized content (5 MiB
default, per-queue override), invalid base64, and implausible content_type
headers. Returns normalized { filename, content_type, content (Buffer),
sha256, size } on success.

The DB also enforces UNIQUE (job_id, filename) as defense-in-depth for
concurrent addAttachment races — JS-only checks are not sufficient.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(minions): queue v7 — depth, child cap, timeouts, cascade, idempotency, child_done

Wraps completeJob and failJob in engine.transaction() so parent hook
invocations (resolveParent, failParent, removeChildDependency) fold into
the same transaction as the child update. A process crash between child
and parent can't strand the parent in waiting-children anymore.

Adds v7 behaviors:
- Depth tracking. add() computes depth = parent.depth + 1 and rejects
  past maxSpawnDepth (default 5).
- Per-parent child cap. add() takes SELECT ... FOR UPDATE on the parent,
  counts non-terminal children, rejects when count >= max_children.
  NULL max_children = no cap.
- Per-job wall-clock timeout. claim() populates timeout_at when
  timeout_ms is set. New handleTimeouts() dead-letters expired rows with
  error_text='timeout exceeded'. Terminal — no retry.
- Cascade cancel. cancelJob() walks descendants via recursive CTE with
  depth-100 runaway cap. Returns the root row. Re-parented descendants
  (parent_job_id NULL) are naturally excluded.
- Idempotency. add() uses INSERT ... ON CONFLICT (idempotency_key) DO
  NOTHING RETURNING; falls back to SELECT when RETURNING is empty. Same
  key always yields the same job id.
- child_done inbox. completeJob inserts {type:'child_done', child_id,
  job_name, result} into the parent's inbox in the same transaction as
  the token rollup, guarded by EXISTS so terminal/deleted parents skip
  without FK violation. New readChildCompletions(parent_id, lock_token,
  since?) helper; token-fenced like readInbox.
- removeOnComplete / removeOnFail. Deletes the row after the parent hook
  fires, so parent policy sees consistent state.
- Attachment methods. addAttachment validates via validateAttachment
  then INSERTs; UNIQUE (job_id, filename) backs the JS dup check.
  listAttachments, getAttachment, deleteAttachment round out the API.

Fixes pre-existing inverted status bug: add() now puts children in
waiting/delayed (not waiting-children) and atomically flips the parent
to waiting-children in the same transaction. Tests no longer need
manual UPDATE workarounds.

Two correctness fixes:
- Sibling completion race. Under READ COMMITTED, two grandchildren
  completing concurrently each saw the other as still-active in the
  pre-commit snapshot and neither flipped the parent. Fixed by taking
  SELECT ... FOR UPDATE on the parent row at the start of completeJob
  and failJob transactions, serializing siblings on the parent lock.
- JSONB double-encode. postgres.js conn.unsafe(sql, params) auto-
  JSON-encodes parameters. Calling JSON.stringify(obj) first stored a
  JSON string literal (jsonb_typeof=string) and broke payload->>'key'
  queries silently. Removed JSON.stringify from three call sites
  (child_done inbox post, updateProgress, sendMessage). PGLite tolerated
  both forms so unit tests missed it — real-PG E2E caught it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(minions): worker — timeout safety net + handleTimeouts tick

Worker tick now calls handleStalled() first, then handleTimeouts() — stall
requeue wins over timeout dead-letter when both could fire in the same
cycle. handleTimeouts() guards on lock_until > now() so stalled jobs take
the retryable path.

launchJob schedules a per-job setTimeout(timeout_ms) that fires ctx.signal
as a best-effort handler interrupt. The timer is always cleared in .finally
so process exit isn't delayed by a dangling timer. Handlers that respect
AbortSignal stop cleanly; handlers that ignore it still get dead-lettered
by the DB-side handleTimeouts.

Removed post-completeJob and post-failJob parent-hook calls from the worker
— those are now inside the queue method transactions. Worker becomes
simpler and crash-safer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(minions): 33 new unit tests for v7 parity layer

Covers depth cap, per-parent child cap, timeout dead-letter, cascade
cancel (including the re-parent edge case), removeOnComplete /
removeOnFail, idempotency (single + concurrent), child_done inbox
(posted in txn + survives child removeOnComplete + since cursor),
attachment validation (oversize, path traversal, null byte, duplicates,
base64), AbortSignal firing on pause mid-handler, catch-block skipping
failJob when aborted, worker in-flight bookkeeping, token-rollup guard
when parent already terminal, and setTimeout safety-net cleanup.

Existing tests updated to remove the inverted-status manual UPDATE
workarounds that the add() fix made obsolete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(e2e): Minions v7 concurrency + OpenClaw resilience coverage

minions-concurrency.test.ts spins two MinionWorker instances against the
test Postgres, submits 20 jobs, and asserts zero double-claims (every job
runs exactly once). This is the only test that actually proves FOR UPDATE
SKIP LOCKED under real concurrency — PGLite runs on a single connection
and can't exercise the race.

minions-resilience.test.ts covers the six OpenClaw daily pains:
1. Spawn storm caps enforce under concurrent submit. 2. Agent stall →
handleStalled() requeues; handleTimeouts() skips (lock_until guard).
3. Forgotten dispatches recoverable via child_done inbox. 4. Cascade
cancel stops grandchildren mid-flight. 5. Deep tree fan-in
(parent → 3 children → 2 grandchildren each) completes with the full
inbox chain. 6. Parent crash/recovery resumes from persisted state.

helpers.ts extends ALL_TABLES with minion_attachments, minion_inbox, and
minion_jobs (FK dependents first) so E2E teardown doesn't leak rows
between runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: release v0.11.0 — Minions v7 agent orchestration primitives

Bumps VERSION / package.json to 0.11.0. Adds CHANGELOG entry covering
depth tracking, max_children, per-job timeouts, cascade cancel,
idempotency keys, child_done inbox, removeOnComplete/Fail, attachments,
migration v7, plus the two correctness fixes (sibling completion race
and JSONB double-encode).

TODOS.md captures the four v7 follow-ups: per-queue rate limiting,
repeat/cron scheduler, worker event emitter, and waitForChildren
convenience helpers.

1066 unit + 105 E2E = 1171 tests passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(minions): unify JSONB inserts, tighten nullish coalescing

Three non-blocker cleanups from post-ship review of v0.11.0:

- queue.ts add() and completeJob(): pre-stringifying with JSON.stringify
  while other sites pass raw objects with $n::jsonb casts. postgres.js
  double-encodes if you stringify first — works on PGLite (text→JSONB
  auto-cast), fails silently on real PG. Unify on raw object + explicit
  $n::jsonb cast.
- queue.ts readChildCompletions: since clause used sent_at > $2 relying
  on PG's implicit text→TIMESTAMPTZ coercion. Explicit $2::timestamptz
  is safer and clearer.
- types.ts rowToMinionJob: parent_job_id used || which coerces 0 to null.
  Harmless today (SERIAL IDs start at 1) but ?? is semantically correct.

All 110 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(minions): updateProgress missed $1::jsonb cast in unification

Residual from c502b7e — updateProgress was the only remaining JSONB write
without the explicit ::jsonb cast. Not broken (implicit cast works) but
breaks the convention the prior commit unified everywhere else.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* doc: Minions v7 skill count + jobs subcommands (26 skills)

README: bump skill count 25 → 26, add minion-orchestrator row, add
`gbrain jobs` command family block so v0.11.0's headline feature is
actually discoverable from the top-level commands reference.

CLAUDE.md: unit test count 48 → 49 (minions.test.ts expanded), skill
count 25 → 26, add minion-orchestrator to Key files + skills categorization,
expand MinionQueue one-liner to cover v7 primitives (depth/child-cap,
timeouts, idempotency, child_done inbox, removeOnComplete/Fail).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: Minions adoption UX — smoke test + migration + pain-triggered routing

Teach OpenClaw when to reach for Minions vs native subagents. Ship three
pieces so upgrading from v0.10.x actually lands for real users:

- `gbrain jobs smoke` — one-command health check that submits a `noop` job,
  runs a worker, verifies completion, and prints engine-aware guidance
  (PGLite installs get the "daemon needs Postgres, use --follow" note).
  Fails loud if schema's below v7 so the user knows to `gbrain init`.

- `skills/migrations/v0.11.0.md` — post-upgrade migration file the
  auto-update agent reads. Six steps: apply schema, run smoke, ask user
  via AskUserQuestion which mode they want (always / pain_triggered / off),
  write to `~/.gbrain/preferences.json`, sanity-check handlers, mark done.
  Completeness scores on each option so the recommendation is explicit.

- `skills/conventions/subagent-routing.md` rewritten — was a "MUST use
  Minions for ALL background work" mandate, now reads preferences.json
  on every routing decision and branches on three modes. Mode B
  (pain_triggered) is the default: keep subagents until gateway drops
  state, parallel > 3, runtime > 5min, or user expresses frustration.
  Then pitch the switch in-session with a specific script.

Rename pass: "Minions v7" → "Minions" in README (JOBS block), TODOS.md
(P1 section header + depends-on), CHANGELOG.md v0.11.0 entry. v7 stays
as the internal schema version in code/migration contexts. The product
name is just Minions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* doc(readme): promote Minions — 6 OpenClaw pains + how each is fixed

The one-line mention in the skills table wasn't doing the work. Added a
dedicated section between "How It Works" and "Getting Data In" that leads
with the six multi-agent failures every OpenClaw user hits daily (spawn
storms, hung handlers, forgotten dispatches, unstructured debugging,
gateway crashes, runaway grandchildren) and maps each pain to the
specific Minions primitive that fixes it.

Includes the smoke test command, the adoption default (pain_triggered),
and a pointer to skills/minion-orchestrator for the full patterns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(bench): add harness for Minions vs OpenClaw subagent dispatch

Shared harness (openclawDispatch + minionsHandler) using matching
claude-haiku-4-5 calls on both sides so the delta measures queue+
dispatch overhead on top of identical LLM work. Includes
statsFromResults (p50/p95/p99) and formatStats helpers. Uses
`openclaw agent --local` embedded mode; does not test gateway
multi-agent fan-out (documented in the harness header).

* test(bench): durability under SIGKILL — Minions vs OpenClaw --local

Headline bench for the claim: when the orchestrator dies mid-dispatch,
Minions rescues via PG state + stall detection; OpenClaw --local loses
in-flight work outright.

Minions side: seed 10 active+expired-lock rows (exact state a SIGKILLed
worker leaves) then run a rescue worker. Expect 10/10 completed.
OpenClaw side: spawn 10 `openclaw agent --local` in parallel, SIGKILL
each at 500ms, count pre-kill delivered output. Expect 0/10 — no
persistence layer, nothing to recover.

Budget: ~$0 (Minions handlers sleep 10ms; OC calls die at 500ms so
partial LLM billing is negligible).

* test(bench): per-dispatch throughput — Minions vs OpenClaw --local

20 serial dispatches each side, identical claude-haiku-4-5 call with the
same trivial prompt. p50/p95/p99 reported via statsFromResults. Serial
(not parallel) so the per-dispatch cost is measured honestly and LLM
token spend stays bounded (~$0.08 total).

Minions: one queue, one worker, one concurrency. Submit → poll to
completion before next submit. OpenClaw: N sequential
`openclaw agent --local` spawns.

* test(bench): fan-out — Minions 10-wide concurrency vs 10 parallel OC spawns

Parent dispatches 10 children, waits for all to return. Minions uses
worker concurrency=10 sharing one warm process; OpenClaw parallel
`openclaw agent --local` spawns, each boots its own runtime.

3 runs × 10 children per run. Reports ok count and wall time per run
plus summary. Honest caveat documented: does not test OC gateway
multi-agent fan-out — that needs a custom WS client and LLM-backed
parent agent. This measures what users script today.

Budget: ~$0.12 LLM spend.

* test(bench): memory — 10 in-flight subagents, single-proc vs 10-proc cost

Measures resident memory for keeping 10 subagents in flight. Minions:
one worker process, concurrency=10 with handlers that park on a
promise — sample RSS of the test process via process.memoryUsage().
OpenClaw: 10 parallel `openclaw agent --local` processes, sum their
RSS via `ps -o rss=`.

Handlers are cheap sleeps, no LLM — we want harness memory, not LLM
client state. Budget: $0.

* test(bench): fan-out — don't gate on OC success rate, report numbers

Initial run showed OC parallel `--local` at 10-wide hits 40% failure
rate (17/30 across 3 runs). That's the finding, not a test bug —
process startup stampede + LLM rate limits. Bench now prints error
samples and reports the numbers instead of gating.

Minions side still gates at 90% (30/30 observed in practice).

* doc(benchmarks): Minions vs OpenClaw --local subagent dispatch

Real numbers on four claims: durability, throughput, fan-out, memory.
Same claude-haiku-4-5 call on both sides so the delta is queue+dispatch+
process cost on top of identical LLM work.

Headline: Minions rescues 10/10 from a SIGKILLed worker in 458ms while
OpenClaw --local loses all 10; ~10× faster per dispatch (778ms p50 vs
8086ms p50); ~21× faster at 10-wide fan-out AND 100% reliable vs OC's
43% failure rate; 2 MB vs 814 MB to keep 10 subagents in flight.

Honest caveats section covers what this doesn't test (OC gateway
multi-agent, load tests, other models). Fully reproducible via
test/e2e/bench-vs-openclaw/.

* doc(readme): inject Minions vs OpenClaw bench numbers

Headline deltas now in the Minions section: 10/10 vs 0/10 on crash,
~10× faster per dispatch, ~21× faster fan-out at 10-wide with 0%
failure vs 43%, ~400× less memory. Links to the full bench doc.

Prose first said Minions "fixes all six pains." Now it shows the
numbers that prove it.

* bench: production Wintermute benchmark — Minions 753ms vs sub-agent timeout

Real deployment: 45K-page brain on Render+Supabase. Task: pull 99 tweets,
write brain page, commit, sync. Minions: 753ms, $0. Sub-agent: gateway
timeout (>10s, couldn't even spawn under production load).

Also: 19,240 tweets backfilled across 36 months in 15 min at $0.
Sub-agents would cost $1.08 and fail 40% of spawns.

* bench: tweet ingestion — Minions 719ms vs OpenClaw 12.5s (17×)

Production benchmark with runnable test code:
- test/e2e/bench-vs-openclaw/tweet-ingest.bench.ts (reusable)
- docs/benchmarks/2026-04-18-tweet-ingestion.md (publishable)

Task: pull 100 tweets from X API, write brain page, commit, sync.
Minions: 719ms mean, $0, 100% success.
OpenClaw: 12,480ms mean, $0.03/run, 60% success (gateway timeouts).
At scale: 36-month backfill, 19K tweets, 15 min, $0 vs est. $1.08.

* doc(benchmarks): Wintermute production data point for Minions vs OpenClaw

Adds a production-environment data point to the Minions README section:
one month of tweet ingest on Wintermute (Render + Supabase + 45K-page brain)
ran end-to-end in 753ms for \$0.00 via Minions, while the equivalent
sessions_spawn hit the 10s gateway timeout and produced nothing.

Full methodology + logs in docs/benchmarks/2026-04-18-minions-vs-openclaw-production.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(core): preferences.ts + cli-util.ts — foundations for v0.11.1

Adds two foundational modules that apply-migrations (Lane A-4), the
v0.11.0 orchestrator (Lane C-1), and the stopgap script (Lane C-4) all
depend on.

- src/core/preferences.ts: atomic-write ~/.gbrain/preferences.json
  (mktemp + rename, 0o600, forward-compatible for unknown keys) with
  validateMinionMode, loadPreferences, savePreferences. Plus
  appendCompletedMigration + loadCompletedMigrations for the
  ~/.gbrain/migrations/completed.jsonl log (tolerates malformed lines).
  Uses process.env.HOME || homedir() so $HOME overrides work in CI and
  tests; Bun's os.homedir() caches the initial value and ignores later
  mutations.
- src/core/cli-util.ts: promptLine(prompt) helper, extracted from
  src/commands/init.ts:212-224. Shared so init, apply-migrations, and
  the v0.11.0 orchestrator's mode prompt don't each reinvent it.

test/preferences.test.ts: 21 unit tests covering load/save atomicity,
0o600 perms, forward-compat for unknown keys, minion_mode validation,
completed.jsonl JSONL append idempotence, auto-ts population, malformed-
line tolerance in loadCompletedMigrations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(init): add --migrate-only flag (schema-only, no saveConfig)

Context: v0.11.0 migration orchestrators need a safe way to re-apply the
schema against an existing brain without risking a config flip. Today
running bare `gbrain init` with no flags defaults to PGLite and calls
saveConfig, which would silently overwrite an existing Postgres
database_url — caught by Codex in the v0.11.1 plan review as a
show-stopper data-loss bug.

The new --migrate-only path:
  - loadConfig() reads the existing config (does NOT call saveConfig)
  - errors out with a clear "run gbrain init first" if no config exists
  - connects via the already-configured engine, calls engine.initSchema(),
    disconnects
  - --json emits structured success/error payloads

Everything downstream in the v0.11.1 migration chain (apply-migrations,
the stopgap bash script, the package.json postinstall hook) will invoke
this flag rather than bare gbrain init.

test/init-migrate-only.test.ts: 4 tests covering the no-config error
path, --json error payload shape, happy-path with a PGLite fixture
(verifies config.json content is byte-identical after the call — the
real invariant), and idempotent rerun.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(migrations): TS registry replaces filesystem migration scan

Context: Codex flagged that bun build --compile produces a self-contained
binary, and the existing findMigrationsDir() in upgrade.ts:145 walks
skills/migrations/v*.md on disk — which fails on a compiled install
because the markdown files aren't bundled. The plan's fix is a TS
registry: migrations are code, imported directly, visible to both source
installs and compiled binaries.

- src/commands/migrations/types.ts: shared Migration, OrchestratorOpts,
  OrchestratorResult types.
- src/commands/migrations/index.ts: exports the migrations[] array,
  getMigration(version), and compareVersions() (semver comparator).
  The feature_pitch data that lived in the MD file frontmatter now
  lives here as a code constant on each Migration, so runPostUpgrade's
  post-upgrade pitch printer can consume it without a filesystem read.
- src/commands/migrations/v0_11_0.ts: stub orchestrator + pitch. The
  full phase implementation lands in Lane C-1; for now the stub throws
  a clear "not yet implemented" so apply-migrations --list (Lane A-4)
  can still enumerate the migration.

test/migrations-registry.test.ts: 9 tests covering ascending-semver
ordering, feature_pitch shape invariants, getMigration lookup, and
compareVersions edge cases (equal / newer / older / single-digit
across major bumps).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cli): gbrain apply-migrations — migration runner CLI

Reads ~/.gbrain/migrations/completed.jsonl, diffs against the TS migration
registry, runs pending orchestrators. Resumes status:"partial" entries
(the stopgap bash script writes these so v0.11.1 apply-migrations can
pick up where it left off). Idempotent: rerunning when up-to-date exits 0.

Flags:
  --list                    Show applied + partial + pending + future.
  --dry-run                 Print the plan; take no action.
  --yes / --non-interactive Skip prompts (used by runPostUpgrade + postinstall).
  --mode <a|p|o>            Preset minion_mode (bypasses the Phase C TTY prompt).
  --migration vX.Y.Z        Force-run one specific version.
  --host-dir <path>         Include $PWD in host-file walk (default is
                            $HOME/.claude + $HOME/.openclaw only).
  --no-autopilot-install    Skip Phase F.

Diff rule (Codex H9): apply when no status:"complete" entry exists AND
migration.version ≤ installed VERSION. Previously proposed rule was
"version > currentVersion", which would SKIP v0.11.0 when running v0.11.1;
regression test in apply-migrations.test.ts pins the correct semantics.

Registered in src/cli.ts CLI_ONLY Set; dispatched before connectEngine so
each phase owns its own engine/subprocess lifecycle (no double-connect
when the orchestrator shells out to init --migrate-only or jobs smoke).

test/apply-migrations.test.ts: 18 unit tests covering parseArgs for every
flag, indexCompleted/statusForVersion correctness (including stopgap-then-
complete transition), and buildPlan's four buckets (applied / par…
garrytan added a commit that referenced this pull request Apr 24, 2026
* fix(link-extraction): v0.10.5 drive works_at + advises accuracy on rich prose

Extends inferLinkType patterns to cover rich-prose phrasings that miss with
v0.10.4 regexes. Targets the residuals called out in TODOS.md: works_at at
58% type accuracy, advises at 41%.

WORKS_AT_RE additions:
- Rank-prefixed: "senior engineer at", "staff engineer at", "principal/lead"
- Discipline-prefixed: "backend/frontend/full-stack/ML/data/security engineer at"
- Possessive time: "his/her/their/my time at"
- Leadership beyond "leads engineering": "heads up X at", "manages engineering at",
  "runs product at", "leads the [team] at"
- Role nouns: "role at", "position at", "tenure as", "stint as"
- Promotion patterns: "promoted to staff/senior/principal at"

ADVISES_RE additions:
- Advisory capacity: "in an advisory capacity", "advisory engagement/partnership/contract"
- "as an advisor": "joined as an advisor", "serves as technical advisor"
- Prefixed advisor nouns: "strategic/technical/security/product/industry advisor to|at"
- Consulting: "consults for", "consulting role at|with"

New EMPLOYEE_ROLE_RE page-level prior: fires when the page describes the subject
as an employee (senior/staff/principal engineer, director, VP, CTO/CEO/CFO) at
some company. Biases outbound company refs toward works_at when per-edge context
is possessive or narrative without an explicit work verb. Scoped to person -> company
links only. Precedence: investor > advisor > employee (investors often hold board
seats which would otherwise mis-classify as advise/works_at).

ADVISOR_ROLE_RE broadened from "full-time/professional/advises multiple" to catch
any page that self-identifies the subject as an advisor ("is an advisor",
"serves as advisor", possessive "her advisory work/role/engagement").

Tests: 65 pass (16 new v0.10.5 coverage tests + 4 regression guards against
v0.10.4 tightenings). Templated benchmark still 88.9% type_accuracy (10/10 on
works_at and advises). Rich-prose measurement requires the multi-axis report
upgrade (next commit) to validate retroactively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): type-accuracy runner on rich-prose corpus + wire into all.ts

New Category 2 in BrainBench: per-link-type accuracy measured directly on the
240-page rich-prose world-v1 corpus. Distinct from Cat 1's retrieval metrics,
this measures whether inferLinkType() correctly classifies extracted edges
when the prose varies (the 58% works_at and 41% advises residuals that v0.10.5
regexes targeted).

How it works:
  1. Loads all pages from eval/data/world-v1/
  2. Derives GOLD expected edges from each page's _facts metadata
     (founders → founded, investors → invested_in, advisors → advises,
      employees → works_at, attendees → attended, primary_affiliation +
      role drives person-page outbound type)
  3. Runs extractPageLinks() on each page → INFERRED edges
  4. Per (from, to) pair, compares inferred type vs gold type
  5. Emits per-link-type table: correct / mistyped / missed / spurious +
     type accuracy + recall + precision + strict F1 (triple match)
  6. Full confusion matrix rows=gold, cols=inferred

v0.10.5 validation on 240-page corpus (up from pre-v0.10.5 baselines):
  - works_at:    58%  → 100.0%   (+42 pts) — 10/10 correct, 0 mistyped
  - advises:     41%  → 88.2%    (+47 pts) — 15/17 correct
  - attended:    —    → 100.0%   131/134 recall
  - founded:    100%  → 100.0%   40/40
  - invested_in: 89%  → 92.0%    69/75
  - Overall:    88.5% → 95.7%    type accuracy (conditional on edge found)

Strict F1 overall: 53.7%. Lower because the _facts-based gold set only
captures core relationships; rich prose extracts many peripheral mentions
(190 spurious "mentions" edges) that aren't bugs but are correctly-typed
prose references without a _facts counterpart. Spurious counts are signal
for future type-precision tuning, not failure.

Wired into eval/runner/all.ts as Cat 2 so every full benchmark run includes
the rich-prose type accuracy table alongside retrieval metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): Phase 2 adapter interface + EXT-1 ripgrep+BM25 baseline

Phase 2 credibility unlock: BrainBench now compares gbrain to external
baselines on the same corpus and queries. Transforms the benchmark from
internal ablation ("gbrain-graph beats gbrain-grep") to category comparison
("gbrain-graph beats classic BM25 by 32 pts P@5"). This is the #1 fix
from the 4-review arc — addresses Codex's core critique that v1's
before/after was self-referential.

Added:
  eval/runner/types.ts                      — Adapter interface (v1.1 spec)
  eval/runner/adapters/ripgrep-bm25.ts      — EXT-1 classic IR baseline
  eval/runner/adapters/ripgrep-bm25.test.ts — 11 unit tests, all pass
  eval/runner/multi-adapter.ts              — side-by-side scorer

Adapter interface (eng pass 2 spec):
  - Thin 3-method Strategy: init(rawPages, config), query(q, state), snapshot(state)
  - BrainState is opaque to runner (never inspected)
  - Raw pages passed in-memory; gold/ never crosses adapter boundary
    (structural ingestion-boundary enforcement)
  - PoisonDisposition enum reserved for future poison-resistance scoring

EXT-1 ripgrep+BM25:
  - Classic Lucene-variant IDF + k1/b tuned at standard 1.5/0.75
  - Title tokens double-weighted for entity-page slug-match bias
  - Stopword filter, alphanumeric tokenization, stable lexicographic tie-break
  - Pure in-memory inverted index — no external deps, ~100 LOC core

First side-by-side results on 240-page rich-prose corpus, 145 relational queries:

| Adapter       | P@5    | R@5    | Correct top-5 |
|---------------|--------|--------|---------------|
| gbrain-after  | 49.1%  | 97.9%  | 248/261       |
| ripgrep-bm25  | 17.1%  | 62.4%  | 124/261       |
| Delta         | +32.0  | +35.5  | +124          |

gbrain-after is the hybrid graph+grep config from PR #188. Ripgrep+BM25 is
a genuinely strong classic-IR baseline (BM25 is what Lucene/Elasticsearch
ship). gbrain's ~+32-point lead on relational queries reflects real work
by the knowledge graph layer: typed links + traversePaths surface the
correct answers in top-K that BM25 only pulls in via partial-text overlap.

Next in Phase 2: EXT-2 vector-only RAG + EXT-3 hybrid-without-graph
adapters. Both plug into the same Adapter interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): Phase 2 EXT-2 vector-only RAG adapter

Second external baseline for BrainBench. Pure cosine-similarity ranking
using the SAME text-embedding-3-large model gbrain uses internally —
apples-to-apples on the embedding layer so any gbrain lead reflects the
graph + hybrid fusion, not a better embedder.

Files:
  eval/runner/adapters/vector-only.ts      ~130 LOC
  eval/runner/adapters/vector-only.test.ts 6 unit tests (cosine math)

Design:
  - One vector per page (title + compiled_truth + timeline, capped 8K chars).
  - No chunking (intentional; chunked vector RAG would be EXT-2b later).
  - No keyword fallback (that's EXT-3 hybrid-without-graph).
  - Embeddings in batches of 50 via existing src/core/embedding.ts (retry+backoff).
  - Cost on 240 pages: ~$0.02/run.

Three-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries:

| Adapter       | P@5    | R@5    | Correct top-5 |
|---------------|--------|--------|---------------|
| gbrain-after  | 49.1%  | 97.9%  | 248/261       |
| ripgrep-bm25  | 17.1%  | 62.4%  | 124/261       |
| vector-only   | 10.8%  | 40.7%  |  78/261       |

Interesting finding: vector-only scores WORSE than BM25 on relational queries
like "Who invested in X?" — exact entity match matters more than semantic
similarity for these templates. BM25 nails the entity-name term; vector-only
returns topically-similar-but-not-mentioning pages. This is the known failure
mode of pure-vector RAG on precise relational/identity queries. Real-world
vector RAG systems always add keyword fallback; EXT-3 (hybrid-without-graph)
will be that fairer comparator.

gbrain's lead widens in vector-only comparison: +38.4 pts P@5, +57.2 pts R@5.
The graph layer is doing the heavy lifting for relational traversal; pure
vector RAG can't express "traverse 'attended' edges from this meeting page."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): Phase 2 EXT-3 hybrid-without-graph adapter — graph isolated

Third and closest-to-gbrain external baseline. Runs gbrain's full hybrid
search (vector + keyword + RRF fusion + dedup) WITHOUT the knowledge-graph
layer. Same engine, same embedder, same chunking, same hybrid fusion —
only traversePaths + typed-link extraction turned off.

This is the decisive comparator for "does the knowledge graph do useful
work?" Same everything-else, only graph differs. Any lead gbrain-after has
over EXT-3 is 100% attributable to the graph layer.

Files:
  eval/runner/adapters/hybrid-nograph.ts   — ~110 LOC

Implementation:
  - New PGLiteEngine per run; auto_link set to 'false' (belt).
  - importFromContent() used instead of bare putPage() so chunks +
    embeddings get populated (hybridSearch needs them).
  - NO runExtract() call — typed links/timeline stay empty (suspenders).
  - hybridSearch(engine, q.text) answers every query. Aggregate chunks
    to page-level by best chunk score.

FOUR-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries:

| Adapter         | P@5    | R@5    | Correct/Gold |
|-----------------|--------|--------|--------------|
| gbrain-after    | 49.1%  | 97.9%  | 248/261      |
| hybrid-nograph  | 17.8%  | 65.1%  | 129/261      |
| ripgrep-bm25    | 17.1%  | 62.4%  | 124/261      |
| vector-only     | 10.8%  | 40.7%  |  78/261      |

The headline delta nobody can hand-wave away:
  gbrain-after → hybrid-nograph  = +31.4 P@5, +32.9 R@5
  hybrid-nograph → ripgrep-bm25  = +0.7 P@5,  +2.7 R@5

Hybrid search (vector+keyword+RRF) over pure BM25 gains ~1 point. The
knowledge graph layer over hybrid gains ~31 points. The graph is doing
the work; adding it to a retrieval stack is what actually moves the needle
on relational queries. The vector/keyword/BM25 debate is a footnote.

Timing: hybrid-nograph init is ~2 min (embeds 240 pages once); query loop
is fast. gbrain-after is ~1.5s total because traversePaths doesn't need
embeddings. Runs at ~$0.02 Opus-equivalent in embedding cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): Phase 2 query validator + Tier 5 Fuzzy + Tier 5.5 synthetic + N=5 tolerance bands

Closes multiple Phase 2 items in one commit since they form a cohesive
package: query schema enforcement + new query tiers + per-query-set
statistical rigor.

Added:
  eval/runner/queries/validator.ts               — hand-rolled Query schema validator
  eval/runner/queries/validator.test.ts          — 24 unit tests, all pass
  eval/runner/queries/tier5-fuzzy.ts             — 30 hand-authored Tier 5 Fuzzy/Vibe queries
  eval/runner/queries/tier5_5-synthetic.ts       — 50 SYNTHETIC-labeled outsider-style queries (author: "synthetic-outsider-v1")
  eval/runner/queries/index.ts                   — aggregator + validateAll()

Modified:
  eval/runner/multi-adapter.ts                   — N=5 runs per adapter (BRAINBENCH_N override), page-order shuffle, mean±stddev reporting

Query validator (hand-rolled, no zod dep to match gbrain codebase style):
  - Temporal verb regex enforces as_of_date (per eng pass 2 spec):
    /\\b(is|was|were|current|now|at the time|during|as of|when did)\\b/i
  - Validates tier enum, expected_output_type enum, gold shape per type
  - gold.relevant must be non-empty slug[] for cited-source-pages queries
  - abstention requires gold.expected_abstention === true
  - externally-authored tier requires author field
  - batch validation catches duplicate IDs

Tier 5 Fuzzy/Vibe (30 queries, hand-authored):
  - Vague recall: "Someone who was a senior engineer at a biotech company..."
  - Trait-based: "The engineer who pushed back on microservices"
  - Cultural/epithet: "Who is known as a 'systems builder' in security?"
  - Abstention bait: "Which Layer 1 project did the crypto guy leave?" (prose
    mentions but never names; good systems abstain)
  - Addresses Codex's circularity critique — vague queries where graph-heavy
    systems shouldn't inherently win.

Tier 5.5 Synthetic Outsider (50 queries, AI-authored placeholder):
  - Clearly labeled author: "synthetic-outsider-v1"
  - Phrasing variety not in the 4 template families:
    * fragment style ("crypto founder Goldman Sachs background")
    * polite/natural ("Can you pull up what we have on...")
    * comparison ("What is the difference between X and Y?")
    * follow-up ("And who else advises Orbit Labs?")
    * typos/misspellings ("adam lopez bioinformatcis")
    * similarity ("Find me someone like Alice Davis...")
    * imperative ("Pull up Alice Davis")
  - Real Tier 5.5 from outside researchers supersedes synthetic via
    PRs to eval/external-authors/ (docs ship in follow-up commit).

N=5 tolerance bands:
  - Default N=5, override via BRAINBENCH_N env var (e.g. BRAINBENCH_N=1 for dev loops)
  - Per-run seeded Fisher-Yates shuffle of page ingest order (LCG seed = run_idx+1)
  - Surfaces order-dependent adapter bugs (tie-break-by-first-seen etc.)
  - Reports mean ± sample-stddev per metric
  - "stddev = 0" is honest signal that the adapter is deterministic, not a bug.
    LLM-judge metrics (future) will naturally produce non-zero stddev.

Validation: all 80 Tier 5 + 5.5 queries pass validateAll(). 24 validator
unit tests pass.

Next commit: world.html contributor explorer (Phase 3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): Phase 3 world.html explorer + eval:* CLI surface

Contributor DX magical moment. Static HTML explorer renders the full
canonical world (240 entities) as an explorable tree, opens in any browser,
zero install. Every string HTML-entity-encoded (XSS-safe — direct vuln
class per eng pass 2, confidence 9/10).

Added:
  eval/generators/world-html.ts         — renderer (~240 LOC; single-file
                                          HTML with inline CSS + minimal JS)
  eval/generators/world-html.test.ts    — 16 tests (XSS + rendering correctness)
  eval/cli/world-view.ts                — render + open in default browser
  eval/cli/query-validate.ts            — CLI wrapper for queries/validator
  eval/cli/query-new.ts                 — scaffold a query template

Modified:
  package.json                          — 7 new eval:* scripts
  .gitignore                            — ignore generated world.html

package.json scripts shipped:
  bun run test:eval                 all eval unit tests (57 pass)
  bun run eval:run                  full 4-adapter N=5 side-by-side
  bun run eval:run:dev              N=1 fast dev iteration
  bun run eval:world:view           render world.html + open in browser
  bun run eval:world:render         render only (CI-friendly, --no-open)
  bun run eval:query:validate       validate built-in T5+T5.5 (or a file path)
  bun run eval:query:new            scaffold a new Query JSON template
  bun run eval:type-accuracy        per-link-type accuracy report

XSS safety:
  escapeHtml() encodes the 5 critical chars (& < > " '). Tested directly
  with representative Opus-generated attacks:
    <img src=x onerror=alert('xss')>  → &lt;img src=x onerror=alert(&#39;xss&#39;)&gt;
    <script>fetch('/steal')</script>  → &lt;script&gt;fetch(&#39;/steal&#39;)&lt;/script&gt;
  Ledger metadata (generated_at, model) also escaped — covers the less
  obvious attack surface where Opus could emit tag-like content into the
  metadata file.

world.html structure:
  - Left rail: entities grouped by type with counts (companies, people,
    meetings, concepts), alphabetical within type
  - Right pane: per-entity cards with title + slug + compiled_truth +
    timeline + canonical _facts as collapsed JSON
  - URL fragment deep-links (#people/alice-chen)
  - Sticky rail on desktop; responsive stack on mobile
  - Vanilla JS for active-link highlighting on scroll (no framework)

Generated file: ~1MB for 240 entities (full prose). Gitignored; rebuild
with `bun run eval:world:view`. Regeneration is ~50ms.

Contributor TTHW (Tier 5.5 query authoring):
  1. bun run eval:world:view                         # see entities
  2. bun run eval:query:new --tier externally-authored --author "@me"
  3. edit template with real slug + query text
  4. bun run eval:query:validate path/to/file.json
  5. submit PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(eval): Phase 3 contributor docs + CI workflow for eval/ tests

Ships the contributor-onboarding surface promised in the plan. With this
commit, external researchers have a self-serve path from clone to PR in
under 5 minutes.

Added:
  eval/README.md                                — 5-minute quickstart,
                                                  directory map, methodology
                                                  one-pager, adapter scorecard
  eval/CONTRIBUTING.md                          — three contributor paths:
                                                    1. Write Tier 5.5 queries
                                                    2. Submit an external adapter
                                                    3. Reproduce a scorecard
  eval/RUNBOOK.md                               — operational troubleshooting:
                                                  generation failures, runner
                                                  failures, query validation,
                                                  world.html rendering, CI
  eval/CREDITS.md                               — contributor attribution
                                                  (synthetic-outsider-v1 labeled
                                                  as placeholder; real submissions
                                                  land here)
  .github/PULL_REQUEST_TEMPLATE/tier5-queries.md — structured PR template
                                                  for Tier 5.5 submissions
  .github/workflows/eval-tests.yml              — CI: validates queries,
                                                  runs all eval unit tests,
                                                  renders world.html on every PR
                                                  touching eval/** or
                                                  src/core/link-extraction.ts

CI scope (intentionally narrow):
  - Triggers on paths: eval/**, src/core/link-extraction.ts, src/core/search/**
  - Runs: bun run eval:query:validate (80 queries), test:eval (57 tests),
          eval:world:render (smoke-test the HTML renderer)
  - Pinned actions by commit SHA (matches existing .github/workflows/test.yml)
  - Zero API calls — all Opus/OpenAI paths stubbed or skipped in unit tests
  - Fast: ~30s total wall clock

Contributor TTHW (clone → first merged PR):
  - Path 1 (Tier 5.5 queries): ~5 min
  - Path 2 (external adapter): ~30 min for a simple adapter
  - Path 3 (reproduce scorecard): ~15 min wall clock (N=5 run)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(eval): teardown PGLite engines so bun run eval:run exits 0

The multi-adapter runner left PGLite engines alive after each run.
GbrainAfterAdapter and HybridNoGraphAdapter both instantiate a
PGLiteEngine in init() but never disconnect it; Bun's shutdown path
exits with code 99 when embedded-Postgres workers outlive main().

Added optional `teardown?(state)` to the Adapter interface, implemented
it on both engine-backed adapters, and call it from scoreOneRun after
the N=5 loop. ripgrep-bm25 and vector-only hold no DB resources and
don't need a teardown.

Verified: gbrain-after, hybrid-nograph, ripgrep-bm25, vector-only all
exit 0 at N=1. Full test:eval passes (57 tests). No metric change.

* docs(bench): 2026-04-19 multi-adapter scorecard

Reproducibility run of the 4-adapter side-by-side at commit b81373d
(branch garrytan/gbrain-evals). N=5, 240-page corpus, 145 relational
queries from world-v1.

Headline: gbrain-after 49.1% P@5 / 97.9% R@5. hybrid-nograph 17.8% /
65.1%. ripgrep-bm25 17.1% / 62.4%. vector-only 10.8% / 40.7%. All
adapters deterministic (stddev = 0 across the 5 runs per adapter).

Matches the scorecard in eval/README.md byte-for-byte for the three
deterministic adapters; hybrid-nograph matches within tolerance bands.

* docs(bench): 2026-04-19 gbrain v0.11.1 vs v0.12.1 regression comparison

Runs the same eval harness against two gbrain src/ trees on the same
240-page corpus and 145 queries. Patches the v0.11 copy's gbrain-after
adapter to use getLinks/getBacklinks (v0.11 has no traversePaths)
with identical direction+linkType semantics.

gbrain-after P@5 22.1% -> 49.1% (+27 pts); R@5 54.6% -> 97.9% (+43
pts); correct-in-top-5 99 -> 248 (+149). hybrid-nograph flat at 17.8%
/ 65.1% on both (v0.12 didn't touch hybridSearch / chunking).

Driver is extraction quality, not graph presence: v0.12 emits 499
typed links (v0.11: 136, x3.7) and 2,208 timeline entries (v0.11: 27,
x82) on the same 240 pages. Sharpens the April-18 "graph layer does
the work" claim -- on v0.11 that architecture only beat hybrid-nograph
by 4.3 points; the 31-point lead in the multi-adapter scorecard comes
from graph + high-quality extract in combination.

* feat(eval): BrainBench v1 portable JSON schemas + gold templates

Adds the v1→v2 contract boundary for BrainBench. 6 JSON schemas at
eval/schemas/ pin the shape of every artifact a stack must emit to be
scorable: corpus-manifest, public-probe (PublicQuery with gold stripped),
tool-schema (12 read + 3 dry_run tools, 32K tool-output cap), transcript,
scorecard (N ∈ {1, 5, 10}), evidence-contract (structured judge input).

8 gold file templates at eval/data/gold/ scaffold the sealed qrels,
contradictions, poison items, and citation labels. Empty-but-valid
skeletons; Day 3b fills them with real content once the amara-life-v1
corpus generates.

48 tests validate schema syntax, $schema/$id/title/type headers,
round-trip stability, and cross-schema coherence (new Page types in
manifest enum, tool counts, token cap, N enum).

When v2 ports to Python + Inspect AI + Docker, these schemas are the
boundary. Same fixtures, same tool contracts, zero rework.

* feat(eval): amara-life-v1 skeleton + Page.type enum for email/slack/cal/note

Deterministic procedural generator for the twin-amara-lite fictional-life
corpus (BrainBench v1 Cat 5/8/9/11 target). 15 contacts picked from
world-v1, 50 emails + 300 Slack messages across 4 channels + 20 calendar
events + 8 meeting transcripts + 40 first-person notes. Mulberry32 PRNG
gives byte-identical output under reseed.

Plants 10 contradictions + 5 stale facts + 5 poison items + 3 implicit
preferences at deterministic positions. Fixture_ids are unique across the
corpus so gold/contradictions.json + gold/poison.json + gold/implicit-
preferences.json can cross-reference by stable ID.

PageType extended in both src/core/types.ts and eval/runner/types.ts to
include email | slack | calendar-event | note (+ meeting on the production
side). src/core/markdown.ts inferType() heuristics updated for the new
one-slash slug prefixes (emails/em-NNNN, slack/sl-NNNN, cal/evt-NNNN,
notes/YYYY-MM-DD-topic, meeting/mtg-NNNN).

17 tests cover counts (50/300/20/8/40), perturbation counts (exact
10/5/5/3), seed determinism + divergence, slug regex conformance (matches
eval/runner/queries/validator.ts:131 one-slash rule), unique fixture_ids,
amara-in-every-email invariant, calendar dtstart < dtend, and Amara-is-
attendee on every meeting.

* feat(eval): amara-life-gen.ts with structured cache key + $20 cost gate

Opus prose expansion of the amara-life-v1 skeleton. Per-item structured
cache key = sha256({schema_version, template_id, template_hash, model_id,
model_params, seed, item_spec_hash}). Prompt-template tweak changes
template_hash; only those items regenerate. Schema bump changes
schema_version; everything invalidates cleanly. Interrupted runs resume
from the last cached item; zero re-spend.

Cost-gated at $20 hard-stop with Anthropic input/output pricing tracking.
Dry-run mode (--dry-run) executes the full pipeline with stub bodies for
smoke-testing the I/O layout without LLM spend. --max N caps items per
type for debugging. --force ignores cache.

Writes per-format outputs under eval/data/amara-life-v1/:
  inbox/emails.jsonl (one email per line with body_text appended)
  slack/messages.jsonl (one message per line with text appended)
  calendar.ics (RFC-5545 VEVENT format, templated — no LLM)
  meetings/<id>.md (transcript with YAML frontmatter)
  notes/<YYYY-MM-DD-topic>.md (first-person journal)
  docs/*.md (6 reference docs, templated — no LLM)
  corpus-manifest.json (per eval/schemas/corpus-manifest.schema.json,
    including per-item content_sha256 and generator_cache_key)

Perturbation hints (contradiction, stale-fact, poison, implicit-
preference) flow through the prompt so Opus weaves the specific claim
into each item's body. Poison items are hand-crafted to include
paraphrased prompt-injection attempts (not literal 'IGNORE ALL
PREVIOUS' — defense is the structured-evidence judge contract at
Day 5, not regex redaction).

New package.json scripts:
  eval:generate-amara-life       # real run (~$12 Opus estimated)
  eval:generate-amara-life:dry   # smoke test, zero spend

test:eval extended to include test/eval/. 10 cache-key tests cover
determinism, invalidation across every field of the key, canonical JSON
stability under object-key reorder, and per-skeleton-item spec-hash
uniqueness (50 distinct hashes for 50 distinct emails).

* chore: bump version and changelog (v0.15.0)

Resets package.json from stale 0.13.1 to 0.15.0 (matches VERSION).
v0.14.0 shipped with the stale package.json version; this sync catches
that up and moves to v0.15.0 in one step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CLAUDE.md + README + eval/README for v0.15.0 BrainBench

CLAUDE.md: adds a full BrainBench section to the Key Files list — 14 new
entries covering eval/README.md, multi-adapter.ts, types.ts (with new
PublicPage/PublicQuery), adapters/, queries/, type-accuracy.ts,
adversarial.ts, all.ts, world.ts/gen.ts, world-html.ts, amara-life.ts,
amara-life-gen.ts, schemas/, data/world-v1/, data/gold/,
data/amara-life-v1/, docs/benchmarks/, and test/eval/. Adds 3 new
test/eval/ lines to the unit-tests catalog.

eval/README.md: file tree updated to reflect v0.15 additions —
data/amara-life-v1/, data/gold/, schemas/, generators/amara-life.ts +
amara-life-gen.ts, runner/all.ts + adversarial.ts.

README.md: updates hero benchmark numbers (L7 intro + L353 mid-page)
from v0.10.5 PR #188 numbers (R@5 83→95, P@5 39→45) to current v0.12.1
4-adapter numbers (P@5 49.1% · R@5 97.9% · +31.4 pts vs hybrid-nograph).
Adds the v0.11→v0.12 regression comparison as the secondary reference.
Deeper-section tables (L422+) labeled "BrainBench v1 (PR #188)" are
preserved as historical data.

CHANGELOG is untouched — /ship already wrote the v0.15.0 entry.
TODOS.md is untouched — Cat 5/6/8/9/11 remain open (only foundations
shipped in v0.15.0; Cat runners ship in v1 Complete follow-ups).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): Day 4 — pdf-parse + flight-recorder + tool-bridge (dry_run + expand:false)

Three infrastructure modules for BrainBench v1 Complete Cats 5/8/9/11.

**eval/runner/loaders/pdf.ts** — Thin pdf-parse wrapper. Lazy import keeps
pdf-parse out of the module-load path (avoids library debug-mode side
effects). Size cap (50MB default), encryption detection, structured error
classes (PdfEncryptedError, PdfTooLargeError, PdfParseError). Only Cat 11
multimodal will import this; production bundle never sees pdf-parse.

**eval/runner/tool-bridge.ts** — Maps 12 read-only operations from
src/core/operations.ts to Anthropic tool definitions + adds 3 dry_run write
tools. Three structural invariants enforced:

  1. No hidden LLM calls. `operations.query` defaults expand=true which
     routes through expansion.ts → Haiku. Bridge strips `expand` from the
     query tool's input schema AND executor hard-sets expand:false. Zero
     nested Haiku calls in any agent trace.

  2. Mutating ops throw ForbiddenOpError. put_page, add_link, delete_page,
     etc. are rejected by name. Agents record intent via dry_run_put_page /
     dry_run_add_link / dry_run_add_timeline_entry which persist to the
     flight-recorder without mutating the engine. This is how Cat 8's
     back_link_compliance + citation_format metrics measure anything with
     a read-only tool surface.

  3. Poison tagged by the bridge, not the judge. Every tool result is
     scanned for slugs matching gold/poison.json fixtures. Matched
     fixture_ids flow into tool_call_summary.saw_poison_items for the
     structured-evidence judge contract. Judge never reads raw tool
     output — Section-3 defense against paraphrased prompt injections
     (poison payloads never reach the judge model at all).

32K-token cap (~128K chars) with "…[truncated]" suffix.

**eval/runner/recorder.ts** — Per-run flight-recorder bundle emitter. Full
6-artifact bundle (transcript.md, brain-export.json, entity-graph.json,
citations.json, scorecard.json, judge-notes.md) when the adapter provides
an AdapterExport; 3-artifact fallback (transcript + scorecard +
judge-notes) otherwise. Atomic writes via tmp+rename. Collision-safe:
duplicate directory names get incremental -2, -3 suffix. `safeStringify`
handles circular references without throwing and JSON-serializes
Float32Array embeddings.

**package.json:** adds pdf-parse@2.4.5 as a devDependency. Scoped to eval/
use only; production gbrain binary unaffected.

**Tests:** 63 new — 30 tool-bridge, 21 recorder, 12 pdf-loader. All pass.
Fake engine uses a Proxy with `__default__` fallback so poison-matching
tests don't have to mock the exact engine method name that each operation
calls (some route via searchKeyword, others via getPage — proxy handles
both uniformly).

Total eval suite now: 132 pass, 0 fail, 923 expect() calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): Day 5 — agent adapter + judge with structured evidence contract

Two modules that together wire Cat 8 / Cat 9 / Cat 5 end-to-end scoring.

**eval/runner/judge.ts** — Haiku 4.5 via tool-use `score_answer`. Input is
the structured JudgeEvidence contract (fix #16 from the plan's codex
review): probe + final_answer_text + evidence_refs + tool_call_summary +
ground_truth_pages + rubric. Raw tool output NEVER reaches the judge —
that's the Section-3 defense against paraphrased prompt-injection payloads
in gold/poison.json.

Retry policy: one retry on malformed tool_use response. If the second
attempt is still malformed, score the probe as `judge_failed` (all scores
0, verdict=fail) so the run still completes.

Aggregation: weighted mean across rubric criteria. Canonical thresholds
(pass ≥3.5, partial 2.5-3.5, fail <2.5) — judge can propose a verdict but
the computed verdict from the weighted mean is what the scorecard records.
This prevents the model from inflating or deflating its own verdict.

Score values are clamped to 0-5 on parse even if the model returns out of
range. `assertNoRawToolOutput(evidence)` is a regression guard that
returns the list of forbidden fields (tool_result, raw_transcript, etc.)
if any leak into the evidence contract.

**eval/runner/adapters/claude-sonnet-with-tools.ts** — The agent adapter.
Implements `Adapter` interface minimally: `init()` spins up PGLite and
seeds it, `query()` throws because the adapter is Cat 8/9-only and emits
a final-answer text, not a RankedDoc[]. Retrieval scorecard stays at 4
adapters.

`runAgentLoop(probeId, text, state, config)` drives the multi-turn loop:
Sonnet → tool_use → tool-bridge.executeTool → tool_result → back to
Sonnet. Turn cap 10. max_tokens 1024. System prompt (brain-first iron
law, citation format, amara context) is cached via cache_control.
Exponential backoff on rate-limit errors (1s, 2s, 4s).

Emits a `Transcript` per eval/schemas/transcript.schema.json — consumed
directly by recorder.ts for the flight-recorder bundle.

`brain_first_ordering` classifies Cat 8's flagship metric: did the agent
call search/get_page BEFORE producing the final answer? The `no_brain_calls`
case (agent answers from general knowledge without ever hitting the brain)
is the compliance failure to surface.

ForbiddenOpError + UnknownToolError from the bridge are caught in the
agent loop and surfaced as tool_result with is_error=true — keeps the
loop going and preserves full audit trail for the judge.

**Tests (35 new):** judge (23) — happy path, retry, fallback, evidence
contract sanitization, rendered prompt does not contain raw tool_result
text, verdict thresholds, score clamping, weighted mean with mixed
weights, parseToolUse rejects malformed input. agent-adapter (12) —
Adapter.query() throws, init() seeds PGLite, end-to-end tool loop with
stubbed Sonnet, turn cap exhaustion, mutating-op rejection surfaces as
tool_result error, extractSlugs regex.

All 12 agent tests take ~23s because PGLite runs 13 schema migrations per
test; the alternative of shared-engine-across-tests was rejected so each
test is isolated.

Total eval suite now: 167 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): Day 6 — adversarial-injections + Cat 6 prose-scale + Cat 11 multi-modal

Three modules that together cover BrainBench v1 Cat 6 (prose-scale
extraction fidelity) and Cat 11 (multi-modal ingest fidelity).

**eval/runner/adversarial-injections.ts** — 6 deterministic content
transforms shared by Cat 10 (adversarial.ts, 22 hand-crafted cases) and
Cat 6 (prose-scale variants). Each injection produces a modified content
string + a structured GoldDelta describing what the extractor MUST and
MUST NOT produce. Kinds:
  - code_fence_leak — fake [X](people/fake) inside ``` fence, must NOT extract
  - inline_code_slug — `people/fake` in backticks, must NOT extract
  - substring_collision — "SamAI" near real `people/sam`, exactly one link
  - ambiguous_role — "works with" vs "works at", downgrade type to mentions
  - prose_only_mention — strip markdown link syntax, bare name → mentions only
  - multi_entity_sentence — pack 4+ entities into one clause, extract all

Mulberry32 PRNG keeps variant generation deterministic under fixed seed.
Codex flagged the original plan's wording ("extract injection engine from
adversarial.ts") as overstated — adversarial.ts is a static case list,
not a reusable engine. This module is NEW code.

**eval/runner/cat6-prose-scale.ts** — Runner. Loads world-v1, applies all
6 injection kinds to sampled base pages (default 50 variants per kind ×
6 kinds = 300 variants), runs extractPageLinks on each, compares to gold
delta. Emits per-kind + overall metrics (precision, recall, F1,
code_fence_leak_rate, substring_fp_rate, pages_with_links_coverage,
mean_links_per_page). **v1 verdict is always "baseline_only"** — no
gating threshold per codex fix #9 (current extractor residuals make
>0.80 unreachable; v1 records a baseline, regression guard triggers on
drop below it).

**eval/runner/cat11-multimodal.ts** — PDF + HTML + audio runners.
Fixtures load from eval/data/multimodal/<modality>/fixtures.json
manifests; each modality skips gracefully when manifest missing or
(audio) when neither GROQ_API_KEY nor OPENAI_API_KEY is set. Metrics:
  - PDF: char-level similarity via Levenshtein + optional entity_recall
  - HTML: word-recall over normalized tokens (multiset semantics)
  - Audio: WER (word error rate) via Levenshtein on word sequences
Fixtures are NOT committed; a future eval:fetch-multimodal script will
download them hash-verified from public sources (arXiv CC-licensed
papers, Wikipedia CC-BY-SA, Common Voice CC0).

Injectable audio transcriber (`opts.transcribe`) means tests don't need
GROQ/OpenAI keys — stubbed transcriptions exercise the WER math path
directly.

**Tests (60 new):** adversarial-injections (19) — per-kind assertions +
dispatcher coverage + slug regex conformance; cat6 (12) — variant
determinism, scoreVariant shape, aggregate per-kind + overall metrics,
corpus resolver slug rules; cat11 (29) — charSimilarity / wordRecall /
wer math, htmlToText strips scripts + decodes entities, HTML modality
with real fixtures, audio modality gracefully skips without key + uses
stub transcriber correctly.

All 60 tests pass in 48ms + 41ms.
Total eval suite now: 227 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): Day 7 — Cat 5 provenance runner + structured classify_claim judge

**eval/runner/cat5-provenance.ts** — BrainBench Cat 5 scoring. Samples
claims from gbrain brain-export and classifies each against its source
material via a dedicated Haiku judge (classify_claim tool with a
three-label enum: supported | unsupported | over-generalized).

Separate from judge.ts by design: Cat 5 is a single three-way
classification per claim, not a weighted rubric. Rather than overload
judge.ts with a mode switch, Cat 5 has its own tool definition
(CLASSIFY_CLAIM_TOOL) and prompt. The retry-once pattern, $20 cost gate
semantics, and structured parsing are mirrored from judge.ts so failures
look the same across Cats.

Metric: `citation_accuracy` = fraction where predicted label equals
gold expected_label. Threshold (informational): >0.90 per design-doc
METRICS.md. v1 ships with `enableThreshold: false` so the verdict is
always baseline_only — we don't have hand-authored gold claims yet, and
codex flagged that threshold gating should wait until the amara-life-v1
corpus + gold file authoring lands in Day 3b.

runCat5 uses a bounded-concurrency worker pool (default 4) to respect
Haiku rate limits across 100+ claim batches. Evidence pages are looked
up by slug from a caller-provided pagesBySlug map — missing pages don't
crash, they just pass an empty source list to the judge (correct
behavior for genuinely unsupported claims).

**Tests (23):** classifyClaim happy/retry/fallback paths with stubbed
Haiku, aggregate accuracy math, threshold gating (pass/fail vs
baseline_only), runCat5 concurrency + missing-page handling,
renderClaimPrompt embeds claim + sources correctly, parseClassification
rejects invalid enum values + plain-text responses.

Total eval suite now: 250 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): Day 8 — Cat 8 skill compliance + Cat 9 end-to-end workflows

**eval/runner/cat8-skill-compliance.ts** — Deterministic, judge-free Cat 8
scoring. Replays inbound signals through the agent adapter (Day 5) and
extracts four iron-law metrics directly from the tool-bridge state:

  - brain_first_compliance: agent called search/get_page BEFORE producing
    its final answer. Non-compliance = hallucinating from general knowledge.
  - back_link_compliance: every dry_run_put_page intent has at least one
    markdown [Name](slug) back-link in its compiled_truth.
  - citation_format: timeline entries use canonical `- **YYYY-MM-DD** |
    Source — Summary`; long final answers cite at least one slug.
  - tier_escalation: simple probes use light tooling (≥1 brain call);
    complex probes require ≥2 brain calls or a dry_run write when
    expects_dry_run_write is set.

No judge call required — everything is computable from
`tool_bridge_state.made_dry_run_writes` + `count_by_tool` + final_answer
regex. Fast, deterministic, reproducible.

Bounded concurrency (p-limit style) worker pool at default 4 to keep
Sonnet rate limits comfortable across 100-probe batches.

**eval/runner/cat9-workflows.ts** — Rubric-graded Cat 9. 5 canonical
workflows (meeting_ingestion, email_to_brain, daily_task_prep, briefing,
sync) × ~10 scenarios each. Each scenario runs through the agent adapter,
then judge.ts scores the answer against a per-scenario rubric.

`buildEvidence(scenario, agentResult, pagesBySlug)` composes the
JudgeEvidence contract: resolves ground_truth_slugs to full
GroundTruthPage[] from a slug-map, pulls tool_call_summary directly from
tool_bridge_state (no raw tool_result content — Section-3 defense),
attaches rubric from the scenario.

Per-workflow rollup: each workflow gets its own pass_rate so the verdict
can fail one workflow without failing the whole Cat. Overall verdict
requires every populated workflow's pass_rate ≥ threshold (default 0.80)
when enableThreshold=true.

Both Cats default to verdict=baseline_only in v1 per codex fix #9: real
thresholds return after 10-probe Haiku-vs-hand-score calibration (κ > 0.7)
runs against the Day 3b amara-life-v1 corpus.

**Tests (23):** Cat 8 per-metric scorer unit tests covering every branch
(brain_first ordering, back-link compliance on mixed writes, long vs
short answer citation requirement, tier escalation for simple/complex/
writey probes, finalAnswerCiteCount dedups across syntaxes). Cat 9
buildEvidence contract shape — evidence_refs flow from agent, missing
slugs skip gracefully, no raw_transcript/tool_result leakage to judge.
Cat 9 runCat9 integration with stubbed agent + mixed-verdict judge
produces fractional pass rates correctly.

Total eval suite now: 273 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): Day 9 — sealed qrels via PublicPage + PublicQuery at adapter boundary

Codex fixes #1, #2, #3 from the plan's outside-voice review. Enforcement
shifts from SOFT-VIA-TYPE-COMMENT to SOFT-VIA-SANITIZED-OBJECT. Hard
enforcement via process isolation waits for BrainBench v2 Docker sandbox.

**eval/runner/types.ts** additions:
  - `PublicPage = Pick<Page, 'slug' | 'type' | 'title' | 'compiled_truth' |
    'timeline'>` — the exact 5 fields adapters should see. No _facts.
    No frontmatter (a known hiding spot for accidental gold leaks).
  - `sanitizePage(p: Page): PublicPage` — returns a NEW object with the 5
    fields only. Cannot be bypassed by `(page as any)._facts` because the
    field does not exist on the sanitized object.
  - `PublicQuery = Omit<Query, 'gold'>` — strips the gold field.
  - `sanitizeQuery(q: Query): PublicQuery` — enumerates public fields
    explicitly (not spread+delete) so no prototype weirdness leaves gold
    reachable.

**eval/runner/multi-adapter.ts** — scoreOneRun now calls sanitizePage /
sanitizeQuery before passing to adapter.init / adapter.query. The scorer
retains the full Query shape (including gold.relevant) for precision /
recall computation. Adapter signatures unchanged — the sealing is at the
OBJECT level, not the type level. This keeps existing adapters
(ripgrep-bm25, vector-only, hybrid-nograph, gbrain-after) binary-compatible.
Verified: no existing adapter reads q.gold or page._facts, so the change
is safe without further adapter updates.

**test/eval/sealed-qrels.test.ts** (17 tests):
  - sanitizePage strips _facts + frontmatter + arbitrary hidden keys
  - Output has exactly the 5 public keys (deep introspection)
  - Proxy tripwire simulates a malicious adapter: any access to _facts or
    gold throws `sealed-qrels violation`
  - sanitizeQuery retains optional fields (as_of_date, tags, author,
    acceptable_variants, known_failure_modes) but omits undefined ones
  - Honest documentation of the seal's limits: filesystem bypass and
    Proxy attacks would still work in v1; Docker isolation (v2) is the
    real enforcement

Every existing eval test still passes (273 before + 17 sealed-qrels = 290).

Total eval suite now: 290 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): Day 10 — all.ts rewrite + llm-budget + BrainBench N tiers

Final wiring of BrainBench v1 Complete. all.ts now orchestrates the full
Cat catalog (1-12) via a mix of subprocess dispatch (Cats 1, 2, 3, 4, 6,
7, 10, 11, 12 — standalone runners with CLI entry points) and
programmatic invocation (Cats 5, 8, 9 — require runtime inputs that
can't come via CLI flags). Subprocess Cats run concurrently under a
p-limit(2) bound to cap peak memory around ~800MB (two PGLite instances
at ~400MB each).

Cats 5/8/9 show as "programmatic" in the report with a one-line
reference to their `runCatN({...})` harness API. They're deliberately
skipped from the master runner because their inputs (claim catalog,
probe catalog, scenario catalog, pre-seeded agent state, evidence
pagesBySlug) are task-specific and assembled at the caller.

**eval/runner/all.ts** — rewritten:
  - CATEGORIES is a tagged union of SubprocessCategory | ProgrammaticCategory
  - runCatSubprocess spawns Bun with pipe'd stdout/stderr, 10-min timeout
    per Cat (124 exit + SIGTERM on timeout; no hung subprocesses)
  - runConcurrently is a bounded worker pool preserving input order
  - buildReport emits the full markdown with per-Cat elapsed times,
    migration-noise filter, and a separate programmatic-only section
  - Honors BRAINBENCH_N (1/5/10 for smoke/iteration/published),
    BRAINBENCH_CONCURRENCY (default 2),
    BRAINBENCH_LLM_CONCURRENCY (default 4, consumed by llm-budget)

**eval/runner/llm-budget.ts** — shared LLM rate-limit semaphore. A full
N=10 published scorecard makes ~900 Anthropic calls (150 Cat 8/9 probes
× N=10 + 100 Cat 5 claims × N=10). Without coordination, concurrent
adapters trigger 429s on per-minute limits.

  - LlmBudget class: acquireSlot/releaseSlot + withLlmSlot(fn) wrapper
    that releases on success AND throw (try/finally)
  - getDefaultLlmBudget() singleton reads BRAINBENCH_LLM_CONCURRENCY,
    falls back to 4 on missing/garbage values
  - capacity enforced ≥1 (rejects 0/negative)
  - Double-release is a no-op (guards against upstream double-call bugs)
  - Active + waiting counts exposed for observability / tests

**package.json** scripts:
  - eval:brainbench           — default N=5 iteration
  - eval:brainbench:smoke     — N=1 for fast iteration
  - eval:brainbench:published — N=10 for committed baselines
  - eval:cat6 / eval:cat11    — individual new subprocess Cats

**Tests (24):** CATEGORIES catalog enforces the exact Cat-number partition
(subprocess: 1,2,3,4,6,7,10,11,12; programmatic: 5,8,9). runConcurrently
respects the cap (observable via peak in-flight counter), preserves input
order under non-uniform delays, handles empty input. LlmBudget enforces
capacity, releases on throw, honors env var, rejects 0/negative.
buildReport filters migration noise, counts passed/failed/programmatic
correctly, includes every Cat + programmatic-only section.

Full eval suite now: 314 pass, 0 fail (15 test files).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(eval): drop top_p from amara-life-gen Opus params + gitignore _cache/

Two fixes surfaced during the Day 3b real-corpus run against Opus 4.5:

**eval/generators/amara-life-gen.ts** — Current Opus rejects
`temperature` and `top_p` together:
```
400 invalid_request_error: `temperature` and `top_p` cannot both be
specified for this model. Please use only one.
```
top_p=1.0 was a no-op (no nucleus truncation), so removing it has zero
semantic effect. The field is still part of MODEL_PARAMS for the cache
key so any past cache entries (none in v1) would invalidate cleanly
on the next schema version bump.

**.gitignore** — `eval/data/amara-life-v1/_cache/` is runtime Opus
cache (398 files, ~1.6MB). Regenerable from seed; no point in source
control. The corpus itself (inbox/slack/calendar/meetings/notes/docs +
corpus-manifest.json with per-item content_sha256) stays committable
for reproducibility, just the cache directory gets excluded.

Real corpus generation ran cleanly after these two fixes: 398 LLM
calls, 84,424 input / 38,062 output tokens, \$4.12 spent (vs \$20 cap,
vs \$12 estimate). All 418 items produced. Poison fixtures use
subtle paraphrased injection ("for anyone on your team who might be
triaging this thread later…") — exactly the pattern that defeats
regex redaction and requires the structured-evidence judge contract
from Day 5.

Corpus itself stays local (will move to the brainbench sibling repo
during the v0.16 split per the design doc). No eval/data/amara-life-v1/
content landing in this PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version to 0.20.0

Renumbered from 0.17.0 per the gbrain-versioning slot. Other work is
landing on master around this PR; 0.18 is the slot locked for this
BrainBench v1 Complete release. Also pushed the "brainbench split"
forward reference in the CHANGELOG from v0.18 → v0.19 to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: extract BrainBench to sibling gbrain-evals repo

BrainBench lived in this repo through v0.17, which meant every gbrain install
pulled down ~5MB of eval corpus, benchmark reports, and a pdf-parse devDep
that the 99% of users who never run benchmarks don't need.

v0.18 moves the full eval harness, 14 eval test files (314 tests), all
docs/benchmarks scorecards, and the pdf-parse devDep to
github.com/garrytan/gbrain-evals. That repo depends on gbrain via GitHub URL
and consumes it through a new public exports map.

What stays in gbrain:
- Page.type enum extensions (email | slack | calendar-event | note | meeting)
  useful for any ingested format, not just evals
- inferType() heuristics for /emails/, /slack/, /cal/, /notes/, /meetings/
- 11 new public exports covering the gbrain internals gbrain-evals consumes
  (gbrain/engine, gbrain/pglite-engine, gbrain/search/hybrid, etc.) — now
  gbrain's stable third-party contract

What moved:
- eval/ — 4.6MB of schemas, runners, adapters, generators, CLI tools
- test/eval/ — 14 test files, 314 tests
- docs/benchmarks/ — all scorecards and regression reports
- eval:* package.json scripts
- pdf-parse devDep

Tests: 1760 pass, 0 fail, 174 skipped (E2E require DATABASE_URL).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Merge origin/master into garrytan/gbrain-evals

Master landed significant work since this branch was cut (v0.15.x → v0.16.x →
v0.17.0 gbrain dream + runCycle → v0.18.0 multi-source brains → v0.18.1 RLS
hardening). Bumped this branch's version from the claimed 0.18.0 to 0.19.0
because master already owns 0.18.x.

Conflicts resolved:
- VERSION: 0.19.0 (was 0.18.0 on HEAD vs 0.18.1 on master)
- package.json: 0.19.0, kept all 11 eval-facing exports, merged master's
  typescript devDep + postinstall script + test script (typecheck added)
- src/core/types.ts: union of both PageType additions. Master had added
  `meeting | note`; this branch added `email | slack | calendar-event`
  for inbox/chat/calendar ingest. Final enum carries all five.
- CHANGELOG.md: renumbered the BrainBench-extraction entry to 0.19.0 and
  placed it above master's 0.18.1 RLS entry. Tweaked copy ("In v0.17 it
  lived inside this repo" → "Previously it lived inside this repo") to
  stop implying a specific version that never shipped.
- CLAUDE.md: adjusted "BrainBench in a sibling repo" heading from
  (v0.18+) → (v0.19+).
- docs/benchmarks/2026-04-18-minions-vs-openclaw-production.md:
  resolved modify-vs-delete conflict in favor of delete (the extraction).
- scripts/llms-config.ts: dropped the docs/benchmarks/ entry (directory
  no longer exists here; lives in gbrain-evals).
- llms.txt / llms-full.txt: regenerated after the config change.
- bun.lock: accepted master's (master already dropped pdf-parse as a
  drive-by; aligned with our removal).

Tests: 2094 pass, 236 skip, 18 fail. Spot-checked failures — build-llms,
dream, orphans tests all pass in isolation. Failures reproduce only under
full-suite parallel load and are pre-existing master flakiness (matches the
graph-quality flake noted in the earlier summary). Not merge-introduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump to v0.20.0

Master is now at v0.18.2 (migration hardening + RLS + multi-source brains).
BrainBench extraction ships as v0.20.0 to leave v0.19 free for any in-flight
work on other branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: remove eval-tests workflow (moved to gbrain-evals)

The Eval tests workflow ran `bun run eval:query:validate`, `test:eval`, and
`eval:world:render` — all three scripts moved to the gbrain-evals repo when
BrainBench was extracted in v0.20.0. The workflow has been failing on master
since the split because the scripts no longer exist here.

Eval CI now runs from gbrain-evals's own workflows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): bump PGLite hook timeouts to 60s for parallel-load stability

Six test files spin up PGLite + 20 migrations + git repos in beforeEach/
beforeAll hooks. Under 136-way parallel test file execution, bun's default
5s hook timeout wasn't enough, producing 18 flaky failures that only
reproduced under full-suite parallel load (all 6 files passed in isolation).

Root cause: PGLite.create() + initSchema() takes ~3-5s under idle load, but
under 136 concurrent WASM instantiations the OS thrashes and hooks stall
well past 5s. The bunfig.toml `timeout = 60_000` applies to TESTS, not HOOKS
— bun requires per-hook timeouts as the third beforeEach/beforeAll argument.

Files touched (hook timeouts added, no test logic changed):
- test/dream.test.ts           — 5 describe blocks × before/afterEach
- test/orphans.test.ts         — 1 beforeEach + afterEach
- test/core/cycle.test.ts      — shared beforeAll + afterAll
- test/brain-allowlist.test.ts — beforeAll + afterAll
- test/extract-db.test.ts      — beforeAll + afterAll
- test/multi-source-integration.test.ts — beforeAll + afterAll

Results: 2317 pass / 0 fail (was 2253 pass / 18 fail).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: coverage for inferType() BrainBench corpus dirs

Closes the 1 gap surfaced by Step 7 coverage audit. 9 table-driven
assertions covering the new Page.type branches:
  emails/*.md, email/*.md       -> 'email'
  slack/*.md                    -> 'slack'
  cal/*.md, calendar/*.md       -> 'calendar-event'
  notes/*.md, note/*.md         -> 'note'
  meetings/*.md, meeting/*.md   -> 'meeting'

The fixtures use realistic paths from the amara-life-v1 corpus in the
sibling gbrain-evals repo (em-0001, sl-0037, evt-0042, mtg-0003) so the
test doubles as a contract check between the two repos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(TODOS): mark BrainBench Cats 5/6/8/9/11 + v0.10.5 inferLinkType as completed

All five BrainBench categories shipped in v0.20.0 (to the gbrain-evals
sibling repo). v0.10.5 inferLinkType regex expansion shipped in-tree.

Remaining P1 BrainBench work: Cat 1+2 at full scale (2-3K pages) —
currently 240 pages in world-v1 corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync CLAUDE.md + polish CHANGELOG voice for v0.20.0

CLAUDE.md: add v0.19 commands to key-files list (skillify, skillpack,
routing-eval, filing-audit, skill-manifest, resolver-filenames);
add 8 new test files + openclaw-reference-compat E2E to test index;
repoint the release-summary template's benchmark source from
`docs/benchmarks/[latest].md` to `gbrain-evals/docs/benchmarks/` since
those files now live in the sibling repo.

CHANGELOG voice polish for v0.20.0: replace em dashes with periods,
parens, or ellipses per project style guide. No content changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regenerate llms-full.txt after CLAUDE.md + CHANGELOG edits (fixes CI)

The v0.20.0 doc-sync commit (9e567bb) added 7 new v0.19 modules to the
CLAUDE.md Key Files index and polished CHANGELOG voice. Both are
includeInFull: true inputs to llms-full.txt but the generator wasn't
re-run, so the drift-detection guard (test/build-llms.test.ts) failed CI.

One-line fix: regenerate. No content changes beyond what the two source
docs already carry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 7, 2026
#696)

* feat: recency boost for search (v0.27.0) — temporal intent auto-detection, date filters, configurable decay

New search pipeline stage: keyword + vector → RRF → cosine re-score → backlink boost → recency boost → dedup

- applyRecencyBoost: hyperbolic decay, two strengths (moderate 30-day halflife, aggressive 7-day halflife)
- Auto-enabled when intent.ts detects temporal/event queries (detail='high')
- Manual override via SearchOpts.recencyBoost (0/1/2)
- Date filtering: afterDate/beforeDate on all three search paths (keyword, keywordChunks, vector)
- getPageTimestamps on both Postgres and PGLite engines
- 15 tests passing (boost math + intent classification)

* v0.29.1 schema: pages.{effective_date, effective_date_source, import_filename, salience_touched_at} + expression index

Migration v38 adds 4 nullable columns to pages and an expression index on
COALESCE(effective_date, updated_at) to support the new since/until date
filters. All additive — no behavior change in the default search path; only
consulted when callers opt into the new salience='on' / recency='on' axes
or pass since/until.

  effective_date         — content date (event_date / date / published /
                           filename-date / fallback). Read by recency boost
                           and date-filter paths only. Auto-link doesn't
                           touch it (immune to updated_at churn).
  effective_date_source  — sentinel for the doctor's effective_date_health
                           check ('event_date' | 'date' | 'published' |
                           'filename' | 'fallback').
  import_filename        — basename without extension, captured at import.
                           Used for filename-date precedence on daily/,
                           meetings/. Older rows leave it NULL.
  salience_touched_at    — bumped by recompute_emotional_weight when
                           emotional_weight changes. Salience window uses
                           GREATEST(updated_at, salience_touched_at) so
                           newly-salient old pages enter the recent salience
                           query.

Index strategy: a partial index on effective_date alone wouldn't help the
COALESCE expression in since/until filters (planner can't use it for the
negative side). The expression index ((COALESCE(effective_date, updated_at)))
is what actually accelerates the filter.

Postgres uses CONCURRENTLY + v14-style pg_index.indisvalid pre-drop guard
for prior failed CONCURRENTLY runs; PGLite uses plain CREATE INDEX. Mirror
of v34's pattern.

src/schema.sql + src/core/pglite-schema.ts updated for fresh installs;
src/core/schema-embedded.ts regenerated via bun run build:schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: computeEffectiveDate helper + putPage integration

Pure helper computing a page's effective_date from frontmatter precedence:
  1. event_date (meeting/event pages)
  2. date (dated essays)
  3. published (writing/)
  4. filename-date (leading YYYY-MM-DD in basename)
  5. updated_at (fallback)
  6. created_at (last resort)

Per-prefix override: for daily/ and meetings/ slugs, filename-date jumps
to position 1 — the filename is the user's primary signal there.

Returns {date, source}. The source label powers the doctor's
effective_date_health check to detect "fell back to updated_at" rows that
look populated but are functionally a NULL.

Range validation: parsed value must be in [1990-01-01, NOW + 1 year].
Out-of-range values drop to the next chain element.

Wired into importFromContent + importFromFile. The put_page MCP op derives
filename from slug-tail when no caller-supplied filename is available.

putPage SQL on both engines extended to write the new columns. ON CONFLICT
uses COALESCE(EXCLUDED.x, pages.x) so callers that don't know about the
new columns (auto-link, code reindex) preserve existing values rather than
blanking them. SELECT projection extended to return them; rowToPage threads
them through.

21 unit tests covering: precedence chain default order, per-prefix override,
parse failure fall-through, range validation [1990, NOW+1y], parseDateLoose
shape variants. All pass; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: backfill orchestrator + library function for existing pages

src/core/backfill-effective-date.ts is the shared library function. Walks
pages in keyset-paginated batches (id > last_id ORDER BY id LIMIT 1000),
runs computeEffectiveDate per row, UPDATEs effective_date +
effective_date_source. Resumable via the `backfill.effective_date.last_id`
checkpoint key in the config table — a killed process can re-run and pick
up without re-doing rows. Idempotent: a full re-walk produces the same
writes.

Postgres-only: SET LOCAL statement_timeout = '600s' per batch. Doesn't
refuse the migration on low session settings (codex pass-2 #16).

src/commands/migrations/v0_29_1.ts is the orchestrator (4 phases mirroring
v0_12_2). Phase A schema (gbrain init --migrate-only), Phase B backfill
(via the library function), Phase C verify (count NULL effective_date),
Phase D record (handled by runner). The library function is reusable from
the gbrain reindex-frontmatter CLI command in the next commit.

import_filename stays NULL for backfilled rows — pre-v0.29.1 imports
didn't capture it. computeEffectiveDate uses the slug-tail when filename
is NULL; daily/2024-03-15 backfilled gets effective_date from the slug.

Registered in src/commands/migrations/index.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: gbrain reindex-frontmatter CLI command

Recovery / explicit-rebuild path for pages.effective_date. Used when:
  - User edited frontmatter dates after import
  - Post-upgrade backfill orchestrator finished but the user wants to
    re-walk a subset (e.g. just meetings/) after fixing some frontmatter
  - Precedence rules change between releases

Thin wrapper over backfillEffectiveDate from commit 3 — same code path
the v0_29_1 orchestrator uses; one source of truth.

Flags mirror reindex-code:
  --source <id>      Scope to one sources row (placeholder; library
                     library doesn't filter by source today, tracked v0.30+)
  --slug-prefix P    Scope to slugs starting with P (e.g. 'meetings/')
  --dry-run          Print what WOULD change, no DB writes
  --yes              Skip confirmation prompt (required for non-TTY non-JSON)
  --json             Machine-readable result envelope
  --force            Re-apply even when computed value matches existing

Wired into src/cli.ts. CLI handles its own engine lifecycle (creates +
disconnects).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: recency-decay map + buildRecencyComponentSql (pure, unused)

src/core/search/recency-decay.ts mirrors source-boost.ts in shape but
drives RECENCY ONLY (per D9 codex resolution). Salience is a separate
orthogonal axis; this map does not feed it.

DEFAULT_RECENCY_DECAY: 10 generic prefixes (no fork-specific names).
  - concepts/      evergreen (halflifeDays=0)
  - originals/     180d × 0.5 (long-tail decay; new essays nudged)
  - writing/       365d × 0.4
  - daily/         14d × 1.5  (aggressive — freshness IS the signal)
  - meetings/      60d × 1.0
  - chat/          7d × 1.0
  - media/x/       7d × 1.5
  - media/articles/ 90d × 0.5
  - people/companies/ 365d × 0.3
  - deals/         180d × 0.5

DEFAULT_FALLBACK: 90d × 0.5 for unmatched slugs.

Override priority: defaults < gbrain.yml recency: < env (GBRAIN_RECENCY_DECAY)
< per-call SearchOpts.recency_decay.

parseRecencyDecayEnv format: comma-separated prefix:halflifeDays:coefficient
triples. Refuses LOUD on parse error (RecencyDecayParseError) — codex
pass-2 #M3 finding. No silent fallback like source-boost's parser.

parseRecencyDecayYaml takes already-parsed YAML; throws on bad shape.

buildRecencyComponentSql in sql-ranking.ts emits a CASE expression with
longest-prefix-first ordering, evergreen short-circuit (literal 0 when
halflifeDays=0 or coefficient=0), and EXTRACT(EPOCH ...) for non-zero
branches. Output: ((CASE WHEN p.slug LIKE 'daily/%' THEN 1.5 * 14.0 /
(14.0 + EXTRACT(EPOCH FROM (NOW() - <dateExpr>))/86400.0) ... END))

Typed NowExpr enum prevents SQL injection (codex pass-1 #5). Tests pass
{ kind: 'fixed', isoUtc } for deterministic output; production NOW().
The 'fixed' branch escapes single quotes via escapeSqlLiteral.

25 unit tests covering: env parser shape, env error cases, yaml parser
shape, merge precedence (defaults < yaml < env < caller), CASE longest-
prefix-first ordering, evergreen short-circuit, NowExpr fixed/now,
single-quote injection defense, empty decayMap fallback path, default
map composition (no fork names, concepts/ evergreen, daily/ aggressive).

Pure module. Zero consumers in this commit; commit 6 wires it into
getRecentSalience, commit 10 wires it into the post-fusion stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: refactor getRecentSalience to consume buildRecencyComponentSql

Both engines (Postgres + PGLite) now build the salience formula's third
term via buildRecencyComponentSql instead of inlining 1.0 / (1 + days_old).
Parameters: empty decayMap + fallback { halflifeDays: 1, coefficient: 1.0 }.
Math expands to 1 * 1.0 / (1.0 + days_old) = 1 / (1 + days_old) — same
numeric output as v0.29.0.

This is a no-behavior-change refactor preparing for commit 7's recency_bias
param. recency_bias='flat' (default) reproduces v0.29.0 exactly; 'on'
swaps in DEFAULT_RECENCY_DECAY for per-prefix decay.

Single source of truth for the recency math: same builder feeds the
salience query AND (in commit 10) the post-fusion applyRecencyBoost stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: get_recent_salience gains recency_bias param (default 'flat')

SalienceOpts.recency_bias: 'flat' | 'on' added; default 'flat' preserves
v0.29.0 ranking verbatim. Pass 'on' to opt into per-prefix decay map
(concepts/originals/writing/ evergreen; daily/, media/x/, chat/ aggressive
decay).

When recency_bias='on', the salience query reads
COALESCE(p.effective_date, p.updated_at) instead of bare p.updated_at, so
the recency component is immune to auto-link updated_at churn — old
concepts/ pages just-touched by auto-link don't suddenly look fresh.

Both engines (Postgres + PGLite) wire the param through. resolveRecencyDecayMap()
honors gbrain.yml + GBRAIN_RECENCY_DECAY env at runtime.

MCP op surface: get_recent_salience gains the param with a load-bearing
description teaching the agent when to use 'on' vs 'flat' (current state →
on; mattering across all time → flat).

No silent v0.29.0 behavior change — opt-in only (per D11 codex resolution).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: recompute_emotional_weight writes salience_touched_at; window picks up newly-salient pages

setEmotionalWeightBatch on both engines now bumps salience_touched_at to
NOW() ONLY when the new emotional_weight differs from the existing one
(IS DISTINCT FROM, NULL-safe). No-op writes (same weight) leave the
column alone — preserves "actual change" semantics.

getRecentSalience window changes from
  WHERE p.updated_at >= boundary
to
  WHERE GREATEST(p.updated_at, COALESCE(p.salience_touched_at, p.updated_at)) >= boundary

Closes codex pass-1 finding #4: pages whose emotional_weight just changed
in the dream cycle (because tags or takes shifted) but whose updated_at
is older than the salience window now correctly enter the recent-salience
results. Without this, "Garry just added a take to a 6-month-old page"
stayed invisible to get_recent_salience until the next content edit.

COALESCE(salience_touched_at, p.updated_at) handles pre-v0.29.1 rows
where salience_touched_at is NULL — they fall back to p.updated_at and
behave identically to v0.29.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: merge intent.ts → query-intent.ts; emit 3 suggestions per query

D1 + D4 + D6 + D8: single regex-pass classifier returning
{intent, suggestedDetail, suggestedSalience, suggestedRecency}.

intent + suggestedDetail are v0.29.0 behavior verbatim (legacy intent.ts
deleted; classifyQueryIntent + autoDetectDetail compat shims preserved).

NEW for v0.29.1 — two orthogonal recency-axis suggestions:

  suggestedSalience: 'off' | 'on' | 'strong'
  suggestedRecency:  'off' | 'on' | 'strong'

Resolution rules (per D6 narrow temporal-bound exception):
  - CANONICAL patterns (who is X / what is Y / code / graph) → both off
  - UNLESS an EXPLICIT_TEMPORAL_BOUND also matches (today / right now /
    this week / since X / last N days), in which case temporal-bound wins
  - STRONG_RECENCY (today / right now / this morning / just now) → strong
  - RECENCY_ON (latest / recent / this week / meeting prep / catch up
    / remind me / status update) → on
  - SALIENCE_ON (catch up / remind me / status update / prep me /
    what's going on / what matters) → on
  - default → off for both axes (v0.29.1 prime-directive: pure opt-in)

Salience and recency are TRULY orthogonal (per D9). A query like
"latest news on AI" → recency='on' but salience='off' (the user wants
fresh, not emotionally-weighted). "What's going on with widget-co" →
both on. "Who is X right now" → both 'strong'/'on' (temporal bound
beats canonical 'who is').

intent.ts deleted; test/intent.test.ts renamed → test/query-intent-legacy.test.ts
(unchanged behavior coverage). New test/query-intent.test.ts adds 21
cases covering all three axes' interactions: canonical wins on bare
'who is', temporal bound overrides, "catch me up" matches with up to 15
chars between, "today" → strong, intent vs recency independence.

Updated callers:
  - src/core/search/hybrid.ts (autoDetectDetail import)
  - test/recency-boost.test.ts (classifyQueryIntent import)
  - test/benchmark-search-quality.ts (autoDetectDetail import)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: applySalienceBoost + applyRecencyBoost + runPostFusionStages wrapper

D9 + codex pass-1 #2 + #3 + pass-2 #4: salience and recency are TRULY
ORTHOGONAL post-fusion stages, both running from ALL THREE hybridSearch
return paths (keyword-only, embed-failure-fallback, full-hybrid).

NEW src/core/search/hybrid.ts exports:
  - applySalienceBoost(results, scores, strength)
      score *= 1 + k * log(1 + score) where k = 0.15 (on) or 0.30 (strong)
      No time component. Pure mattering signal.
  - applyRecencyBoost(results, dates, strength, decayMap, fallback, nowMs?)
      Per-prefix decay factor: 1 + strengthMul * coefficient * halflife / (halflife + days_old)
      strengthMul: 1.0 (on) or 1.5 (strong)
      Evergreen prefixes (halflifeDays=0) skipped (factor 1.0).
      Pure recency signal. Independent of mattering.
  - runPostFusionStages(engine, results, opts)
      Wraps backlink + salience + recency. Called from EACH return path so
      keyless installs and embed failures get the same boost surface as
      the full hybrid path.

NEW engine methods (composite-keyed for multi-source isolation):
  - getEffectiveDates(refs: Array<{slug, source_id}>): Map<key, Date>
      Returns COALESCE(effective_date, updated_at, created_at). Key format:
      `${source_id}::${slug}`. Mirror of getBacklinkCounts shape.
  - getSalienceScores(refs: Array<{slug, source_id}>): Map<key, number>
      Returns emotional_weight × 5 + ln(1 + take_count). Composite key.

Deprecated (kept for back-compat through v0.29.x):
  - SearchOpts.afterDate / beforeDate (alias for since/until)
  - SearchOpts.recencyBoost: 0|1|2 (alias for recency: 'off'|'on'|'strong')
  - getPageTimestamps (use getEffectiveDates instead)

NEW SearchOpts fields:
  - salience: 'off' | 'on' | 'strong'
  - recency:  'off' | 'on' | 'strong'
  - since:    string (ISO-8601 or relative, replaces afterDate)
  - until:    string (replaces beforeDate)

Resolution: caller-explicit > legacy alias (recencyBoost) > heuristic
(classifyQuery's suggestedSalience / suggestedRecency).

Deleted: src/core/search/recency.ts (PR #618's, replaced) +
test/recency-boost.test.ts (its scope is replaced by query-intent.test.ts +
future post-fusion tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Wintermute <wintermute@garrytan.com>

* v0.29.1: query op gains salience + recency + since + until params; PGLite since/until parity

Combines commits 12 + 13 of the plan.

Query op surface (src/core/operations.ts):
  - salience: 'off' | 'on' | 'strong' (with load-bearing description)
  - recency:  'off' | 'on' | 'strong'
  - since:    string (ISO-8601 or relative; replaces deprecated afterDate)
  - until:    string (replaces deprecated beforeDate)

Tool descriptions teach the calling agent:
  - salience axis = mattering, no time component
  - recency axis = age decay, no mattering signal
  - omit either to let gbrain auto-detect from query text via classifyQuery

hybrid.ts maps since/until → afterDate/beforeDate at the engine call
boundary so PR #618's existing engine plumbing keeps working without
rename. Codex pass-1 #10 finding closed.

PGLite engine (codex pass-1 #10): since/until parity added to all three
search methods (searchKeyword, searchKeywordChunks, searchVector). SQL
filter against COALESCE(p.effective_date, p.updated_at, p.created_at)
so date filtering matches user content-date intent (a meeting was on
event_date, not when it got reimported). Filter is applied INSIDE the
HNSW inner CTE in searchVector so HNSW's candidate pool already
excludes out-of-range pages — preserves pagination contract.

This also closes existing cross-engine drift: pre-v0.29.1 Postgres had
afterDate/beforeDate from PR #618; PGLite had nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: migration v39 — eval_candidates capture columns for replay reproducibility

D11 codex pass-2 resolution: extend eval_candidates with 7 new nullable
columns so `gbrain eval replay` can reproduce captured runs of agent-explicit
salience + recency choices.

Without these columns, replays of the new axis params drift. The live
behavior depends on the resolved {salience, recency} values; v0.29.0's
schema doesn't capture them.

  as_of_ts            TIMESTAMPTZ  — brain's logical NOW at capture
                                     (replay uses this instead of wall-clock)
  salience_param      TEXT         — what the caller passed (NULL if omitted)
  recency_param       TEXT         — same
  salience_resolved   TEXT         — final value applied
  recency_resolved    TEXT         — same
  salience_source     TEXT         — 'caller' or 'auto_heuristic'
  recency_source      TEXT         — same

All nullable + additive. Pre-v0.29.1 rows stay valid. NDJSON
schema_version STAYS at 1 — consumers ignore unknown fields (codex
pass-1 #C2 dissolves; no cross-repo coordination needed).

ADD COLUMN with no DEFAULT is metadata-only on PG 11+ and PGLite —
instant on tables of any size.

src/schema.sql + src/core/pglite-schema.ts mirror the additions for fresh
installs; src/core/schema-embedded.ts regenerated. eval_capture.ts
populates the new fields in commit 16 (docs + ship).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: doctor checks — effective_date_health + salience_health

effective_date_health: sample-1000 scan detects three classes of
problems (codex pass-1 #5 resolution via the effective_date_source
sentinel column added in commit 1):

  fallback_with_fm_date  — page fell back to updated_at even though
                           frontmatter has parseable event_date / date /
                           published. The "wrong but populated" residual
                           that earlier review iterations missed.
  future_dated            — effective_date > NOW() + 1 year (corrupt
                            or typo'd century).
  pre_1990                — effective_date < 1990-01-01 (epoch math gone
                            wrong, bad parse).

Sample of last 1000 pages by default — fast on 200K-page brains. Fix
hint: gbrain reindex-frontmatter.

salience_health: detects pages with active takes whose emotional_weight
is still 0 (recompute_emotional_weight phase hasn't run since the
take landed). Reports the brain's non-zero emotional_weight count as
an informational baseline. Fix hint: gbrain dream --phase
recompute_emotional_weight.

Both checks gracefully skip on pre-v0.29.1 brains (column doesn't
exist → 42703) without surfacing as warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: docs + skills convention + CHANGELOG + version bump

- VERSION 0.29.0 → 0.29.1
- package.json version bump
- CHANGELOG.md: full release-summary + itemized + "To take advantage"
  block per the project's voice rules. Two-line headline + concrete
  pathology framing (existing callers unchanged; new axes opt-in;
  agent in charge per the prime directive).
- skills/conventions/salience-and-recency.md: agent-readable decision
  rules. "Current state → on. Canonical truth → off." plus the narrow
  temporal-bound exception. Cross-cutting convention propagates to
  brain skills via RESOLVER.md.
- skills/migrations/v0.29.1.md: agent-readable upgrade instructions.
  Verify steps + behavior-change reference + recovery commands.

The build-time tool-description generator from D2 (extract decision
tables from skills/conventions/salience-and-recency.md, embed into
operations.ts at build time) is deferred to a follow-up commit. The
tool descriptions on the query op + get_recent_salience are inline in
operations.ts for v0.29.1; the auto-gen + CI staleness gate land in
v0.29.2 if drift becomes a problem in practice.

148 unit tests pass across the v0.29.1 surface (effective-date,
recency-decay, query-intent, migrate, salience, recompute-emotional-weight).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Wintermute <wintermute@garrytan.com>

---------

Co-authored-by: Wintermute <wintermute@garrytan.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 8, 2026
…t without being asked (#592)

* v0.29 foundation: emotional_weight column + formula + anomaly stats

Migration v34 adds pages.emotional_weight REAL DEFAULT 0.0 (column-only,
no index — salience query orders by computed score, not raw weight).
Embedded DDL (schema.sql + pglite-schema.ts + schema-embedded.ts)
mirrors the column so fresh installs don't need migration replay.

types.ts gains: PageFilters.sort enum + PAGE_SORT_SQL whitelist (engines
hardcoded ORDER BY updated_at DESC; threading lands in the next commit);
SalienceOpts/SalienceResult, AnomaliesOpts/AnomalyResult,
EmotionalWeightInputRow/EmotionalWeightWriteRow contracts.

cycle/emotional-weight.ts: pure-function score in [0..1] from tags +
takes (anglocentric default seed list; user-overridable via config key
emotional_weight.high_tags). cycle/anomaly.ts: meanStddev + cohort
threshold helpers with zero-stddev fallback (count > mean + 1) so rare
cohorts don't produce NaN sigmas.

Test coverage: migrate v34 structural assertions + 14-case formula
unit + 13-case anomaly stats unit. Codex review fixes baked in:
formula clamped to [0,1]; per-take weight clamped to [0,1] before
averaging; zero-stddev fallback finite, never NaN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 engine: batch emotional-weight methods + listPages sort

BrainEngine adds 4 methods, both engines implement:

- batchLoadEmotionalInputs(slugs?): CTE-shaped read with per-table
  pre-aggregates. A page with N tags + M takes never produces N×M rows
  (codex C4#4) — page_tags + page_takes CTEs aggregate independently,
  then LEFT JOIN to pages.

- setEmotionalWeightBatch(rows): UPDATE FROM unnest($1::text[],
  $2::text[], $3::real[]) composite-keyed on (slug, source_id). Multi-
  source brains can't fan out (codex C4#3) — pages.slug is unique only
  within source_id. Same shape that v0.18 link batches use.

- getRecentSalience: time boundary computed in JS, bound as TIMESTAMPTZ.
  SQL identical across engines (codex C5/D5 — avoids dialect drift on
  $1::interval binding which has zero current uses on PGLite).

- findAnomalies: tag + type cohort baselines via generate_series-
  densified daily-count CTEs (codex C4#6). Sparse-day rare cohorts get
  correct (mean, stddev) instead of biased upward by zero-omission.
  Year cohort deferred to v0.30.

listPages threads the new PageFilters.sort enum through both engines.
Was hardcoded ORDER BY updated_at DESC; now PAGE_SORT_SQL whitelist
maps the 4 enum values to literal SQL fragments — no injection surface.
postgres.js uses sql.unsafe; PGLite splices the fragment directly.

Regression tests (PGLite, no DATABASE_URL needed):

- multi-source-emotional-weight: same slug under two source_ids,
  setEmotionalWeightBatch on one of them, asserts the other survives
  untouched. Direct codex C4#3 guard.

- list-pages-regression (IRON RULE): old call shape (type, tag, limit)
  still returns updated_desc default; new sort=updated_asc reverses;
  sort=created_desc orders by created_at; sort=slug alphabetical;
  unsupported sort enum falls back to default (defense in depth).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 cycle: new recompute_emotional_weight phase

Adds a 9th cycle phase between extract and embed. Sees the union of
syncPagesAffected + synthesizeWrittenSlugs for incremental mode (so
synthesize-written pages get their weight computed too — codex C2 caught
that the prior plan threaded only sync). Full mode (no incremental
anchors) walks every page; users hit this path on first upgrade via
gbrain dream --phase recompute_emotional_weight.

Phase orchestrator (cycle/recompute-emotional-weight.ts) is two SQL
round-trips total regardless of brain size:
  1. batchLoadEmotionalInputs(slugs?) → per-page tag/take inputs.
  2. computeEmotionalWeight in memory (pure function).
  3. setEmotionalWeightBatch(rows) → composite-keyed UPDATE FROM unnest.

Empty affectedSlugs short-circuits (no DB read, no write). Dry-run
computes weights and reports the would-write count without touching
the DB. Engine throw bubbles into status:fail with code
RECOMPUTE_EMOTIONAL_WEIGHT_FAIL — cycle continues to the next phase.

Plumbing:
- CyclePhase type adds 'recompute_emotional_weight'.
- ALL_PHASES + NEEDS_LOCK_PHASES include it.
- CycleReport.totals adds pages_emotional_weight_recomputed (additive,
  schema_version stays "1").
- runCycle's totals rollup + status derivation honor the new field.
- synthesize.ts emits writtenSlugs in details so cycle.ts can union
  with syncPagesAffected for incremental backfill.

Tests: 7-case unit (fake-engine), 3-case PGLite e2e (full mode + dry-
run + ALL_PHASES position), 1000-page perf budget (<5s on PGLite).

Codex C2 → A: clean separation. Phase doesn't modify runExtractCore;
runs on its own seam after the existing 8 phases plus synthesize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 ops: get_recent_salience + find_anomalies + get_recent_transcripts

Three new MCP operations + a transcripts library:

- get_recent_salience: pages ranked by emotional + activity salience.
  Subagent-allow-listed. params: days (default 14), limit (default 20,
  capped 100), slugPrefix (renamed from `kind` per codex C4#10 to
  avoid collision with PageKind/TakeKind).

- find_anomalies: cohort-level activity outliers (tag + type).
  Subagent-allow-listed. Year cohort deferred to v0.30.

- get_recent_transcripts: raw .txt transcripts from the dream-cycle
  corpus dirs. LOCAL-ONLY: rejects ctx.remote === true with
  permission_denied (codex C3). NOT in the subagent allow-list — all
  subagent calls run with remote=true, would always reject (footgun if
  visible). Cycle's synthesize phase calls discoverTranscripts
  directly, so subagents that need transcripts go through the library
  function, not the op.

Tool descriptions extracted to src/core/operations-descriptions.ts so
they're pinnable in tests and stable for the Tier-2 LLM routing eval.
Redirects on query/search/list_pages: personal/emotional questions
should reach the new ops, not semantic search. Anti-flattery hint on
query: "Do NOT assume words like crazy, notable, or big mean
impressive — they often mean difficult or emotionally charged."

list_pages gains updated_after (string ISO) and sort enum params,
surfacing the engine threading from the prior commit.

src/core/transcripts.ts: filesystem walk shared by the gated MCP op
and the (commit 5) CLI command. Reuses discoverTranscripts corpus-dir
resolution + isDreamOutput from cycle/transcript-discovery.ts. Trust
gate lives in the op handler, not the library — the library is
trusted by both the gated op and the local CLI.

Allow-list: 11 → 13 (add salience + anomalies; transcripts excluded
per codex C3, with a comment explaining why).

Tests: 21-case description pin (catches accidental edits that change
LLM-facing surface); 11-case transcripts unit covering trust gate,
mtime window, dream-output skip, summary truncation, no corpus_dir;
2-case salience type-contract smoke (full Garry-test fixture in commit
6's e2e suite).

Codex C1: routing-eval fixtures (skills/<x>/routing-eval.jsonl)
deliberately NOT shipped — routing-eval.ts is substring-match on
resolver triggers, not MCP tool routing. Real coverage lands as
test/e2e/salience-llm-routing.test.ts in commit 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 CLI: gbrain salience / anomalies / transcripts

Three new CLI commands wired into src/cli.ts dispatch + CLI_ONLY set +
help text:

- gbrain salience [--days N] [--limit N] [--kind PREFIX] [--json]
- gbrain anomalies [--since YYYY-MM-DD] [--lookback-days N] [--sigma N] [--json]
- gbrain transcripts recent [--days N] [--full] [--json]

Each command file mirrors src/commands/orphans.ts shape: pure data fn
+ JSON formatter + human formatter. Calls into engine.getRecentSalience
/ findAnomalies (already shipped) and src/core/transcripts.ts.

salience and anomalies show ranked rows with per-cohort
mean/stddev/sigma. transcripts honors `--full` (caps at 100KB/file)
vs default summary (first non-empty line + ~250 chars). All three
emit JSON with --json for agent consumption.

`--kind` is accepted as a slug-prefix shorthand on `gbrain salience`
even though the underlying op param is `slugPrefix` (kept the CLI
flag short; the MCP-facing param uses the more-explicit name to
align with PageKind/TakeKind/slugPrefix vocabulary).

CLI_ONLY set in src/cli.ts gains the three new command names so
they don't get forwarded to MCP-only routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 e2e: Garry-test fixtures + Postgres parity + LLM routing eval

PGLite e2e (no DATABASE_URL needed):

- salience-pglite: the Garry test. 7 wedding-tagged pages updated today
  + 100 background pages backdated across 30 days via raw SQL UPDATE
  (codex C4#7 — engine.putPage stamps updated_at = now(), so seeding
  via the engine alone can't reproduce historical recency windows).
  Asserts wedding pages outrank random-tag noise in the 7-day window;
  slugPrefix filter narrows correctly; days=0 boundary case; limit cap.

- anomalies-pglite: same fixture shape (7 wedding pages today, 100
  background backdated). findAnomalies with sigma=3 returns the
  wedding-tag cohort with sigma_observed > 3 vs near-zero baseline;
  page_slugs sample carries the wedding pages; date with no activity
  returns []; high sigma threshold suppresses borderline cohorts
  (zero-stddev fallback stays finite — no NaN sigma).

Postgres-gated e2e:

- engine-parity-salience: PGLite ↔ Postgres parity for getRecentSalience
  and findAnomalies. Same fixture into both engines; top-result and
  cohort-set match. Closes the v0.22.0-style parity gap for the new
  v0.29 SQL idioms (EXTRACT(EPOCH ...), generate_series, CTE chain).

Tier-2 LLM routing eval (ANTHROPIC_API_KEY-gated):

- salience-llm-routing: calls Claude with v0.29 tool descriptions and
  12 personal-query phrasings ("anything crazy lately", "what's been
  going on with me", etc.). Asserts the chosen tool is in the v0.29
  set, not query() / search(). ~$0.10 per CI run on Haiku. Tests the
  ACTUAL ship criterion — replaces the discarded fake-coverage
  routing-eval.jsonl fixtures (codex C1 → B).

This is the only test that proves the description edits drive routing.
Without it, we'd ship description changes and only learn from
production behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.0: ship-prep — VERSION + CHANGELOG + CLAUDE Key Files

VERSION + package.json bump 0.28.0 → 0.29.0.

CHANGELOG.md adds a v0.29.0 release-summary in the GStack/Garry voice
plus the "To take advantage of v0.29.0" block. Headline two-liner:
"The brain tells you what's hot without being asked. Salience +
anomaly detection ship. Search rewards hypotheses; salience surfaces
them." Numbers-that-matter table covers engine surface delta, MCP op
delta, allow-list delta, cycle-phase delta, schema migration, list_pages
param surface, and test count. Itemized changes section lists the
schema migration + new cycle phase + new MCP ops + redirect
descriptions + subagent allow-list rules + new tests + a contributor
note clarifying that routing-eval.ts is not the right surface for
testing MCP tool routing (use the Tier-2 LLM eval pattern instead).

CLAUDE.md Key Files updated for the v0.29 surface:

- src/core/engine.ts: notes the 4 new methods + PageFilters.sort threading.
- src/core/migrate.ts: v34 (pages_emotional_weight) entry.
- src/core/cycle.ts: 8 → 9 phases, recompute_emotional_weight inserted
  between patterns and embed; totals.pages_emotional_weight_recomputed.
- src/core/cycle/emotional-weight.ts (NEW): formula + override path.
- src/core/cycle/anomaly.ts (NEW): stats helpers + zero-stddev fallback.
- src/core/cycle/recompute-emotional-weight.ts (NEW): phase orchestrator.
- src/core/transcripts.ts (NEW): library shared by gated MCP op + CLI.
- src/core/operations-descriptions.ts (NEW): pinned tool descriptions.
- src/core/minions/tools/brain-allowlist.ts: 11 → 13 entries; comment
  on why get_recent_transcripts is excluded.
- src/commands/salience.ts / anomalies.ts / transcripts.ts (NEW): CLI surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1 feat: recency + salience as two orthogonal options on query op (#696)

* feat: recency boost for search (v0.27.0) — temporal intent auto-detection, date filters, configurable decay

New search pipeline stage: keyword + vector → RRF → cosine re-score → backlink boost → recency boost → dedup

- applyRecencyBoost: hyperbolic decay, two strengths (moderate 30-day halflife, aggressive 7-day halflife)
- Auto-enabled when intent.ts detects temporal/event queries (detail='high')
- Manual override via SearchOpts.recencyBoost (0/1/2)
- Date filtering: afterDate/beforeDate on all three search paths (keyword, keywordChunks, vector)
- getPageTimestamps on both Postgres and PGLite engines
- 15 tests passing (boost math + intent classification)

* v0.29.1 schema: pages.{effective_date, effective_date_source, import_filename, salience_touched_at} + expression index

Migration v38 adds 4 nullable columns to pages and an expression index on
COALESCE(effective_date, updated_at) to support the new since/until date
filters. All additive — no behavior change in the default search path; only
consulted when callers opt into the new salience='on' / recency='on' axes
or pass since/until.

  effective_date         — content date (event_date / date / published /
                           filename-date / fallback). Read by recency boost
                           and date-filter paths only. Auto-link doesn't
                           touch it (immune to updated_at churn).
  effective_date_source  — sentinel for the doctor's effective_date_health
                           check ('event_date' | 'date' | 'published' |
                           'filename' | 'fallback').
  import_filename        — basename without extension, captured at import.
                           Used for filename-date precedence on daily/,
                           meetings/. Older rows leave it NULL.
  salience_touched_at    — bumped by recompute_emotional_weight when
                           emotional_weight changes. Salience window uses
                           GREATEST(updated_at, salience_touched_at) so
                           newly-salient old pages enter the recent salience
                           query.

Index strategy: a partial index on effective_date alone wouldn't help the
COALESCE expression in since/until filters (planner can't use it for the
negative side). The expression index ((COALESCE(effective_date, updated_at)))
is what actually accelerates the filter.

Postgres uses CONCURRENTLY + v14-style pg_index.indisvalid pre-drop guard
for prior failed CONCURRENTLY runs; PGLite uses plain CREATE INDEX. Mirror
of v34's pattern.

src/schema.sql + src/core/pglite-schema.ts updated for fresh installs;
src/core/schema-embedded.ts regenerated via bun run build:schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: computeEffectiveDate helper + putPage integration

Pure helper computing a page's effective_date from frontmatter precedence:
  1. event_date (meeting/event pages)
  2. date (dated essays)
  3. published (writing/)
  4. filename-date (leading YYYY-MM-DD in basename)
  5. updated_at (fallback)
  6. created_at (last resort)

Per-prefix override: for daily/ and meetings/ slugs, filename-date jumps
to position 1 — the filename is the user's primary signal there.

Returns {date, source}. The source label powers the doctor's
effective_date_health check to detect "fell back to updated_at" rows that
look populated but are functionally a NULL.

Range validation: parsed value must be in [1990-01-01, NOW + 1 year].
Out-of-range values drop to the next chain element.

Wired into importFromContent + importFromFile. The put_page MCP op derives
filename from slug-tail when no caller-supplied filename is available.

putPage SQL on both engines extended to write the new columns. ON CONFLICT
uses COALESCE(EXCLUDED.x, pages.x) so callers that don't know about the
new columns (auto-link, code reindex) preserve existing values rather than
blanking them. SELECT projection extended to return them; rowToPage threads
them through.

21 unit tests covering: precedence chain default order, per-prefix override,
parse failure fall-through, range validation [1990, NOW+1y], parseDateLoose
shape variants. All pass; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: backfill orchestrator + library function for existing pages

src/core/backfill-effective-date.ts is the shared library function. Walks
pages in keyset-paginated batches (id > last_id ORDER BY id LIMIT 1000),
runs computeEffectiveDate per row, UPDATEs effective_date +
effective_date_source. Resumable via the `backfill.effective_date.last_id`
checkpoint key in the config table — a killed process can re-run and pick
up without re-doing rows. Idempotent: a full re-walk produces the same
writes.

Postgres-only: SET LOCAL statement_timeout = '600s' per batch. Doesn't
refuse the migration on low session settings (codex pass-2 #16).

src/commands/migrations/v0_29_1.ts is the orchestrator (4 phases mirroring
v0_12_2). Phase A schema (gbrain init --migrate-only), Phase B backfill
(via the library function), Phase C verify (count NULL effective_date),
Phase D record (handled by runner). The library function is reusable from
the gbrain reindex-frontmatter CLI command in the next commit.

import_filename stays NULL for backfilled rows — pre-v0.29.1 imports
didn't capture it. computeEffectiveDate uses the slug-tail when filename
is NULL; daily/2024-03-15 backfilled gets effective_date from the slug.

Registered in src/commands/migrations/index.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: gbrain reindex-frontmatter CLI command

Recovery / explicit-rebuild path for pages.effective_date. Used when:
  - User edited frontmatter dates after import
  - Post-upgrade backfill orchestrator finished but the user wants to
    re-walk a subset (e.g. just meetings/) after fixing some frontmatter
  - Precedence rules change between releases

Thin wrapper over backfillEffectiveDate from commit 3 — same code path
the v0_29_1 orchestrator uses; one source of truth.

Flags mirror reindex-code:
  --source <id>      Scope to one sources row (placeholder; library
                     library doesn't filter by source today, tracked v0.30+)
  --slug-prefix P    Scope to slugs starting with P (e.g. 'meetings/')
  --dry-run          Print what WOULD change, no DB writes
  --yes              Skip confirmation prompt (required for non-TTY non-JSON)
  --json             Machine-readable result envelope
  --force            Re-apply even when computed value matches existing

Wired into src/cli.ts. CLI handles its own engine lifecycle (creates +
disconnects).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: recency-decay map + buildRecencyComponentSql (pure, unused)

src/core/search/recency-decay.ts mirrors source-boost.ts in shape but
drives RECENCY ONLY (per D9 codex resolution). Salience is a separate
orthogonal axis; this map does not feed it.

DEFAULT_RECENCY_DECAY: 10 generic prefixes (no fork-specific names).
  - concepts/      evergreen (halflifeDays=0)
  - originals/     180d × 0.5 (long-tail decay; new essays nudged)
  - writing/       365d × 0.4
  - daily/         14d × 1.5  (aggressive — freshness IS the signal)
  - meetings/      60d × 1.0
  - chat/          7d × 1.0
  - media/x/       7d × 1.5
  - media/articles/ 90d × 0.5
  - people/companies/ 365d × 0.3
  - deals/         180d × 0.5

DEFAULT_FALLBACK: 90d × 0.5 for unmatched slugs.

Override priority: defaults < gbrain.yml recency: < env (GBRAIN_RECENCY_DECAY)
< per-call SearchOpts.recency_decay.

parseRecencyDecayEnv format: comma-separated prefix:halflifeDays:coefficient
triples. Refuses LOUD on parse error (RecencyDecayParseError) — codex
pass-2 #M3 finding. No silent fallback like source-boost's parser.

parseRecencyDecayYaml takes already-parsed YAML; throws on bad shape.

buildRecencyComponentSql in sql-ranking.ts emits a CASE expression with
longest-prefix-first ordering, evergreen short-circuit (literal 0 when
halflifeDays=0 or coefficient=0), and EXTRACT(EPOCH ...) for non-zero
branches. Output: ((CASE WHEN p.slug LIKE 'daily/%' THEN 1.5 * 14.0 /
(14.0 + EXTRACT(EPOCH FROM (NOW() - <dateExpr>))/86400.0) ... END))

Typed NowExpr enum prevents SQL injection (codex pass-1 #5). Tests pass
{ kind: 'fixed', isoUtc } for deterministic output; production NOW().
The 'fixed' branch escapes single quotes via escapeSqlLiteral.

25 unit tests covering: env parser shape, env error cases, yaml parser
shape, merge precedence (defaults < yaml < env < caller), CASE longest-
prefix-first ordering, evergreen short-circuit, NowExpr fixed/now,
single-quote injection defense, empty decayMap fallback path, default
map composition (no fork names, concepts/ evergreen, daily/ aggressive).

Pure module. Zero consumers in this commit; commit 6 wires it into
getRecentSalience, commit 10 wires it into the post-fusion stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: refactor getRecentSalience to consume buildRecencyComponentSql

Both engines (Postgres + PGLite) now build the salience formula's third
term via buildRecencyComponentSql instead of inlining 1.0 / (1 + days_old).
Parameters: empty decayMap + fallback { halflifeDays: 1, coefficient: 1.0 }.
Math expands to 1 * 1.0 / (1.0 + days_old) = 1 / (1 + days_old) — same
numeric output as v0.29.0.

This is a no-behavior-change refactor preparing for commit 7's recency_bias
param. recency_bias='flat' (default) reproduces v0.29.0 exactly; 'on'
swaps in DEFAULT_RECENCY_DECAY for per-prefix decay.

Single source of truth for the recency math: same builder feeds the
salience query AND (in commit 10) the post-fusion applyRecencyBoost stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: get_recent_salience gains recency_bias param (default 'flat')

SalienceOpts.recency_bias: 'flat' | 'on' added; default 'flat' preserves
v0.29.0 ranking verbatim. Pass 'on' to opt into per-prefix decay map
(concepts/originals/writing/ evergreen; daily/, media/x/, chat/ aggressive
decay).

When recency_bias='on', the salience query reads
COALESCE(p.effective_date, p.updated_at) instead of bare p.updated_at, so
the recency component is immune to auto-link updated_at churn — old
concepts/ pages just-touched by auto-link don't suddenly look fresh.

Both engines (Postgres + PGLite) wire the param through. resolveRecencyDecayMap()
honors gbrain.yml + GBRAIN_RECENCY_DECAY env at runtime.

MCP op surface: get_recent_salience gains the param with a load-bearing
description teaching the agent when to use 'on' vs 'flat' (current state →
on; mattering across all time → flat).

No silent v0.29.0 behavior change — opt-in only (per D11 codex resolution).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: recompute_emotional_weight writes salience_touched_at; window picks up newly-salient pages

setEmotionalWeightBatch on both engines now bumps salience_touched_at to
NOW() ONLY when the new emotional_weight differs from the existing one
(IS DISTINCT FROM, NULL-safe). No-op writes (same weight) leave the
column alone — preserves "actual change" semantics.

getRecentSalience window changes from
  WHERE p.updated_at >= boundary
to
  WHERE GREATEST(p.updated_at, COALESCE(p.salience_touched_at, p.updated_at)) >= boundary

Closes codex pass-1 finding #4: pages whose emotional_weight just changed
in the dream cycle (because tags or takes shifted) but whose updated_at
is older than the salience window now correctly enter the recent-salience
results. Without this, "Garry just added a take to a 6-month-old page"
stayed invisible to get_recent_salience until the next content edit.

COALESCE(salience_touched_at, p.updated_at) handles pre-v0.29.1 rows
where salience_touched_at is NULL — they fall back to p.updated_at and
behave identically to v0.29.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: merge intent.ts → query-intent.ts; emit 3 suggestions per query

D1 + D4 + D6 + D8: single regex-pass classifier returning
{intent, suggestedDetail, suggestedSalience, suggestedRecency}.

intent + suggestedDetail are v0.29.0 behavior verbatim (legacy intent.ts
deleted; classifyQueryIntent + autoDetectDetail compat shims preserved).

NEW for v0.29.1 — two orthogonal recency-axis suggestions:

  suggestedSalience: 'off' | 'on' | 'strong'
  suggestedRecency:  'off' | 'on' | 'strong'

Resolution rules (per D6 narrow temporal-bound exception):
  - CANONICAL patterns (who is X / what is Y / code / graph) → both off
  - UNLESS an EXPLICIT_TEMPORAL_BOUND also matches (today / right now /
    this week / since X / last N days), in which case temporal-bound wins
  - STRONG_RECENCY (today / right now / this morning / just now) → strong
  - RECENCY_ON (latest / recent / this week / meeting prep / catch up
    / remind me / status update) → on
  - SALIENCE_ON (catch up / remind me / status update / prep me /
    what's going on / what matters) → on
  - default → off for both axes (v0.29.1 prime-directive: pure opt-in)

Salience and recency are TRULY orthogonal (per D9). A query like
"latest news on AI" → recency='on' but salience='off' (the user wants
fresh, not emotionally-weighted). "What's going on with widget-co" →
both on. "Who is X right now" → both 'strong'/'on' (temporal bound
beats canonical 'who is').

intent.ts deleted; test/intent.test.ts renamed → test/query-intent-legacy.test.ts
(unchanged behavior coverage). New test/query-intent.test.ts adds 21
cases covering all three axes' interactions: canonical wins on bare
'who is', temporal bound overrides, "catch me up" matches with up to 15
chars between, "today" → strong, intent vs recency independence.

Updated callers:
  - src/core/search/hybrid.ts (autoDetectDetail import)
  - test/recency-boost.test.ts (classifyQueryIntent import)
  - test/benchmark-search-quality.ts (autoDetectDetail import)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: applySalienceBoost + applyRecencyBoost + runPostFusionStages wrapper

D9 + codex pass-1 #2 + #3 + pass-2 #4: salience and recency are TRULY
ORTHOGONAL post-fusion stages, both running from ALL THREE hybridSearch
return paths (keyword-only, embed-failure-fallback, full-hybrid).

NEW src/core/search/hybrid.ts exports:
  - applySalienceBoost(results, scores, strength)
      score *= 1 + k * log(1 + score) where k = 0.15 (on) or 0.30 (strong)
      No time component. Pure mattering signal.
  - applyRecencyBoost(results, dates, strength, decayMap, fallback, nowMs?)
      Per-prefix decay factor: 1 + strengthMul * coefficient * halflife / (halflife + days_old)
      strengthMul: 1.0 (on) or 1.5 (strong)
      Evergreen prefixes (halflifeDays=0) skipped (factor 1.0).
      Pure recency signal. Independent of mattering.
  - runPostFusionStages(engine, results, opts)
      Wraps backlink + salience + recency. Called from EACH return path so
      keyless installs and embed failures get the same boost surface as
      the full hybrid path.

NEW engine methods (composite-keyed for multi-source isolation):
  - getEffectiveDates(refs: Array<{slug, source_id}>): Map<key, Date>
      Returns COALESCE(effective_date, updated_at, created_at). Key format:
      `${source_id}::${slug}`. Mirror of getBacklinkCounts shape.
  - getSalienceScores(refs: Array<{slug, source_id}>): Map<key, number>
      Returns emotional_weight × 5 + ln(1 + take_count). Composite key.

Deprecated (kept for back-compat through v0.29.x):
  - SearchOpts.afterDate / beforeDate (alias for since/until)
  - SearchOpts.recencyBoost: 0|1|2 (alias for recency: 'off'|'on'|'strong')
  - getPageTimestamps (use getEffectiveDates instead)

NEW SearchOpts fields:
  - salience: 'off' | 'on' | 'strong'
  - recency:  'off' | 'on' | 'strong'
  - since:    string (ISO-8601 or relative, replaces afterDate)
  - until:    string (replaces beforeDate)

Resolution: caller-explicit > legacy alias (recencyBoost) > heuristic
(classifyQuery's suggestedSalience / suggestedRecency).

Deleted: src/core/search/recency.ts (PR #618's, replaced) +
test/recency-boost.test.ts (its scope is replaced by query-intent.test.ts +
future post-fusion tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Wintermute <wintermute@garrytan.com>

* v0.29.1: query op gains salience + recency + since + until params; PGLite since/until parity

Combines commits 12 + 13 of the plan.

Query op surface (src/core/operations.ts):
  - salience: 'off' | 'on' | 'strong' (with load-bearing description)
  - recency:  'off' | 'on' | 'strong'
  - since:    string (ISO-8601 or relative; replaces deprecated afterDate)
  - until:    string (replaces deprecated beforeDate)

Tool descriptions teach the calling agent:
  - salience axis = mattering, no time component
  - recency axis = age decay, no mattering signal
  - omit either to let gbrain auto-detect from query text via classifyQuery

hybrid.ts maps since/until → afterDate/beforeDate at the engine call
boundary so PR #618's existing engine plumbing keeps working without
rename. Codex pass-1 #10 finding closed.

PGLite engine (codex pass-1 #10): since/until parity added to all three
search methods (searchKeyword, searchKeywordChunks, searchVector). SQL
filter against COALESCE(p.effective_date, p.updated_at, p.created_at)
so date filtering matches user content-date intent (a meeting was on
event_date, not when it got reimported). Filter is applied INSIDE the
HNSW inner CTE in searchVector so HNSW's candidate pool already
excludes out-of-range pages — preserves pagination contract.

This also closes existing cross-engine drift: pre-v0.29.1 Postgres had
afterDate/beforeDate from PR #618; PGLite had nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: migration v39 — eval_candidates capture columns for replay reproducibility

D11 codex pass-2 resolution: extend eval_candidates with 7 new nullable
columns so `gbrain eval replay` can reproduce captured runs of agent-explicit
salience + recency choices.

Without these columns, replays of the new axis params drift. The live
behavior depends on the resolved {salience, recency} values; v0.29.0's
schema doesn't capture them.

  as_of_ts            TIMESTAMPTZ  — brain's logical NOW at capture
                                     (replay uses this instead of wall-clock)
  salience_param      TEXT         — what the caller passed (NULL if omitted)
  recency_param       TEXT         — same
  salience_resolved   TEXT         — final value applied
  recency_resolved    TEXT         — same
  salience_source     TEXT         — 'caller' or 'auto_heuristic'
  recency_source      TEXT         — same

All nullable + additive. Pre-v0.29.1 rows stay valid. NDJSON
schema_version STAYS at 1 — consumers ignore unknown fields (codex
pass-1 #C2 dissolves; no cross-repo coordination needed).

ADD COLUMN with no DEFAULT is metadata-only on PG 11+ and PGLite —
instant on tables of any size.

src/schema.sql + src/core/pglite-schema.ts mirror the additions for fresh
installs; src/core/schema-embedded.ts regenerated. eval_capture.ts
populates the new fields in commit 16 (docs + ship).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: doctor checks — effective_date_health + salience_health

effective_date_health: sample-1000 scan detects three classes of
problems (codex pass-1 #5 resolution via the effective_date_source
sentinel column added in commit 1):

  fallback_with_fm_date  — page fell back to updated_at even though
                           frontmatter has parseable event_date / date /
                           published. The "wrong but populated" residual
                           that earlier review iterations missed.
  future_dated            — effective_date > NOW() + 1 year (corrupt
                            or typo'd century).
  pre_1990                — effective_date < 1990-01-01 (epoch math gone
                            wrong, bad parse).

Sample of last 1000 pages by default — fast on 200K-page brains. Fix
hint: gbrain reindex-frontmatter.

salience_health: detects pages with active takes whose emotional_weight
is still 0 (recompute_emotional_weight phase hasn't run since the
take landed). Reports the brain's non-zero emotional_weight count as
an informational baseline. Fix hint: gbrain dream --phase
recompute_emotional_weight.

Both checks gracefully skip on pre-v0.29.1 brains (column doesn't
exist → 42703) without surfacing as warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: docs + skills convention + CHANGELOG + version bump

- VERSION 0.29.0 → 0.29.1
- package.json version bump
- CHANGELOG.md: full release-summary + itemized + "To take advantage"
  block per the project's voice rules. Two-line headline + concrete
  pathology framing (existing callers unchanged; new axes opt-in;
  agent in charge per the prime directive).
- skills/conventions/salience-and-recency.md: agent-readable decision
  rules. "Current state → on. Canonical truth → off." plus the narrow
  temporal-bound exception. Cross-cutting convention propagates to
  brain skills via RESOLVER.md.
- skills/migrations/v0.29.1.md: agent-readable upgrade instructions.
  Verify steps + behavior-change reference + recovery commands.

The build-time tool-description generator from D2 (extract decision
tables from skills/conventions/salience-and-recency.md, embed into
operations.ts at build time) is deferred to a follow-up commit. The
tool descriptions on the query op + get_recent_salience are inline in
operations.ts for v0.29.1; the auto-gen + CI staleness gate land in
v0.29.2 if drift becomes a problem in practice.

148 unit tests pass across the v0.29.1 surface (effective-date,
recency-decay, query-intent, migrate, salience, recompute-emotional-weight).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Wintermute <wintermute@garrytan.com>

---------

Co-authored-by: Wintermute <wintermute@garrytan.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Wintermute <wintermute@garrytan.com>
garrytan added a commit that referenced this pull request May 8, 2026
… what's hot without being asked (#730)

* v0.29 foundation: emotional_weight column + formula + anomaly stats

Migration v34 adds pages.emotional_weight REAL DEFAULT 0.0 (column-only,
no index — salience query orders by computed score, not raw weight).
Embedded DDL (schema.sql + pglite-schema.ts + schema-embedded.ts)
mirrors the column so fresh installs don't need migration replay.

types.ts gains: PageFilters.sort enum + PAGE_SORT_SQL whitelist (engines
hardcoded ORDER BY updated_at DESC; threading lands in the next commit);
SalienceOpts/SalienceResult, AnomaliesOpts/AnomalyResult,
EmotionalWeightInputRow/EmotionalWeightWriteRow contracts.

cycle/emotional-weight.ts: pure-function score in [0..1] from tags +
takes (anglocentric default seed list; user-overridable via config key
emotional_weight.high_tags). cycle/anomaly.ts: meanStddev + cohort
threshold helpers with zero-stddev fallback (count > mean + 1) so rare
cohorts don't produce NaN sigmas.

Test coverage: migrate v34 structural assertions + 14-case formula
unit + 13-case anomaly stats unit. Codex review fixes baked in:
formula clamped to [0,1]; per-take weight clamped to [0,1] before
averaging; zero-stddev fallback finite, never NaN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 engine: batch emotional-weight methods + listPages sort

BrainEngine adds 4 methods, both engines implement:

- batchLoadEmotionalInputs(slugs?): CTE-shaped read with per-table
  pre-aggregates. A page with N tags + M takes never produces N×M rows
  (codex C4#4) — page_tags + page_takes CTEs aggregate independently,
  then LEFT JOIN to pages.

- setEmotionalWeightBatch(rows): UPDATE FROM unnest($1::text[],
  $2::text[], $3::real[]) composite-keyed on (slug, source_id). Multi-
  source brains can't fan out (codex C4#3) — pages.slug is unique only
  within source_id. Same shape that v0.18 link batches use.

- getRecentSalience: time boundary computed in JS, bound as TIMESTAMPTZ.
  SQL identical across engines (codex C5/D5 — avoids dialect drift on
  $1::interval binding which has zero current uses on PGLite).

- findAnomalies: tag + type cohort baselines via generate_series-
  densified daily-count CTEs (codex C4#6). Sparse-day rare cohorts get
  correct (mean, stddev) instead of biased upward by zero-omission.
  Year cohort deferred to v0.30.

listPages threads the new PageFilters.sort enum through both engines.
Was hardcoded ORDER BY updated_at DESC; now PAGE_SORT_SQL whitelist
maps the 4 enum values to literal SQL fragments — no injection surface.
postgres.js uses sql.unsafe; PGLite splices the fragment directly.

Regression tests (PGLite, no DATABASE_URL needed):

- multi-source-emotional-weight: same slug under two source_ids,
  setEmotionalWeightBatch on one of them, asserts the other survives
  untouched. Direct codex C4#3 guard.

- list-pages-regression (IRON RULE): old call shape (type, tag, limit)
  still returns updated_desc default; new sort=updated_asc reverses;
  sort=created_desc orders by created_at; sort=slug alphabetical;
  unsupported sort enum falls back to default (defense in depth).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 cycle: new recompute_emotional_weight phase

Adds a 9th cycle phase between extract and embed. Sees the union of
syncPagesAffected + synthesizeWrittenSlugs for incremental mode (so
synthesize-written pages get their weight computed too — codex C2 caught
that the prior plan threaded only sync). Full mode (no incremental
anchors) walks every page; users hit this path on first upgrade via
gbrain dream --phase recompute_emotional_weight.

Phase orchestrator (cycle/recompute-emotional-weight.ts) is two SQL
round-trips total regardless of brain size:
  1. batchLoadEmotionalInputs(slugs?) → per-page tag/take inputs.
  2. computeEmotionalWeight in memory (pure function).
  3. setEmotionalWeightBatch(rows) → composite-keyed UPDATE FROM unnest.

Empty affectedSlugs short-circuits (no DB read, no write). Dry-run
computes weights and reports the would-write count without touching
the DB. Engine throw bubbles into status:fail with code
RECOMPUTE_EMOTIONAL_WEIGHT_FAIL — cycle continues to the next phase.

Plumbing:
- CyclePhase type adds 'recompute_emotional_weight'.
- ALL_PHASES + NEEDS_LOCK_PHASES include it.
- CycleReport.totals adds pages_emotional_weight_recomputed (additive,
  schema_version stays "1").
- runCycle's totals rollup + status derivation honor the new field.
- synthesize.ts emits writtenSlugs in details so cycle.ts can union
  with syncPagesAffected for incremental backfill.

Tests: 7-case unit (fake-engine), 3-case PGLite e2e (full mode + dry-
run + ALL_PHASES position), 1000-page perf budget (<5s on PGLite).

Codex C2 → A: clean separation. Phase doesn't modify runExtractCore;
runs on its own seam after the existing 8 phases plus synthesize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 ops: get_recent_salience + find_anomalies + get_recent_transcripts

Three new MCP operations + a transcripts library:

- get_recent_salience: pages ranked by emotional + activity salience.
  Subagent-allow-listed. params: days (default 14), limit (default 20,
  capped 100), slugPrefix (renamed from `kind` per codex C4#10 to
  avoid collision with PageKind/TakeKind).

- find_anomalies: cohort-level activity outliers (tag + type).
  Subagent-allow-listed. Year cohort deferred to v0.30.

- get_recent_transcripts: raw .txt transcripts from the dream-cycle
  corpus dirs. LOCAL-ONLY: rejects ctx.remote === true with
  permission_denied (codex C3). NOT in the subagent allow-list — all
  subagent calls run with remote=true, would always reject (footgun if
  visible). Cycle's synthesize phase calls discoverTranscripts
  directly, so subagents that need transcripts go through the library
  function, not the op.

Tool descriptions extracted to src/core/operations-descriptions.ts so
they're pinnable in tests and stable for the Tier-2 LLM routing eval.
Redirects on query/search/list_pages: personal/emotional questions
should reach the new ops, not semantic search. Anti-flattery hint on
query: "Do NOT assume words like crazy, notable, or big mean
impressive — they often mean difficult or emotionally charged."

list_pages gains updated_after (string ISO) and sort enum params,
surfacing the engine threading from the prior commit.

src/core/transcripts.ts: filesystem walk shared by the gated MCP op
and the (commit 5) CLI command. Reuses discoverTranscripts corpus-dir
resolution + isDreamOutput from cycle/transcript-discovery.ts. Trust
gate lives in the op handler, not the library — the library is
trusted by both the gated op and the local CLI.

Allow-list: 11 → 13 (add salience + anomalies; transcripts excluded
per codex C3, with a comment explaining why).

Tests: 21-case description pin (catches accidental edits that change
LLM-facing surface); 11-case transcripts unit covering trust gate,
mtime window, dream-output skip, summary truncation, no corpus_dir;
2-case salience type-contract smoke (full Garry-test fixture in commit
6's e2e suite).

Codex C1: routing-eval fixtures (skills/<x>/routing-eval.jsonl)
deliberately NOT shipped — routing-eval.ts is substring-match on
resolver triggers, not MCP tool routing. Real coverage lands as
test/e2e/salience-llm-routing.test.ts in commit 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 CLI: gbrain salience / anomalies / transcripts

Three new CLI commands wired into src/cli.ts dispatch + CLI_ONLY set +
help text:

- gbrain salience [--days N] [--limit N] [--kind PREFIX] [--json]
- gbrain anomalies [--since YYYY-MM-DD] [--lookback-days N] [--sigma N] [--json]
- gbrain transcripts recent [--days N] [--full] [--json]

Each command file mirrors src/commands/orphans.ts shape: pure data fn
+ JSON formatter + human formatter. Calls into engine.getRecentSalience
/ findAnomalies (already shipped) and src/core/transcripts.ts.

salience and anomalies show ranked rows with per-cohort
mean/stddev/sigma. transcripts honors `--full` (caps at 100KB/file)
vs default summary (first non-empty line + ~250 chars). All three
emit JSON with --json for agent consumption.

`--kind` is accepted as a slug-prefix shorthand on `gbrain salience`
even though the underlying op param is `slugPrefix` (kept the CLI
flag short; the MCP-facing param uses the more-explicit name to
align with PageKind/TakeKind/slugPrefix vocabulary).

CLI_ONLY set in src/cli.ts gains the three new command names so
they don't get forwarded to MCP-only routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 e2e: Garry-test fixtures + Postgres parity + LLM routing eval

PGLite e2e (no DATABASE_URL needed):

- salience-pglite: the Garry test. 7 wedding-tagged pages updated today
  + 100 background pages backdated across 30 days via raw SQL UPDATE
  (codex C4#7 — engine.putPage stamps updated_at = now(), so seeding
  via the engine alone can't reproduce historical recency windows).
  Asserts wedding pages outrank random-tag noise in the 7-day window;
  slugPrefix filter narrows correctly; days=0 boundary case; limit cap.

- anomalies-pglite: same fixture shape (7 wedding pages today, 100
  background backdated). findAnomalies with sigma=3 returns the
  wedding-tag cohort with sigma_observed > 3 vs near-zero baseline;
  page_slugs sample carries the wedding pages; date with no activity
  returns []; high sigma threshold suppresses borderline cohorts
  (zero-stddev fallback stays finite — no NaN sigma).

Postgres-gated e2e:

- engine-parity-salience: PGLite ↔ Postgres parity for getRecentSalience
  and findAnomalies. Same fixture into both engines; top-result and
  cohort-set match. Closes the v0.22.0-style parity gap for the new
  v0.29 SQL idioms (EXTRACT(EPOCH ...), generate_series, CTE chain).

Tier-2 LLM routing eval (ANTHROPIC_API_KEY-gated):

- salience-llm-routing: calls Claude with v0.29 tool descriptions and
  12 personal-query phrasings ("anything crazy lately", "what's been
  going on with me", etc.). Asserts the chosen tool is in the v0.29
  set, not query() / search(). ~$0.10 per CI run on Haiku. Tests the
  ACTUAL ship criterion — replaces the discarded fake-coverage
  routing-eval.jsonl fixtures (codex C1 → B).

This is the only test that proves the description edits drive routing.
Without it, we'd ship description changes and only learn from
production behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.0: ship-prep — VERSION + CHANGELOG + CLAUDE Key Files

VERSION + package.json bump 0.28.0 → 0.29.0.

CHANGELOG.md adds a v0.29.0 release-summary in the GStack/Garry voice
plus the "To take advantage of v0.29.0" block. Headline two-liner:
"The brain tells you what's hot without being asked. Salience +
anomaly detection ship. Search rewards hypotheses; salience surfaces
them." Numbers-that-matter table covers engine surface delta, MCP op
delta, allow-list delta, cycle-phase delta, schema migration, list_pages
param surface, and test count. Itemized changes section lists the
schema migration + new cycle phase + new MCP ops + redirect
descriptions + subagent allow-list rules + new tests + a contributor
note clarifying that routing-eval.ts is not the right surface for
testing MCP tool routing (use the Tier-2 LLM eval pattern instead).

CLAUDE.md Key Files updated for the v0.29 surface:

- src/core/engine.ts: notes the 4 new methods + PageFilters.sort threading.
- src/core/migrate.ts: v34 (pages_emotional_weight) entry.
- src/core/cycle.ts: 8 → 9 phases, recompute_emotional_weight inserted
  between patterns and embed; totals.pages_emotional_weight_recomputed.
- src/core/cycle/emotional-weight.ts (NEW): formula + override path.
- src/core/cycle/anomaly.ts (NEW): stats helpers + zero-stddev fallback.
- src/core/cycle/recompute-emotional-weight.ts (NEW): phase orchestrator.
- src/core/transcripts.ts (NEW): library shared by gated MCP op + CLI.
- src/core/operations-descriptions.ts (NEW): pinned tool descriptions.
- src/core/minions/tools/brain-allowlist.ts: 11 → 13 entries; comment
  on why get_recent_transcripts is excluded.
- src/commands/salience.ts / anomalies.ts / transcripts.ts (NEW): CLI surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1 feat: recency + salience as two orthogonal options on query op (#696)

* feat: recency boost for search (v0.27.0) — temporal intent auto-detection, date filters, configurable decay

New search pipeline stage: keyword + vector → RRF → cosine re-score → backlink boost → recency boost → dedup

- applyRecencyBoost: hyperbolic decay, two strengths (moderate 30-day halflife, aggressive 7-day halflife)
- Auto-enabled when intent.ts detects temporal/event queries (detail='high')
- Manual override via SearchOpts.recencyBoost (0/1/2)
- Date filtering: afterDate/beforeDate on all three search paths (keyword, keywordChunks, vector)
- getPageTimestamps on both Postgres and PGLite engines
- 15 tests passing (boost math + intent classification)

* v0.29.1 schema: pages.{effective_date, effective_date_source, import_filename, salience_touched_at} + expression index

Migration v38 adds 4 nullable columns to pages and an expression index on
COALESCE(effective_date, updated_at) to support the new since/until date
filters. All additive — no behavior change in the default search path; only
consulted when callers opt into the new salience='on' / recency='on' axes
or pass since/until.

  effective_date         — content date (event_date / date / published /
                           filename-date / fallback). Read by recency boost
                           and date-filter paths only. Auto-link doesn't
                           touch it (immune to updated_at churn).
  effective_date_source  — sentinel for the doctor's effective_date_health
                           check ('event_date' | 'date' | 'published' |
                           'filename' | 'fallback').
  import_filename        — basename without extension, captured at import.
                           Used for filename-date precedence on daily/,
                           meetings/. Older rows leave it NULL.
  salience_touched_at    — bumped by recompute_emotional_weight when
                           emotional_weight changes. Salience window uses
                           GREATEST(updated_at, salience_touched_at) so
                           newly-salient old pages enter the recent salience
                           query.

Index strategy: a partial index on effective_date alone wouldn't help the
COALESCE expression in since/until filters (planner can't use it for the
negative side). The expression index ((COALESCE(effective_date, updated_at)))
is what actually accelerates the filter.

Postgres uses CONCURRENTLY + v14-style pg_index.indisvalid pre-drop guard
for prior failed CONCURRENTLY runs; PGLite uses plain CREATE INDEX. Mirror
of v34's pattern.

src/schema.sql + src/core/pglite-schema.ts updated for fresh installs;
src/core/schema-embedded.ts regenerated via bun run build:schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: computeEffectiveDate helper + putPage integration

Pure helper computing a page's effective_date from frontmatter precedence:
  1. event_date (meeting/event pages)
  2. date (dated essays)
  3. published (writing/)
  4. filename-date (leading YYYY-MM-DD in basename)
  5. updated_at (fallback)
  6. created_at (last resort)

Per-prefix override: for daily/ and meetings/ slugs, filename-date jumps
to position 1 — the filename is the user's primary signal there.

Returns {date, source}. The source label powers the doctor's
effective_date_health check to detect "fell back to updated_at" rows that
look populated but are functionally a NULL.

Range validation: parsed value must be in [1990-01-01, NOW + 1 year].
Out-of-range values drop to the next chain element.

Wired into importFromContent + importFromFile. The put_page MCP op derives
filename from slug-tail when no caller-supplied filename is available.

putPage SQL on both engines extended to write the new columns. ON CONFLICT
uses COALESCE(EXCLUDED.x, pages.x) so callers that don't know about the
new columns (auto-link, code reindex) preserve existing values rather than
blanking them. SELECT projection extended to return them; rowToPage threads
them through.

21 unit tests covering: precedence chain default order, per-prefix override,
parse failure fall-through, range validation [1990, NOW+1y], parseDateLoose
shape variants. All pass; typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: backfill orchestrator + library function for existing pages

src/core/backfill-effective-date.ts is the shared library function. Walks
pages in keyset-paginated batches (id > last_id ORDER BY id LIMIT 1000),
runs computeEffectiveDate per row, UPDATEs effective_date +
effective_date_source. Resumable via the `backfill.effective_date.last_id`
checkpoint key in the config table — a killed process can re-run and pick
up without re-doing rows. Idempotent: a full re-walk produces the same
writes.

Postgres-only: SET LOCAL statement_timeout = '600s' per batch. Doesn't
refuse the migration on low session settings (codex pass-2 #16).

src/commands/migrations/v0_29_1.ts is the orchestrator (4 phases mirroring
v0_12_2). Phase A schema (gbrain init --migrate-only), Phase B backfill
(via the library function), Phase C verify (count NULL effective_date),
Phase D record (handled by runner). The library function is reusable from
the gbrain reindex-frontmatter CLI command in the next commit.

import_filename stays NULL for backfilled rows — pre-v0.29.1 imports
didn't capture it. computeEffectiveDate uses the slug-tail when filename
is NULL; daily/2024-03-15 backfilled gets effective_date from the slug.

Registered in src/commands/migrations/index.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: gbrain reindex-frontmatter CLI command

Recovery / explicit-rebuild path for pages.effective_date. Used when:
  - User edited frontmatter dates after import
  - Post-upgrade backfill orchestrator finished but the user wants to
    re-walk a subset (e.g. just meetings/) after fixing some frontmatter
  - Precedence rules change between releases

Thin wrapper over backfillEffectiveDate from commit 3 — same code path
the v0_29_1 orchestrator uses; one source of truth.

Flags mirror reindex-code:
  --source <id>      Scope to one sources row (placeholder; library
                     library doesn't filter by source today, tracked v0.30+)
  --slug-prefix P    Scope to slugs starting with P (e.g. 'meetings/')
  --dry-run          Print what WOULD change, no DB writes
  --yes              Skip confirmation prompt (required for non-TTY non-JSON)
  --json             Machine-readable result envelope
  --force            Re-apply even when computed value matches existing

Wired into src/cli.ts. CLI handles its own engine lifecycle (creates +
disconnects).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: recency-decay map + buildRecencyComponentSql (pure, unused)

src/core/search/recency-decay.ts mirrors source-boost.ts in shape but
drives RECENCY ONLY (per D9 codex resolution). Salience is a separate
orthogonal axis; this map does not feed it.

DEFAULT_RECENCY_DECAY: 10 generic prefixes (no fork-specific names).
  - concepts/      evergreen (halflifeDays=0)
  - originals/     180d × 0.5 (long-tail decay; new essays nudged)
  - writing/       365d × 0.4
  - daily/         14d × 1.5  (aggressive — freshness IS the signal)
  - meetings/      60d × 1.0
  - chat/          7d × 1.0
  - media/x/       7d × 1.5
  - media/articles/ 90d × 0.5
  - people/companies/ 365d × 0.3
  - deals/         180d × 0.5

DEFAULT_FALLBACK: 90d × 0.5 for unmatched slugs.

Override priority: defaults < gbrain.yml recency: < env (GBRAIN_RECENCY_DECAY)
< per-call SearchOpts.recency_decay.

parseRecencyDecayEnv format: comma-separated prefix:halflifeDays:coefficient
triples. Refuses LOUD on parse error (RecencyDecayParseError) — codex
pass-2 #M3 finding. No silent fallback like source-boost's parser.

parseRecencyDecayYaml takes already-parsed YAML; throws on bad shape.

buildRecencyComponentSql in sql-ranking.ts emits a CASE expression with
longest-prefix-first ordering, evergreen short-circuit (literal 0 when
halflifeDays=0 or coefficient=0), and EXTRACT(EPOCH ...) for non-zero
branches. Output: ((CASE WHEN p.slug LIKE 'daily/%' THEN 1.5 * 14.0 /
(14.0 + EXTRACT(EPOCH FROM (NOW() - <dateExpr>))/86400.0) ... END))

Typed NowExpr enum prevents SQL injection (codex pass-1 #5). Tests pass
{ kind: 'fixed', isoUtc } for deterministic output; production NOW().
The 'fixed' branch escapes single quotes via escapeSqlLiteral.

25 unit tests covering: env parser shape, env error cases, yaml parser
shape, merge precedence (defaults < yaml < env < caller), CASE longest-
prefix-first ordering, evergreen short-circuit, NowExpr fixed/now,
single-quote injection defense, empty decayMap fallback path, default
map composition (no fork names, concepts/ evergreen, daily/ aggressive).

Pure module. Zero consumers in this commit; commit 6 wires it into
getRecentSalience, commit 10 wires it into the post-fusion stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: refactor getRecentSalience to consume buildRecencyComponentSql

Both engines (Postgres + PGLite) now build the salience formula's third
term via buildRecencyComponentSql instead of inlining 1.0 / (1 + days_old).
Parameters: empty decayMap + fallback { halflifeDays: 1, coefficient: 1.0 }.
Math expands to 1 * 1.0 / (1.0 + days_old) = 1 / (1 + days_old) — same
numeric output as v0.29.0.

This is a no-behavior-change refactor preparing for commit 7's recency_bias
param. recency_bias='flat' (default) reproduces v0.29.0 exactly; 'on'
swaps in DEFAULT_RECENCY_DECAY for per-prefix decay.

Single source of truth for the recency math: same builder feeds the
salience query AND (in commit 10) the post-fusion applyRecencyBoost stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: get_recent_salience gains recency_bias param (default 'flat')

SalienceOpts.recency_bias: 'flat' | 'on' added; default 'flat' preserves
v0.29.0 ranking verbatim. Pass 'on' to opt into per-prefix decay map
(concepts/originals/writing/ evergreen; daily/, media/x/, chat/ aggressive
decay).

When recency_bias='on', the salience query reads
COALESCE(p.effective_date, p.updated_at) instead of bare p.updated_at, so
the recency component is immune to auto-link updated_at churn — old
concepts/ pages just-touched by auto-link don't suddenly look fresh.

Both engines (Postgres + PGLite) wire the param through. resolveRecencyDecayMap()
honors gbrain.yml + GBRAIN_RECENCY_DECAY env at runtime.

MCP op surface: get_recent_salience gains the param with a load-bearing
description teaching the agent when to use 'on' vs 'flat' (current state →
on; mattering across all time → flat).

No silent v0.29.0 behavior change — opt-in only (per D11 codex resolution).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: recompute_emotional_weight writes salience_touched_at; window picks up newly-salient pages

setEmotionalWeightBatch on both engines now bumps salience_touched_at to
NOW() ONLY when the new emotional_weight differs from the existing one
(IS DISTINCT FROM, NULL-safe). No-op writes (same weight) leave the
column alone — preserves "actual change" semantics.

getRecentSalience window changes from
  WHERE p.updated_at >= boundary
to
  WHERE GREATEST(p.updated_at, COALESCE(p.salience_touched_at, p.updated_at)) >= boundary

Closes codex pass-1 finding #4: pages whose emotional_weight just changed
in the dream cycle (because tags or takes shifted) but whose updated_at
is older than the salience window now correctly enter the recent-salience
results. Without this, "Garry just added a take to a 6-month-old page"
stayed invisible to get_recent_salience until the next content edit.

COALESCE(salience_touched_at, p.updated_at) handles pre-v0.29.1 rows
where salience_touched_at is NULL — they fall back to p.updated_at and
behave identically to v0.29.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: merge intent.ts → query-intent.ts; emit 3 suggestions per query

D1 + D4 + D6 + D8: single regex-pass classifier returning
{intent, suggestedDetail, suggestedSalience, suggestedRecency}.

intent + suggestedDetail are v0.29.0 behavior verbatim (legacy intent.ts
deleted; classifyQueryIntent + autoDetectDetail compat shims preserved).

NEW for v0.29.1 — two orthogonal recency-axis suggestions:

  suggestedSalience: 'off' | 'on' | 'strong'
  suggestedRecency:  'off' | 'on' | 'strong'

Resolution rules (per D6 narrow temporal-bound exception):
  - CANONICAL patterns (who is X / what is Y / code / graph) → both off
  - UNLESS an EXPLICIT_TEMPORAL_BOUND also matches (today / right now /
    this week / since X / last N days), in which case temporal-bound wins
  - STRONG_RECENCY (today / right now / this morning / just now) → strong
  - RECENCY_ON (latest / recent / this week / meeting prep / catch up
    / remind me / status update) → on
  - SALIENCE_ON (catch up / remind me / status update / prep me /
    what's going on / what matters) → on
  - default → off for both axes (v0.29.1 prime-directive: pure opt-in)

Salience and recency are TRULY orthogonal (per D9). A query like
"latest news on AI" → recency='on' but salience='off' (the user wants
fresh, not emotionally-weighted). "What's going on with widget-co" →
both on. "Who is X right now" → both 'strong'/'on' (temporal bound
beats canonical 'who is').

intent.ts deleted; test/intent.test.ts renamed → test/query-intent-legacy.test.ts
(unchanged behavior coverage). New test/query-intent.test.ts adds 21
cases covering all three axes' interactions: canonical wins on bare
'who is', temporal bound overrides, "catch me up" matches with up to 15
chars between, "today" → strong, intent vs recency independence.

Updated callers:
  - src/core/search/hybrid.ts (autoDetectDetail import)
  - test/recency-boost.test.ts (classifyQueryIntent import)
  - test/benchmark-search-quality.ts (autoDetectDetail import)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: applySalienceBoost + applyRecencyBoost + runPostFusionStages wrapper

D9 + codex pass-1 #2 + #3 + pass-2 #4: salience and recency are TRULY
ORTHOGONAL post-fusion stages, both running from ALL THREE hybridSearch
return paths (keyword-only, embed-failure-fallback, full-hybrid).

NEW src/core/search/hybrid.ts exports:
  - applySalienceBoost(results, scores, strength)
      score *= 1 + k * log(1 + score) where k = 0.15 (on) or 0.30 (strong)
      No time component. Pure mattering signal.
  - applyRecencyBoost(results, dates, strength, decayMap, fallback, nowMs?)
      Per-prefix decay factor: 1 + strengthMul * coefficient * halflife / (halflife + days_old)
      strengthMul: 1.0 (on) or 1.5 (strong)
      Evergreen prefixes (halflifeDays=0) skipped (factor 1.0).
      Pure recency signal. Independent of mattering.
  - runPostFusionStages(engine, results, opts)
      Wraps backlink + salience + recency. Called from EACH return path so
      keyless installs and embed failures get the same boost surface as
      the full hybrid path.

NEW engine methods (composite-keyed for multi-source isolation):
  - getEffectiveDates(refs: Array<{slug, source_id}>): Map<key, Date>
      Returns COALESCE(effective_date, updated_at, created_at). Key format:
      `${source_id}::${slug}`. Mirror of getBacklinkCounts shape.
  - getSalienceScores(refs: Array<{slug, source_id}>): Map<key, number>
      Returns emotional_weight × 5 + ln(1 + take_count). Composite key.

Deprecated (kept for back-compat through v0.29.x):
  - SearchOpts.afterDate / beforeDate (alias for since/until)
  - SearchOpts.recencyBoost: 0|1|2 (alias for recency: 'off'|'on'|'strong')
  - getPageTimestamps (use getEffectiveDates instead)

NEW SearchOpts fields:
  - salience: 'off' | 'on' | 'strong'
  - recency:  'off' | 'on' | 'strong'
  - since:    string (ISO-8601 or relative, replaces afterDate)
  - until:    string (replaces beforeDate)

Resolution: caller-explicit > legacy alias (recencyBoost) > heuristic
(classifyQuery's suggestedSalience / suggestedRecency).

Deleted: src/core/search/recency.ts (PR #618's, replaced) +
test/recency-boost.test.ts (its scope is replaced by query-intent.test.ts +
future post-fusion tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Wintermute <wintermute@garrytan.com>

* v0.29.1: query op gains salience + recency + since + until params; PGLite since/until parity

Combines commits 12 + 13 of the plan.

Query op surface (src/core/operations.ts):
  - salience: 'off' | 'on' | 'strong' (with load-bearing description)
  - recency:  'off' | 'on' | 'strong'
  - since:    string (ISO-8601 or relative; replaces deprecated afterDate)
  - until:    string (replaces deprecated beforeDate)

Tool descriptions teach the calling agent:
  - salience axis = mattering, no time component
  - recency axis = age decay, no mattering signal
  - omit either to let gbrain auto-detect from query text via classifyQuery

hybrid.ts maps since/until → afterDate/beforeDate at the engine call
boundary so PR #618's existing engine plumbing keeps working without
rename. Codex pass-1 #10 finding closed.

PGLite engine (codex pass-1 #10): since/until parity added to all three
search methods (searchKeyword, searchKeywordChunks, searchVector). SQL
filter against COALESCE(p.effective_date, p.updated_at, p.created_at)
so date filtering matches user content-date intent (a meeting was on
event_date, not when it got reimported). Filter is applied INSIDE the
HNSW inner CTE in searchVector so HNSW's candidate pool already
excludes out-of-range pages — preserves pagination contract.

This also closes existing cross-engine drift: pre-v0.29.1 Postgres had
afterDate/beforeDate from PR #618; PGLite had nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: migration v39 — eval_candidates capture columns for replay reproducibility

D11 codex pass-2 resolution: extend eval_candidates with 7 new nullable
columns so `gbrain eval replay` can reproduce captured runs of agent-explicit
salience + recency choices.

Without these columns, replays of the new axis params drift. The live
behavior depends on the resolved {salience, recency} values; v0.29.0's
schema doesn't capture them.

  as_of_ts            TIMESTAMPTZ  — brain's logical NOW at capture
                                     (replay uses this instead of wall-clock)
  salience_param      TEXT         — what the caller passed (NULL if omitted)
  recency_param       TEXT         — same
  salience_resolved   TEXT         — final value applied
  recency_resolved    TEXT         — same
  salience_source     TEXT         — 'caller' or 'auto_heuristic'
  recency_source      TEXT         — same

All nullable + additive. Pre-v0.29.1 rows stay valid. NDJSON
schema_version STAYS at 1 — consumers ignore unknown fields (codex
pass-1 #C2 dissolves; no cross-repo coordination needed).

ADD COLUMN with no DEFAULT is metadata-only on PG 11+ and PGLite —
instant on tables of any size.

src/schema.sql + src/core/pglite-schema.ts mirror the additions for fresh
installs; src/core/schema-embedded.ts regenerated. eval_capture.ts
populates the new fields in commit 16 (docs + ship).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: doctor checks — effective_date_health + salience_health

effective_date_health: sample-1000 scan detects three classes of
problems (codex pass-1 #5 resolution via the effective_date_source
sentinel column added in commit 1):

  fallback_with_fm_date  — page fell back to updated_at even though
                           frontmatter has parseable event_date / date /
                           published. The "wrong but populated" residual
                           that earlier review iterations missed.
  future_dated            — effective_date > NOW() + 1 year (corrupt
                            or typo'd century).
  pre_1990                — effective_date < 1990-01-01 (epoch math gone
                            wrong, bad parse).

Sample of last 1000 pages by default — fast on 200K-page brains. Fix
hint: gbrain reindex-frontmatter.

salience_health: detects pages with active takes whose emotional_weight
is still 0 (recompute_emotional_weight phase hasn't run since the
take landed). Reports the brain's non-zero emotional_weight count as
an informational baseline. Fix hint: gbrain dream --phase
recompute_emotional_weight.

Both checks gracefully skip on pre-v0.29.1 brains (column doesn't
exist → 42703) without surfacing as warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29.1: docs + skills convention + CHANGELOG + version bump

- VERSION 0.29.0 → 0.29.1
- package.json version bump
- CHANGELOG.md: full release-summary + itemized + "To take advantage"
  block per the project's voice rules. Two-line headline + concrete
  pathology framing (existing callers unchanged; new axes opt-in;
  agent in charge per the prime directive).
- skills/conventions/salience-and-recency.md: agent-readable decision
  rules. "Current state → on. Canonical truth → off." plus the narrow
  temporal-bound exception. Cross-cutting convention propagates to
  brain skills via RESOLVER.md.
- skills/migrations/v0.29.1.md: agent-readable upgrade instructions.
  Verify steps + behavior-change reference + recovery commands.

The build-time tool-description generator from D2 (extract decision
tables from skills/conventions/salience-and-recency.md, embed into
operations.ts at build time) is deferred to a follow-up commit. The
tool descriptions on the query op + get_recent_salience are inline in
operations.ts for v0.29.1; the auto-gen + CI staleness gate land in
v0.29.2 if drift becomes a problem in practice.

148 unit tests pass across the v0.29.1 surface (effective-date,
recency-decay, query-intent, migrate, salience, recompute-emotional-weight).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Wintermute <wintermute@garrytan.com>

---------

Co-authored-by: Wintermute <wintermute@garrytan.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 master-rebase fixups: renumber + drift cleanup

- v0.29.1 migrations renumber v38/v39 → v41/v42 (master shipped takes_table at
  v37 + access_tokens_permissions at v38; v0.27.1 took v39). My v0.29.0
  emotional_weight slots in at v40; v0.29.1's pages_recency_columns lands at
  v41 and eval_candidates_recency_capture at v42.
- src/core/utils.ts comment refs updated v37 → v40 (emotional_weight) and
  v38 → v41 (effective_date/etc).
- test/brain-allowlist.test.ts: size assertion 11 → 13 + the new
  get_recent_salience / find_anomalies positive checks + the explicit
  get_recent_transcripts negative check (v0.29 added the salience pair to
  the allow-list; transcripts are deliberately excluded because all
  subagent calls have remote=true and the v0.29 trust gate rejects them —
  visibility would be a footgun).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 CI fixups: privacy allow-list + cycle phase count + migration plan

Three CI test failures on PR #730, all caused by master-side state the
v0.29 cherry-picks didn't yet account for:

1. scripts/check-privacy.sh allow-lists test/recency-decay.test.ts
   The v0.29.1 recency-decay test asserts that DEFAULT_RECENCY_DECAY's
   keys do NOT include fork-specific path prefixes. Because the assertion
   has to name the banned tokens to assert their absence, the privacy
   guard flagged the literal occurrence. Same exception class as
   CHANGELOG.md, CLAUDE.md, and scripts/check-privacy.sh itself —
   meta-rule enforcement requires mentioning what the rule forbids.

2. test/core/cycle.serial.test.ts: 9 → 10 phases.
   The yieldBetweenPhases test was written for v0.26.5 (9 phases incl.
   purge). v0.29 added a 10th phase (recompute_emotional_weight)
   between patterns and embed; the test's expected hookCalls and
   report.phases.length needed bumping.

3. test/apply-migrations.test.ts: append '0.29.1' to skippedFuture lists.
   v0.29.1 added a new entry to src/commands/migrations/index.ts; the
   buildPlan test snapshots the exact ordered list of versions, so it
   needs the new entry in both the fresh-install case and the Codex H9
   regression case.

All three verified locally:
  - bash scripts/check-privacy.sh → exit 0
  - bun test test/apply-migrations.test.ts → 18/18 pass
  - bun test test/core/cycle.serial.test.ts → 28/28 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 CI fixup: regenerate llms-full.txt to match CLAUDE.md state

build-llms test asserts the committed llms.txt + llms-full.txt match
what the generator produces from the current source tree. CLAUDE.md
got new v0.29 Key Files entries (recompute_emotional_weight phase,
emotional-weight formula, anomaly stats, transcripts library, salience
ops, etc.) without a corresponding regen. `bun run build:llms` brings
llms-full.txt back in sync; llms.txt is byte-for-byte identical so
only the larger inline bundle changed.

Verified locally: bun test test/build-llms.test.ts → 7/7 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 e2e: cover tool-surfaces + MCP dispatch path

Two gaps were uncovered when reviewing v0.29 coverage against the new
contracts the cherry-picks landed onto master.

1. test/v0_29-tool-surfaces.test.ts (unit, 9 cases)

   Existing tests pin the description constants module and the
   BRAIN_TOOL_ALLOWLIST set membership, but nothing checked the two
   filters that ACT on those constants:

   - serve-http.ts:745 filters operations by !op.localOnly to build the
     HTTP MCP tool list. Without a test, anyone removing `localOnly: true`
     from get_recent_transcripts would silently expose it to remote
     callers — defense-in-depth on top of the in-handler ctx.remote check
     would be the only guard. Now pinned: get_recent_transcripts is
     hidden, salience + anomalies stay visible.

   - buildBrainTools surfaces the v0.29 ops as `brain_get_recent_salience`
     and `brain_find_anomalies`, and EXCLUDES `brain_get_recent_transcripts`
     (codex C3 footgun gate — all subagent calls are remote=true, the op
     would always reject). Now pinned.

   Both filters are pure functions; no DB / engine.connect needed.

2. test/e2e/v0_29-mcp-dispatch-pglite.test.ts (e2e, 5 cases)

   Existing v0.29 e2e tests call engine methods directly. None went
   through the full dispatchToolCall pipeline that stdio MCP and HTTP
   MCP both use. The new file covers:

   - get_recent_salience returns ranked rows via dispatch (top result
     is the wedding-tagged page from the seeded fixture).
   - find_anomalies returns the AnomalyResult shape via dispatch.
   - get_recent_transcripts rejects with permission_denied when
     ctx.remote === true (the in-handler trust gate is the last line if
     localOnly ever drops).
   - get_recent_transcripts succeeds with ctx.remote === false (CLI
     path) and returns [] when no corpus dir is configured.
   - Unknown tool name returns the standard isError + "Unknown tool"
     envelope (regression guard for dispatch shape).

Verified locally — all 14 cases pass:
  bun test test/v0_29-tool-surfaces.test.ts                          → 9 pass
  bun test test/e2e/v0_29-mcp-dispatch-pglite.test.ts                → 5 pass

Re-ran the full v0.29 PGLite e2e suite to confirm no regressions:
  salience-pglite.test.ts                       5 pass
  anomalies-pglite.test.ts                      4 pass
  cycle-recompute-emotional-weight-pglite.test  3 pass
  list-pages-regression.test.ts                 6 pass
  multi-source-emotional-weight-pglite.test     4 pass
  backfill-perf-pglite.test.ts                  1 pass
  v0_29-mcp-dispatch-pglite.test.ts             5 pass
  -----
  Total: 28 pass / 0 fail
  Postgres parity test (DATABASE_URL gated)     7 skip (correct)
  LLM routing eval (ANTHROPIC_API_KEY gated)   12 skip (correct)
  bun run typecheck                             clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.29 CI fixup: drop unused PGLiteEngine in tool-surfaces test

scripts/check-test-isolation.sh's R3 + R4 lints flagged the new
test/v0_29-tool-surfaces.test.ts for instantiating PGLiteEngine outside
a beforeAll() block (R3) and lacking the matching afterAll(disconnect)
(R4). The intent of those rules is to prevent engine leaks across the
shard process — every PGLiteEngine must follow the canonical
beforeAll(connect+initSchema) / afterAll(disconnect) pattern.

The fix here is upstream of the rule, not a workaround: this test never
needed an engine. buildBrainTools doesn't issue any SQL at registry-build
time — it only reads `engine.kind` for the put_page namespace-wrap
branch. A `{ kind: 'pglite' } as unknown as BrainEngine` fake-engine
literal keeps the test pure-function: no WASM cold-start, no connect
lifecycle, no test-isolation rule fired.

Verified locally:
  bash scripts/check-test-isolation.sh → OK (257 non-serial unit files)
  bun test test/v0_29-tool-surfaces.test.ts → 9 pass
  bun run typecheck → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Wintermute <wintermute@garrytan.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants