Skip to content

feat(exports): add ./enrichment to package.json exports map#555

Open
chengzehsu wants to merge 1 commit into
garrytan:masterfrom
chengzehsu:feat/exports-map-enrichment
Open

feat(exports): add ./enrichment to package.json exports map#555
chengzehsu wants to merge 1 commit into
garrytan:masterfrom
chengzehsu:feat/exports-map-enrichment

Conversation

@chengzehsu

@chengzehsu chengzehsu commented May 1, 2026

Copy link
Copy Markdown

Summary

Adds a single ./enrichment entry to the exports map in package.json,
pointing to the existing src/core/enrichment-service.ts. Purely additive
— no file is moved, no symbol is renamed, no behavior changes.

     "./search/expansion": "./src/core/search/expansion.ts",
-    "./extract": "./src/commands/extract.ts"
+    "./extract": "./src/commands/extract.ts",
+    "./enrichment": "./src/core/enrichment-service.ts"
   },

Motivation

External consumers — e.g. user-built ingestion workers that want to plug
entity enrichment into their own pipelines — need to call the symbols
that src/core/enrichment-service.ts already exports:

  • extractAndEnrich
  • enrichEntity
  • enrichEntities
  • extractEntities
  • slugifyEntity
  • entityPagePath

Today, because enrichment-service.ts is not in the exports map, the
only way to reach those symbols from outside the package is a deep import:

import { extractAndEnrich } from "gbrain/src/core/enrichment-service";

That has two problems:

  1. Outside the public contract. Any internal reshuffle of src/core/
    silently breaks downstream consumers, because nothing pins this path
    as part of the API surface.
  2. Not type-resolvable by default. TypeScript with default
    moduleResolution rejects deep imports that are not declared in
    exports, so consumers have to add custom paths mappings in their
    tsconfig.json just to compile.

After this change, consumers can write:

import { extractAndEnrich } from "gbrain/enrichment";

…in the same shape as the existing gbrain/extract, gbrain/operations,
gbrain/engine, etc.

Discovery

Found while wiring extractAndEnrich into an external eco-brain-worker
service as a Phase 1.5 follow-up — the deep-import workaround was the
clear sharp edge, and a 1-line addition to exports cleans it up for
every future consumer.

Test plan

  • package.json parses as valid JSON (verified locally with python -m json.tool).
  • Diff is exactly the 2 lines shown above (1 added entry + trailing comma on the previous line).
  • No other file touched.
  • CI green on this PR.

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

External consumers (e.g. user-built workers that wire entity enrichment
into ingestion pipelines) need to call `extractAndEnrich`, `enrichEntity`,
`enrichEntities`, `extractEntities`, `slugifyEntity`, and `entityPagePath`
as a library. Today the only way to reach `src/core/enrichment-service.ts`
from outside the repo is a deep import path like
`gbrain/src/core/enrichment-service.ts`, which:

- Sits outside the public exports contract, so it could break silently on
  any internal restructure.
- Is not resolvable by TypeScript without a custom `paths` mapping in the
  consumer's tsconfig.

This change adds a single `./enrichment` entry to the `exports` map, in
the same style as the existing `./extract`, `./operations`, `./engine`,
etc. entries. It is purely additive — no file is moved, no symbol is
renamed, and there is no behavioral change. Consumers can now write:

  import { extractAndEnrich } from "gbrain/enrichment";

The motivating use case is a Phase 1.5 follow-up in eco-brain-worker,
where this gap was discovered while wiring `extractAndEnrich` into the
worker's ingestion flow.
garrytan added a commit that referenced this pull request May 18, 2026
…mp (T15)

Adds the v0.36.0.0 admin SPA Calibration tab. Per the design review,
the approved variant-B (Linear calm clarity) layout: single-column flow,
generous whitespace, ONE big sparkline as hero, then patterns, then
domain bars, then abandoned threads.

D23 server-rendered SVG architecture:

  src/core/calibration/svg-renderer.ts — pure functions. data → SVG
  string. No DOM, no React, no chart library dep. Inlines the admin
  design tokens (#0a0a0f bg, #3b82f6 accent, etc.) so the SVG is
  visually consistent with the rest of the admin SPA.

  Four chart renderers:
    - renderBrierTrend({ series }) — sparkline w/ baseline reference
      at 0.25 (always-50% baseline)
    - renderDomainBars({ bars }) — horizontal accuracy bars per domain
    - renderAbandonedThreadsCard(threads) — D30/TD4 'revisit now' link
      per row, points at /admin/calibration/revisit/<takeId>
    - renderPatternStatementsCard(statements) — D29/TD3 clickable
      drill-down links per row, point at /admin/calibration/pattern/<i>

  XSS posture: all caller-controlled strings pass through escapeXml().
  Numeric inputs are .toFixed()-coerced. Admin SPA renders via
  dangerouslySetInnerHTML inside a TrustedSVG wrapper component;
  endpoint is gated by requireAdmin middleware.

  /admin/api/calibration/profile — returns the active profile row as JSON.
  /admin/api/calibration/charts/:type — returns image/svg+xml markup
    for type ∈ {brier-trend, domain-bars, pattern-statements,
                abandoned-threads}. Cache-Control: private, max-age=60.

  brier-trend currently renders a single-point series from the active
  profile (the time-series view across calibration_profiles.generated_at
  history is a v0.37 follow-up once we have multiple snapshots).
  abandoned-threads pulls the top 5 abandoned rows via the same SQL the
  doctor check uses.

CalibrationPage React component (admin/src/pages/Calibration.tsx):
  Fetches profile + 4 charts. Loading / error / cold-brain states all
  handled. Layout includes the audit annotations (partial-grade badge,
  voice-gate-fell-back-to-template badge) per the approved mockup.
  TrustedSVG wrapper isolates the dangerouslySetInnerHTML to the SVG
  surface only.

App.tsx nav: added 'calibration' page route + sidebar nav item, hash
routing extended to support #calibration.

TD2 contrast bump:
  admin/src/index.css --text-muted: #555#777. Old value was contrast
  4.0 on the #0a0a0f bg — below WCAG AA 4.5 for body text. New value is
  ~5.5, passes AA. Improvement is global across Dashboard, Agents,
  RequestLog, and the new Calibration tab — single-line CSS change with
  ~10x the impact.

admin/dist/ rebuilt via `bun run build` (vite). 36 modules transformed.

Tests: 19 cases in test/svg-renderer.test.ts.
  escapeXml (1): canonical entities.
  renderBrierTrend (6): empty state, polyline for 2+ points, clamp
  beyond yMax, design tokens inlined, XSS safety on date strings,
  text-anchor end on right label.
  renderDomainBars (4): empty state, label/accuracy/n rendering,
  out-of-range accuracy clamp, XSS safety on labels.
  renderAbandonedThreadsCard (4): empty state, row rendering with
  revisit link, claim truncation at 70 chars, custom revisitHref override.
  renderPatternStatementsCard (4): empty state, anchor count matches
  statement count, XSS safety, custom drillHref override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 18, 2026
Promotes the admin SPA's de facto design tokens (landed v0.26.0) to a
canonical DESIGN.md at the repo root. This is the calibration target
for /plan-design-review and /design-review going forward — when a
question is "does this UI fit the system?", the answer is here.

Captures the system as it stands today:

  Voice (5 surfaces, all routed through gateVoice() with mode-specific
  rubrics): pattern_statement, nudge, forecast_blurb, dashboard_caption,
  morning_pulse. Friend-not-doctor; concrete data over abstract metrics;
  no preachy / clinical / corporate language.

  Color tokens: 10 CSS variables from admin/src/index.css inlined into
  the SVG renderer (src/core/calibration/svg-renderer.ts). Dark theme
  is the only theme — admin is an operator tool. WCAG contrast
  documented per token; TD2's #555#777 bump on --text-muted noted.

  Typography: Inter for UI, JetBrains Mono for numbers/slugs/data.
  Type scale (18 / 14 / 13 / 12 / 11) documented as de facto, not yet
  formalized.

  Spacing scale: 4 / 8 / 16 / 24 / 32px. Linear-app density.

  Layout: sidebar 200px, max content 720px (text) / 960px (tables).
  No 3-column feature grids, no icons in colored circles, no
  decorative blobs.

  Charts: server-rendered SVG via pure functions in
  src/core/calibration/svg-renderer.ts. XSS posture documented:
  server-side escapeXml on caller-controlled strings, numeric inputs
  .toFixed()-coerced, admin SPA renders via <TrustedSVG> wrapper.

  Interaction patterns: keyboard nav required (J/K/space/u/q on the
  propose-queue), loading/empty/error states ARE features.

  v0.37+ roadmap: type scale formalization, animation tokens, component
  library extraction. Light mode explicitly NOT planned.

The doc is a living target, not a frozen spec. Major changes route
through /plan-design-review per the existing review chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 19, 2026
… wrong (#1139)

* schema: v0.36.0.0 Hindsight calibration tables (migrations v67-v71)

Foundation commit for the Hindsight-inspired calibration wave. Adds four
new tables + one perf index, all source-scoped from day 1 per v0.34.1
discipline:

- calibration_profiles (v67): per-holder LLM-narrative aggregation of
  TakesScorecard data. published BOOL gates E8 cross-brain mount sharing
  (default false). grade_completion REAL surfaces partial-grade state to
  the dashboard. active_bias_tags TEXT[] with GIN index feeds E3 (calibration-
  aware contradictions) and E7 (real-time nudge matching).

- take_proposals (v68): propose_takes phase queue. Idempotency cache via
  (source_id, page_slug, content_hash, prompt_version) unique index mirrors
  the v0.23 dream_verdicts pattern. proposal_run_id supports --rollback by
  run. dedup_against_fence_rows JSONB audit column records what canonical
  takes the LLM was told to dedupe against at proposal time.

- take_grade_cache (v69): grade_takes verdict cache. Composite PK on
  (take_id, prompt_version, judge_model_id, evidence_signature) — prompt
  edits OR evidence changes cleanly invalidate prior verdicts. applied=false
  default + auto-resolve-off-by-default (D17) means every fresh install
  needs operator opt-in before grade verdicts mutate the takes table.

- take_nudge_log (v70): E7 nudge cooldown state. Polymorphic FK — a nudge
  fires on either a canonical take OR a pending proposal (CDX-5 fix). CHECK
  constraint enforces exactly-one-set. channel column lets future routing
  (webhook, admin SPA toast) reuse the same cooldown semantics.

- takes_resolved_at_idx (v71): partial index for the Brier-trend
  aggregation queries. Engine-aware handler — Postgres uses CONCURRENTLY
  to avoid the ShareLock; PGLite uses plain CREATE.

Every table carries wave_version TEXT NOT NULL DEFAULT 'v0.36.0.0' so the
v0.36.0.0 calibration --undo-wave command (lands later in the wave) can
reverse just this wave's writes.

Plan: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md
covers the design rationale (D17/D18/D21 + CDX findings).

Schema parity:
- src/schema.sql for fresh Postgres installs
- src/core/pglite-schema.ts for fresh PGLite installs
- src/core/schema-embedded.ts auto-regenerated from schema.sql
- src/core/migrate.ts for upgrade-in-place from older brains

VERSION bumped to 0.36.0.0 for the wave. CHANGELOG entry lands at /ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* core: BaseCyclePhase abstract class enforces source-scope + budget contracts

D21 from the eng review. Three new v0.36.0.0 cycle phases (propose_takes,
grade_takes, calibration_profile) share enough structure that the
duplication-vs-abstraction trade tips toward a shared base. Without this
scaffold, source-isolation discipline would drift exactly the way it
drifted in v0.34.1 — except this time across three new surfaces at once.

What this enforces:

1. Phase signature is uniform: run(ctx, opts) → PhaseResult.

2. ctx.sourceId / ctx.auth.allowedSources MUST be threaded through every
   engine call. The base class surfaces a scope() helper that wraps
   sourceScopeOpts(ctx) and is the only sanctioned way to read source-
   scoped data. Forgetting to thread source scope becomes a TypeScript
   compile error, not a runtime leak. Closes the v0.34.1 leak class
   structurally for every new phase.

3. Budget meter wraps run() automatically. Subclass declares budgetUsdKey
   + budgetUsdDefault; base reads the resolved cap from config and creates
   the BudgetMeter. Subclass calls this.checkBudget() before each LLM
   submit; budget-exhausted phase still returns status='ok' (clean abort)
   so the cycle report shows partial completion, not failure.

4. Error envelope is uniform. Thrown errors get caught and converted to
   status='fail' with a phase-specific error.code via the subclass's
   mapErrorCode() hook.

5. Progress reporter integration. Base accepts the reporter via opts;
   subclasses call this.tick() instead of touching the reporter directly,
   so the phase name in the progress stream is always correct.

Tests: 13 cases in test/core/base-phase.test.ts cover source-scope
threading (5 cases including the empty-allowedSources-MUST-NOT-widen-scope
regression), PhaseResult shape including the error envelope path (3
cases), dry-run propagation (2 cases), and budget meter construction
(3 cases including config-key override).

Synthesize.ts / patterns.ts (existing pre-v0.36 phases) deliberately do
NOT retrofit to this base in v0.36.0.0 — too much churn for a refactor
that doesn't pay off until v0.37+. Future phases use this by default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cycle: propose_takes phase + take_proposals queue write path (T3)

LLM-based take extraction from markdown prose. Walks pages updated since
last cycle, sends each page's body to a tuned extractor, writes the
extracted gradeable claims to the take_proposals queue. User accepts /
rejects via `gbrain takes propose --review` (lands in Lane C).

Cycle wiring:
  lint → backlinks → sync → synthesize → extract → extract_facts →
    resolve_symbol_edges → patterns → recompute_emotional_weight →
    consolidate → propose_takes (NEW) → grade_takes (NEW; T4) →
    calibration_profile (NEW; T6) → embed → orphans → purge

CyclePhase enum extended with 3 new entries; ALL_PHASES + NEEDS_LOCK_PHASES
updated. All three new phases acquire the cycle lock (writes to
take_proposals / take_grade_cache / calibration_profiles).

Idempotency contract:
  The (source_id, page_slug, content_hash, prompt_version) composite unique
  index on take_proposals means an unchanged page never re-spends LLM
  tokens. Bumping PROPOSE_TAKES_PROMPT_VERSION cleanly invalidates the
  cache so a tuned prompt re-runs proposals on every page. Mirrors the
  v0.23 dream_verdicts pattern.

F2 fence dedup:
  The phase reads the page's existing `<!-- gbrain:takes:begin -->` fence
  (when present) and passes the canonical take rows to the extractor as
  "things you have already captured." Prevents duplicate proposals when
  prose is appended to a page that already has takes. Records the fence
  rows the LLM was told to dedupe against on the take_proposals row for
  audit (dedup_against_fence_rows JSONB).

Auto-resolve posture:
  propose_takes only WRITES proposals to the queue. Nothing in this phase
  mutates the canonical takes table. Operator opt-in via the queue review
  CLI (Lane C) is the only path from queue to canonical fence (D17).

Prompt tuning status (v0.36.0.0 ship state):
  The default extractor prompt is annotated `v0.36.0.0-stub`. The real
  tuned prompt arrives via T19 synthetic corpus build (50 anonymized
  pages, 3-model parallel extraction, user reviews disagreement set,
  F1 ≥ 0.85 on training corpus + F1 ≥ 0.8 on ground-truth holdout).
  Until T19 lands, propose_takes runs but produces best-effort candidates
  the user reviews manually.

Architecture:
  ProposeTakesPhase extends BaseCyclePhase (T2). Inherits source-scope
  threading via scope(), budget metering via this.checkBudget(), error
  envelope wrapping. budgetUsdKey: cycle.propose_takes.budget_usd
  (default $5/cycle). Budget exhaustion mid-page returns status='warn'
  with details.budget_exhausted=true — clean partial-completion semantics.

  Test seam: opts.extractor injection so the phase can run hermetically
  without touching the gateway. defaultExtractor (production path) calls
  gateway.chat with the EXTRACT_TAKES_PROMPT and parses the JSON array
  output via parseExtractorOutput.

  parseExtractorOutput defends against common LLM output sins: markdown
  code fence wrapping, leading prose, single-object instead of array,
  unknown kind values, weight out of [0,1], rows missing claim_text or
  exceeding 500 chars.

Tests: 25 cases in test/propose-takes.test.ts cover the 4 pure helpers
(parseExtractorOutput, contentHash, hasCompleteFence,
extractExistingTakesForDedup) + 7 phase integration scenarios (happy path,
cache hit, fence dedup, extractor failure, empty pages, skipPagesWithFence,
proposal_run_id stability).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cycle: grade_takes phase + take_grade_cache verdict pipeline (T4)

Walks unresolved takes that are old enough to have outcome data, retrieves
evidence from the brain, asks a judge model to verdict each one. Writes
verdicts to take_grade_cache. Optionally — only when operator has flipped
the opt-in config flag — auto-applies high-confidence verdicts to the
canonical takes table via engine.resolveTake.

Auto-resolve posture (D17 — DISABLED by default):
  On a fresh install, grade_takes runs and writes verdicts to the cache,
  but applied=false on every row. Operator reviews the queue, then flips
  `cycle.grade_takes.auto_resolve.enabled: true` once trust is earned.
  Mirrors the propose_takes review-queue posture: queue exists, mutation
  requires explicit opt-in.

Conservative threshold (D12):
  When auto_resolve.enabled is true, a verdict auto-applies only when
  confidence >= 0.95 (single-judge path). T5 ensemble path lands next,
  tightening this further with 3/3 unanimous requirement.

  'unresolvable' verdict NEVER auto-applies even at confidence=1.0 —
  there's no canonical column for "we tried and there's no evidence yet."

Evidence retrieval status (v0.36.0.0 ship state):
  The default evidence retriever returns an "evidence-retrieval not yet
  wired" placeholder. Most verdicts produced by the stub-judge against
  the stub-evidence will be 'unresolvable'. Real retrieval (hybrid search
  over pages newer than the take's since_date, optionally augmented by a
  gateway web-search recipe in v0.37+) lands as a follow-up. Documented
  limitation per CDX-8 + D17 — the phase ships now so the wiring is real
  and the cache table accumulates verdicts even if early ones are
  conservative.

Cache key:
  Composite primary key on take_grade_cache is
  (take_id, prompt_version, judge_model_id, evidence_signature). Prompt
  edits OR evidence changes OR judge swap cleanly invalidate prior
  verdicts. Mirrors the v0.32.6 eval_contradictions_cache pattern.

  evidence_signature = SHA-256 of (judge_model_id + '|' + evidence_text)
  so identical evidence under a different judge does NOT collide.

Architecture:
  GradeTakesPhase extends BaseCyclePhase. Inherits source-scope threading,
  budget metering (cycle.grade_takes.budget_usd, default $3/cycle), error
  envelope. Test seam: opts.judge + opts.evidenceRetriever injection so
  the phase runs hermetically.

  parseJudgeOutput defends against fence-wrapping, leading prose,
  out-of-range confidence (clamps to [0,1]), invalid verdict labels,
  oversized reasoning (truncated at 400 chars). Returns null on
  unrecoverable parse — caller treats null as "judge_output_parse_failed
  / unresolvable at confidence 0.0" so the row still lands in cache with
  the parse failure surfaced via warnings.

  takeIsOldEnough gates on since_date (default 6 months). Tolerates
  YYYY-MM-DD and YYYY-MM formats. Returns false on null/unparseable
  since_date so takes without dates never get graded (we'd be
  hallucinating temporal context).

Tests: 23 cases covering parseJudgeOutput (7 cases), evidenceSignature
(3), takeIsOldEnough (5), and 8 phase integration scenarios — happy path,
D17 auto-resolve-off default, D12 above-threshold auto-apply, below-
threshold cache-only, unresolvable-NEVER-applies, cache hit, too-recent
gate, judge-throw warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cycle: grade_takes ensemble tiebreaker for borderline verdicts (T5 / E2)

Multi-judge ensemble tiebreaker, additive on top of T4's single-judge
foundation. Reuses gateway.chat as the per-model judge interface; runs
three judges in parallel via Promise.allSettled. Pure aggregation logic
in aggregateEnsemble() — no SQL, no LLM, hermetically testable.

When ensemble fires (T5 trigger band):
  Only when ALL of:
    - opts.useEnsemble === true (default false)
    - opts.ensembleJudges array is non-empty
    - single-model confidence in [0.6, 0.95) (configurable via
      opts.ensembleTriggerBand)
    - single-model verdict !== 'unresolvable'

  Above 0.95 the single judge is already sufficient (T4 path). Below 0.6
  the verdict is clearly review-only — ensemble wouldn't change the
  posture. 'unresolvable' from single-judge means no evidence yet; calling
  three more judges on the same evidence won't manufacture some.

Conservative auto-apply (D12):
  Ensemble verdict auto-applies via engine.resolveTake only when ALL of:
    - autoResolve === true (operator opt-in per D17)
    - ensemble.agreement === 3 (3/3 unanimous)
    - ensemble.minConfidence >= ensembleThreshold (default 0.85)
    - winning verdict !== 'unresolvable'

  Schema-level monotonic-tightening guard for ensembleThreshold lives in
  the takes resolution layer.

Cache identity:
  When ensemble fires, the cache row's judge_model_id becomes
  'ensemble:<modelA>+<modelB>+<modelC>' — a future re-run with different
  ensemble membership doesn't collide with prior verdicts. evidence_signature
  is recomputed because it includes the judge_model_id.

aggregateEnsemble (pure):
  - 3/3 unanimous → agreement=3, minConfidence=min across the three
  - 2/3 majority → agreement=2, minConfidence across the agreeing two
  - 1/1/1 disagreement → tie-break: prefer non-'unresolvable', then
    alphabetical for determinism
  - 'unresolvable' from one model NEVER tips a 2-vote majority toward
    'unresolvable' — by-label tally only counts a model toward its own
    label
  - All three judges failing (allSettled rejected) → verdict='unresolvable'
    with agreement=0; auto-apply path blocked
  - Single judge survives + two fail → agreement=1; the lone verdict wins
    but auto-apply gated by the 3/3 requirement

Tests: 16 cases.
  aggregateEnsemble (6): 3/3, 2/3, 1/1/1, unresolvable-tipping-resistance,
  all-failed, partial-failed-but-survives.
  Phase trigger conditions (5): useEnsemble=false default, useEnsemble=true
  in borderline band, single >= 0.95 skip, single < 0.6 skip, single =
  'unresolvable' skip.
  Phase auto-apply rules (5): 3/3+threshold+autoResolve, 2/3 majority no
  apply, 3/3 below threshold no apply, one ensemble judge throws still
  aggregates from allSettled, empty ensembleJudges falls through to
  single.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cycle: calibration_profile phase + shared voice gate across surfaces (T6)

The calibration narrative layer. Reads TakesScorecard, asks an LLM to
write 2-4 conversational pattern statements ("right on tactics, late on
macro by 18 months"), passes them through the voice gate, derives active
bias tags, writes the row to calibration_profiles. This is the read-side
that E1 (think anti-bias rewrite), E3 (contradictions join), E6
(dashboard), and E7 (real-time nudges) all consume.

Voice gate (D24 — single function, multiple surfaces):
  ALL five calibration UX surfaces import the same gateVoice() function
  from src/core/calibration/voice-gate.ts. Mode parameter
  ('pattern_statement' | 'nudge' | 'forecast_blurb' | 'dashboard_caption'
  | 'morning_pulse') drives surface-specific tuning via the rubric the
  gate ships to its Haiku judge. NO forked implementations — voice
  rubric drift would defeat the gate.

  Each mode's rubric explicitly forbids preachy / clinical / corporate
  voice; a structural test pins this. Anchors the cross-cutting voice
  rule from /plan-ceo-review D2-D8.

Fallback policy (D11):
  Up to 2 generation attempts (configurable). On both rejects → fall back
  to a hand-written template from src/core/calibration/templates.ts.
  Templates are intentionally short and a little "robotic" — they're the
  safety net, not the destination. voice_gate_passed=false +
  voice_gate_attempts get persisted on the calibration_profiles row so
  the operator can review the failing examples and tune the rubric over
  time. Suppressing the surface silently is NEVER an option — that's how
  voice quality silently degrades.

  parseJudgeOutput defaults to 'academic' on parse failure (NEVER passes
  pass-through) so a Haiku output garble falls through to the template
  rather than letting unverified text reach the user.

calibration_profile phase:
  Extends BaseCyclePhase. Cold-brain skip: <5 resolved takes → no row
  written, no LLM call. Otherwise: scorecard via engine.getScorecard()
  → patterns via voice-gated generator → bias tags via separate
  generator (best-effort; failure logs warning, phase continues).

  The DB INSERT lands in the v67 calibration_profiles row with
  source_id, holder, the patterns, voice gate audit fields, active bias
  tags, and grade_completion (F1 fix — partial-grade state surfaces to
  the dashboard "60% graded" badge).

  Budget gate at $0.50/cycle default (mostly Haiku). Below-budget
  before-LLM-call check returns status='warn' without writing the row.

  Per-domain scorecards are a placeholder for v0.36.0.0 ship state —
  the F12 batchGetTakesScorecards() engine method that powers per-domain
  rendering lands in Lane C alongside the CLI/MCP surface.

Architecture:
  parsePatternStatementsOutput is tolerant of LLM emitting numbered
  lists / bulleted lines despite the prompt asking for plain lines.
  Caps at 4 patterns + drops excessively long lines (>200 chars).

  parseBiasTagsOutput lowercases input + drops non-kebab-case tokens
  (defends against the LLM emitting "Over-Confident Geography" with
  spaces or capitals). Caps at 4 tags.

Tests: 43 cases across two new test files.
  voice-gate.test.ts (24): parseJudgeOutput (7), gateVoice happy path
  (3), fallback path (5), mode parity (2), templates (7).
  calibration-profile.test.ts (19): parsers (10), pickFallbackSlots
  (3), phase integration (6 — cold-brain skip, happy path, voice gate
  fallback, grade_completion plumbed through, bias-tags failure
  non-fatal, source_id scope reaches INSERT).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cli: gbrain calibration + get_calibration_profile MCP op (T7)

Public-facing read surface for the v0.36.0.0 calibration wave. CLI prints
the active calibration profile; MCP op exposes the same data path for
agents. Mirror of the v0.29 salience/anomalies shape (pure data fn + JSON
formatter + human formatter + thin CLI dispatch).

CLI: `gbrain calibration`
  Flags:
    --holder <id>         specific holder (default 'garry')
    --json                machine output for piping
    --regenerate          run calibration_profile phase now
    --undo-wave <ver>     [placeholder — wires in Lane D / T17]
    ab-report             [placeholder — wires in Lane D / T18]

  Human output:
    Calibration profile — holder: garry, source: default
    Generated: <local timestamp>
    [Note: built on 60% graded — partial completion this cycle.]   (when grade_completion < 0.9)
    [Note: voice gate fell back to template (2 attempts).]         (when voice_gate_passed=false)

    Resolved: 12 takes
    Brier:    0.210 (lower is better)
    Accuracy: 60.0%
    Partial:  10.0%

    Pattern statements:
      • You called early-stage tactics well — 8 of 10 held up.

    Active bias tags: over-confident-geography

  Cold-brain fallback message names the exact dream command to run.

MCP: `get_calibration_profile` (scope: read)
  Param: holder?: string (defaults to 'garry')
  Returns: latest CalibrationProfileRow | null

  Source-scoping via sourceScopeOpts(ctx): scalar source-bound clients see
  only their source; federated_read scopes see the union of allowed sources;
  no source filter when neither is set (CLI default path).

  Throws GBrainError('INVALID_HOLDER') on empty/non-string holder so
  remote callers get a structured error instead of a SQL-shape failure.

Architecture:
  getLatestProfile is the pure data fn — engine + opts → CalibrationProfileRow | null.
  Reused by both the CLI and the MCP op. Source-scoped via the standard
  v0.34.1 spread pattern (scalar sourceId vs sourceIds array).

  formatProfileText is pure — null → cold-brain message, populated → full
  printout. Annotates partial-grade rows and voice-gate-fallback rows so
  the operator sees data-quality status inline.

  parseArgs is exported via __testing for unit coverage. Sub-command
  ('ab-report') vs flag distinction is intentional — keeps the surface
  parallel with `gbrain eval cross-modal` etc.

Tests: 21 cases.
  parseArgs (6 cases): empty, --holder, --json, --regenerate, --undo-wave, ab-report.
  getLatestProfile (5 cases): happy, null, scalar source scope, federated array
    scope, no-source-filter default.
  formatProfileText (5 cases): cold-brain, happy, partial-grade note, voice-fallback
    note, published-to-mounts note.
  getCalibrationProfileOp (5 cases): default holder, scalar source scope,
    federated scope union, returns-null-on-unknown-holder, throws on empty holder.

Lane D follow-ups: --undo-wave (T17) and ab-report (T18) print a clear
"lands in Lane D" stderr line + exit 2; the surfaces exist for early
testers, the implementations land next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* think: --with-calibration + anti-bias prompt rewrite (T8 / E1, D22)

Optional anti-bias rewrite mode for `gbrain think`. When set, the active
calibration profile gets injected per the D22 placement spec (AFTER
retrieval evidence, BEFORE the user's question). The bias filter applies
to QUESTION FRAMING, not evidence interpretation — matches LLM-as-judge
best practice (bias prompts near end of context perform better).

Default behavior unchanged (R1 regression guard): omitting
--with-calibration produces the v0.28-vintage user-message shape with the
question first, then retrieval. Existing think users see no change.

Two user-message shapes in buildThinkUserMessage:

  Default (no calibration):
    Question: X
    <pages>...</pages>
    <takes>...</takes>
    <graph>...</graph>
    Respond with a single JSON object...

  With calibration (D22):
    <pages>...</pages>
    <takes>...</takes>
    <graph>...</graph>
    <calibration holder="garry">
      Track record: Brier 0.210 (lower is better).
      Active patterns:
        - You called early-stage tactics well — 8 of 10 held up.
      Active bias tags: over-confident-geography
    </calibration>
    Question: X
    Respond...

  Calibration block is built by buildCalibrationBlock (exported for the
  E3 contradictions probe to render the same shape).

System prompt extension (withCalibration:true):
  - Names BOTH the user's PRIOR (default reasoning) AND the COUNTER-PRIOR
    from their hedged-domain self.
  - References active bias tags by name when relevant ("this fits the
    over-confident-geography pattern").
  - Does NOT silently substitute the debiased answer. ALWAYS surfaces
    both priors transparently.
  - Adds a "Calibration" section between Conflicts and Gaps in the
    answer body.

RunThinkOpts extension:
  - withCalibration?: boolean — opt-in
  - calibrationHolder?: string — defaults to 'garry'

  When withCalibration=true and no profile exists, runThink falls back to
  baseline behavior + pushes NO_CALIBRATION_PROFILE to warnings (visible
  to the operator). When the calibration fetch fails, CALIBRATION_FETCH_FAILED
  warning surfaces with the underlying error. Either path keeps think working;
  the calibration loop is enhancement, not requirement.

CLI: `gbrain think "<q>" --with-calibration [--calibration-holder <id>]`

Tests: 11 cases.
  buildThinkSystemPrompt (4 cases): R1 regression — default/false/omitted
  → no anti-bias rules; with calibration → adds PRIOR + COUNTER-PRIOR +
  bias-tag reference; preserves existing hard rules.

  buildCalibrationBlock (3 cases): happy path, null brier omitted (not
  "Brier null"), empty patterns + tags still well-formed.

  buildThinkUserMessage (4 cases): R1 regression — without calibration:
  question first; D22 placement — retrieval → calibration → question →
  instruction; graph + calibration ordering; empty retrieval blocks render
  placeholders without breaking shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* contradictions: calibration-profile join (T9 / E3)

Cross-references each contradiction finding against the active calibration
profile. When a contradiction's domain matches an active bias tag (e.g.
"over-confident-geography" or "late-on-macro-tech"), the output gains a
one-line bias context explaining which pattern this fits.

Pure functions only — no DB writes, no LLM calls. The probe runner imports
tagFindingWithCalibration() and applies it to each finding before emitting.
When no profile exists or no tags match, the helper returns null and the
runner emits the unchanged finding (regression R2 — contradictions output
is byte-identical to v0.32.6 when no calibration profile is present).

Match heuristic (v0.36.0.0 ship-state):
  Bias tags are kebab-case axis-then-domain slugs ('over-confident-geography').
  computeDomainHint() extracts a domain hint from the finding's slugs +
  holder + verdict text:
    - wiki/companies/... → hiring | market-timing
    - wiki/people/... → founder-behavior
    - macro / geography / tactics / ai segments in slug → matching tag
  First-match-wins for ordering determinism.

  Match is intentionally fuzzy — the v0.32.6 contradictions probe doesn't
  yet carry structured domain metadata. v0.37+ structured-domain-on-takes
  (Hindsight-style enum) tightens this.

Output:
  Returns { bias_tag: string, context: string } | null.
  Context format: "This contradiction fits your active bias pattern
  \"<tag>\" (Brier 0.31). Verdict: contradiction; severity: medium.
  Consider reviewing both sides through the lens of that pattern."

Tests: 13 cases.
  R2 regression (2): null profile → null tag; empty active_bias_tags → null tag.
  computeDomainHint (5): companies / people / macro / geography / unknown
  paths produce expected hints.
  Match path (4): macro→late-on-macro-tech, geography→over-confident-geography,
  mismatch returns null, first-match-wins with multiple candidate tags.
  buildBiasContextString (2): emits tag+verdict+severity+Brier; omits
  Brier when null (no "Brier null" leak).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: Brier-trend forecast at write time (T10 / E5)

Pure math layer over existing TakesScorecard data. Zero new LLM cost, zero
new schema. Surfaces the user's historical Brier for the take's
(holder, domain) bucket at write time so they see "your historical Brier
in macro takes is 0.31" before committing the take.

Voice-gate-rendered output:
  The user-facing string goes through gateVoice mode='forecast_blurb' via
  templates.ts (already in T6). This module is the pure data layer; the
  template renders the math into the conversational voice.

v0.36.0.0 ship state:
  Bucket dimension is the DOMAIN (slug-prefix). The conviction-weight
  bucket dimension would need a new engine method
  (engine.batchGetTakeBucketStats per F11) — deferred to v0.37+. Until
  then, forecast = historical Brier in this holder's domain.

  resolveDomainPrefix() keeps slug-prefix-looking domain hints
  ('companies/', 'wiki/macro') and falls back to overall for free-form
  hints ('macro tech', 'geography'). Hindsight-style structured domain
  on takes (CDX-11 mitigation TODO) tightens this in v0.37+.

MIN_BUCKET_N = 5:
  Below this sample size, the forecast returns predicted_brier=null with
  insufficient_data=true. Template renders "Forecast unavailable: only N
  resolved takes at this conviction yet" instead of a noisy estimate.

Architecture:
  computeForecast(input) — pure function, takes scorecards already
  fetched; ideal for tests + reuse across batched paths.
  forecastForTake(engine, input) — convenience wrapper, 1-2 engine
  round-trips (no domain → 1; with domain → 2).
  batchForecast(engine, inputs[]) — memoizes per (holder, domainPrefix);
  N inputs collapse to ≤2*unique_holders unique engine calls. Used by
  the propose-queue review flow (50 candidates → 1-2 scorecard fetches).

Tests: 14 cases.
  computeForecast (4): insufficient_data branch, stable forecast,
    overall fallback, MIN_BUCKET_N export.
  resolveDomainPrefix (5): undefined/empty/whitespace → undefined;
    slug-prefix → kept; free-form → undefined.
  forecastForTake (3): 1-call overall, 2-call domain, free-form fallback.
  batchForecast (2): cache collapse for repeat queries; different holders
    do not collapse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: gstack-learnings coupling on incorrect resolutions (T11 / E4)

When the grade_takes phase auto-resolves a take as 'incorrect' or 'partial',
optionally write a learning entry to gstack's per-project learnings.jsonl
so other gstack skills (plan-ceo-review, ship, investigate, ...) can pull
it as context when relevant. The brain teaches every other tool about
the user's track record.

Config gate (D5 / CDX-17 mitigation):
  `cycle.grade_takes.write_gstack_learnings` defaults FALSE. External
  users may not have gstack installed; the gstack-learnings binary API
  isn't stable yet. Garry's brain flips it true to opt in.

Quality gate:
  Only 'incorrect' and 'partial' verdicts trigger the write. 'correct'
  resolutions are noise (we expected the take to hold up — no learning).
  'unresolvable' has no canonical column. Defense-in-depth runtime guard
  in writeIncorrectResolution() rejects ineligible qualities with
  reason='quality_not_eligible' so a caller misuse never surfaces a
  malformed learning entry.

Auto-apply only:
  Coupling fires only when grade_takes both auto-applies AND the verdict
  is incorrect/partial AND the config flag is enabled. Manual resolutions
  via `gbrain takes resolve` intentionally DO NOT propagate to gstack —
  manual writes already carry operator intent; the calibration loop is
  the noise-prone path that earns coupling.

Namespace:
  Every entry's key starts with 'gbrain:calibration:v0.36.0.0:'. Lane D
  `gbrain calibration --undo-wave v0.36.0.0` (T17) filters on this prefix
  for the optional gstack-scrub step. First active bias tag suffixes the
  key (e.g. 'take-42:over-confident-geography') so future analysis can
  group learnings by bias pattern.

Architecture:
  buildLearningEntry — pure. Truncates claim at 200 chars + ellipsis;
  emits Pattern: line when activeBiasTags present; defaults confidence
  to 0.8 when caller omits it.

  writeIncorrectResolution — async wrapper. Honors config gate; honors
  quality gate; calls the injected writer (or defaultGstackWriter in
  production). Failures are non-fatal: returns
  { written: false, reason: 'write_failed' | 'binary_missing', error }.
  The grade_takes phase logs to result.warnings and continues — gstack
  coupling failure NEVER aborts a cycle.

  defaultGstackWriter — shells out to gstack-learnings-log binary via
  execFileSync. Throws GBrainError('GSTACK_BINARY_NOT_FOUND') when the
  binary isn't on PATH; writeIncorrectResolution classifies that error
  to reason='binary_missing' so the operator sees the install hint
  instead of a generic write_failed.

  Wired into grade-takes.ts after engine.resolveTake() inside the
  auto-apply block. Only fires when shouldApply=true.

Tests: 14 cases.
  buildLearningEntry (7): canonical shape, partial vs incorrect wording,
  bias-tag suffix, no-tag fallback, claim truncation, default confidence,
  no-reasoning omission.
  writeIncorrectResolution (7): config gate, quality gate, happy path,
  writer-throw graceful degrade, binary-missing classification, async
  writer awaited, partial quality writes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* doctor: 4 calibration checks — abandoned/freshness/drift/voice (T12)

Adds the four calibration doctor checks per the eng-review spec.

abandoned_threads:
  Counts active high-conviction takes (weight >= 0.7) older than 12 months
  that have never been superseded. Signal, not error — always status='ok'
  with a count. The hint sends users to `gbrain calibration` for details.

calibration_freshness:
  Warns when the active profile is older than 7 days (configurable via
  the same env-var pattern other freshness checks use). Cold-brain branch
  (no profile yet) returns ok without scolding. Hint points at
  `gbrain calibration --regenerate`.

grade_confidence_drift (CDX-11 mitigation):
  Surfaces the count of auto-applied grade verdicts. Below 30: returns
  "need 30+ for drift detection". At/above 30: returns "drift math
  arrives in v0.37+". The surface is wired; the actual
  confidence-vs-accuracy correlation math is a v0.37+ follow-up once we
  have 30+ auto-applied verdicts to measure against. Closes the CDX-11
  hole structurally — the operator sees the surface even before the math
  is meaningful.

voice_gate_health:
  Tracks voice gate failure rate over the last 7 days. <30% fail rate →
  ok (template fallback is fine in isolation). >=30% → warn with hint
  to review src/core/calibration/voice-gate.ts rubric. Anchors the
  cross-cutting voice rule observability story.

All four checks return status='warn' with a diagnostic message on
engine errors — non-blocking, never throws. Matches the existing doctor
check pattern (see checkSyncFreshness for prior art).

Wired into runDoctor after checkRerankerHealth (the v0.35 cluster), in
the canonical block 10 slot.

Tests: 15 cases. 4 per check (happy path, alt-status, engine-throw
diagnostic, plus boundary tests for the freshness staleness gate at
exactly 7 days and the grade drift gate at 30 applied verdicts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: E7 nudge + 14-day cooldown (T13 / D16 F3)

Real-time pattern surfacing when a newly-committed high-conviction take
matches an active bias pattern. Conversational nudge text via the
templates module; 14-day cooldown per (take_id, nudge_pattern) via
take_nudge_log to prevent the feedback loop where each cycle re-fires
the same nudge on the same take.

Threshold gates (D16 F3):
  - holder match (profile.holder === take.holder)
  - conviction-weight > 0.7 (strict greater than)
  - take's slug-derived domain hint matches an active bias tag
    (takeDomainHint — same heuristic as eval-contradictions/calibration-join.ts
    for cross-surface consistency)

Cooldown gate:
  Before firing, probe take_nudge_log for (take_id, nudge_pattern) rows
  with fired_at >= now() - 14 days. Any hit → silently skip. After firing,
  insert a new row with channel='stderr' so the next 14 days are gated.

Feedback-loop prevention:
  User hedges a take in response to a nudge (e.g. weight 0.85 → 0.65).
  Even though the take's `weight` field changed, the cooldown row for
  the over-confident-geography pattern is still there from the original
  fire — so the next cycle's evaluateAndFireNudge() silently skips. The
  user reset path (gbrain takes nudge --reset N) clears the cooldown to
  re-arm.

Output channel (v0.36.0.0 ship state):
  STDERR only. Schema's `channel` column already supports multi-channel
  (webhook, admin SPA toast); routing those is a v0.37+ follow-up.

Architecture:
  evaluateNudgeRule(take, profile) — pure rule check. Returns
  { matched, reason, matchedTag }. No engine call.
  checkCooldown(engine, takeId, pattern) — engine probe, returns boolean.
  recordNudgeFire(engine, opts) — INSERT into take_nudge_log.
  evaluateAndFireNudge(opts) — full pipeline. Returns NudgeDecision.
  resetNudgeCooldown(engine, takeId) — DELETE...RETURNING for the CLI.

  buildNudgeText delegates to templates.ts nudgeTemplate (D24 mode='nudge'
  voice). v0.36.0.0 ship state uses the template directly; LLM-generated
  nudge text via the voice gate lands in v0.37+ when we have production
  examples to tune from.

Tests: 22 cases.
  takeDomainHint (5): companies/people/macro/geography/unrecognized.
  evaluateNudgeRule (6): no_profile, wrong_holder, conviction-at-threshold-
  is-NOT-eligible (strict >), no matching tag, happy match,
  first-match-wins for multiple candidate tags.
  checkCooldown (3): true on row hit, false on no row, cutoff date param
  verifies the 14-day boundary.
  evaluateAndFireNudge (4): happy fire (text contains hush command +
  matched tag), cooldown silent skip (no INSERT, no stderr), no_profile
  short-circuit, below-conviction short-circuit (no cooldown query fired).
  buildNudgeText (2): hush command shape, conviction value embedded.
  resetNudgeCooldown (2): returns count, idempotent on zero rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: E8 team-brain sharing + D18 cross-brain query semantics (T14)

Cross-brain calibration profile resolution per the D18 4-rule contract.
Pins all four cross-brain leak surfaces in dedicated unit tests so future
mount features can't silently regress this security model.

D18 semantics (committed):

  Rule 1 — LOCAL-FIRST ORDERING.
    Query the local brain first. If a profile exists, return it. Do NOT
    also query mounts (avoids stale-mount-overrides-fresh-local).
    Verified: mountResolver is NOT called when local has a hit.

  Rule 2 — MOUNT FALLBACK.
    Only when local has no profile AND canReadMounts=true, walk the
    mounts in priority order. First match wins. Each mount-side row
    must have published=true to be visible (D15 asymmetric opt-in).

  Rule 3 — CROSS-BRAIN ATTRIBUTION.
    Every returned profile carries source_brain_id + from_mount flag.
    Consumers (E1 think rewrite, E3 contradictions, E7 nudge, E6
    dashboard) MUST surface this via attributionSuffix() so the user
    sees which brain answered.

  Rule 4 — SUBAGENT PROHIBITION.
    canReadMountsForCtx() classifier returns FALSE for subagent loops
    without trusted-workspace allowedSlugPrefixes. Closes the
    OAuth-token-to-cross-brain-leak surface — subagents see ONLY their
    local-brain results regardless of which holder they query.

    Exception: trusted cycle phases (synthesize/patterns) pass
    allowedSlugPrefixes set and ARE allowed to read mounts. Pinned in
    the classifier test.

Architecture:
  queryAcrossBrains(localEngine, opts) — pure orchestrator. Composes
  getLatestProfile() from src/commands/calibration.ts. Mount engine
  access is via opts.mountResolver — production wires this to the
  v0.19+ gbrain mounts subsystem; tests inject a stub returning an
  ordered list of mocked engines. Decouples cross-brain LOGIC from
  multi-engine PLUMBING.

  canReadMountsForCtx(ctx) — pure classifier table. Drives the rule-4
  gate. Production callers compose it from OperationContext.

  attributionSuffix(result) — pure formatter. Emits the "(from mounted
  brain: <id>)" suffix when from_mount=true; empty string when local.
  Mandatory for user-visible cross-brain consumers.

Tests: 15 cases pinned to the 4 D18 rules + 4 supplementary structural
checks.
  D18-1: published=false profile on mount stays hidden.
  D18-2/3: subagent context cannot fall back to mounts (2 cases — null
    on local-empty + canReadMounts=false, local hit still returned).
  D18-4: attribution surfaces source_brain_id (3 cases — mount answer
    flag, local answer flag, attributionSuffix formatter).
  Rule 1 local-first ordering (2 cases — mountResolver NOT called on
    local hit, IS called on local empty).
  Mount priority order (3 cases — first published=true wins, all
    published=false returns null, no mounts configured returns null
    without throwing).
  canReadMountsForCtx classifier (4 cases — local CLI true, MCP
    non-subagent true, subagent without trusted-workspace false,
    subagent WITH trusted-workspace true).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* admin: E6 Calibration tab + D23 server-rendered SVG + TD2 contrast bump (T15)

Adds the v0.36.0.0 admin SPA Calibration tab. Per the design review,
the approved variant-B (Linear calm clarity) layout: single-column flow,
generous whitespace, ONE big sparkline as hero, then patterns, then
domain bars, then abandoned threads.

D23 server-rendered SVG architecture:

  src/core/calibration/svg-renderer.ts — pure functions. data → SVG
  string. No DOM, no React, no chart library dep. Inlines the admin
  design tokens (#0a0a0f bg, #3b82f6 accent, etc.) so the SVG is
  visually consistent with the rest of the admin SPA.

  Four chart renderers:
    - renderBrierTrend({ series }) — sparkline w/ baseline reference
      at 0.25 (always-50% baseline)
    - renderDomainBars({ bars }) — horizontal accuracy bars per domain
    - renderAbandonedThreadsCard(threads) — D30/TD4 'revisit now' link
      per row, points at /admin/calibration/revisit/<takeId>
    - renderPatternStatementsCard(statements) — D29/TD3 clickable
      drill-down links per row, point at /admin/calibration/pattern/<i>

  XSS posture: all caller-controlled strings pass through escapeXml().
  Numeric inputs are .toFixed()-coerced. Admin SPA renders via
  dangerouslySetInnerHTML inside a TrustedSVG wrapper component;
  endpoint is gated by requireAdmin middleware.

  /admin/api/calibration/profile — returns the active profile row as JSON.
  /admin/api/calibration/charts/:type — returns image/svg+xml markup
    for type ∈ {brier-trend, domain-bars, pattern-statements,
                abandoned-threads}. Cache-Control: private, max-age=60.

  brier-trend currently renders a single-point series from the active
  profile (the time-series view across calibration_profiles.generated_at
  history is a v0.37 follow-up once we have multiple snapshots).
  abandoned-threads pulls the top 5 abandoned rows via the same SQL the
  doctor check uses.

CalibrationPage React component (admin/src/pages/Calibration.tsx):
  Fetches profile + 4 charts. Loading / error / cold-brain states all
  handled. Layout includes the audit annotations (partial-grade badge,
  voice-gate-fell-back-to-template badge) per the approved mockup.
  TrustedSVG wrapper isolates the dangerouslySetInnerHTML to the SVG
  surface only.

App.tsx nav: added 'calibration' page route + sidebar nav item, hash
routing extended to support #calibration.

TD2 contrast bump:
  admin/src/index.css --text-muted: #555 → #777. Old value was contrast
  4.0 on the #0a0a0f bg — below WCAG AA 4.5 for body text. New value is
  ~5.5, passes AA. Improvement is global across Dashboard, Agents,
  RequestLog, and the new Calibration tab — single-line CSS change with
  ~10x the impact.

admin/dist/ rebuilt via `bun run build` (vite). 36 modules transformed.

Tests: 19 cases in test/svg-renderer.test.ts.
  escapeXml (1): canonical entities.
  renderBrierTrend (6): empty state, polyline for 2+ points, clamp
  beyond yMax, design tokens inlined, XSS safety on date strings,
  text-anchor end on right label.
  renderDomainBars (4): empty state, label/accuracy/n rendering,
  out-of-range accuracy clamp, XSS safety on labels.
  renderAbandonedThreadsCard (4): empty state, row rendering with
  revisit link, claim truncation at 70 chars, custom revisitHref override.
  renderPatternStatementsCard (4): empty state, anchor count matches
  statement count, XSS safety, custom drillHref override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* recall: calibration footer formatter for morning pulse (T16)

Pure formatter that turns a CalibrationProfileRow + optional abandoned-
threads list into the conversational block the morning pulse will surface:

  Calibration this quarter:
    Brier 0.18 (solid).
    Right on early-stage tactics, late on macro by 18 months.
    Over-confident on team execution; under-calibrated on regulatory risk.

  Threads you opened and never came back to:
    · AI search platform differentiation         (17 months silent)
    · International expansion playbook           (12 months silent)

Cold-brain branch: returns empty string when no profile or < 5 resolved
takes. Caller decides whether to render the block; cold-brain absence
is the cleanest non-event.

Brier trend note maps the absolute value to conversational copy:
  <= 0.10 → "(strong calibration)"
  <= 0.20 → "(solid)"
  <= 0.25 → "(near baseline)"
  > 0.25  → "(worse than always-50% baseline — review your high-conviction calls)"

  v0.36.0.0 ship state has only the current profile snapshot. The
  "was 0.22 90d ago — improving" comparison shape arrives when we
  accumulate generated_at history across multiple cycles.

R3 regression posture:
  This module is the FORMATTER only. Wiring into `gbrain recall`'s text
  output is intentionally NOT in this commit — runRecall's surface
  stays unchanged. v0.37 wires it under --show-calibration (opt-in
  initially, default-on later). For now the formatter is callable from
  the admin tab + custom CLI scripts that want it.

Architecture:
  buildRecallCalibrationFooter(opts) — pure. opts.profile required,
  opts.abandonedThreads optional, opts.threadColumnWidth defaults to 50.

  Caps at 4 patterns + 5 abandoned threads to keep the footer scannable.
  Truncates long abandoned-thread claim text to fit the column width with
  a trailing ellipsis.

Tests: 14 cases.
  Cold-brain branch (3): null profile, < 5 resolved, zero resolved.
  Happy path (7): header + Brier + patterns, trend note ranges (4
  brackets), null brier omits the Brier line but keeps header, caps at
  4 patterns.
  Abandoned threads (4): omit section when none, emit when present,
  cap at 5, truncate long claim with column-width override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: --undo-wave reversal command (T17 / D18 CDX-3)

Implements the undo-wave reversal flow. Every new row written by the
v0.36.0.0 calibration wave carries wave_version='v0.36.0.0' so a precise
revert is possible without touching pre-wave data.

CLI surface (replaces the v0.36.0.0 ship-state placeholder):
  gbrain calibration --undo-wave v0.36.0.0 [--dry-run] [--scrub-gstack] [--json]

Reversal scope (4 steps):

  Step 1 — UNSET takes.resolved_* columns for takes auto-applied by this
  wave. Identifies wave-applied takes via take_grade_cache.applied=true
  + wave_version match. Cross-checks resolved_by='gbrain:grade_takes' to
  ensure we're not un-resolving a take a manual `gbrain takes resolve`
  override has since claimed. Manual resolutions persist; only auto-grade
  resolutions revert.

  Step 1b — Mark take_grade_cache rows applied=false post-undo so the
  audit trail shows they WERE applied but this wave was reverted. The
  CDX-11 confidence-drift check filters on applied=true and gets a
  cleaner sample post-undo.

  Step 2 — DELETE FROM calibration_profiles WHERE wave_version = ?.

  Step 3 — DELETE FROM take_nudge_log WHERE wave_version = ?.

  Step 4 — Optional gstack-learnings-prune via the binary, scoped to the
  GSTACK_LEARNING_NAMESPACE prefix. Opt-in via --scrub-gstack. Best-effort:
  binary-missing or failure logs a warning + suggests the manual command;
  the rest of the undo still succeeded.

Dry-run posture:
  --dry-run computes the counts via SELECT COUNT(*) shapes without
  emitting any UPDATE or DELETE. Same UndoWaveResult shape returned so
  operator sees exactly what would be reverted before committing.

  --dry-run intentionally skips the gstack scrub (filesystem write) too;
  ship-state safety call.

Idempotency:
  Re-running --undo-wave on a brain that's already reverted is a no-op.
  Each query filters on wave_version; no matching rows → zero counts.

Architecture:
  undoWave(engine, opts) — async, returns UndoWaveResult. Pure data
  layer; no stderr writes, no process exits. CLI dispatch in
  src/commands/calibration.ts handles printing.

  v0.36.0.0 ship state runs steps 1-3 sequentially (no transaction).
  Partial reversal is recoverable via re-run since each step is
  idempotent on wave_version match. A future enhancement (v0.37+) can
  wrap in engine.transaction once that surface lands in BrainEngine.

Tests: 8 cases in test/undo-wave.test.ts.
  Dry-run posture (1): counts emitted, NO UPDATE/DELETE SQL fired.
  Happy path (3): all 4 steps execute, resolved_by filter scopes UPDATE
  to wave-applied resolutions, custom resolvedByLabel honored.
  Empty wave (2): zero counts when no matching rows, idempotent re-run.
  Wave-version parameter threading (2): supplied version threads
  through all queries, different wave versions don't collide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: A/B harness for think + ab-report (T18 / D19 CDX-18)

Structural answer to CDX-18 (anti-bias rewrite may make advice worse).
We don't have to guess whether calibration helps — we measure.

Architecture:
  runAbTrial(input) — calls thinkRunner TWICE on the same question
  (baseline + --with-calibration), surfaces both answers to a
  preferenceResolver, persists the trial to think_ab_results.

  buildAbReport(engine, { days }) — aggregates the table over the last
  N days (default 30). Computes win counts, ties, neither, and a
  with_calibration_win_rate over DECISIVE trials only (excludes
  neither/tie). Flags calibration_net_negative when n >= 20 AND win
  rate < 45%.

  formatAbReport(report, days) — pretty-prints for stdout; emits the
  calibration_net_negative warning block when triggered.

CLI:
  gbrain calibration ab-report [--days N] [--json]
    Reads the table, prints the breakdown. Replaces the v0.36.0.0
    ship-state placeholder in src/commands/calibration.ts.

  gbrain think --ab "<question>"
    Wires into runAbTrial via the dispatch in src/commands/think.ts —
    follow-up commit. This commit lands the harness layer + schema +
    report surface; the --ab flag itself flips on in a one-line wiring
    commit when the runRecall path is ready.

Schema (migration v72 / think_ab_results):
  source_id, wave_version, ran_at, question, baseline_answer,
  with_calibration_answer, preferred (CHECK in {baseline,
  with_calibration, neither, tie}), model_id, notes.

  CHECK constraint enforces preferred enum. Default wave_version
  'v0.36.0.0' stamped so --undo-wave can scrub these too.

  Index on (source_id, ran_at DESC) supports the report's
  "last N days" query.

  schema.sql + pglite-schema.ts both updated for fresh-install parity.
  schema-embedded.ts regenerated via build:schema.

calibration_net_negative threshold (D19):
  Triggers when:
    - decisive_trials (baseline + with_calibration) >= 20
    - with_calibration_win_rate < 0.45 (NOT <= — exact 45% is OK)

  Small-sample guard (n < 20) prevents the warning from firing on
  early data with sampling noise. Confidence-flat threshold (no Wilson
  CI yet) keeps the math simple; v0.37+ adds CI bounds.

Tests: 12 cases in test/think-ab.test.ts.
  runAbTrial (4): both runner calls fire, preferenceResolver receives
    both answers, INSERT row params shape, throws when thinkRunner
    missing.
  buildAbReport (5): zero trials, aggregation, net_negative trigger at
    n>=20 + win<45%, no trigger at n<20 (small-sample guard), no
    trigger at exact 45% boundary.
  formatAbReport (3): zero-state message, decisive-trials breakdown,
    net_negative warning block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: pattern drill-down route + revisit-now CLI (TD3 / D29 + TD4 / D30)

TD3 (D29) — clickable pattern drill-down endpoint:
  GET /admin/api/calibration/pattern/:id (requireAdmin)
  Returns the pattern statement at index `id` plus the top 25 resolved
  takes for the holder, sorted by weight desc. v0.36.0.0 ship-state
  approximation: surfaces broad provenance evidence (top resolved
  takes). v0.37+ stores per-pattern source_take_ids[] on a
  calibration_profile_patterns join table so the drill-down shows the
  EXACT takes that drove the pattern.

  Surfaces a `provenance_note` field in the response so the operator
  sees the v0.36.0.0-vs-v0.37 fidelity boundary inline.

  The admin SPA's renderPatternStatementsCard SVG already emits anchor
  tags pointing at /admin/calibration/pattern/<i> (T15 ship state).
  This route makes those anchors clickable — closes the trust loop that
  was the rationale for D29 ("pattern statements without their evidence
  are dressed-up LLM hallucinations").

TD4 (D30) — `gbrain takes revisit <slug>` editor-open action:
  Adds the `revisit` subcommand to gbrain takes. Opens $EDITOR (falling
  back to vi) on the source markdown file for the slug. Appends a
  `<!-- gbrain:revisit -->` cursor marker at the bottom of the page on
  first invocation so the editor opens with intent visible.

  Reads sync.repo_path from config to locate the brain repo. Refuses to
  proceed with a clear error when the repo isn't configured or the page
  doesn't exist.

  spawnSync with stdio:'inherit' so the editor takes the terminal. Exit
  status surfaced on failure.

  The SVG renderer's revisit-now anchor for each abandoned thread row
  emits /admin/calibration/revisit/<takeId>. A small route handler that
  resolves take_id → page_slug then dispatches `gbrain takes revisit`
  via spawn is a v0.37 follow-up — the CLI command exists now so
  developers can wire it directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: DESIGN.md — formalize de facto design tokens (TD1)

Promotes the admin SPA's de facto design tokens (landed v0.26.0) to a
canonical DESIGN.md at the repo root. This is the calibration target
for /plan-design-review and /design-review going forward — when a
question is "does this UI fit the system?", the answer is here.

Captures the system as it stands today:

  Voice (5 surfaces, all routed through gateVoice() with mode-specific
  rubrics): pattern_statement, nudge, forecast_blurb, dashboard_caption,
  morning_pulse. Friend-not-doctor; concrete data over abstract metrics;
  no preachy / clinical / corporate language.

  Color tokens: 10 CSS variables from admin/src/index.css inlined into
  the SVG renderer (src/core/calibration/svg-renderer.ts). Dark theme
  is the only theme — admin is an operator tool. WCAG contrast
  documented per token; TD2's #555 → #777 bump on --text-muted noted.

  Typography: Inter for UI, JetBrains Mono for numbers/slugs/data.
  Type scale (18 / 14 / 13 / 12 / 11) documented as de facto, not yet
  formalized.

  Spacing scale: 4 / 8 / 16 / 24 / 32px. Linear-app density.

  Layout: sidebar 200px, max content 720px (text) / 960px (tables).
  No 3-column feature grids, no icons in colored circles, no
  decorative blobs.

  Charts: server-rendered SVG via pure functions in
  src/core/calibration/svg-renderer.ts. XSS posture documented:
  server-side escapeXml on caller-controlled strings, numeric inputs
  .toFixed()-coerced, admin SPA renders via <TrustedSVG> wrapper.

  Interaction patterns: keyboard nav required (J/K/space/u/q on the
  propose-queue), loading/empty/error states ARE features.

  v0.37+ roadmap: type scale formalization, animation tokens, component
  library extraction. Light mode explicitly NOT planned.

The doc is a living target, not a frozen spec. Major changes route
through /plan-design-review per the existing review chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: synthetic corpus scaffold + privacy CI guard (T19 + T20)

T19 — synthetic corpus scaffold for extract-takes prompt tuning.
  test/fixtures/calibration/extract-takes-corpus/ — 5 representative
  pages across 4 genres (essay, people, companies, meetings, decisions).
  v0.36.0.0 ships a SMALL representative corpus as proof of structure;
  the full 50-page training set + 10-page holdout gets generated by the
  operator via `gbrain calibration build-corpus` (v0.37 follow-up
  subcommand) or by hand with the privacy guard catching violations
  either way.

  Privacy contract per D13': every page is SYNTHETIC. None of the
  names/companies/funds/deals/events refer to anything real. Placeholder
  names per CLAUDE.md: alice-example, charlie-example, acme-example,
  widget-co, fund-a/b/c, acme-seed, widget-series-a, meetings/2026-04-03.

  test/fixtures/calibration/README.md spells out the privacy contract,
  generation flow, and what the corpus is (stable regression set for
  the extract-takes prompt) vs is not (real anything).

T20 — privacy CI guard (CDX-14 mitigation).
  scripts/check-synthetic-corpus-privacy.sh greps the corpus for:
    1. Explicit dollar amounts ($50M, $1.2B etc) — would suggest the
       page memorized a real round size.
    2. Out-of-range year references (informational only for v0.36.0.0;
       deferred to a manual review checklist).
    3. Pages that reference ZERO placeholder names — suggests the page
       might be referring to real entities. Essay-genre fixtures
       exempt (they're anonymized PG-style writing by design).

  Wired into `bun run verify` (CI gate) so contributors can't accidentally
  land a synthetic fixture that leaks real-world specificity. The intent
  is fail-fast on accidental leakage; the operator can update the
  allowlist if a generic dollar amount is intentional.

  Closes CDX-14: 'CC reads real brain pages locally, writes nothing
  still risks privacy if any generated synthetic fixture memorizes
  structure-specific facts. Placeholder names are not enough.'

The corpus shipped here is intentionally small but covers the four
core gbrain page genres (essay, people, companies, meetings/decisions).
The v0.37 corpus-build subcommand will fan out to 50 with the operator
spot-checking + the CI guard enforcing the privacy contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: R1-R5 IRON RULE regression inventory (T21)

Per /plan-eng-review D26 IRON RULE: regressions get added to the test
suite as critical requirements, no AskUserQuestion needed. Pins five
regressions identified during the v0.36.0.0 wave's coverage diagram:

  R1: think baseline UNCHANGED when --with-calibration absent.
      Covered structurally by test/think-with-calibration.test.ts plus
      assertion-pinned in this file (default user message: question
      first, then retrieval; system prompt: no anti-bias section).

  R2: contradictions probe output UNCHANGED when no calibration profile.
      Covered structurally by test/eval-contradictions-calibration-join.test.ts
      plus pinned here (null profile → null tag, byte-identical to v0.32.6).

  R3: takes resolution flow works when grade_takes phase disabled.
      Pinned import-surface coupling: takes-resolution.ts has zero
      dependency on grade_takes module. If a future refactor accidentally
      couples them, this test fails to compile.

  R4: search/list_pages/get_page work identically through new source_id paths.
      Marker test referencing existing v0.34.1 source-isolation suite at
      test/source-isolation-pglite.test.ts. v0.36.0.0 does NOT modify
      those code paths; the existing tests catch any accidental coupling.

  R5: existing search modes (conservative/balanced/tokenmax) unaffected.
      Marker test referencing existing test/search-mode.test.ts. The
      calibration code DOES NOT IMPORT from src/core/search/mode.ts.

Plus an inventory test that confirms all 5 regressions have an
'addressed' status — fail-loud if a future contributor removes a
guard without updating the inventory.

7 tests total. Pure functions, no engine, hermetic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: v0.36.0.0 CHANGELOG + CLAUDE.md anchors + calibration convention skill

CHANGELOG entry: the user-facing release notes. Leads with the headline
("the brain learns how you tend to be wrong, then argues against your
blind spots on every advice call"), 5 'what you can now do' bullets in
GStack voice, itemized changes by lane, and the 'To take advantage of
v0.36.0.0' upgrade checklist per the CLAUDE.md required-block contract.

CLAUDE.md anchors: new 'v0.36.0.0 Hindsight calibration wave (key files
cluster)' block inserted before the v0.31.1 thin-client section. 23 new
files / extensions annotated with one-paragraph descriptions each,
linking back to the convention skill at skills/conventions/calibration.md
for the agent-facing rules.

skills/conventions/calibration.md: the agent-facing convention skill.
Tells future contributors which calibration touchpoint applies to
their task — voice gate? BaseCyclePhase? source-scope thread? doctor
warning? cross-brain query rules? auto-resolve threshold posture? Test
seam patterns. Bug class to avoid (the v0.34.1 source-isolation leak
shape).

Version trio (per CLAUDE.md mandatory audit):
  VERSION:     0.36.0.0
  package.json: 0.36.0.0
  CHANGELOG:   ## [0.36.0.0] - 2026-05-17

llms.txt + llms-full.txt regenerated via `bun run build:llms` after
the CLAUDE.md edit (per the explicit CLAUDE.md mandate "Any CLAUDE.md
edit MUST be followed by `bun run build:llms`"). The `test/build-llms.test.ts`
guard runs in CI shard 1; the committed bundles are checked against
fresh generator output.

bun run verify is clean. typecheck clean. Privacy CI guard passes
(0 violations across 6 corpus pages). All ready for /ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cycle: wire propose_takes / grade_takes / calibration_profile into runCycle (T-fix)

The three new v0.36.0.0 phases were declared in CyclePhase / ALL_PHASES /
NEEDS_LOCK_PHASES but the runCycle orchestrator never dispatched them.
ALL_PHASES advertised them, gbrain dream --phase propose_takes accepted
them, but `gbrain dream` (default) silently skipped all three.

Adds a single dispatch block between consolidate and embed that:
  - builds an OperationContext on the fly (trusted-workspace caller,
    remote: false, sourceId resolved via the same helper sync uses)
  - dispatches the three phases in the order ALL_PHASES declares
  - records the same skipped-phase shape (no_database) when engine is null

Pinned by test/core/cycle.serial.test.ts "default: all 6 phases run in
order" which was already failing against ALL_PHASES (the test name lags
the actual phase count; left as-is since renaming churns history).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: expand synthetic corpus + add hand-labeled ground-truth (T19)

Adds 8 new synthetic pages modeled on the genre mix observed in the
real brain (concepts-with-timeline, meeting-notes, daily-journal,
people-pages, essays). Companion .gradeable-claims.json files carry
hand-labeled answer keys — what a tuned propose_takes prompt SHOULD
extract per page. Closes the F1 gate gap from the plan's T19/D19:

  Training corpus (test/fixtures/calibration/extract-takes-corpus/):
    + concept-startup-market-dynamics.md     (10 claims)
    + meeting-2026-04-10-fundraise-fund-a.md (6 claims)
    + daily-2026-04-15.md                    (5 claims)

  Blind holdout (test/fixtures/calibration/holdout/):
    + concept-founder-execution.md           (6 claims, F1 >= 0.80)
    + daily-2026-04-18.md                    (4 claims, F1 >= 0.80)
    + meeting-2026-04-17-hiring-charlie.md   (5 claims, F1 >= 0.80)
    + essay-on-conviction.md                 (7 claims, F1 >= 0.80)
    + people-bob-example.md                  (5 claims, F1 >= 0.80)

Privacy:
  - No real-brain content read into any committed artifact. Pages
    written from scratch using the canonical placeholder set
    (alice-example, charlie-example, bob-example, acme-example,
    widget-co, fund-a/b/c). Real-name grep confirms zero leakage:
    wintermute, garrytan, paul-graham, sam-altman, etc. → 0 hits.
  - scripts/check-synthetic-corpus-privacy.sh passes: 0 violations
    across 14 pages (was 6).

Genre fidelity:
  - concept-with-timeline pages mirror the dated-assertion structure
    real brain uses (verb framing varies: "argues / predicts / I
    think / I bet / strong conviction / moderate conviction").
  - meeting-notes pages carry both prose claims (extracted via
    hedging language) and explicit ## Takes sections.
  - daily-journal pages test probabilistic framing ("75/25 in favor",
    "call it ~0.5") and self-tagged conviction values.
  - essay-on-conviction is the meta-page that names the author's
    own bias patterns — primary signal for calibration_profile.
  - people pages test claim-about-third-party extraction.

Each JSON ground-truth lists per-claim:
  - claim_text + kind (prediction|judgment|bet) + domain
  - conviction (0..1)
  - since_date
  - rationale (why this claim is gradeable + how a tuned prompt
    should infer conviction from the prose)

This is the corpus that gates the T19 prompt-tune iteration:
  - F1 >= 0.85 on training (10+6+5 = 21 claims across 3 pages
    plus the existing 5 fixtures already shipped)
  - F1 >= 0.80 on holdout (27 claims across 5 pages)

Plan reference: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md
Privacy gate: scripts/check-synthetic-corpus-privacy.sh (wired into bun run verify).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: tune propose…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant