feat(exports): add ./enrichment to package.json exports map by chengzehsu · Pull Request #555 · garrytan/gbrain

chengzehsu · 2026-05-01T14:42:21Z

Summary

Adds a single ./enrichment entry to the exports map in package.json,
pointing to the existing src/core/enrichment-service.ts. Purely additive
— no file is moved, no symbol is renamed, no behavior changes.

     "./search/expansion": "./src/core/search/expansion.ts",
-    "./extract": "./src/commands/extract.ts"
+    "./extract": "./src/commands/extract.ts",
+    "./enrichment": "./src/core/enrichment-service.ts"
   },

Motivation

External consumers — e.g. user-built ingestion workers that want to plug
entity enrichment into their own pipelines — need to call the symbols
that src/core/enrichment-service.ts already exports:

extractAndEnrich
enrichEntity
enrichEntities
extractEntities
slugifyEntity
entityPagePath

Today, because enrichment-service.ts is not in the exports map, the
only way to reach those symbols from outside the package is a deep import:

import { extractAndEnrich } from "gbrain/src/core/enrichment-service";

That has two problems:

Outside the public contract. Any internal reshuffle of src/core/
silently breaks downstream consumers, because nothing pins this path
as part of the API surface.
Not type-resolvable by default. TypeScript with default
moduleResolution rejects deep imports that are not declared in
exports, so consumers have to add custom paths mappings in their
tsconfig.json just to compile.

After this change, consumers can write:

import { extractAndEnrich } from "gbrain/enrichment";

…in the same shape as the existing gbrain/extract, gbrain/operations,
gbrain/engine, etc.

Discovery

Found while wiring extractAndEnrich into an external eco-brain-worker
service as a Phase 1.5 follow-up — the deep-import workaround was the
clear sharp edge, and a 1-line addition to exports cleans it up for
every future consumer.

Test plan

package.json parses as valid JSON (verified locally with python -m json.tool).
Diff is exactly the 2 lines shown above (1 added entry + trailing comma on the previous line).
No other file touched.
CI green on this PR.

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

External consumers (e.g. user-built workers that wire entity enrichment into ingestion pipelines) need to call `extractAndEnrich`, `enrichEntity`, `enrichEntities`, `extractEntities`, `slugifyEntity`, and `entityPagePath` as a library. Today the only way to reach `src/core/enrichment-service.ts` from outside the repo is a deep import path like `gbrain/src/core/enrichment-service.ts`, which: - Sits outside the public exports contract, so it could break silently on any internal restructure. - Is not resolvable by TypeScript without a custom `paths` mapping in the consumer's tsconfig. This change adds a single `./enrichment` entry to the `exports` map, in the same style as the existing `./extract`, `./operations`, `./engine`, etc. entries. It is purely additive — no file is moved, no symbol is renamed, and there is no behavioral change. Consumers can now write: import { extractAndEnrich } from "gbrain/enrichment"; The motivating use case is a Phase 1.5 follow-up in eco-brain-worker, where this gap was discovered while wiring `extractAndEnrich` into the worker's ingestion flow.

…mp (T15) Adds the v0.36.0.0 admin SPA Calibration tab. Per the design review, the approved variant-B (Linear calm clarity) layout: single-column flow, generous whitespace, ONE big sparkline as hero, then patterns, then domain bars, then abandoned threads. D23 server-rendered SVG architecture: src/core/calibration/svg-renderer.ts — pure functions. data → SVG string. No DOM, no React, no chart library dep. Inlines the admin design tokens (#0a0a0f bg, #3b82f6 accent, etc.) so the SVG is visually consistent with the rest of the admin SPA. Four chart renderers: - renderBrierTrend({ series }) — sparkline w/ baseline reference at 0.25 (always-50% baseline) - renderDomainBars({ bars }) — horizontal accuracy bars per domain - renderAbandonedThreadsCard(threads) — D30/TD4 'revisit now' link per row, points at /admin/calibration/revisit/<takeId> - renderPatternStatementsCard(statements) — D29/TD3 clickable drill-down links per row, point at /admin/calibration/pattern/<i> XSS posture: all caller-controlled strings pass through escapeXml(). Numeric inputs are .toFixed()-coerced. Admin SPA renders via dangerouslySetInnerHTML inside a TrustedSVG wrapper component; endpoint is gated by requireAdmin middleware. /admin/api/calibration/profile — returns the active profile row as JSON. /admin/api/calibration/charts/:type — returns image/svg+xml markup for type ∈ {brier-trend, domain-bars, pattern-statements, abandoned-threads}. Cache-Control: private, max-age=60. brier-trend currently renders a single-point series from the active profile (the time-series view across calibration_profiles.generated_at history is a v0.37 follow-up once we have multiple snapshots). abandoned-threads pulls the top 5 abandoned rows via the same SQL the doctor check uses. CalibrationPage React component (admin/src/pages/Calibration.tsx): Fetches profile + 4 charts. Loading / error / cold-brain states all handled. Layout includes the audit annotations (partial-grade badge, voice-gate-fell-back-to-template badge) per the approved mockup. TrustedSVG wrapper isolates the dangerouslySetInnerHTML to the SVG surface only. App.tsx nav: added 'calibration' page route + sidebar nav item, hash routing extended to support #calibration. TD2 contrast bump: admin/src/index.css --text-muted: #555 → #777. Old value was contrast 4.0 on the #0a0a0f bg — below WCAG AA 4.5 for body text. New value is ~5.5, passes AA. Improvement is global across Dashboard, Agents, RequestLog, and the new Calibration tab — single-line CSS change with ~10x the impact. admin/dist/ rebuilt via `bun run build` (vite). 36 modules transformed. Tests: 19 cases in test/svg-renderer.test.ts. escapeXml (1): canonical entities. renderBrierTrend (6): empty state, polyline for 2+ points, clamp beyond yMax, design tokens inlined, XSS safety on date strings, text-anchor end on right label. renderDomainBars (4): empty state, label/accuracy/n rendering, out-of-range accuracy clamp, XSS safety on labels. renderAbandonedThreadsCard (4): empty state, row rendering with revisit link, claim truncation at 70 chars, custom revisitHref override. renderPatternStatementsCard (4): empty state, anchor count matches statement count, XSS safety, custom drillHref override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Promotes the admin SPA's de facto design tokens (landed v0.26.0) to a canonical DESIGN.md at the repo root. This is the calibration target for /plan-design-review and /design-review going forward — when a question is "does this UI fit the system?", the answer is here. Captures the system as it stands today: Voice (5 surfaces, all routed through gateVoice() with mode-specific rubrics): pattern_statement, nudge, forecast_blurb, dashboard_caption, morning_pulse. Friend-not-doctor; concrete data over abstract metrics; no preachy / clinical / corporate language. Color tokens: 10 CSS variables from admin/src/index.css inlined into the SVG renderer (src/core/calibration/svg-renderer.ts). Dark theme is the only theme — admin is an operator tool. WCAG contrast documented per token; TD2's #555 → #777 bump on --text-muted noted. Typography: Inter for UI, JetBrains Mono for numbers/slugs/data. Type scale (18 / 14 / 13 / 12 / 11) documented as de facto, not yet formalized. Spacing scale: 4 / 8 / 16 / 24 / 32px. Linear-app density. Layout: sidebar 200px, max content 720px (text) / 960px (tables). No 3-column feature grids, no icons in colored circles, no decorative blobs. Charts: server-rendered SVG via pure functions in src/core/calibration/svg-renderer.ts. XSS posture documented: server-side escapeXml on caller-controlled strings, numeric inputs .toFixed()-coerced, admin SPA renders via <TrustedSVG> wrapper. Interaction patterns: keyboard nav required (J/K/space/u/q on the propose-queue), loading/empty/error states ARE features. v0.37+ roadmap: type scale formalization, animation tokens, component library extraction. Light mode explicitly NOT planned. The doc is a living target, not a frozen spec. Major changes route through /plan-design-review per the existing review chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… wrong (#1139) * schema: v0.36.0.0 Hindsight calibration tables (migrations v67-v71) Foundation commit for the Hindsight-inspired calibration wave. Adds four new tables + one perf index, all source-scoped from day 1 per v0.34.1 discipline: - calibration_profiles (v67): per-holder LLM-narrative aggregation of TakesScorecard data. published BOOL gates E8 cross-brain mount sharing (default false). grade_completion REAL surfaces partial-grade state to the dashboard. active_bias_tags TEXT[] with GIN index feeds E3 (calibration- aware contradictions) and E7 (real-time nudge matching). - take_proposals (v68): propose_takes phase queue. Idempotency cache via (source_id, page_slug, content_hash, prompt_version) unique index mirrors the v0.23 dream_verdicts pattern. proposal_run_id supports --rollback by run. dedup_against_fence_rows JSONB audit column records what canonical takes the LLM was told to dedupe against at proposal time. - take_grade_cache (v69): grade_takes verdict cache. Composite PK on (take_id, prompt_version, judge_model_id, evidence_signature) — prompt edits OR evidence changes cleanly invalidate prior verdicts. applied=false default + auto-resolve-off-by-default (D17) means every fresh install needs operator opt-in before grade verdicts mutate the takes table. - take_nudge_log (v70): E7 nudge cooldown state. Polymorphic FK — a nudge fires on either a canonical take OR a pending proposal (CDX-5 fix). CHECK constraint enforces exactly-one-set. channel column lets future routing (webhook, admin SPA toast) reuse the same cooldown semantics. - takes_resolved_at_idx (v71): partial index for the Brier-trend aggregation queries. Engine-aware handler — Postgres uses CONCURRENTLY to avoid the ShareLock; PGLite uses plain CREATE. Every table carries wave_version TEXT NOT NULL DEFAULT 'v0.36.0.0' so the v0.36.0.0 calibration --undo-wave command (lands later in the wave) can reverse just this wave's writes. Plan: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md covers the design rationale (D17/D18/D21 + CDX findings). Schema parity: - src/schema.sql for fresh Postgres installs - src/core/pglite-schema.ts for fresh PGLite installs - src/core/schema-embedded.ts auto-regenerated from schema.sql - src/core/migrate.ts for upgrade-in-place from older brains VERSION bumped to 0.36.0.0 for the wave. CHANGELOG entry lands at /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * core: BaseCyclePhase abstract class enforces source-scope + budget contracts D21 from the eng review. Three new v0.36.0.0 cycle phases (propose_takes, grade_takes, calibration_profile) share enough structure that the duplication-vs-abstraction trade tips toward a shared base. Without this scaffold, source-isolation discipline would drift exactly the way it drifted in v0.34.1 — except this time across three new surfaces at once. What this enforces: 1. Phase signature is uniform: run(ctx, opts) → PhaseResult. 2. ctx.sourceId / ctx.auth.allowedSources MUST be threaded through every engine call. The base class surfaces a scope() helper that wraps sourceScopeOpts(ctx) and is the only sanctioned way to read source- scoped data. Forgetting to thread source scope becomes a TypeScript compile error, not a runtime leak. Closes the v0.34.1 leak class structurally for every new phase. 3. Budget meter wraps run() automatically. Subclass declares budgetUsdKey + budgetUsdDefault; base reads the resolved cap from config and creates the BudgetMeter. Subclass calls this.checkBudget() before each LLM submit; budget-exhausted phase still returns status='ok' (clean abort) so the cycle report shows partial completion, not failure. 4. Error envelope is uniform. Thrown errors get caught and converted to status='fail' with a phase-specific error.code via the subclass's mapErrorCode() hook. 5. Progress reporter integration. Base accepts the reporter via opts; subclasses call this.tick() instead of touching the reporter directly, so the phase name in the progress stream is always correct. Tests: 13 cases in test/core/base-phase.test.ts cover source-scope threading (5 cases including the empty-allowedSources-MUST-NOT-widen-scope regression), PhaseResult shape including the error envelope path (3 cases), dry-run propagation (2 cases), and budget meter construction (3 cases including config-key override). Synthesize.ts / patterns.ts (existing pre-v0.36 phases) deliberately do NOT retrofit to this base in v0.36.0.0 — too much churn for a refactor that doesn't pay off until v0.37+. Future phases use this by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: propose_takes phase + take_proposals queue write path (T3) LLM-based take extraction from markdown prose. Walks pages updated since last cycle, sends each page's body to a tuned extractor, writes the extracted gradeable claims to the take_proposals queue. User accepts / rejects via `gbrain takes propose --review` (lands in Lane C). Cycle wiring: lint → backlinks → sync → synthesize → extract → extract_facts → resolve_symbol_edges → patterns → recompute_emotional_weight → consolidate → propose_takes (NEW) → grade_takes (NEW; T4) → calibration_profile (NEW; T6) → embed → orphans → purge CyclePhase enum extended with 3 new entries; ALL_PHASES + NEEDS_LOCK_PHASES updated. All three new phases acquire the cycle lock (writes to take_proposals / take_grade_cache / calibration_profiles). Idempotency contract: The (source_id, page_slug, content_hash, prompt_version) composite unique index on take_proposals means an unchanged page never re-spends LLM tokens. Bumping PROPOSE_TAKES_PROMPT_VERSION cleanly invalidates the cache so a tuned prompt re-runs proposals on every page. Mirrors the v0.23 dream_verdicts pattern. F2 fence dedup: The phase reads the page's existing `` fence (when present) and passes the canonical take rows to the extractor as "things you have already captured." Prevents duplicate proposals when prose is appended to a page that already has takes. Records the fence rows the LLM was told to dedupe against on the take_proposals row for audit (dedup_against_fence_rows JSONB). Auto-resolve posture: propose_takes only WRITES proposals to the queue. Nothing in this phase mutates the canonical takes table. Operator opt-in via the queue review CLI (Lane C) is the only path from queue to canonical fence (D17). Prompt tuning status (v0.36.0.0 ship state): The default extractor prompt is annotated `v0.36.0.0-stub`. The real tuned prompt arrives via T19 synthetic corpus build (50 anonymized pages, 3-model parallel extraction, user reviews disagreement set, F1 ≥ 0.85 on training corpus + F1 ≥ 0.8 on ground-truth holdout). Until T19 lands, propose_takes runs but produces best-effort candidates the user reviews manually. Architecture: ProposeTakesPhase extends BaseCyclePhase (T2). Inherits source-scope threading via scope(), budget metering via this.checkBudget(), error envelope wrapping. budgetUsdKey: cycle.propose_takes.budget_usd (default $5/cycle). Budget exhaustion mid-page returns status='warn' with details.budget_exhausted=true — clean partial-completion semantics. Test seam: opts.extractor injection so the phase can run hermetically without touching the gateway. defaultExtractor (production path) calls gateway.chat with the EXTRACT_TAKES_PROMPT and parses the JSON array output via parseExtractorOutput. parseExtractorOutput defends against common LLM output sins: markdown code fence wrapping, leading prose, single-object instead of array, unknown kind values, weight out of [0,1], rows missing claim_text or exceeding 500 chars. Tests: 25 cases in test/propose-takes.test.ts cover the 4 pure helpers (parseExtractorOutput, contentHash, hasCompleteFence, extractExistingTakesForDedup) + 7 phase integration scenarios (happy path, cache hit, fence dedup, extractor failure, empty pages, skipPagesWithFence, proposal_run_id stability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: grade_takes phase + take_grade_cache verdict pipeline (T4) Walks unresolved takes that are old enough to have outcome data, retrieves evidence from the brain, asks a judge model to verdict each one. Writes verdicts to take_grade_cache. Optionally — only when operator has flipped the opt-in config flag — auto-applies high-confidence verdicts to the canonical takes table via engine.resolveTake. Auto-resolve posture (D17 — DISABLED by default): On a fresh install, grade_takes runs and writes verdicts to the cache, but applied=false on every row. Operator reviews the queue, then flips `cycle.grade_takes.auto_resolve.enabled: true` once trust is earned. Mirrors the propose_takes review-queue posture: queue exists, mutation requires explicit opt-in. Conservative threshold (D12): When auto_resolve.enabled is true, a verdict auto-applies only when confidence >= 0.95 (single-judge path). T5 ensemble path lands next, tightening this further with 3/3 unanimous requirement. 'unresolvable' verdict NEVER auto-applies even at confidence=1.0 — there's no canonical column for "we tried and there's no evidence yet." Evidence retrieval status (v0.36.0.0 ship state): The default evidence retriever returns an "evidence-retrieval not yet wired" placeholder. Most verdicts produced by the stub-judge against the stub-evidence will be 'unresolvable'. Real retrieval (hybrid search over pages newer than the take's since_date, optionally augmented by a gateway web-search recipe in v0.37+) lands as a follow-up. Documented limitation per CDX-8 + D17 — the phase ships now so the wiring is real and the cache table accumulates verdicts even if early ones are conservative. Cache key: Composite primary key on take_grade_cache is (take_id, prompt_version, judge_model_id, evidence_signature). Prompt edits OR evidence changes OR judge swap cleanly invalidate prior verdicts. Mirrors the v0.32.6 eval_contradictions_cache pattern. evidence_signature = SHA-256 of (judge_model_id + '|' + evidence_text) so identical evidence under a different judge does NOT collide. Architecture: GradeTakesPhase extends BaseCyclePhase. Inherits source-scope threading, budget metering (cycle.grade_takes.budget_usd, default $3/cycle), error envelope. Test seam: opts.judge + opts.evidenceRetriever injection so the phase runs hermetically. parseJudgeOutput defends against fence-wrapping, leading prose, out-of-range confidence (clamps to [0,1]), invalid verdict labels, oversized reasoning (truncated at 400 chars). Returns null on unrecoverable parse — caller treats null as "judge_output_parse_failed / unresolvable at confidence 0.0" so the row still lands in cache with the parse failure surfaced via warnings. takeIsOldEnough gates on since_date (default 6 months). Tolerates YYYY-MM-DD and YYYY-MM formats. Returns false on null/unparseable since_date so takes without dates never get graded (we'd be hallucinating temporal context). Tests: 23 cases covering parseJudgeOutput (7 cases), evidenceSignature (3), takeIsOldEnough (5), and 8 phase integration scenarios — happy path, D17 auto-resolve-off default, D12 above-threshold auto-apply, below- threshold cache-only, unresolvable-NEVER-applies, cache hit, too-recent gate, judge-throw warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: grade_takes ensemble tiebreaker for borderline verdicts (T5 / E2) Multi-judge ensemble tiebreaker, additive on top of T4's single-judge foundation. Reuses gateway.chat as the per-model judge interface; runs three judges in parallel via Promise.allSettled. Pure aggregation logic in aggregateEnsemble() — no SQL, no LLM, hermetically testable. When ensemble fires (T5 trigger band): Only when ALL of: - opts.useEnsemble === true (default false) - opts.ensembleJudges array is non-empty - single-model confidence in [0.6, 0.95) (configurable via opts.ensembleTriggerBand) - single-model verdict !== 'unresolvable' Above 0.95 the single judge is already sufficient (T4 path). Below 0.6 the verdict is clearly review-only — ensemble wouldn't change the posture. 'unresolvable' from single-judge means no evidence yet; calling three more judges on the same evidence won't manufacture some. Conservative auto-apply (D12): Ensemble verdict auto-applies via engine.resolveTake only when ALL of: - autoResolve === true (operator opt-in per D17) - ensemble.agreement === 3 (3/3 unanimous) - ensemble.minConfidence >= ensembleThreshold (default 0.85) - winning verdict !== 'unresolvable' Schema-level monotonic-tightening guard for ensembleThreshold lives in the takes resolution layer. Cache identity: When ensemble fires, the cache row's judge_model_id becomes 'ensemble:<modelA>+<modelB>+<modelC>' — a future re-run with different ensemble membership doesn't collide with prior verdicts. evidence_signature is recomputed because it includes the judge_model_id. aggregateEnsemble (pure): - 3/3 unanimous → agreement=3, minConfidence=min across the three - 2/3 majority → agreement=2, minConfidence across the agreeing two - 1/1/1 disagreement → tie-break: prefer non-'unresolvable', then alphabetical for determinism - 'unresolvable' from one model NEVER tips a 2-vote majority toward 'unresolvable' — by-label tally only counts a model toward its own label - All three judges failing (allSettled rejected) → verdict='unresolvable' with agreement=0; auto-apply path blocked - Single judge survives + two fail → agreement=1; the lone verdict wins but auto-apply gated by the 3/3 requirement Tests: 16 cases. aggregateEnsemble (6): 3/3, 2/3, 1/1/1, unresolvable-tipping-resistance, all-failed, partial-failed-but-survives. Phase trigger conditions (5): useEnsemble=false default, useEnsemble=true in borderline band, single >= 0.95 skip, single < 0.6 skip, single = 'unresolvable' skip. Phase auto-apply rules (5): 3/3+threshold+autoResolve, 2/3 majority no apply, 3/3 below threshold no apply, one ensemble judge throws still aggregates from allSettled, empty ensembleJudges falls through to single. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: calibration_profile phase + shared voice gate across surfaces (T6) The calibration narrative layer. Reads TakesScorecard, asks an LLM to write 2-4 conversational pattern statements ("right on tactics, late on macro by 18 months"), passes them through the voice gate, derives active bias tags, writes the row to calibration_profiles. This is the read-side that E1 (think anti-bias rewrite), E3 (contradictions join), E6 (dashboard), and E7 (real-time nudges) all consume. Voice gate (D24 — single function, multiple surfaces): ALL five calibration UX surfaces import the same gateVoice() function from src/core/calibration/voice-gate.ts. Mode parameter ('pattern_statement' | 'nudge' | 'forecast_blurb' | 'dashboard_caption' | 'morning_pulse') drives surface-specific tuning via the rubric the gate ships to its Haiku judge. NO forked implementations — voice rubric drift would defeat the gate. Each mode's rubric explicitly forbids preachy / clinical / corporate voice; a structural test pins this. Anchors the cross-cutting voice rule from /plan-ceo-review D2-D8. Fallback policy (D11): Up to 2 generation attempts (configurable). On both rejects → fall back to a hand-written template from src/core/calibration/templates.ts. Templates are intentionally short and a little "robotic" — they're the safety net, not the destination. voice_gate_passed=false + voice_gate_attempts get persisted on the calibration_profiles row so the operator can review the failing examples and tune the rubric over time. Suppressing the surface silently is NEVER an option — that's how voice quality silently degrades. parseJudgeOutput defaults to 'academic' on parse failure (NEVER passes pass-through) so a Haiku output garble falls through to the template rather than letting unverified text reach the user. calibration_profile phase: Extends BaseCyclePhase. Cold-brain skip: <5 resolved takes → no row written, no LLM call. Otherwise: scorecard via engine.getScorecard() → patterns via voice-gated generator → bias tags via separate generator (best-effort; failure logs warning, phase continues). The DB INSERT lands in the v67 calibration_profiles row with source_id, holder, the patterns, voice gate audit fields, active bias tags, and grade_completion (F1 fix — partial-grade state surfaces to the dashboard "60% graded" badge). Budget gate at $0.50/cycle default (mostly Haiku). Below-budget before-LLM-call check returns status='warn' without writing the row. Per-domain scorecards are a placeholder for v0.36.0.0 ship state — the F12 batchGetTakesScorecards() engine method that powers per-domain rendering lands in Lane C alongside the CLI/MCP surface. Architecture: parsePatternStatementsOutput is tolerant of LLM emitting numbered lists / bulleted lines despite the prompt asking for plain lines. Caps at 4 patterns + drops excessively long lines (>200 chars). parseBiasTagsOutput lowercases input + drops non-kebab-case tokens (defends against the LLM emitting "Over-Confident Geography" with spaces or capitals). Caps at 4 tags. Tests: 43 cases across two new test files. voice-gate.test.ts (24): parseJudgeOutput (7), gateVoice happy path (3), fallback path (5), mode parity (2), templates (7). calibration-profile.test.ts (19): parsers (10), pickFallbackSlots (3), phase integration (6 — cold-brain skip, happy path, voice gate fallback, grade_completion plumbed through, bias-tags failure non-fatal, source_id scope reaches INSERT). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cli: gbrain calibration + get_calibration_profile MCP op (T7) Public-facing read surface for the v0.36.0.0 calibration wave. CLI prints the active calibration profile; MCP op exposes the same data path for agents. Mirror of the v0.29 salience/anomalies shape (pure data fn + JSON formatter + human formatter + thin CLI dispatch). CLI: `gbrain calibration` Flags: --holder <id> specific holder (default 'garry') --json machine output for piping --regenerate run calibration_profile phase now --undo-wave <ver> [placeholder — wires in Lane D / T17] ab-report [placeholder — wires in Lane D / T18] Human output: Calibration profile — holder: garry, source: default Generated: <local timestamp> [Note: built on 60% graded — partial completion this cycle.] (when grade_completion < 0.9) [Note: voice gate fell back to template (2 attempts).] (when voice_gate_passed=false) Resolved: 12 takes Brier: 0.210 (lower is better) Accuracy: 60.0% Partial: 10.0% Pattern statements: • You called early-stage tactics well — 8 of 10 held up. Active bias tags: over-confident-geography Cold-brain fallback message names the exact dream command to run. MCP: `get_calibration_profile` (scope: read) Param: holder?: string (defaults to 'garry') Returns: latest CalibrationProfileRow | null Source-scoping via sourceScopeOpts(ctx): scalar source-bound clients see only their source; federated_read scopes see the union of allowed sources; no source filter when neither is set (CLI default path). Throws GBrainError('INVALID_HOLDER') on empty/non-string holder so remote callers get a structured error instead of a SQL-shape failure. Architecture: getLatestProfile is the pure data fn — engine + opts → CalibrationProfileRow | null. Reused by both the CLI and the MCP op. Source-scoped via the standard v0.34.1 spread pattern (scalar sourceId vs sourceIds array). formatProfileText is pure — null → cold-brain message, populated → full printout. Annotates partial-grade rows and voice-gate-fallback rows so the operator sees data-quality status inline. parseArgs is exported via __testing for unit coverage. Sub-command ('ab-report') vs flag distinction is intentional — keeps the surface parallel with `gbrain eval cross-modal` etc. Tests: 21 cases. parseArgs (6 cases): empty, --holder, --json, --regenerate, --undo-wave, ab-report. getLatestProfile (5 cases): happy, null, scalar source scope, federated array scope, no-source-filter default. formatProfileText (5 cases): cold-brain, happy, partial-grade note, voice-fallback note, published-to-mounts note. getCalibrationProfileOp (5 cases): default holder, scalar source scope, federated scope union, returns-null-on-unknown-holder, throws on empty holder. Lane D follow-ups: --undo-wave (T17) and ab-report (T18) print a clear "lands in Lane D" stderr line + exit 2; the surfaces exist for early testers, the implementations land next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * think: --with-calibration + anti-bias prompt rewrite (T8 / E1, D22) Optional anti-bias rewrite mode for `gbrain think`. When set, the active calibration profile gets injected per the D22 placement spec (AFTER retrieval evidence, BEFORE the user's question). The bias filter applies to QUESTION FRAMING, not evidence interpretation — matches LLM-as-judge best practice (bias prompts near end of context perform better). Default behavior unchanged (R1 regression guard): omitting --with-calibration produces the v0.28-vintage user-message shape with the question first, then retrieval. Existing think users see no change. Two user-message shapes in buildThinkUserMessage: Default (no calibration): Question: X <pages>...</pages> <takes>...</takes> <graph>...</graph> Respond with a single JSON object... With calibration (D22): <pages>...</pages> <takes>...</takes> <graph>...</graph> <calibration holder="garry"> Track record: Brier 0.210 (lower is better). Active patterns: - You called early-stage tactics well — 8 of 10 held up. Active bias tags: over-confident-geography </calibration> Question: X Respond... Calibration block is built by buildCalibrationBlock (exported for the E3 contradictions probe to render the same shape). System prompt extension (withCalibration:true): - Names BOTH the user's PRIOR (default reasoning) AND the COUNTER-PRIOR from their hedged-domain self. - References active bias tags by name when relevant ("this fits the over-confident-geography pattern"). - Does NOT silently substitute the debiased answer. ALWAYS surfaces both priors transparently. - Adds a "Calibration" section between Conflicts and Gaps in the answer body. RunThinkOpts extension: - withCalibration?: boolean — opt-in - calibrationHolder?: string — defaults to 'garry' When withCalibration=true and no profile exists, runThink falls back to baseline behavior + pushes NO_CALIBRATION_PROFILE to warnings (visible to the operator). When the calibration fetch fails, CALIBRATION_FETCH_FAILED warning surfaces with the underlying error. Either path keeps think working; the calibration loop is enhancement, not requirement. CLI: `gbrain think "<q>" --with-calibration [--calibration-holder <id>]` Tests: 11 cases. buildThinkSystemPrompt (4 cases): R1 regression — default/false/omitted → no anti-bias rules; with calibration → adds PRIOR + COUNTER-PRIOR + bias-tag reference; preserves existing hard rules. buildCalibrationBlock (3 cases): happy path, null brier omitted (not "Brier null"), empty patterns + tags still well-formed. buildThinkUserMessage (4 cases): R1 regression — without calibration: question first; D22 placement — retrieval → calibration → question → instruction; graph + calibration ordering; empty retrieval blocks render placeholders without breaking shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * contradictions: calibration-profile join (T9 / E3) Cross-references each contradiction finding against the active calibration profile. When a contradiction's domain matches an active bias tag (e.g. "over-confident-geography" or "late-on-macro-tech"), the output gains a one-line bias context explaining which pattern this fits. Pure functions only — no DB writes, no LLM calls. The probe runner imports tagFindingWithCalibration() and applies it to each finding before emitting. When no profile exists or no tags match, the helper returns null and the runner emits the unchanged finding (regression R2 — contradictions output is byte-identical to v0.32.6 when no calibration profile is present). Match heuristic (v0.36.0.0 ship-state): Bias tags are kebab-case axis-then-domain slugs ('over-confident-geography'). computeDomainHint() extracts a domain hint from the finding's slugs + holder + verdict text: - wiki/companies/... → hiring | market-timing - wiki/people/... → founder-behavior - macro / geography / tactics / ai segments in slug → matching tag First-match-wins for ordering determinism. Match is intentionally fuzzy — the v0.32.6 contradictions probe doesn't yet carry structured domain metadata. v0.37+ structured-domain-on-takes (Hindsight-style enum) tightens this. Output: Returns { bias_tag: string, context: string } | null. Context format: "This contradiction fits your active bias pattern \"<tag>\" (Brier 0.31). Verdict: contradiction; severity: medium. Consider reviewing both sides through the lens of that pattern." Tests: 13 cases. R2 regression (2): null profile → null tag; empty active_bias_tags → null tag. computeDomainHint (5): companies / people / macro / geography / unknown paths produce expected hints. Match path (4): macro→late-on-macro-tech, geography→over-confident-geography, mismatch returns null, first-match-wins with multiple candidate tags. buildBiasContextString (2): emits tag+verdict+severity+Brier; omits Brier when null (no "Brier null" leak). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: Brier-trend forecast at write time (T10 / E5) Pure math layer over existing TakesScorecard data. Zero new LLM cost, zero new schema. Surfaces the user's historical Brier for the take's (holder, domain) bucket at write time so they see "your historical Brier in macro takes is 0.31" before committing the take. Voice-gate-rendered output: The user-facing string goes through gateVoice mode='forecast_blurb' via templates.ts (already in T6). This module is the pure data layer; the template renders the math into the conversational voice. v0.36.0.0 ship state: Bucket dimension is the DOMAIN (slug-prefix). The conviction-weight bucket dimension would need a new engine method (engine.batchGetTakeBucketStats per F11) — deferred to v0.37+. Until then, forecast = historical Brier in this holder's domain. resolveDomainPrefix() keeps slug-prefix-looking domain hints ('companies/', 'wiki/macro') and falls back to overall for free-form hints ('macro tech', 'geography'). Hindsight-style structured domain on takes (CDX-11 mitigation TODO) tightens this in v0.37+. MIN_BUCKET_N = 5: Below this sample size, the forecast returns predicted_brier=null with insufficient_data=true. Template renders "Forecast unavailable: only N resolved takes at this conviction yet" instead of a noisy estimate. Architecture: computeForecast(input) — pure function, takes scorecards already fetched; ideal for tests + reuse across batched paths. forecastForTake(engine, input) — convenience wrapper, 1-2 engine round-trips (no domain → 1; with domain → 2). batchForecast(engine, inputs[]) — memoizes per (holder, domainPrefix); N inputs collapse to ≤2*unique_holders unique engine calls. Used by the propose-queue review flow (50 candidates → 1-2 scorecard fetches). Tests: 14 cases. computeForecast (4): insufficient_data branch, stable forecast, overall fallback, MIN_BUCKET_N export. resolveDomainPrefix (5): undefined/empty/whitespace → undefined; slug-prefix → kept; free-form → undefined. forecastForTake (3): 1-call overall, 2-call domain, free-form fallback. batchForecast (2): cache collapse for repeat queries; different holders do not collapse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: gstack-learnings coupling on incorrect resolutions (T11 / E4) When the grade_takes phase auto-resolves a take as 'incorrect' or 'partial', optionally write a learning entry to gstack's per-project learnings.jsonl so other gstack skills (plan-ceo-review, ship, investigate, ...) can pull it as context when relevant. The brain teaches every other tool about the user's track record. Config gate (D5 / CDX-17 mitigation): `cycle.grade_takes.write_gstack_learnings` defaults FALSE. External users may not have gstack installed; the gstack-learnings binary API isn't stable yet. Garry's brain flips it true to opt in. Quality gate: Only 'incorrect' and 'partial' verdicts trigger the write. 'correct' resolutions are noise (we expected the take to hold up — no learning). 'unresolvable' has no canonical column. Defense-in-depth runtime guard in writeIncorrectResolution() rejects ineligible qualities with reason='quality_not_eligible' so a caller misuse never surfaces a malformed learning entry. Auto-apply only: Coupling fires only when grade_takes both auto-applies AND the verdict is incorrect/partial AND the config flag is enabled. Manual resolutions via `gbrain takes resolve` intentionally DO NOT propagate to gstack — manual writes already carry operator intent; the calibration loop is the noise-prone path that earns coupling. Namespace: Every entry's key starts with 'gbrain:calibration:v0.36.0.0:'. Lane D `gbrain calibration --undo-wave v0.36.0.0` (T17) filters on this prefix for the optional gstack-scrub step. First active bias tag suffixes the key (e.g. 'take-42:over-confident-geography') so future analysis can group learnings by bias pattern. Architecture: buildLearningEntry — pure. Truncates claim at 200 chars + ellipsis; emits Pattern: line when activeBiasTags present; defaults confidence to 0.8 when caller omits it. writeIncorrectResolution — async wrapper. Honors config gate; honors quality gate; calls the injected writer (or defaultGstackWriter in production). Failures are non-fatal: returns { written: false, reason: 'write_failed' | 'binary_missing', error }. The grade_takes phase logs to result.warnings and continues — gstack coupling failure NEVER aborts a cycle. defaultGstackWriter — shells out to gstack-learnings-log binary via execFileSync. Throws GBrainError('GSTACK_BINARY_NOT_FOUND') when the binary isn't on PATH; writeIncorrectResolution classifies that error to reason='binary_missing' so the operator sees the install hint instead of a generic write_failed. Wired into grade-takes.ts after engine.resolveTake() inside the auto-apply block. Only fires when shouldApply=true. Tests: 14 cases. buildLearningEntry (7): canonical shape, partial vs incorrect wording, bias-tag suffix, no-tag fallback, claim truncation, default confidence, no-reasoning omission. writeIncorrectResolution (7): config gate, quality gate, happy path, writer-throw graceful degrade, binary-missing classification, async writer awaited, partial quality writes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * doctor: 4 calibration checks — abandoned/freshness/drift/voice (T12) Adds the four calibration doctor checks per the eng-review spec. abandoned_threads: Counts active high-conviction takes (weight >= 0.7) older than 12 months that have never been superseded. Signal, not error — always status='ok' with a count. The hint sends users to `gbrain calibration` for details. calibration_freshness: Warns when the active profile is older than 7 days (configurable via the same env-var pattern other freshness checks use). Cold-brain branch (no profile yet) returns ok without scolding. Hint points at `gbrain calibration --regenerate`. grade_confidence_drift (CDX-11 mitigation): Surfaces the count of auto-applied grade verdicts. Below 30: returns "need 30+ for drift detection". At/above 30: returns "drift math arrives in v0.37+". The surface is wired; the actual confidence-vs-accuracy correlation math is a v0.37+ follow-up once we have 30+ auto-applied verdicts to measure against. Closes the CDX-11 hole structurally — the operator sees the surface even before the math is meaningful. voice_gate_health: Tracks voice gate failure rate over the last 7 days. <30% fail rate → ok (template fallback is fine in isolation). >=30% → warn with hint to review src/core/calibration/voice-gate.ts rubric. Anchors the cross-cutting voice rule observability story. All four checks return status='warn' with a diagnostic message on engine errors — non-blocking, never throws. Matches the existing doctor check pattern (see checkSyncFreshness for prior art). Wired into runDoctor after checkRerankerHealth (the v0.35 cluster), in the canonical block 10 slot. Tests: 15 cases. 4 per check (happy path, alt-status, engine-throw diagnostic, plus boundary tests for the freshness staleness gate at exactly 7 days and the grade drift gate at 30 applied verdicts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: E7 nudge + 14-day cooldown (T13 / D16 F3) Real-time pattern surfacing when a newly-committed high-conviction take matches an active bias pattern. Conversational nudge text via the templates module; 14-day cooldown per (take_id, nudge_pattern) via take_nudge_log to prevent the feedback loop where each cycle re-fires the same nudge on the same take. Threshold gates (D16 F3): - holder match (profile.holder === take.holder) - conviction-weight > 0.7 (strict greater than) - take's slug-derived domain hint matches an active bias tag (takeDomainHint — same heuristic as eval-contradictions/calibration-join.ts for cross-surface consistency) Cooldown gate: Before firing, probe take_nudge_log for (take_id, nudge_pattern) rows with fired_at >= now() - 14 days. Any hit → silently skip. After firing, insert a new row with channel='stderr' so the next 14 days are gated. Feedback-loop prevention: User hedges a take in response to a nudge (e.g. weight 0.85 → 0.65). Even though the take's `weight` field changed, the cooldown row for the over-confident-geography pattern is still there from the original fire — so the next cycle's evaluateAndFireNudge() silently skips. The user reset path (gbrain takes nudge --reset N) clears the cooldown to re-arm. Output channel (v0.36.0.0 ship state): STDERR only. Schema's `channel` column already supports multi-channel (webhook, admin SPA toast); routing those is a v0.37+ follow-up. Architecture: evaluateNudgeRule(take, profile) — pure rule check. Returns { matched, reason, matchedTag }. No engine call. checkCooldown(engine, takeId, pattern) — engine probe, returns boolean. recordNudgeFire(engine, opts) — INSERT into take_nudge_log. evaluateAndFireNudge(opts) — full pipeline. Returns NudgeDecision. resetNudgeCooldown(engine, takeId) — DELETE...RETURNING for the CLI. buildNudgeText delegates to templates.ts nudgeTemplate (D24 mode='nudge' voice). v0.36.0.0 ship state uses the template directly; LLM-generated nudge text via the voice gate lands in v0.37+ when we have production examples to tune from. Tests: 22 cases. takeDomainHint (5): companies/people/macro/geography/unrecognized. evaluateNudgeRule (6): no_profile, wrong_holder, conviction-at-threshold- is-NOT-eligible (strict >), no matching tag, happy match, first-match-wins for multiple candidate tags. checkCooldown (3): true on row hit, false on no row, cutoff date param verifies the 14-day boundary. evaluateAndFireNudge (4): happy fire (text contains hush command + matched tag), cooldown silent skip (no INSERT, no stderr), no_profile short-circuit, below-conviction short-circuit (no cooldown query fired). buildNudgeText (2): hush command shape, conviction value embedded. resetNudgeCooldown (2): returns count, idempotent on zero rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: E8 team-brain sharing + D18 cross-brain query semantics (T14) Cross-brain calibration profile resolution per the D18 4-rule contract. Pins all four cross-brain leak surfaces in dedicated unit tests so future mount features can't silently regress this security model. D18 semantics (committed): Rule 1 — LOCAL-FIRST ORDERING. Query the local brain first. If a profile exists, return it. Do NOT also query mounts (avoids stale-mount-overrides-fresh-local). Verified: mountResolver is NOT called when local has a hit. Rule 2 — MOUNT FALLBACK. Only when local has no profile AND canReadMounts=true, walk the mounts in priority order. First match wins. Each mount-side row must have published=true to be visible (D15 asymmetric opt-in). Rule 3 — CROSS-BRAIN ATTRIBUTION. Every returned profile carries source_brain_id + from_mount flag. Consumers (E1 think rewrite, E3 contradictions, E7 nudge, E6 dashboard) MUST surface this via attributionSuffix() so the user sees which brain answered. Rule 4 — SUBAGENT PROHIBITION. canReadMountsForCtx() classifier returns FALSE for subagent loops without trusted-workspace allowedSlugPrefixes. Closes the OAuth-token-to-cross-brain-leak surface — subagents see ONLY their local-brain results regardless of which holder they query. Exception: trusted cycle phases (synthesize/patterns) pass allowedSlugPrefixes set and ARE allowed to read mounts. Pinned in the classifier test. Architecture: queryAcrossBrains(localEngine, opts) — pure orchestrator. Composes getLatestProfile() from src/commands/calibration.ts. Mount engine access is via opts.mountResolver — production wires this to the v0.19+ gbrain mounts subsystem; tests inject a stub returning an ordered list of mocked engines. Decouples cross-brain LOGIC from multi-engine PLUMBING. canReadMountsForCtx(ctx) — pure classifier table. Drives the rule-4 gate. Production callers compose it from OperationContext. attributionSuffix(result) — pure formatter. Emits the "(from mounted brain: <id>)" suffix when from_mount=true; empty string when local. Mandatory for user-visible cross-brain consumers. Tests: 15 cases pinned to the 4 D18 rules + 4 supplementary structural checks. D18-1: published=false profile on mount stays hidden. D18-2/3: subagent context cannot fall back to mounts (2 cases — null on local-empty + canReadMounts=false, local hit still returned). D18-4: attribution surfaces source_brain_id (3 cases — mount answer flag, local answer flag, attributionSuffix formatter). Rule 1 local-first ordering (2 cases — mountResolver NOT called on local hit, IS called on local empty). Mount priority order (3 cases — first published=true wins, all published=false returns null, no mounts configured returns null without throwing). canReadMountsForCtx classifier (4 cases — local CLI true, MCP non-subagent true, subagent without trusted-workspace false, subagent WITH trusted-workspace true). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * admin: E6 Calibration tab + D23 server-rendered SVG + TD2 contrast bump (T15) Adds the v0.36.0.0 admin SPA Calibration tab. Per the design review, the approved variant-B (Linear calm clarity) layout: single-column flow, generous whitespace, ONE big sparkline as hero, then patterns, then domain bars, then abandoned threads. D23 server-rendered SVG architecture: src/core/calibration/svg-renderer.ts — pure functions. data → SVG string. No DOM, no React, no chart library dep. Inlines the admin design tokens (#0a0a0f bg, #3b82f6 accent, etc.) so the SVG is visually consistent with the rest of the admin SPA. Four chart renderers: - renderBrierTrend({ series }) — sparkline w/ baseline reference at 0.25 (always-50% baseline) - renderDomainBars({ bars }) — horizontal accuracy bars per domain - renderAbandonedThreadsCard(threads) — D30/TD4 'revisit now' link per row, points at /admin/calibration/revisit/<takeId> - renderPatternStatementsCard(statements) — D29/TD3 clickable drill-down links per row, point at /admin/calibration/pattern/<i> XSS posture: all caller-controlled strings pass through escapeXml(). Numeric inputs are .toFixed()-coerced. Admin SPA renders via dangerouslySetInnerHTML inside a TrustedSVG wrapper component; endpoint is gated by requireAdmin middleware. /admin/api/calibration/profile — returns the active profile row as JSON. /admin/api/calibration/charts/:type — returns image/svg+xml markup for type ∈ {brier-trend, domain-bars, pattern-statements, abandoned-threads}. Cache-Control: private, max-age=60. brier-trend currently renders a single-point series from the active profile (the time-series view across calibration_profiles.generated_at history is a v0.37 follow-up once we have multiple snapshots). abandoned-threads pulls the top 5 abandoned rows via the same SQL the doctor check uses. CalibrationPage React component (admin/src/pages/Calibration.tsx): Fetches profile + 4 charts. Loading / error / cold-brain states all handled. Layout includes the audit annotations (partial-grade badge, voice-gate-fell-back-to-template badge) per the approved mockup. TrustedSVG wrapper isolates the dangerouslySetInnerHTML to the SVG surface only. App.tsx nav: added 'calibration' page route + sidebar nav item, hash routing extended to support #calibration. TD2 contrast bump: admin/src/index.css --text-muted: #555 → #777. Old value was contrast 4.0 on the #0a0a0f bg — below WCAG AA 4.5 for body text. New value is ~5.5, passes AA. Improvement is global across Dashboard, Agents, RequestLog, and the new Calibration tab — single-line CSS change with ~10x the impact. admin/dist/ rebuilt via `bun run build` (vite). 36 modules transformed. Tests: 19 cases in test/svg-renderer.test.ts. escapeXml (1): canonical entities. renderBrierTrend (6): empty state, polyline for 2+ points, clamp beyond yMax, design tokens inlined, XSS safety on date strings, text-anchor end on right label. renderDomainBars (4): empty state, label/accuracy/n rendering, out-of-range accuracy clamp, XSS safety on labels. renderAbandonedThreadsCard (4): empty state, row rendering with revisit link, claim truncation at 70 chars, custom revisitHref override. renderPatternStatementsCard (4): empty state, anchor count matches statement count, XSS safety, custom drillHref override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * recall: calibration footer formatter for morning pulse (T16) Pure formatter that turns a CalibrationProfileRow + optional abandoned- threads list into the conversational block the morning pulse will surface: Calibration this quarter: Brier 0.18 (solid). Right on early-stage tactics, late on macro by 18 months. Over-confident on team execution; under-calibrated on regulatory risk. Threads you opened and never came back to: · AI search platform differentiation (17 months silent) · International expansion playbook (12 months silent) Cold-brain branch: returns empty string when no profile or < 5 resolved takes. Caller decides whether to render the block; cold-brain absence is the cleanest non-event. Brier trend note maps the absolute value to conversational copy: <= 0.10 → "(strong calibration)" <= 0.20 → "(solid)" <= 0.25 → "(near baseline)" > 0.25 → "(worse than always-50% baseline — review your high-conviction calls)" v0.36.0.0 ship state has only the current profile snapshot. The "was 0.22 90d ago — improving" comparison shape arrives when we accumulate generated_at history across multiple cycles. R3 regression posture: This module is the FORMATTER only. Wiring into `gbrain recall`'s text output is intentionally NOT in this commit — runRecall's surface stays unchanged. v0.37 wires it under --show-calibration (opt-in initially, default-on later). For now the formatter is callable from the admin tab + custom CLI scripts that want it. Architecture: buildRecallCalibrationFooter(opts) — pure. opts.profile required, opts.abandonedThreads optional, opts.threadColumnWidth defaults to 50. Caps at 4 patterns + 5 abandoned threads to keep the footer scannable. Truncates long abandoned-thread claim text to fit the column width with a trailing ellipsis. Tests: 14 cases. Cold-brain branch (3): null profile, < 5 resolved, zero resolved. Happy path (7): header + Brier + patterns, trend note ranges (4 brackets), null brier omits the Brier line but keeps header, caps at 4 patterns. Abandoned threads (4): omit section when none, emit when present, cap at 5, truncate long claim with column-width override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: --undo-wave reversal command (T17 / D18 CDX-3) Implements the undo-wave reversal flow. Every new row written by the v0.36.0.0 calibration wave carries wave_version='v0.36.0.0' so a precise revert is possible without touching pre-wave data. CLI surface (replaces the v0.36.0.0 ship-state placeholder): gbrain calibration --undo-wave v0.36.0.0 [--dry-run] [--scrub-gstack] [--json] Reversal scope (4 steps): Step 1 — UNSET takes.resolved_* columns for takes auto-applied by this wave. Identifies wave-applied takes via take_grade_cache.applied=true + wave_version match. Cross-checks resolved_by='gbrain:grade_takes' to ensure we're not un-resolving a take a manual `gbrain takes resolve` override has since claimed. Manual resolutions persist; only auto-grade resolutions revert. Step 1b — Mark take_grade_cache rows applied=false post-undo so the audit trail shows they WERE applied but this wave was reverted. The CDX-11 confidence-drift check filters on applied=true and gets a cleaner sample post-undo. Step 2 — DELETE FROM calibration_profiles WHERE wave_version = ?. Step 3 — DELETE FROM take_nudge_log WHERE wave_version = ?. Step 4 — Optional gstack-learnings-prune via the binary, scoped to the GSTACK_LEARNING_NAMESPACE prefix. Opt-in via --scrub-gstack. Best-effort: binary-missing or failure logs a warning + suggests the manual command; the rest of the undo still succeeded. Dry-run posture: --dry-run computes the counts via SELECT COUNT(*) shapes without emitting any UPDATE or DELETE. Same UndoWaveResult shape returned so operator sees exactly what would be reverted before committing. --dry-run intentionally skips the gstack scrub (filesystem write) too; ship-state safety call. Idempotency: Re-running --undo-wave on a brain that's already reverted is a no-op. Each query filters on wave_version; no matching rows → zero counts. Architecture: undoWave(engine, opts) — async, returns UndoWaveResult. Pure data layer; no stderr writes, no process exits. CLI dispatch in src/commands/calibration.ts handles printing. v0.36.0.0 ship state runs steps 1-3 sequentially (no transaction). Partial reversal is recoverable via re-run since each step is idempotent on wave_version match. A future enhancement (v0.37+) can wrap in engine.transaction once that surface lands in BrainEngine. Tests: 8 cases in test/undo-wave.test.ts. Dry-run posture (1): counts emitted, NO UPDATE/DELETE SQL fired. Happy path (3): all 4 steps execute, resolved_by filter scopes UPDATE to wave-applied resolutions, custom resolvedByLabel honored. Empty wave (2): zero counts when no matching rows, idempotent re-run. Wave-version parameter threading (2): supplied version threads through all queries, different wave versions don't collide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: A/B harness for think + ab-report (T18 / D19 CDX-18) Structural answer to CDX-18 (anti-bias rewrite may make advice worse). We don't have to guess whether calibration helps — we measure. Architecture: runAbTrial(input) — calls thinkRunner TWICE on the same question (baseline + --with-calibration), surfaces both answers to a preferenceResolver, persists the trial to think_ab_results. buildAbReport(engine, { days }) — aggregates the table over the last N days (default 30). Computes win counts, ties, neither, and a with_calibration_win_rate over DECISIVE trials only (excludes neither/tie). Flags calibration_net_negative when n >= 20 AND win rate < 45%. formatAbReport(report, days) — pretty-prints for stdout; emits the calibration_net_negative warning block when triggered. CLI: gbrain calibration ab-report [--days N] [--json] Reads the table, prints the breakdown. Replaces the v0.36.0.0 ship-state placeholder in src/commands/calibration.ts. gbrain think --ab "<question>" Wires into runAbTrial via the dispatch in src/commands/think.ts — follow-up commit. This commit lands the harness layer + schema + report surface; the --ab flag itself flips on in a one-line wiring commit when the runRecall path is ready. Schema (migration v72 / think_ab_results): source_id, wave_version, ran_at, question, baseline_answer, with_calibration_answer, preferred (CHECK in {baseline, with_calibration, neither, tie}), model_id, notes. CHECK constraint enforces preferred enum. Default wave_version 'v0.36.0.0' stamped so --undo-wave can scrub these too. Index on (source_id, ran_at DESC) supports the report's "last N days" query. schema.sql + pglite-schema.ts both updated for fresh-install parity. schema-embedded.ts regenerated via build:schema. calibration_net_negative threshold (D19): Triggers when: - decisive_trials (baseline + with_calibration) >= 20 - with_calibration_win_rate < 0.45 (NOT <= — exact 45% is OK) Small-sample guard (n < 20) prevents the warning from firing on early data with sampling noise. Confidence-flat threshold (no Wilson CI yet) keeps the math simple; v0.37+ adds CI bounds. Tests: 12 cases in test/think-ab.test.ts. runAbTrial (4): both runner calls fire, preferenceResolver receives both answers, INSERT row params shape, throws when thinkRunner missing. buildAbReport (5): zero trials, aggregation, net_negative trigger at n>=20 + win<45%, no trigger at n<20 (small-sample guard), no trigger at exact 45% boundary. formatAbReport (3): zero-state message, decisive-trials breakdown, net_negative warning block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: pattern drill-down route + revisit-now CLI (TD3 / D29 + TD4 / D30) TD3 (D29) — clickable pattern drill-down endpoint: GET /admin/api/calibration/pattern/:id (requireAdmin) Returns the pattern statement at index `id` plus the top 25 resolved takes for the holder, sorted by weight desc. v0.36.0.0 ship-state approximation: surfaces broad provenance evidence (top resolved takes). v0.37+ stores per-pattern source_take_ids[] on a calibration_profile_patterns join table so the drill-down shows the EXACT takes that drove the pattern. Surfaces a `provenance_note` field in the response so the operator sees the v0.36.0.0-vs-v0.37 fidelity boundary inline. The admin SPA's renderPatternStatementsCard SVG already emits anchor tags pointing at /admin/calibration/pattern/<i> (T15 ship state). This route makes those anchors clickable — closes the trust loop that was the rationale for D29 ("pattern statements without their evidence are dressed-up LLM hallucinations"). TD4 (D30) — `gbrain takes revisit <slug>` editor-open action: Adds the `revisit` subcommand to gbrain takes. Opens $EDITOR (falling back to vi) on the source markdown file for the slug. Appends a `` cursor marker at the bottom of the page on first invocation so the editor opens with intent visible. Reads sync.repo_path from config to locate the brain repo. Refuses to proceed with a clear error when the repo isn't configured or the page doesn't exist. spawnSync with stdio:'inherit' so the editor takes the terminal. Exit status surfaced on failure. The SVG renderer's revisit-now anchor for each abandoned thread row emits /admin/calibration/revisit/<takeId>. A small route handler that resolves take_id → page_slug then dispatches `gbrain takes revisit` via spawn is a v0.37 follow-up — the CLI command exists now so developers can wire it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: DESIGN.md — formalize de facto design tokens (TD1) Promotes the admin SPA's de facto design tokens (landed v0.26.0) to a canonical DESIGN.md at the repo root. This is the calibration target for /plan-design-review and /design-review going forward — when a question is "does this UI fit the system?", the answer is here. Captures the system as it stands today: Voice (5 surfaces, all routed through gateVoice() with mode-specific rubrics): pattern_statement, nudge, forecast_blurb, dashboard_caption, morning_pulse. Friend-not-doctor; concrete data over abstract metrics; no preachy / clinical / corporate language. Color tokens: 10 CSS variables from admin/src/index.css inlined into the SVG renderer (src/core/calibration/svg-renderer.ts). Dark theme is the only theme — admin is an operator tool. WCAG contrast documented per token; TD2's #555 → #777 bump on --text-muted noted. Typography: Inter for UI, JetBrains Mono for numbers/slugs/data. Type scale (18 / 14 / 13 / 12 / 11) documented as de facto, not yet formalized. Spacing scale: 4 / 8 / 16 / 24 / 32px. Linear-app density. Layout: sidebar 200px, max content 720px (text) / 960px (tables). No 3-column feature grids, no icons in colored circles, no decorative blobs. Charts: server-rendered SVG via pure functions in src/core/calibration/svg-renderer.ts. XSS posture documented: server-side escapeXml on caller-controlled strings, numeric inputs .toFixed()-coerced, admin SPA renders via <TrustedSVG> wrapper. Interaction patterns: keyboard nav required (J/K/space/u/q on the propose-queue), loading/empty/error states ARE features. v0.37+ roadmap: type scale formalization, animation tokens, component library extraction. Light mode explicitly NOT planned. The doc is a living target, not a frozen spec. Major changes route through /plan-design-review per the existing review chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: synthetic corpus scaffold + privacy CI guard (T19 + T20) T19 — synthetic corpus scaffold for extract-takes prompt tuning. test/fixtures/calibration/extract-takes-corpus/ — 5 representative pages across 4 genres (essay, people, companies, meetings, decisions). v0.36.0.0 ships a SMALL representative corpus as proof of structure; the full 50-page training set + 10-page holdout gets generated by the operator via `gbrain calibration build-corpus` (v0.37 follow-up subcommand) or by hand with the privacy guard catching violations either way. Privacy contract per D13': every page is SYNTHETIC. None of the names/companies/funds/deals/events refer to anything real. Placeholder names per CLAUDE.md: alice-example, charlie-example, acme-example, widget-co, fund-a/b/c, acme-seed, widget-series-a, meetings/2026-04-03. test/fixtures/calibration/README.md spells out the privacy contract, generation flow, and what the corpus is (stable regression set for the extract-takes prompt) vs is not (real anything). T20 — privacy CI guard (CDX-14 mitigation). scripts/check-synthetic-corpus-privacy.sh greps the corpus for: 1. Explicit dollar amounts ($50M, $1.2B etc) — would suggest the page memorized a real round size. 2. Out-of-range year references (informational only for v0.36.0.0; deferred to a manual review checklist). 3. Pages that reference ZERO placeholder names — suggests the page might be referring to real entities. Essay-genre fixtures exempt (they're anonymized PG-style writing by design). Wired into `bun run verify` (CI gate) so contributors can't accidentally land a synthetic fixture that leaks real-world specificity. The intent is fail-fast on accidental leakage; the operator can update the allowlist if a generic dollar amount is intentional. Closes CDX-14: 'CC reads real brain pages locally, writes nothing still risks privacy if any generated synthetic fixture memorizes structure-specific facts. Placeholder names are not enough.' The corpus shipped here is intentionally small but covers the four core gbrain page genres (essay, people, companies, meetings/decisions). The v0.37 corpus-build subcommand will fan out to 50 with the operator spot-checking + the CI guard enforcing the privacy contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: R1-R5 IRON RULE regression inventory (T21) Per /plan-eng-review D26 IRON RULE: regressions get added to the test suite as critical requirements, no AskUserQuestion needed. Pins five regressions identified during the v0.36.0.0 wave's coverage diagram: R1: think baseline UNCHANGED when --with-calibration absent. Covered structurally by test/think-with-calibration.test.ts plus assertion-pinned in this file (default user message: question first, then retrieval; system prompt: no anti-bias section). R2: contradictions probe output UNCHANGED when no calibration profile. Covered structurally by test/eval-contradictions-calibration-join.test.ts plus pinned here (null profile → null tag, byte-identical to v0.32.6). R3: takes resolution flow works when grade_takes phase disabled. Pinned import-surface coupling: takes-resolution.ts has zero dependency on grade_takes module. If a future refactor accidentally couples them, this test fails to compile. R4: search/list_pages/get_page work identically through new source_id paths. Marker test referencing existing v0.34.1 source-isolation suite at test/source-isolation-pglite.test.ts. v0.36.0.0 does NOT modify those code paths; the existing tests catch any accidental coupling. R5: existing search modes (conservative/balanced/tokenmax) unaffected. Marker test referencing existing test/search-mode.test.ts. The calibration code DOES NOT IMPORT from src/core/search/mode.ts. Plus an inventory test that confirms all 5 regressions have an 'addressed' status — fail-loud if a future contributor removes a guard without updating the inventory. 7 tests total. Pure functions, no engine, hermetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: v0.36.0.0 CHANGELOG + CLAUDE.md anchors + calibration convention skill CHANGELOG entry: the user-facing release notes. Leads with the headline ("the brain learns how you tend to be wrong, then argues against your blind spots on every advice call"), 5 'what you can now do' bullets in GStack voice, itemized changes by lane, and the 'To take advantage of v0.36.0.0' upgrade checklist per the CLAUDE.md required-block contract. CLAUDE.md anchors: new 'v0.36.0.0 Hindsight calibration wave (key files cluster)' block inserted before the v0.31.1 thin-client section. 23 new files / extensions annotated with one-paragraph descriptions each, linking back to the convention skill at skills/conventions/calibration.md for the agent-facing rules. skills/conventions/calibration.md: the agent-facing convention skill. Tells future contributors which calibration touchpoint applies to their task — voice gate? BaseCyclePhase? source-scope thread? doctor warning? cross-brain query rules? auto-resolve threshold posture? Test seam patterns. Bug class to avoid (the v0.34.1 source-isolation leak shape). Version trio (per CLAUDE.md mandatory audit): VERSION: 0.36.0.0 package.json: 0.36.0.0 CHANGELOG: ## [0.36.0.0] - 2026-05-17 llms.txt + llms-full.txt regenerated via `bun run build:llms` after the CLAUDE.md edit (per the explicit CLAUDE.md mandate "Any CLAUDE.md edit MUST be followed by `bun run build:llms`"). The `test/build-llms.test.ts` guard runs in CI shard 1; the committed bundles are checked against fresh generator output. bun run verify is clean. typecheck clean. Privacy CI guard passes (0 violations across 6 corpus pages). All ready for /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cycle: wire propose_takes / grade_takes / calibration_profile into runCycle (T-fix) The three new v0.36.0.0 phases were declared in CyclePhase / ALL_PHASES / NEEDS_LOCK_PHASES but the runCycle orchestrator never dispatched them. ALL_PHASES advertised them, gbrain dream --phase propose_takes accepted them, but `gbrain dream` (default) silently skipped all three. Adds a single dispatch block between consolidate and embed that: - builds an OperationContext on the fly (trusted-workspace caller, remote: false, sourceId resolved via the same helper sync uses) - dispatches the three phases in the order ALL_PHASES declares - records the same skipped-phase shape (no_database) when engine is null Pinned by test/core/cycle.serial.test.ts "default: all 6 phases run in order" which was already failing against ALL_PHASES (the test name lags the actual phase count; left as-is since renaming churns history). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: expand synthetic corpus + add hand-labeled ground-truth (T19) Adds 8 new synthetic pages modeled on the genre mix observed in the real brain (concepts-with-timeline, meeting-notes, daily-journal, people-pages, essays). Companion .gradeable-claims.json files carry hand-labeled answer keys — what a tuned propose_takes prompt SHOULD extract per page. Closes the F1 gate gap from the plan's T19/D19: Training corpus (test/fixtures/calibration/extract-takes-corpus/): + concept-startup-market-dynamics.md (10 claims) + meeting-2026-04-10-fundraise-fund-a.md (6 claims) + daily-2026-04-15.md (5 claims) Blind holdout (test/fixtures/calibration/holdout/): + concept-founder-execution.md (6 claims, F1 >= 0.80) + daily-2026-04-18.md (4 claims, F1 >= 0.80) + meeting-2026-04-17-hiring-charlie.md (5 claims, F1 >= 0.80) + essay-on-conviction.md (7 claims, F1 >= 0.80) + people-bob-example.md (5 claims, F1 >= 0.80) Privacy: - No real-brain content read into any committed artifact. Pages written from scratch using the canonical placeholder set (alice-example, charlie-example, bob-example, acme-example, widget-co, fund-a/b/c). Real-name grep confirms zero leakage: wintermute, garrytan, paul-graham, sam-altman, etc. → 0 hits. - scripts/check-synthetic-corpus-privacy.sh passes: 0 violations across 14 pages (was 6). Genre fidelity: - concept-with-timeline pages mirror the dated-assertion structure real brain uses (verb framing varies: "argues / predicts / I think / I bet / strong conviction / moderate conviction"). - meeting-notes pages carry both prose claims (extracted via hedging language) and explicit ## Takes sections. - daily-journal pages test probabilistic framing ("75/25 in favor", "call it ~0.5") and self-tagged conviction values. - essay-on-conviction is the meta-page that names the author's own bias patterns — primary signal for calibration_profile. - people pages test claim-about-third-party extraction. Each JSON ground-truth lists per-claim: - claim_text + kind (prediction|judgment|bet) + domain - conviction (0..1) - since_date - rationale (why this claim is gradeable + how a tuned prompt should infer conviction from the prose) This is the corpus that gates the T19 prompt-tune iteration: - F1 >= 0.85 on training (10+6+5 = 21 claims across 3 pages plus the existing 5 fixtures already shipped) - F1 >= 0.80 on holdout (27 claims across 5 pages) Plan reference: ~/.claude/plans/system-instruction-you-are-working-rippling-knuth.md Privacy gate: scripts/check-synthetic-corpus-privacy.sh (wired into bun run verify). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * calibration: tune propose…

garrytan mentioned this pull request May 18, 2026

v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong #1139

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(exports): add ./enrichment to package.json exports map#555

feat(exports): add ./enrichment to package.json exports map#555
chengzehsu wants to merge 1 commit into
garrytan:masterfrom
chengzehsu:feat/exports-map-enrichment

chengzehsu commented May 1, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chengzehsu commented May 1, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Discovery

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chengzehsu commented May 1, 2026 •

edited by blacksmith-sh Bot

Loading