Skip to content

v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts#1759

Merged
garrytan merged 16 commits into
masterfrom
garrytan/skillopt-eval-explainer
Jun 3, 2026
Merged

v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts#1759
garrytan merged 16 commits into
masterfrom
garrytan/skillopt-eval-explainer

Conversation

@garrytan

@garrytan garrytan commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Summary

Makes gbrain skillopt (self-improving skills) honest and tamper-resistant — the SkillOpt eval-readiness wave. Track A of the "prove the feature works" plan; the real-LLM benchmark suite (Track B) lands separately in gbrain-evals.

Held-out validation gate (F11) — now actually wired. runHeldOutGate existed but the orchestrator never called it and --held-out was never parsed. Now --held-out <path> is parsed and threaded through every caller (CLI, batch, fleet, background job, run_skillopt MCP op), and the gate runs at checkpoint acceptance — a candidate that climbs the benchmark but regresses on the independent held-out set is refused, on the mutate AND no-mutate/fleet paths.

Bundled-skill safety (D16) in core mutation policy. Mutating a shipped skill in place now requires a non-empty (>=5), benchmark-disjoint held-out set or hard-refuses (exit 2) with a proposed.md fallback. Enforced in assertBundledMutationHeldOut, so it fires for every entry point (they all funnel through runSkillOpt).

Honest receipts + final-test. receipt.baseline_sel_score was hardcoded 0 (real value discarded); now populated, plus a real final-test eval (test_score + baseline_test_score) via a shared scoreSkillOnTasks primitive. --no-mutate now writes proposed.md (was a stub); --max-runtime-min is enforced.

Security hardening. The run_skillopt MCP op validates skill_name (kebab-only) and confines caller-supplied benchmark/held-out paths to the skills dir for remote callers — closes an arbitrary-file-read / existence-oracle for admin OAuth tokens.

Eval-internal ablation opts (NOT on the CLI): reflectMode, disableValidationGate, optimizerMode ('one-shot-rewrite'), recorded in the receipt + audit for replayability. They drive the cat30/31/32/33 SkillOpt benchmark suite in gbrain-evals.

Test Coverage

New test/skillopt/rollout.test.ts (rollout had zero coverage); held-out ENFORCE + one-shot-rewrite unit cases; e2e for F11 block/allow, bundled no-mutate write, the three ablation opts, maxRuntimeMin abort, receipt-score honesty, held-out/benchmark disjointness, and D2 no-DB-pollution. Targeted suite (skillopt + operations trust-boundary + autocut/search-mode merge seam + core): 355 pass / 0 fail. Full skillopt dir: 211 pass. bun run verify: 29/29. bun run typecheck: clean.

Note: the full sharded unit suite was environmentally slow locally (PGLite-heavy + loaded machine); CI runs the full shard matrix on this PR.

Pre-Landing Review

6 findings (3 critical, 3 informational), all actioned:

  • [FIXED] [security] run_skillopt MCP op passed benchmark/held-out paths to readFileSync with no confinement → arbitrary file read for remote admin callers. Added kebab skill_name validation + path confinement to skillsDir for remote callers.
  • [FIXED] [testing] maxRuntimeMin abort branch + receipt baseline/test-score (the headline fix) were unasserted → added tests.
  • [FIXED] [maint] MIN_HELD_OUT_SIZE now derives from D_SEL_MIN_SIZE; 0.5ROLLOUT_SUCCESS_THRESHOLD; MCP param held_outheld_out_path (consistency).

Adversarial Review

Claude + Codex both flagged (consensus, block-until-fixed):

  • [FIXED] skill_name path traversal — confinement covered caller-supplied paths but not the skillName-derived defaults. Now validated kebab-only at the op boundary, so derived paths are contained by construction.
  • [FIXED] One-shot fence-strip truncation — the non-anchored fence regex could truncate a body containing a code sample. Now only unwraps a whole-response fence.
  • [FIXED] Held-out/benchmark non-disjointness — pointing --held-out at a copy of the benchmark voided the gate. Now rejected on task_id overlap.
  • [FIXED] Confinement false-block under symlinked skillsDir (macOS /tmp/private/tmp, Conductor worktrees) — canonicalize the nearest existing ancestor.

Plan Completion

Track A complete (T0 North Star, T1 load-bearing fixes, T2 ablation opts, T3 tests). Track B (real-LLM cat30/31/32/33 in gbrain-evals) is a separate repo/PR, intentionally not in this PR.

TODOS

Added 4 v0.42+ SkillOpt follow-ups (promoteCandidate DRY extraction, bundled-detection hardening, preflight ablation-opt awareness, maxRuntimeMin in held-out/final-test phases).

Test plan

  • bun run typecheck clean
  • bun run verify 29/29
  • Targeted suite 355 pass / 0 fail (skillopt + operations trust-boundary + merge seam + core)
  • Full skillopt dir 211 pass
  • Full sharded unit matrix — runs on CI (slow locally)

🤖 Generated with Claude Code

Documentation

Docs synced for the v0.42.9.0 skillopt eval-readiness wave: skills/skill-optimizer/SKILL.md (bundled-skill held-out requirement + honest receipt fields), docs/guides/skillopt.md (--held-out flag + F11 gate + D16 row), docs/tutorials/improving-skills-with-skillopt.md (Step 5 bundled command now shows --allow-mutate-bundled --held-out). CHANGELOG/CLAUDE.md/TODOS/llms updated in the release commit.


Also in this PR: CLAUDE.md thin-resolver restructure (folded in)

Separate concern from skillopt, folded into this batch as atomic, bisect-friendly commits
(75992b77, c825ef8f, 163f044e, fa2f9de2).

Problem: CLAUDE.md had grown to 591,854 bytes (~147k tokens auto-loaded every session,
~77% of the llms-full.txt one-fetch bundle, which had just blown its 750KB budget). Root cause
was structural: the per-file index + command/test sections were append-only by mandate, so every
release chained another **vX.Y.Z:** clause forever.

Fix: CLAUDE.md becomes a thin orientation + resolver (gbrain's own thin-dispatcher/fat-detail
pattern). The per-file index, thin-client routing, test discipline, and the verbose release
process move to on-demand docs; CLAUDE.md keeps the North Star, architecture + cross-cutting
invariants, the IRON RULES, and a reference map that routes to detail.

Commit What
75992b77 CI-cache prerequisite — ci-cache-hash.sh keeps relocated policy docs (docs/TESTING.md, docs/RELEASING.md) test-affecting so a change to them still invalidates the cache (closes a false-pass path before any policy moved). Pinned by tests.
c825ef8f Verbatim relocation → docs/architecture/KEY_FILES.md, docs/architecture/thin-client.md, docs/TESTING.md; resolver + cross-cutting invariants lifted into CLAUDE.md. Content-preserving.
163f044e Compress reference docs to current-state (393 entries, every invariant preserved) + scripts/check-key-files-current-state.sh recurrence guard (bans **v0. chains + CLAUDE.md size cap, wired into verify) + content-contract tests + revert the bundle band-aid.
fa2f9de2 Verbose release process → docs/RELEASING.md; ship IRON RULES + version-locations table kept inline.

Result: CLAUDE.md 591,854 → 39,181 bytes (93%), llms-full.txt 740KB → 204KB.
Zero src/ changes. The bloat cannot recur — verify fails on re-introduced append-history or
an over-cap CLAUDE.md.

Reviews: eng-review (6 findings, folded) + codex outside-voice (11 findings, 9 folded as
mandatory hardening incl. the CI-cache fix, content-contract tests, measured bundle).

Verification (docs delta is zero-src): bun run verify 30/30; build-llms drift+budget,
doc-history guard, ci-cache contract = 46/46; targeted doc-touching suite (public-exports,
resolver, skill-trigger-index) 199/0.

garrytan and others added 16 commits June 1, 2026 20:37
…on opts

Wire the F11 held-out gate into the orchestrator at checkpoint acceptance
(runHeldOutGate was dead code); parse + thread --held-out through CLI, batch,
fleet, background job, and the run_skillopt MCP op. Populate the real
receipt.baseline_sel_score (was hardcoded 0) and add a final-test eval
(test_score + baseline_test_score) via a shared scoreSkillOnTasks primitive.
Fix the --no-mutate proposed.md write (was a stub) and enforce maxRuntimeMin.

D16 ENFORCE in core mutation policy (assertBundledMutationHeldOut): mutating a
bundled skill in place requires a non-empty (>=5), benchmark-disjoint held-out
set or hard-refuses. Add three eval-internal ablation opts (reflectMode,
disableValidationGate, optimizerMode='one-shot-rewrite') recorded in the
receipt + audit; ROLLOUT_SUCCESS_THRESHOLD named constant.

Security: run_skillopt MCP op validates skill_name (kebab-only) and confines
caller-supplied benchmark/held-out paths to the skills dir for remote callers.
…eceipt honesty

New test/skillopt/rollout.test.ts (rollout had zero coverage). Held-out ENFORCE
unit cases + one-shot-rewrite fence handling (whole-response unwrap, embedded-fence
preserved, error path). E2E: F11 held-out BLOCKS/ALLOWS, bundled no-mutate write,
reflectMode/disableValidationGate/optimizerMode, maxRuntimeMin abort, receipt
baseline/test-score honesty, held-out/benchmark disjointness, D2 no-DB-pollution.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…for v0.42.9.0

Wire --held-out into the skill-optimizer SKILL.md, guide flags/safety tables, and
the tutorial's bundled-skill step: mutating a bundled skill in place now requires
--allow-mutate-bundled AND --held-out (>=5 benchmark-disjoint tasks) or it
hard-refuses. Add the --held-out flag row + F11 held-out gate to the guide; update
the receipt contract to the honest baseline/test-score fields.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…val-explainer

# Conflicts:
#	CHANGELOG.md
#	TODOS.md
#	VERSION
#	package.json
…again

The ai@6.x bump tightened ModelMessage + tool-schema validation, which
silently broke every multi-turn tool loop. Both `gbrain skillopt` rollouts
and production background `subagent` jobs route through `chat()`/`toolLoop`
and crashed the moment the model called a tool ("messages do not match the
ModelMessage[] schema" / "schema is not a function"). Surfaced end-to-end
by the SkillOpt real-LLM eval.

Three fixes:
- chat(): wrap tool defs with the SDK's `jsonSchema()` helper instead of a
  bare `{jsonSchema}` object (v6 asSchema() treated the bare object as a
  thunk and threw).
- chat(): new exported pure `toModelMessages()` converts gbrain's
  provider-neutral ChatMessage[] into v6 ModelMessage[] — tool results ride
  a dedicated `role:'tool'` message with structured `{type,value}` output;
  null output preserved as json null. Load-bearing for the production
  subagent path, not just skillopt.
- rollout.ts: replace the inline params→schema mapper (dropped `items` on
  array params) with the shared `paramDefToSchema` single source of truth.

Pinned by test/gateway-model-messages.test.ts (8 cases). Folds into the
open v0.42.9.0 PR (#1759) — these complete the eval-readiness wave by
making skillopt actually run against a live model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…out 0

Surfaced by the SkillOpt real-LLM eval (Track B). Two coupled bugs that made
a budget-capped Haiku run report a vacuous "0/N" measurement in ~2ms with
zero LLM calls — indistinguishable from a real deficient-skill score:

1. Claude Haiku 4.5's canonical dateless id (`claude-haiku-4-5`) was missing
   from anthropic-pricing.ts (only the dated `-20251001` was present). With
   `--max-cost` set, BudgetTracker.reserve() threw no_pricing on the FIRST
   chat() of every rollout. Added the dateless entry (sonnet already had its
   dateless form).
2. runValidationGate swallowed that BUDGET_EXHAUSTED error — runWithLimit
   settled it as {ok:false}, which the gate turned into median:0. A pricing/cap
   crash became a fake score. The gate now scans settled results for
   isMustAbortError() and re-throws so the caller aborts loudly; ordinary
   (non-abort) rollout errors still fail-open to 0 (judge-hiccup posture kept).

Pinned by test/skillopt/validate-gate-abort.test.ts (3 cases). Folds into the
open v0.42.9.0 PR (#1759).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…om full bundle

The toolLoop + budget bug-fix annotations grew CLAUDE.md, pushing llms-full.txt
to 756KB over the 750KB FULL_SIZE_BUDGET (the `build-llms > size budget` test
failed, failing the `test` CI job). CLAUDE.md stays inlined by design (it's the
point of the one-fetch bundle), so per the budget comment's own guidance ("ship
with includeInFull=false exclusions") this excludes docs/what-schemas-unlock.md
(15.4KB value-explainer, not load-bearing operational reference) from
llms-full.txt; it stays linked in llms.txt. Bundle now 740KB with ~9KB headroom.
No budget bump — 750KB is near the ~190k-token-context fit ceiling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs/**/*.md is deny-listed from the CI cache hash (test-irrelevant). The
CLAUDE.md restructure moves test/release POLICY into docs/TESTING.md +
docs/RELEASING.md, which DO carry contracts the test suite reads. Without
re-admitting them, a policy-only edit would produce the same cache hash and
skip the test shard that runs the build-llms + doc-history guards (false-pass).

Adds an ALLOW_PATTERNS re-admit step after the deny, scoped to the named
policy docs (not a blanket docs un-deny). Lands FIRST, before any doc moves.

Pinned by 3 new cases in test/scripts/ci-cache-hash.test.ts: TESTING.md +
RELEASING.md edits MUST change the hash; docs/guide.md still must not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…UDE.md (verbatim)

CLAUDE.md had grown to 592KB / ~147k tokens auto-loaded every session (~77% of
the llms-full.txt single-fetch bundle). The per-file index was append-only by
mandate. This is the exact thin-dispatcher-vs-fat-blob anti-pattern gbrain exists
to fix, so CLAUDE.md becomes a thin orientation + resolver that points at
on-demand docs.

This commit is the VERBATIM move (content-preserving — the next commit compresses):
- docs/architecture/KEY_FILES.md   <- ## Key files + the calibration key-files
  cluster + Schema Cathedral v3 impl detail
- docs/architecture/thin-client.md <- ## Thin-client routing
- docs/TESTING.md                   <- ## Testing
- ## Commands DROPPED (18 'added in vX.Y' history blocks; current surface is
  gbrain 0.41.38.0 -- personal knowledge brain

USAGE
  gbrain <command> [options]

SETUP
  init [--pglite|--supabase|--url]   Create brain (PGLite default, no server)
  migrate --to <supabase|pglite>     Transfer brain between engines
  upgrade                            Self-update
  check-update [--json]              Check for new versions
  doctor [--json] [--fast]            Health check (resolver, skills, pgvector, RLS, embeddings)
  integrations [subcommand]          Manage integration recipes (senses + reflexes)

PAGES
  get <slug>                         Read a page
  put <slug> [< file.md]             Write/update a page
  delete <slug>                      Delete a page
  list [--type T] [--tag T] [-n N]   List pages

SEARCH
  search <query>                     Keyword search (tsvector)
  query <question> [--no-expand]     Hybrid search (RRF + expansion)
  ask <question> [--no-expand]       Alias for query

IMPORT/EXPORT
  import <dir> [--no-embed]          Import markdown directory
  sync [--repo <path>] [flags]       Git-to-brain incremental sync
  sync --watch [--interval N]        Continuous sync (loops until stopped)
  sync --install-cron                Install persistent sync daemon
  export [--dir ./out/]              Export to markdown
  export --restore-only [--repo <p>] Restore missing supabase-only files
        [--type T] [--slug-prefix S] With optional filters

FILES
  files list [slug]                  List stored files
  files upload <file> --page <slug>  Upload file to storage
  files upload-raw <file> --page <s> Smart upload (size routing + .redirect.yaml)
  files signed-url <path>            Generate signed URL (1-hour)
  files sync <dir>                   Bulk upload directory
  files verify                       Verify all uploads

EMBEDDINGS
  embed [<slug>|--all|--stale]       Generate/refresh embeddings

LINKS
  link <from> <to> [--type T]        Create typed link
  unlink <from> <to>                 Remove link
  backlinks <slug>                   Incoming links
  graph <slug> [--depth N]           Traverse link graph (returns nodes)
  graph-query <slug> [--type T]      Edge-based traversal with type/direction filters
        [--depth N] [--direction in|out|both]

TAGS
  tags <slug>                        List tags
  tag <slug> <tag>                   Add tag
  untag <slug> <tag>                 Remove tag

TIMELINE
  timeline [<slug>]                  View timeline
  timeline-add <slug> <date> <text>  Add timeline entry

TOOLS
  extract <links|timeline|all>       Extract links/timeline (idempotent)
        [--source fs|db]             fs (default) walks .md files; db iterates engine pages
        [--dir <brain>]              brain dir for fs source
        [--type T] [--since DATE]    filters (db source)
        [--dry-run] [--json]
  publish <page.md> [--password]     Shareable HTML (strips private data, optional AES-256)
  check-backlinks <check|fix> [dir]  Find/fix missing back-links across brain
  lint <dir|file> [--fix]            Catch LLM artifacts, placeholder dates, bad frontmatter
  orphans [--json] [--count]         Find pages with no inbound wikilinks
  salience [--days N] [--kind P]     v0.29: pages ranked by emotional + activity salience
  anomalies [--since D] [--sigma N]  v0.29: cohort-based statistical anomalies (tag, type)
  transcripts recent [--days N]      v0.29: recent raw .txt transcripts (local-only)
  dream [--dry-run] [--json]         Run the overnight maintenance cycle once (cron-friendly).
                                     See also: autopilot --install (continuous daemon).
  check-resolvable [--json] [--fix]  Validate skill tree (reachability/MECE/DRY)
  report --type <name> --content ... Save timestamped report to brain/reports/

BRAIN (capture / ideate / explore — v0.37/v0.38)
  capture [content] [--file PATH]    Single entrypoint for getting content into the brain
        [--stdin] [--slug s] [--type t]   Inline content / file / stdin; writes to inbox/ by default
        [--source ID] [--quiet|--json]    Multi-source brains: route to a non-default source
  brainstorm <question> [--json]     Bisociation idea generator (hybrid search + far-set + judge)
        [--save|--no-save] [--limit N]
  lsd <question> [--json]            Lateral Synaptic Drift: inverted-judge brainstorm
        [--save|--no-save] [--limit N]    rewarding far-from-obvious + axiomatic inversions

SOURCES (multi-repo / multi-brain)
  sources list                       Show registered sources
  sources add <id> --path <p>        Register a source (id = short name, e.g. 'wiki')
  sources remove <id>                Remove a source + its pages
  sync --all                         Sync all sources with a local_path
  sync --source <id>                 Sync one specific source
  repos ...                          DEPRECATED alias for 'sources' (v0.19.0)

CODE INDEXING (v0.19.0 / v0.20.0 Cathedral II)
  code-def <symbol> [--lang l]       Find the definition of a symbol across code pages
  code-refs <symbol> [--lang l]      Find all references to a symbol (JSON-first)
  code-callers <symbol>              Who calls this symbol? (v0.20.0 A1)
  code-callees <symbol>              What does this symbol call? (v0.20.0 A1)
  query <q> --lang <l>               Filter hybrid search to one language (v0.20.0)
  query <q> --symbol-kind <k>        Filter to symbol type (function|class|method|...) (v0.20.0)
  reconcile-links [--dry-run]        Batch-recompute doc↔impl edges (v0.20.0)
  reindex-code [--source id] [--yes] Explicit code-page reindex (v0.20.0)
  sync --strategy code               Sync code files into the brain

JOBS (Minions)
  jobs submit <name> [--params JSON]  Submit background job [--follow] [--dry-run]
  jobs list [--status S] [--limit N]  List jobs
  jobs get <id>                       Job details + history
  jobs cancel <id>                    Cancel job
  jobs retry <id>                     Re-queue failed/dead job
  jobs prune [--older-than 30d]       Clean old jobs
  jobs stats                          Job health dashboard
  jobs work [--queue Q]               Start worker daemon (Postgres only)

ADMIN
  stats                              Brain statistics
  health                             Brain health dashboard
  history <slug>                     Page version history
  revert <slug> <version-id>         Revert to version
  features [--json] [--auto-fix]     Scan usage + recommend unused features
  autopilot [--repo] [--interval N]  Self-maintaining brain daemon
  config [show|get|set] <key> [val]  Brain config
  storage status [--repo <path>]     Storage tier status and health
        [--json]                     (git-tracked vs supabase-only)
  serve                              MCP server (stdio)
  serve --http [--port N]            HTTP MCP server with OAuth 2.1
    --token-ttl N                    Access token TTL in seconds (default: 3600)
    --enable-dcr                     Enable Dynamic Client Registration
    --public-url URL                 Public issuer URL (required behind proxy/tunnel)
  call <tool> '<json>'               Raw tool invocation
  version                            Version info
  --tools-json                       Tool discovery (JSON)

Run gbrain <command> --help for command-specific help. + the per-command KEY_FILES entries; content stays in git)

CLAUDE.md gains: a Reference map (resolver), a Maintaining section (the
anti-disease rule), and a Cross-cutting invariants subsection under Architecture
so the must-never-violate rules (trust fail-closed, sourceScopeOpts isolation,
JSONB trap, engine parity, contract-first, migrations, multi-source) still
auto-load after the index moved out.

Result: CLAUDE.md 592KB -> 61KB; llms-full.txt 740KB -> 210KB (new docs link-only
until compressed). build-llms drift + budget test green; verify 29/29 green.
The pre-move content is recoverable at git show <this^>:CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ence guard

Compresses the verbatim-relocated reference docs from append-only release-history
to current-state-only (the disease cure), then makes recurrence structurally
impossible via a CI guard.

Compression (fan-out subagents + adversarial verify, audited mechanically):
- KEY_FILES.md 453KB -> 356KB; TESTING.md 42KB -> 38KB; thin-client.md already clean.
- 393/393 entries preserved; every src/test/scripts path from the verbatim original
  survives (mechanical comm-check); zero bolded **v0. markers remain.
- Conservative ratio (~22%) because the content is invariant-dense — correctness
  over brevity. Dropped: **vX.Y.Z (#NNN):** clauses, codex/review tags, contributor
  credits, PR-numbers-as-ids, pre-fix/then/was-now history deltas. Kept: every
  exported symbol, invariant, and Pinned-by reference. Verbatim original recoverable
  at git show <relocation-commit>:docs/architecture/KEY_FILES.md.

Recurrence guard (scripts/check-key-files-current-state.sh, wired into verify + check:all):
- HARD: bans the bolded **v0.<digit> marker in the reference docs (scoped — plain
  'as of pgvector 0.7' prose is fine, no false positives).
- HARD: CLAUDE.md size cap (90KB; currently 61KB) — the structural backstop.
- Pinned by test/scripts/check-key-files-current-state.test.ts (7 cases).

Content contracts (test/build-llms.test.ts, +5 cases per codex outside-voice):
CLAUDE.md keeps inline ship IRON RULES (version format, document-release,
never-hand-roll); AGENTS.md keeps its boot order; llms indexes the new docs;
KEY_FILES stays link-only (not inlined).

Privacy: scrubbed the relocated 'wintermute/chat/' source-boost examples + the
literal harvest-lint regex to generic placeholders (legitimate in allowlisted
CLAUDE.md; genericized for the new public docs per the privacy rule).

Reverts the 284c50a band-aid: re-inlines docs/what-schemas-unlock.md now that the
restructure freed ~530KB of bundle headroom (llms-full.txt 740KB -> 225KB).

verify 30/30 green (incl. new check:doc-history).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The highest-/ship-risk commit (isolated so it can revert alone). Moves the verbose
release + contributor procedure out of CLAUDE.md, keeping every ship-critical IRON
RULE inline so /ship + /document-release (which read CLAUDE.md) cannot regress.

Moved to docs/RELEASING.md: pre-ship test requirements; the CHANGELOG-branch-scoped
+ CHANGELOG voice + release-summary template; the 'To take advantage of vX' block
spec; version migrations + migration-is-canonical; schema state tracking; GitHub
Actions SHA maintenance; PR-descriptions-cover-the-branch; community-PR-wave;
checking-out-PRs-from-garrytan-agents.

Kept INLINE in CLAUDE.md (ship-critical IRON RULES — do NOT move):
- the Version-locations table (5-file sync) + the 3-line consistency audit
- Conductor branch=workspace
- Post-ship /document-release (MANDATORY)
- Privacy + Responsible-disclosure rules (Privacy also anchors the check-privacy
  allowlist — the only place allowed to name the fork)
- PR-title-version-first
- never-hand-roll-ship (Skill routing)
Plus a new ## Releasing pointer ('Before any ship, read docs/RELEASING.md in full')
and a resolver row.

CLAUDE.md 61KB -> 39KB (592KB -> 39KB overall, 93% cut; ~9k tokens auto-loaded vs
~147k). CLAUDE.md size-gate tightened 90KB -> 60KB. The content-contract tests pin
that the inline IRON RULES (MAJOR.MINOR.PATCH.MICRO, document-release, hand-roll
ship) did NOT move out. The moved ranges carry no banned fork name, so RELEASING.md
needs no privacy allowlist entry. verify 30/30; bundle 225KB -> 204KB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CLAUDE.md thin-resolver restructure (592KB → 39KB) rides in this
release; record it under the existing v0.42.9.0 For-contributors section.
No version bump — v0.42.9.0 is unreleased and already allocated to this PR.
…grep

The policy-doc re-admit (75992b7) put `\t` inline in the ALLOW patterns
passed to `grep -E`. BSD grep (macOS local) treats `\t` as a tab so it
worked locally; GNU grep (Ubuntu CI) treats it as literal `t`, so nothing
re-admitted and docs/TESTING.md / docs/RELEASING.md stayed deny-listed —
the two policy-doc tests failed on CI shard 6 (1097 pass / 2 fail).

Build ALLOW_RE with `printf '\t(%s)'` so the tab is a real byte, identical
in construction to DENY_RE (line 117), which the CI log shows matches
correctly on GNU grep. End-to-end: editing docs/TESTING.md now flips the
hash; a normal docs/*.md add still does not (deny stays scoped).
Surfaced by the SkillOpt real-LLM eval (Track B). The reflect step was shown
only a pass/fail score and the agent transcript — never WHAT the benchmark
judge rewards. On a skill judged by structure (e.g. "must include a
Confidence: line") the optimizer proposed plausible-but-off edits ("close with
a synthesis") that never satisfied the literal check; every candidate scored 0
on D_sel, the validation gate rejected them all, and the skill text never
changed (optimized === baseline === 0).

Fix: render each benchmark Judge (rule checks / llm rubric / qrels) into
plain-English criteria via new exported describeJudge / describeJudges, and
thread them into the reflect prompt (a SUCCESS CRITERIA block) for both the
loop reflect calls and the one-shot-rewrite path. The orchestrator computes the
distinct criteria across train+sel+test once. The optimizer system prompt now
instructs it to satisfy the criteria through genuine content, never empty
keywords — reward-hacking stays defended by the independent held-out gate
(cat32 confirms the gate catches a keyword-stuffing hack).

End-to-end this took a deficient skill from 0.00 to 1.00 on a held-out set it
never trained on. Pinned by test/skillopt/reflect.test.ts (describeJudge per
kind, describeJudges dedup, criteria present/absent in the prompt). Folds into
the open v0.42.9.0 PR (#1759).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve version collision: master shipped v0.42.10.0 (wikilink
global-basename) while the skillopt-eval-explainer work was in-branch as
v0.42.9.0. Bump the wave to v0.42.11.0 (strictly greater than master) per
the version-locations IRON RULE; keep both CHANGELOG entries.

CLAUDE.md: keep the branch's restructured thin orientation; master's only
change was a per-file-index annotation for the wikilink feature, which this
branch deliberately moved out of CLAUDE.md. Carried that documentation into
docs/architecture/KEY_FILES.md as a current-state link-extraction.ts entry
so nothing is lost.

Regenerated llms.txt / llms-full.txt via build:llms. Version trio audited
(VERSION = package.json = CHANGELOG = 0.42.11.0); typecheck + current-state
guard green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.42.9.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts Jun 3, 2026
@garrytan garrytan merged commit d4211f4 into master Jun 3, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant