v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts#1759
Merged
Merged
Conversation
…on opts Wire the F11 held-out gate into the orchestrator at checkpoint acceptance (runHeldOutGate was dead code); parse + thread --held-out through CLI, batch, fleet, background job, and the run_skillopt MCP op. Populate the real receipt.baseline_sel_score (was hardcoded 0) and add a final-test eval (test_score + baseline_test_score) via a shared scoreSkillOnTasks primitive. Fix the --no-mutate proposed.md write (was a stub) and enforce maxRuntimeMin. D16 ENFORCE in core mutation policy (assertBundledMutationHeldOut): mutating a bundled skill in place requires a non-empty (>=5), benchmark-disjoint held-out set or hard-refuses. Add three eval-internal ablation opts (reflectMode, disableValidationGate, optimizerMode='one-shot-rewrite') recorded in the receipt + audit; ROLLOUT_SUCCESS_THRESHOLD named constant. Security: run_skillopt MCP op validates skill_name (kebab-only) and confines caller-supplied benchmark/held-out paths to the skills dir for remote callers.
…eceipt honesty New test/skillopt/rollout.test.ts (rollout had zero coverage). Held-out ENFORCE unit cases + one-shot-rewrite fence handling (whole-response unwrap, embedded-fence preserved, error path). E2E: F11 held-out BLOCKS/ALLOWS, bundled no-mutate write, reflectMode/disableValidationGate/optimizerMode, maxRuntimeMin abort, receipt baseline/test-score honesty, held-out/benchmark disjointness, D2 no-DB-pollution.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…for v0.42.9.0 Wire --held-out into the skill-optimizer SKILL.md, guide flags/safety tables, and the tutorial's bundled-skill step: mutating a bundled skill in place now requires --allow-mutate-bundled AND --held-out (>=5 benchmark-disjoint tasks) or it hard-refuses. Add the --held-out flag row + F11 held-out gate to the guide; update the receipt contract to the honest baseline/test-score fields. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…val-explainer # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json
…again
The ai@6.x bump tightened ModelMessage + tool-schema validation, which
silently broke every multi-turn tool loop. Both `gbrain skillopt` rollouts
and production background `subagent` jobs route through `chat()`/`toolLoop`
and crashed the moment the model called a tool ("messages do not match the
ModelMessage[] schema" / "schema is not a function"). Surfaced end-to-end
by the SkillOpt real-LLM eval.
Three fixes:
- chat(): wrap tool defs with the SDK's `jsonSchema()` helper instead of a
bare `{jsonSchema}` object (v6 asSchema() treated the bare object as a
thunk and threw).
- chat(): new exported pure `toModelMessages()` converts gbrain's
provider-neutral ChatMessage[] into v6 ModelMessage[] — tool results ride
a dedicated `role:'tool'` message with structured `{type,value}` output;
null output preserved as json null. Load-bearing for the production
subagent path, not just skillopt.
- rollout.ts: replace the inline params→schema mapper (dropped `items` on
array params) with the shared `paramDefToSchema` single source of truth.
Pinned by test/gateway-model-messages.test.ts (8 cases). Folds into the
open v0.42.9.0 PR (#1759) — these complete the eval-readiness wave by
making skillopt actually run against a live model.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…out 0
Surfaced by the SkillOpt real-LLM eval (Track B). Two coupled bugs that made
a budget-capped Haiku run report a vacuous "0/N" measurement in ~2ms with
zero LLM calls — indistinguishable from a real deficient-skill score:
1. Claude Haiku 4.5's canonical dateless id (`claude-haiku-4-5`) was missing
from anthropic-pricing.ts (only the dated `-20251001` was present). With
`--max-cost` set, BudgetTracker.reserve() threw no_pricing on the FIRST
chat() of every rollout. Added the dateless entry (sonnet already had its
dateless form).
2. runValidationGate swallowed that BUDGET_EXHAUSTED error — runWithLimit
settled it as {ok:false}, which the gate turned into median:0. A pricing/cap
crash became a fake score. The gate now scans settled results for
isMustAbortError() and re-throws so the caller aborts loudly; ordinary
(non-abort) rollout errors still fail-open to 0 (judge-hiccup posture kept).
Pinned by test/skillopt/validate-gate-abort.test.ts (3 cases). Folds into the
open v0.42.9.0 PR (#1759).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…om full bundle
The toolLoop + budget bug-fix annotations grew CLAUDE.md, pushing llms-full.txt
to 756KB over the 750KB FULL_SIZE_BUDGET (the `build-llms > size budget` test
failed, failing the `test` CI job). CLAUDE.md stays inlined by design (it's the
point of the one-fetch bundle), so per the budget comment's own guidance ("ship
with includeInFull=false exclusions") this excludes docs/what-schemas-unlock.md
(15.4KB value-explainer, not load-bearing operational reference) from
llms-full.txt; it stays linked in llms.txt. Bundle now 740KB with ~9KB headroom.
No budget bump — 750KB is near the ~190k-token-context fit ceiling.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs/**/*.md is deny-listed from the CI cache hash (test-irrelevant). The CLAUDE.md restructure moves test/release POLICY into docs/TESTING.md + docs/RELEASING.md, which DO carry contracts the test suite reads. Without re-admitting them, a policy-only edit would produce the same cache hash and skip the test shard that runs the build-llms + doc-history guards (false-pass). Adds an ALLOW_PATTERNS re-admit step after the deny, scoped to the named policy docs (not a blanket docs un-deny). Lands FIRST, before any doc moves. Pinned by 3 new cases in test/scripts/ci-cache-hash.test.ts: TESTING.md + RELEASING.md edits MUST change the hash; docs/guide.md still must not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…UDE.md (verbatim)
CLAUDE.md had grown to 592KB / ~147k tokens auto-loaded every session (~77% of
the llms-full.txt single-fetch bundle). The per-file index was append-only by
mandate. This is the exact thin-dispatcher-vs-fat-blob anti-pattern gbrain exists
to fix, so CLAUDE.md becomes a thin orientation + resolver that points at
on-demand docs.
This commit is the VERBATIM move (content-preserving — the next commit compresses):
- docs/architecture/KEY_FILES.md <- ## Key files + the calibration key-files
cluster + Schema Cathedral v3 impl detail
- docs/architecture/thin-client.md <- ## Thin-client routing
- docs/TESTING.md <- ## Testing
- ## Commands DROPPED (18 'added in vX.Y' history blocks; current surface is
gbrain 0.41.38.0 -- personal knowledge brain
USAGE
gbrain <command> [options]
SETUP
init [--pglite|--supabase|--url] Create brain (PGLite default, no server)
migrate --to <supabase|pglite> Transfer brain between engines
upgrade Self-update
check-update [--json] Check for new versions
doctor [--json] [--fast] Health check (resolver, skills, pgvector, RLS, embeddings)
integrations [subcommand] Manage integration recipes (senses + reflexes)
PAGES
get <slug> Read a page
put <slug> [< file.md] Write/update a page
delete <slug> Delete a page
list [--type T] [--tag T] [-n N] List pages
SEARCH
search <query> Keyword search (tsvector)
query <question> [--no-expand] Hybrid search (RRF + expansion)
ask <question> [--no-expand] Alias for query
IMPORT/EXPORT
import <dir> [--no-embed] Import markdown directory
sync [--repo <path>] [flags] Git-to-brain incremental sync
sync --watch [--interval N] Continuous sync (loops until stopped)
sync --install-cron Install persistent sync daemon
export [--dir ./out/] Export to markdown
export --restore-only [--repo <p>] Restore missing supabase-only files
[--type T] [--slug-prefix S] With optional filters
FILES
files list [slug] List stored files
files upload <file> --page <slug> Upload file to storage
files upload-raw <file> --page <s> Smart upload (size routing + .redirect.yaml)
files signed-url <path> Generate signed URL (1-hour)
files sync <dir> Bulk upload directory
files verify Verify all uploads
EMBEDDINGS
embed [<slug>|--all|--stale] Generate/refresh embeddings
LINKS
link <from> <to> [--type T] Create typed link
unlink <from> <to> Remove link
backlinks <slug> Incoming links
graph <slug> [--depth N] Traverse link graph (returns nodes)
graph-query <slug> [--type T] Edge-based traversal with type/direction filters
[--depth N] [--direction in|out|both]
TAGS
tags <slug> List tags
tag <slug> <tag> Add tag
untag <slug> <tag> Remove tag
TIMELINE
timeline [<slug>] View timeline
timeline-add <slug> <date> <text> Add timeline entry
TOOLS
extract <links|timeline|all> Extract links/timeline (idempotent)
[--source fs|db] fs (default) walks .md files; db iterates engine pages
[--dir <brain>] brain dir for fs source
[--type T] [--since DATE] filters (db source)
[--dry-run] [--json]
publish <page.md> [--password] Shareable HTML (strips private data, optional AES-256)
check-backlinks <check|fix> [dir] Find/fix missing back-links across brain
lint <dir|file> [--fix] Catch LLM artifacts, placeholder dates, bad frontmatter
orphans [--json] [--count] Find pages with no inbound wikilinks
salience [--days N] [--kind P] v0.29: pages ranked by emotional + activity salience
anomalies [--since D] [--sigma N] v0.29: cohort-based statistical anomalies (tag, type)
transcripts recent [--days N] v0.29: recent raw .txt transcripts (local-only)
dream [--dry-run] [--json] Run the overnight maintenance cycle once (cron-friendly).
See also: autopilot --install (continuous daemon).
check-resolvable [--json] [--fix] Validate skill tree (reachability/MECE/DRY)
report --type <name> --content ... Save timestamped report to brain/reports/
BRAIN (capture / ideate / explore — v0.37/v0.38)
capture [content] [--file PATH] Single entrypoint for getting content into the brain
[--stdin] [--slug s] [--type t] Inline content / file / stdin; writes to inbox/ by default
[--source ID] [--quiet|--json] Multi-source brains: route to a non-default source
brainstorm <question> [--json] Bisociation idea generator (hybrid search + far-set + judge)
[--save|--no-save] [--limit N]
lsd <question> [--json] Lateral Synaptic Drift: inverted-judge brainstorm
[--save|--no-save] [--limit N] rewarding far-from-obvious + axiomatic inversions
SOURCES (multi-repo / multi-brain)
sources list Show registered sources
sources add <id> --path <p> Register a source (id = short name, e.g. 'wiki')
sources remove <id> Remove a source + its pages
sync --all Sync all sources with a local_path
sync --source <id> Sync one specific source
repos ... DEPRECATED alias for 'sources' (v0.19.0)
CODE INDEXING (v0.19.0 / v0.20.0 Cathedral II)
code-def <symbol> [--lang l] Find the definition of a symbol across code pages
code-refs <symbol> [--lang l] Find all references to a symbol (JSON-first)
code-callers <symbol> Who calls this symbol? (v0.20.0 A1)
code-callees <symbol> What does this symbol call? (v0.20.0 A1)
query <q> --lang <l> Filter hybrid search to one language (v0.20.0)
query <q> --symbol-kind <k> Filter to symbol type (function|class|method|...) (v0.20.0)
reconcile-links [--dry-run] Batch-recompute doc↔impl edges (v0.20.0)
reindex-code [--source id] [--yes] Explicit code-page reindex (v0.20.0)
sync --strategy code Sync code files into the brain
JOBS (Minions)
jobs submit <name> [--params JSON] Submit background job [--follow] [--dry-run]
jobs list [--status S] [--limit N] List jobs
jobs get <id> Job details + history
jobs cancel <id> Cancel job
jobs retry <id> Re-queue failed/dead job
jobs prune [--older-than 30d] Clean old jobs
jobs stats Job health dashboard
jobs work [--queue Q] Start worker daemon (Postgres only)
ADMIN
stats Brain statistics
health Brain health dashboard
history <slug> Page version history
revert <slug> <version-id> Revert to version
features [--json] [--auto-fix] Scan usage + recommend unused features
autopilot [--repo] [--interval N] Self-maintaining brain daemon
config [show|get|set] <key> [val] Brain config
storage status [--repo <path>] Storage tier status and health
[--json] (git-tracked vs supabase-only)
serve MCP server (stdio)
serve --http [--port N] HTTP MCP server with OAuth 2.1
--token-ttl N Access token TTL in seconds (default: 3600)
--enable-dcr Enable Dynamic Client Registration
--public-url URL Public issuer URL (required behind proxy/tunnel)
call <tool> '<json>' Raw tool invocation
version Version info
--tools-json Tool discovery (JSON)
Run gbrain <command> --help for command-specific help. + the per-command KEY_FILES entries; content stays in git)
CLAUDE.md gains: a Reference map (resolver), a Maintaining section (the
anti-disease rule), and a Cross-cutting invariants subsection under Architecture
so the must-never-violate rules (trust fail-closed, sourceScopeOpts isolation,
JSONB trap, engine parity, contract-first, migrations, multi-source) still
auto-load after the index moved out.
Result: CLAUDE.md 592KB -> 61KB; llms-full.txt 740KB -> 210KB (new docs link-only
until compressed). build-llms drift + budget test green; verify 29/29 green.
The pre-move content is recoverable at git show <this^>:CLAUDE.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ence guard Compresses the verbatim-relocated reference docs from append-only release-history to current-state-only (the disease cure), then makes recurrence structurally impossible via a CI guard. Compression (fan-out subagents + adversarial verify, audited mechanically): - KEY_FILES.md 453KB -> 356KB; TESTING.md 42KB -> 38KB; thin-client.md already clean. - 393/393 entries preserved; every src/test/scripts path from the verbatim original survives (mechanical comm-check); zero bolded **v0. markers remain. - Conservative ratio (~22%) because the content is invariant-dense — correctness over brevity. Dropped: **vX.Y.Z (#NNN):** clauses, codex/review tags, contributor credits, PR-numbers-as-ids, pre-fix/then/was-now history deltas. Kept: every exported symbol, invariant, and Pinned-by reference. Verbatim original recoverable at git show <relocation-commit>:docs/architecture/KEY_FILES.md. Recurrence guard (scripts/check-key-files-current-state.sh, wired into verify + check:all): - HARD: bans the bolded **v0.<digit> marker in the reference docs (scoped — plain 'as of pgvector 0.7' prose is fine, no false positives). - HARD: CLAUDE.md size cap (90KB; currently 61KB) — the structural backstop. - Pinned by test/scripts/check-key-files-current-state.test.ts (7 cases). Content contracts (test/build-llms.test.ts, +5 cases per codex outside-voice): CLAUDE.md keeps inline ship IRON RULES (version format, document-release, never-hand-roll); AGENTS.md keeps its boot order; llms indexes the new docs; KEY_FILES stays link-only (not inlined). Privacy: scrubbed the relocated 'wintermute/chat/' source-boost examples + the literal harvest-lint regex to generic placeholders (legitimate in allowlisted CLAUDE.md; genericized for the new public docs per the privacy rule). Reverts the 284c50a band-aid: re-inlines docs/what-schemas-unlock.md now that the restructure freed ~530KB of bundle headroom (llms-full.txt 740KB -> 225KB). verify 30/30 green (incl. new check:doc-history). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The highest-/ship-risk commit (isolated so it can revert alone). Moves the verbose
release + contributor procedure out of CLAUDE.md, keeping every ship-critical IRON
RULE inline so /ship + /document-release (which read CLAUDE.md) cannot regress.
Moved to docs/RELEASING.md: pre-ship test requirements; the CHANGELOG-branch-scoped
+ CHANGELOG voice + release-summary template; the 'To take advantage of vX' block
spec; version migrations + migration-is-canonical; schema state tracking; GitHub
Actions SHA maintenance; PR-descriptions-cover-the-branch; community-PR-wave;
checking-out-PRs-from-garrytan-agents.
Kept INLINE in CLAUDE.md (ship-critical IRON RULES — do NOT move):
- the Version-locations table (5-file sync) + the 3-line consistency audit
- Conductor branch=workspace
- Post-ship /document-release (MANDATORY)
- Privacy + Responsible-disclosure rules (Privacy also anchors the check-privacy
allowlist — the only place allowed to name the fork)
- PR-title-version-first
- never-hand-roll-ship (Skill routing)
Plus a new ## Releasing pointer ('Before any ship, read docs/RELEASING.md in full')
and a resolver row.
CLAUDE.md 61KB -> 39KB (592KB -> 39KB overall, 93% cut; ~9k tokens auto-loaded vs
~147k). CLAUDE.md size-gate tightened 90KB -> 60KB. The content-contract tests pin
that the inline IRON RULES (MAJOR.MINOR.PATCH.MICRO, document-release, hand-roll
ship) did NOT move out. The moved ranges carry no banned fork name, so RELEASING.md
needs no privacy allowlist entry. verify 30/30; bundle 225KB -> 204KB.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CLAUDE.md thin-resolver restructure (592KB → 39KB) rides in this release; record it under the existing v0.42.9.0 For-contributors section. No version bump — v0.42.9.0 is unreleased and already allocated to this PR.
…grep The policy-doc re-admit (75992b7) put `\t` inline in the ALLOW patterns passed to `grep -E`. BSD grep (macOS local) treats `\t` as a tab so it worked locally; GNU grep (Ubuntu CI) treats it as literal `t`, so nothing re-admitted and docs/TESTING.md / docs/RELEASING.md stayed deny-listed — the two policy-doc tests failed on CI shard 6 (1097 pass / 2 fail). Build ALLOW_RE with `printf '\t(%s)'` so the tab is a real byte, identical in construction to DENY_RE (line 117), which the CI log shows matches correctly on GNU grep. End-to-end: editing docs/TESTING.md now flips the hash; a normal docs/*.md add still does not (deny stays scoped).
Surfaced by the SkillOpt real-LLM eval (Track B). The reflect step was shown
only a pass/fail score and the agent transcript — never WHAT the benchmark
judge rewards. On a skill judged by structure (e.g. "must include a
Confidence: line") the optimizer proposed plausible-but-off edits ("close with
a synthesis") that never satisfied the literal check; every candidate scored 0
on D_sel, the validation gate rejected them all, and the skill text never
changed (optimized === baseline === 0).
Fix: render each benchmark Judge (rule checks / llm rubric / qrels) into
plain-English criteria via new exported describeJudge / describeJudges, and
thread them into the reflect prompt (a SUCCESS CRITERIA block) for both the
loop reflect calls and the one-shot-rewrite path. The orchestrator computes the
distinct criteria across train+sel+test once. The optimizer system prompt now
instructs it to satisfy the criteria through genuine content, never empty
keywords — reward-hacking stays defended by the independent held-out gate
(cat32 confirms the gate catches a keyword-stuffing hack).
End-to-end this took a deficient skill from 0.00 to 1.00 on a held-out set it
never trained on. Pinned by test/skillopt/reflect.test.ts (describeJudge per
kind, describeJudges dedup, criteria present/absent in the prompt). Folds into
the open v0.42.9.0 PR (#1759).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve version collision: master shipped v0.42.10.0 (wikilink global-basename) while the skillopt-eval-explainer work was in-branch as v0.42.9.0. Bump the wave to v0.42.11.0 (strictly greater than master) per the version-locations IRON RULE; keep both CHANGELOG entries. CLAUDE.md: keep the branch's restructured thin orientation; master's only change was a per-file-index annotation for the wikilink feature, which this branch deliberately moved out of CLAUDE.md. Carried that documentation into docs/architecture/KEY_FILES.md as a current-state link-extraction.ts entry so nothing is lost. Regenerated llms.txt / llms-full.txt via build:llms. Version trio audited (VERSION = package.json = CHANGELOG = 0.42.11.0); typecheck + current-state guard green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes
gbrain skillopt(self-improving skills) honest and tamper-resistant — the SkillOpt eval-readiness wave. Track A of the "prove the feature works" plan; the real-LLM benchmark suite (Track B) lands separately ingbrain-evals.Held-out validation gate (F11) — now actually wired.
runHeldOutGateexisted but the orchestrator never called it and--held-outwas never parsed. Now--held-out <path>is parsed and threaded through every caller (CLI, batch, fleet, background job,run_skilloptMCP op), and the gate runs at checkpoint acceptance — a candidate that climbs the benchmark but regresses on the independent held-out set is refused, on the mutate AND no-mutate/fleet paths.Bundled-skill safety (D16) in core mutation policy. Mutating a shipped skill in place now requires a non-empty (>=5), benchmark-disjoint held-out set or hard-refuses (exit 2) with a
proposed.mdfallback. Enforced inassertBundledMutationHeldOut, so it fires for every entry point (they all funnel throughrunSkillOpt).Honest receipts + final-test.
receipt.baseline_sel_scorewas hardcoded0(real value discarded); now populated, plus a real final-test eval (test_score+baseline_test_score) via a sharedscoreSkillOnTasksprimitive.--no-mutatenow writesproposed.md(was a stub);--max-runtime-minis enforced.Security hardening. The
run_skilloptMCP op validatesskill_name(kebab-only) and confines caller-supplied benchmark/held-out paths to the skills dir for remote callers — closes an arbitrary-file-read / existence-oracle for admin OAuth tokens.Eval-internal ablation opts (NOT on the CLI):
reflectMode,disableValidationGate,optimizerMode('one-shot-rewrite'), recorded in the receipt + audit for replayability. They drive the cat30/31/32/33 SkillOpt benchmark suite ingbrain-evals.Test Coverage
New
test/skillopt/rollout.test.ts(rollout had zero coverage); held-out ENFORCE + one-shot-rewrite unit cases; e2e for F11 block/allow, bundled no-mutate write, the three ablation opts,maxRuntimeMinabort, receipt-score honesty, held-out/benchmark disjointness, and D2 no-DB-pollution. Targeted suite (skillopt + operations trust-boundary + autocut/search-mode merge seam + core): 355 pass / 0 fail. Full skillopt dir: 211 pass.bun run verify: 29/29.bun run typecheck: clean.Note: the full sharded unit suite was environmentally slow locally (PGLite-heavy + loaded machine); CI runs the full shard matrix on this PR.
Pre-Landing Review
6 findings (3 critical, 3 informational), all actioned:
run_skilloptMCP op passed benchmark/held-out paths toreadFileSyncwith no confinement → arbitrary file read for remote admin callers. Added kebabskill_namevalidation + path confinement to skillsDir for remote callers.maxRuntimeMinabort branch + receipt baseline/test-score (the headline fix) were unasserted → added tests.MIN_HELD_OUT_SIZEnow derives fromD_SEL_MIN_SIZE;0.5→ROLLOUT_SUCCESS_THRESHOLD; MCP paramheld_out→held_out_path(consistency).Adversarial Review
Claude + Codex both flagged (consensus, block-until-fixed):
skill_namepath traversal — confinement covered caller-supplied paths but not theskillName-derived defaults. Now validated kebab-only at the op boundary, so derived paths are contained by construction.--held-outat a copy of the benchmark voided the gate. Now rejected on task_id overlap./tmp→/private/tmp, Conductor worktrees) — canonicalize the nearest existing ancestor.Plan Completion
Track A complete (T0 North Star, T1 load-bearing fixes, T2 ablation opts, T3 tests). Track B (real-LLM cat30/31/32/33 in
gbrain-evals) is a separate repo/PR, intentionally not in this PR.TODOS
Added 4 v0.42+ SkillOpt follow-ups (promoteCandidate DRY extraction, bundled-detection hardening, preflight ablation-opt awareness, maxRuntimeMin in held-out/final-test phases).
Test plan
bun run typecheckcleanbun run verify29/29🤖 Generated with Claude Code
Documentation
Docs synced for the v0.42.9.0 skillopt eval-readiness wave:
skills/skill-optimizer/SKILL.md(bundled-skill held-out requirement + honest receipt fields),docs/guides/skillopt.md(--held-outflag + F11 gate + D16 row),docs/tutorials/improving-skills-with-skillopt.md(Step 5 bundled command now shows--allow-mutate-bundled --held-out). CHANGELOG/CLAUDE.md/TODOS/llms updated in the release commit.Also in this PR: CLAUDE.md thin-resolver restructure (folded in)
Separate concern from skillopt, folded into this batch as atomic, bisect-friendly commits
(
75992b77,c825ef8f,163f044e,fa2f9de2).Problem: CLAUDE.md had grown to 591,854 bytes (~147k tokens auto-loaded every session,
~77% of the
llms-full.txtone-fetch bundle, which had just blown its 750KB budget). Root causewas structural: the per-file index + command/test sections were append-only by mandate, so every
release chained another
**vX.Y.Z:**clause forever.Fix: CLAUDE.md becomes a thin orientation + resolver (gbrain's own thin-dispatcher/fat-detail
pattern). The per-file index, thin-client routing, test discipline, and the verbose release
process move to on-demand docs; CLAUDE.md keeps the North Star, architecture + cross-cutting
invariants, the IRON RULES, and a reference map that routes to detail.
75992b77ci-cache-hash.shkeeps relocated policy docs (docs/TESTING.md,docs/RELEASING.md) test-affecting so a change to them still invalidates the cache (closes a false-pass path before any policy moved). Pinned by tests.c825ef8fdocs/architecture/KEY_FILES.md,docs/architecture/thin-client.md,docs/TESTING.md; resolver + cross-cutting invariants lifted into CLAUDE.md. Content-preserving.163f044escripts/check-key-files-current-state.shrecurrence guard (bans**v0.chains + CLAUDE.md size cap, wired intoverify) + content-contract tests + revert the bundle band-aid.fa2f9de2docs/RELEASING.md; ship IRON RULES + version-locations table kept inline.Result: CLAUDE.md 591,854 → 39,181 bytes (93%), llms-full.txt 740KB → 204KB.
Zero
src/changes. The bloat cannot recur —verifyfails on re-introduced append-history oran over-cap CLAUDE.md.
Reviews: eng-review (6 findings, folded) + codex outside-voice (11 findings, 9 folded as
mandatory hardening incl. the CI-cache fix, content-contract tests, measured bundle).
Verification (docs delta is zero-src):
bun run verify30/30; build-llms drift+budget,doc-history guard, ci-cache contract = 46/46; targeted doc-touching suite (public-exports,
resolver, skill-trigger-index) 199/0.