v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts by garrytan · Pull Request #1759 · garrytan/gbrain

garrytan · 2026-06-02T03:40:45Z

Summary

Makes gbrain skillopt (self-improving skills) honest and tamper-resistant — the SkillOpt eval-readiness wave. Track A of the "prove the feature works" plan; the real-LLM benchmark suite (Track B) lands separately in gbrain-evals.

Held-out validation gate (F11) — now actually wired. runHeldOutGate existed but the orchestrator never called it and --held-out was never parsed. Now --held-out <path> is parsed and threaded through every caller (CLI, batch, fleet, background job, run_skillopt MCP op), and the gate runs at checkpoint acceptance — a candidate that climbs the benchmark but regresses on the independent held-out set is refused, on the mutate AND no-mutate/fleet paths.

Bundled-skill safety (D16) in core mutation policy. Mutating a shipped skill in place now requires a non-empty (>=5), benchmark-disjoint held-out set or hard-refuses (exit 2) with a proposed.md fallback. Enforced in assertBundledMutationHeldOut, so it fires for every entry point (they all funnel through runSkillOpt).

Honest receipts + final-test. receipt.baseline_sel_score was hardcoded 0 (real value discarded); now populated, plus a real final-test eval (test_score + baseline_test_score) via a shared scoreSkillOnTasks primitive. --no-mutate now writes proposed.md (was a stub); --max-runtime-min is enforced.

Security hardening. The run_skillopt MCP op validates skill_name (kebab-only) and confines caller-supplied benchmark/held-out paths to the skills dir for remote callers — closes an arbitrary-file-read / existence-oracle for admin OAuth tokens.

Eval-internal ablation opts (NOT on the CLI): reflectMode, disableValidationGate, optimizerMode ('one-shot-rewrite'), recorded in the receipt + audit for replayability. They drive the cat30/31/32/33 SkillOpt benchmark suite in gbrain-evals.

Test Coverage

New test/skillopt/rollout.test.ts (rollout had zero coverage); held-out ENFORCE + one-shot-rewrite unit cases; e2e for F11 block/allow, bundled no-mutate write, the three ablation opts, maxRuntimeMin abort, receipt-score honesty, held-out/benchmark disjointness, and D2 no-DB-pollution. Targeted suite (skillopt + operations trust-boundary + autocut/search-mode merge seam + core): 355 pass / 0 fail. Full skillopt dir: 211 pass. bun run verify: 29/29. bun run typecheck: clean.

Note: the full sharded unit suite was environmentally slow locally (PGLite-heavy + loaded machine); CI runs the full shard matrix on this PR.

Pre-Landing Review

6 findings (3 critical, 3 informational), all actioned:

[FIXED] [security] run_skillopt MCP op passed benchmark/held-out paths to readFileSync with no confinement → arbitrary file read for remote admin callers. Added kebab skill_name validation + path confinement to skillsDir for remote callers.
[FIXED] [testing] maxRuntimeMin abort branch + receipt baseline/test-score (the headline fix) were unasserted → added tests.
[FIXED] [maint] MIN_HELD_OUT_SIZE now derives from D_SEL_MIN_SIZE; 0.5 → ROLLOUT_SUCCESS_THRESHOLD; MCP param held_out → held_out_path (consistency).

Adversarial Review

Claude + Codex both flagged (consensus, block-until-fixed):

[FIXED] skill_name path traversal — confinement covered caller-supplied paths but not the skillName-derived defaults. Now validated kebab-only at the op boundary, so derived paths are contained by construction.
[FIXED] One-shot fence-strip truncation — the non-anchored fence regex could truncate a body containing a code sample. Now only unwraps a whole-response fence.
[FIXED] Held-out/benchmark non-disjointness — pointing --held-out at a copy of the benchmark voided the gate. Now rejected on task_id overlap.
[FIXED] Confinement false-block under symlinked skillsDir (macOS /tmp→/private/tmp, Conductor worktrees) — canonicalize the nearest existing ancestor.

Plan Completion

Track A complete (T0 North Star, T1 load-bearing fixes, T2 ablation opts, T3 tests). Track B (real-LLM cat30/31/32/33 in gbrain-evals) is a separate repo/PR, intentionally not in this PR.

TODOS

Added 4 v0.42+ SkillOpt follow-ups (promoteCandidate DRY extraction, bundled-detection hardening, preflight ablation-opt awareness, maxRuntimeMin in held-out/final-test phases).

Test plan

bun run typecheck clean
bun run verify 29/29
Targeted suite 355 pass / 0 fail (skillopt + operations trust-boundary + merge seam + core)
Full skillopt dir 211 pass
Full sharded unit matrix — runs on CI (slow locally)

🤖 Generated with Claude Code

Documentation

Docs synced for the v0.42.9.0 skillopt eval-readiness wave: skills/skill-optimizer/SKILL.md (bundled-skill held-out requirement + honest receipt fields), docs/guides/skillopt.md (--held-out flag + F11 gate + D16 row), docs/tutorials/improving-skills-with-skillopt.md (Step 5 bundled command now shows --allow-mutate-bundled --held-out). CHANGELOG/CLAUDE.md/TODOS/llms updated in the release commit.

Also in this PR: CLAUDE.md thin-resolver restructure (folded in)

Separate concern from skillopt, folded into this batch as atomic, bisect-friendly commits
(75992b77, c825ef8f, 163f044e, fa2f9de2).

Problem: CLAUDE.md had grown to 591,854 bytes (~147k tokens auto-loaded every session,
~77% of the llms-full.txt one-fetch bundle, which had just blown its 750KB budget). Root cause
was structural: the per-file index + command/test sections were append-only by mandate, so every
release chained another **vX.Y.Z:** clause forever.

Fix: CLAUDE.md becomes a thin orientation + resolver (gbrain's own thin-dispatcher/fat-detail
pattern). The per-file index, thin-client routing, test discipline, and the verbose release
process move to on-demand docs; CLAUDE.md keeps the North Star, architecture + cross-cutting
invariants, the IRON RULES, and a reference map that routes to detail.

Commit	What
`75992b77`	CI-cache prerequisite — `ci-cache-hash.sh` keeps relocated policy docs (`docs/TESTING.md`, `docs/RELEASING.md`) test-affecting so a change to them still invalidates the cache (closes a false-pass path before any policy moved). Pinned by tests.
`c825ef8f`	Verbatim relocation → `docs/architecture/KEY_FILES.md`, `docs/architecture/thin-client.md`, `docs/TESTING.md`; resolver + cross-cutting invariants lifted into CLAUDE.md. Content-preserving.
`163f044e`	Compress reference docs to current-state (393 entries, every invariant preserved) + `scripts/check-key-files-current-state.sh` recurrence guard (bans `**v0.` chains + CLAUDE.md size cap, wired into `verify`) + content-contract tests + revert the bundle band-aid.
`fa2f9de2`	Verbose release process → `docs/RELEASING.md`; ship IRON RULES + version-locations table kept inline.

Result: CLAUDE.md 591,854 → 39,181 bytes (93%), llms-full.txt 740KB → 204KB.
Zero src/ changes. The bloat cannot recur — verify fails on re-introduced append-history or
an over-cap CLAUDE.md.

Reviews: eng-review (6 findings, folded) + codex outside-voice (11 findings, 9 folded as
mandatory hardening incl. the CI-cache fix, content-contract tests, measured bundle).

Verification (docs delta is zero-src): bun run verify 30/30; build-llms drift+budget,
doc-history guard, ci-cache contract = 46/46; targeted doc-touching suite (public-exports,
resolver, skill-trigger-index) 199/0.

…on opts Wire the F11 held-out gate into the orchestrator at checkpoint acceptance (runHeldOutGate was dead code); parse + thread --held-out through CLI, batch, fleet, background job, and the run_skillopt MCP op. Populate the real receipt.baseline_sel_score (was hardcoded 0) and add a final-test eval (test_score + baseline_test_score) via a shared scoreSkillOnTasks primitive. Fix the --no-mutate proposed.md write (was a stub) and enforce maxRuntimeMin. D16 ENFORCE in core mutation policy (assertBundledMutationHeldOut): mutating a bundled skill in place requires a non-empty (>=5), benchmark-disjoint held-out set or hard-refuses. Add three eval-internal ablation opts (reflectMode, disableValidationGate, optimizerMode='one-shot-rewrite') recorded in the receipt + audit; ROLLOUT_SUCCESS_THRESHOLD named constant. Security: run_skillopt MCP op validates skill_name (kebab-only) and confines caller-supplied benchmark/held-out paths to the skills dir for remote callers.

…eceipt honesty New test/skillopt/rollout.test.ts (rollout had zero coverage). Held-out ENFORCE unit cases + one-shot-rewrite fence handling (whole-response unwrap, embedded-fence preserved, error path). E2E: F11 held-out BLOCKS/ALLOWS, bundled no-mutate write, reflectMode/disableValidationGate/optimizerMode, maxRuntimeMin abort, receipt baseline/test-score honesty, held-out/benchmark disjointness, D2 no-DB-pollution.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…for v0.42.9.0 Wire --held-out into the skill-optimizer SKILL.md, guide flags/safety tables, and the tutorial's bundled-skill step: mutating a bundled skill in place now requires --allow-mutate-bundled AND --held-out (>=5 benchmark-disjoint tasks) or it hard-refuses. Add the --held-out flag row + F11 held-out gate to the guide; update the receipt contract to the honest baseline/test-score fields. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…val-explainer # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json

…again The ai@6.x bump tightened ModelMessage + tool-schema validation, which silently broke every multi-turn tool loop. Both `gbrain skillopt` rollouts and production background `subagent` jobs route through `chat()`/`toolLoop` and crashed the moment the model called a tool ("messages do not match the ModelMessage[] schema" / "schema is not a function"). Surfaced end-to-end by the SkillOpt real-LLM eval. Three fixes: - chat(): wrap tool defs with the SDK's `jsonSchema()` helper instead of a bare `{jsonSchema}` object (v6 asSchema() treated the bare object as a thunk and threw). - chat(): new exported pure `toModelMessages()` converts gbrain's provider-neutral ChatMessage[] into v6 ModelMessage[] — tool results ride a dedicated `role:'tool'` message with structured `{type,value}` output; null output preserved as json null. Load-bearing for the production subagent path, not just skillopt. - rollout.ts: replace the inline params→schema mapper (dropped `items` on array params) with the shared `paramDefToSchema` single source of truth. Pinned by test/gateway-model-messages.test.ts (8 cases). Folds into the open v0.42.9.0 PR (#1759) — these complete the eval-readiness wave by making skillopt actually run against a live model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…out 0 Surfaced by the SkillOpt real-LLM eval (Track B). Two coupled bugs that made a budget-capped Haiku run report a vacuous "0/N" measurement in ~2ms with zero LLM calls — indistinguishable from a real deficient-skill score: 1. Claude Haiku 4.5's canonical dateless id (`claude-haiku-4-5`) was missing from anthropic-pricing.ts (only the dated `-20251001` was present). With `--max-cost` set, BudgetTracker.reserve() threw no_pricing on the FIRST chat() of every rollout. Added the dateless entry (sonnet already had its dateless form). 2. runValidationGate swallowed that BUDGET_EXHAUSTED error — runWithLimit settled it as {ok:false}, which the gate turned into median:0. A pricing/cap crash became a fake score. The gate now scans settled results for isMustAbortError() and re-throws so the caller aborts loudly; ordinary (non-abort) rollout errors still fail-open to 0 (judge-hiccup posture kept). Pinned by test/skillopt/validate-gate-abort.test.ts (3 cases). Folds into the open v0.42.9.0 PR (#1759). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…om full bundle The toolLoop + budget bug-fix annotations grew CLAUDE.md, pushing llms-full.txt to 756KB over the 750KB FULL_SIZE_BUDGET (the `build-llms > size budget` test failed, failing the `test` CI job). CLAUDE.md stays inlined by design (it's the point of the one-fetch bundle), so per the budget comment's own guidance ("ship with includeInFull=false exclusions") this excludes docs/what-schemas-unlock.md (15.4KB value-explainer, not load-bearing operational reference) from llms-full.txt; it stays linked in llms.txt. Bundle now 740KB with ~9KB headroom. No budget bump — 750KB is near the ~190k-token-context fit ceiling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs/**/*.md is deny-listed from the CI cache hash (test-irrelevant). The CLAUDE.md restructure moves test/release POLICY into docs/TESTING.md + docs/RELEASING.md, which DO carry contracts the test suite reads. Without re-admitting them, a policy-only edit would produce the same cache hash and skip the test shard that runs the build-llms + doc-history guards (false-pass). Adds an ALLOW_PATTERNS re-admit step after the deny, scoped to the named policy docs (not a blanket docs un-deny). Lands FIRST, before any doc moves. Pinned by 3 new cases in test/scripts/ci-cache-hash.test.ts: TESTING.md + RELEASING.md edits MUST change the hash; docs/guide.md still must not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…UDE.md (verbatim) CLAUDE.md had grown to 592KB / ~147k tokens auto-loaded every session (~77% of the llms-full.txt single-fetch bundle). The per-file index was append-only by mandate. This is the exact thin-dispatcher-vs-fat-blob anti-pattern gbrain exists to fix, so CLAUDE.md becomes a thin orientation + resolver that points at on-demand docs. This commit is the VERBATIM move (content-preserving — the next commit compresses): - docs/architecture/KEY_FILES.md <- ## Key files + the calibration key-files cluster + Schema Cathedral v3 impl detail - docs/architecture/thin-client.md <- ## Thin-client routing - docs/TESTING.md <- ## Testing - ## Commands DROPPED (18 'added in vX.Y' history blocks; current surface is gbrain 0.41.38.0 -- personal knowledge brain USAGE gbrain <command> [options] SETUP init [--pglite|--supabase|--url] Create brain (PGLite default, no server) migrate --to <supabase|pglite> Transfer brain between engines upgrade Self-update check-update [--json] Check for new versions doctor [--json] [--fast] Health check (resolver, skills, pgvector, RLS, embeddings) integrations [subcommand] Manage integration recipes (senses + reflexes) PAGES get <slug> Read a page put <slug> [< file.md] Write/update a page delete <slug> Delete a page list [--type T] [--tag T] [-n N] List pages SEARCH search <query> Keyword search (tsvector) query <question> [--no-expand] Hybrid search (RRF + expansion) ask <question> [--no-expand] Alias for query IMPORT/EXPORT import <dir> [--no-embed] Import markdown directory sync [--repo <path>] [flags] Git-to-brain incremental sync sync --watch [--interval N] Continuous sync (loops until stopped) sync --install-cron Install persistent sync daemon export [--dir ./out/] Export to markdown export --restore-only [--repo <p>] Restore missing supabase-only files [--type T] [--slug-prefix S] With optional filters FILES files list [slug] List stored files files upload <file> --page <slug> Upload file to storage files upload-raw <file> --page <s> Smart upload (size routing + .redirect.yaml) files signed-url <path> Generate signed URL (1-hour) files sync <dir> Bulk upload directory files verify Verify all uploads EMBEDDINGS embed [<slug>|--all|--stale] Generate/refresh embeddings LINKS link <from> <to> [--type T] Create typed link unlink <from> <to> Remove link backlinks <slug> Incoming links graph <slug> [--depth N] Traverse link graph (returns nodes) graph-query <slug> [--type T] Edge-based traversal with type/direction filters [--depth N] [--direction in|out|both] TAGS tags <slug> List tags tag <slug> <tag> Add tag untag <slug> <tag> Remove tag TIMELINE timeline [<slug>] View timeline timeline-add <slug> <date> <text> Add timeline entry TOOLS extract <links|timeline|all> Extract links/timeline (idempotent) [--source fs|db] fs (default) walks .md files; db iterates engine pages [--dir <brain>] brain dir for fs source [--type T] [--since DATE] filters (db source) [--dry-run] [--json] publish <page.md> [--password] Shareable HTML (strips private data, optional AES-256) check-backlinks <check|fix> [dir] Find/fix missing back-links across brain lint <dir|file> [--fix] Catch LLM artifacts, placeholder dates, bad frontmatter orphans [--json] [--count] Find pages with no inbound wikilinks salience [--days N] [--kind P] v0.29: pages ranked by emotional + activity salience anomalies [--since D] [--sigma N] v0.29: cohort-based statistical anomalies (tag, type) transcripts recent [--days N] v0.29: recent raw .txt transcripts (local-only) dream [--dry-run] [--json] Run the overnight maintenance cycle once (cron-friendly). See also: autopilot --install (continuous daemon). check-resolvable [--json] [--fix] Validate skill tree (reachability/MECE/DRY) report --type <name> --content ... Save timestamped report to brain/reports/ BRAIN (capture / ideate / explore — v0.37/v0.38) capture [content] [--file PATH] Single entrypoint for getting content into the brain [--stdin] [--slug s] [--type t] Inline content / file / stdin; writes to inbox/ by default [--source ID] [--quiet|--json] Multi-source brains: route to a non-default source brainstorm <question> [--json] Bisociation idea generator (hybrid search + far-set + judge) [--save|--no-save] [--limit N] lsd <question> [--json] Lateral Synaptic Drift: inverted-judge brainstorm [--save|--no-save] [--limit N] rewarding far-from-obvious + axiomatic inversions SOURCES (multi-repo / multi-brain) sources list Show registered sources sources add <id> --path <p> Register a source (id = short name, e.g. 'wiki') sources remove <id> Remove a source + its pages sync --all Sync all sources with a local_path sync --source <id> Sync one specific source repos ... DEPRECATED alias for 'sources' (v0.19.0) CODE INDEXING (v0.19.0 / v0.20.0 Cathedral II) code-def <symbol> [--lang l] Find the definition of a symbol across code pages code-refs <symbol> [--lang l] Find all references to a symbol (JSON-first) code-callers <symbol> Who calls this symbol? (v0.20.0 A1) code-callees <symbol> What does this symbol call? (v0.20.0 A1) query <q> --lang <l> Filter hybrid search to one language (v0.20.0) query <q> --symbol-kind <k> Filter to symbol type (function|class|method|...) (v0.20.0) reconcile-links [--dry-run] Batch-recompute doc↔impl edges (v0.20.0) reindex-code [--source id] [--yes] Explicit code-page reindex (v0.20.0) sync --strategy code Sync code files into the brain JOBS (Minions) jobs submit <name> [--params JSON] Submit background job [--follow] [--dry-run] jobs list [--status S] [--limit N] List jobs jobs get <id> Job details + history jobs cancel <id> Cancel job jobs retry <id> Re-queue failed/dead job jobs prune [--older-than 30d] Clean old jobs jobs stats Job health dashboard jobs work [--queue Q] Start worker daemon (Postgres only) ADMIN stats Brain statistics health Brain health dashboard history <slug> Page version history revert <slug> <version-id> Revert to version features [--json] [--auto-fix] Scan usage + recommend unused features autopilot [--repo] [--interval N] Self-maintaining brain daemon config [show|get|set] <key> [val] Brain config storage status [--repo <path>] Storage tier status and health [--json] (git-tracked vs supabase-only) serve MCP server (stdio) serve --http [--port N] HTTP MCP server with OAuth 2.1 --token-ttl N Access token TTL in seconds (default: 3600) --enable-dcr Enable Dynamic Client Registration --public-url URL Public issuer URL (required behind proxy/tunnel) call <tool> '<json>' Raw tool invocation version Version info --tools-json Tool discovery (JSON) Run gbrain <command> --help for command-specific help. + the per-command KEY_FILES entries; content stays in git) CLAUDE.md gains: a Reference map (resolver), a Maintaining section (the anti-disease rule), and a Cross-cutting invariants subsection under Architecture so the must-never-violate rules (trust fail-closed, sourceScopeOpts isolation, JSONB trap, engine parity, contract-first, migrations, multi-source) still auto-load after the index moved out. Result: CLAUDE.md 592KB -> 61KB; llms-full.txt 740KB -> 210KB (new docs link-only until compressed). build-llms drift + budget test green; verify 29/29 green. The pre-move content is recoverable at git show <this^>:CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ence guard Compresses the verbatim-relocated reference docs from append-only release-history to current-state-only (the disease cure), then makes recurrence structurally impossible via a CI guard. Compression (fan-out subagents + adversarial verify, audited mechanically): - KEY_FILES.md 453KB -> 356KB; TESTING.md 42KB -> 38KB; thin-client.md already clean. - 393/393 entries preserved; every src/test/scripts path from the verbatim original survives (mechanical comm-check); zero bolded **v0. markers remain. - Conservative ratio (~22%) because the content is invariant-dense — correctness over brevity. Dropped: **vX.Y.Z (#NNN):** clauses, codex/review tags, contributor credits, PR-numbers-as-ids, pre-fix/then/was-now history deltas. Kept: every exported symbol, invariant, and Pinned-by reference. Verbatim original recoverable at git show <relocation-commit>:docs/architecture/KEY_FILES.md. Recurrence guard (scripts/check-key-files-current-state.sh, wired into verify + check:all): - HARD: bans the bolded **v0.<digit> marker in the reference docs (scoped — plain 'as of pgvector 0.7' prose is fine, no false positives). - HARD: CLAUDE.md size cap (90KB; currently 61KB) — the structural backstop. - Pinned by test/scripts/check-key-files-current-state.test.ts (7 cases). Content contracts (test/build-llms.test.ts, +5 cases per codex outside-voice): CLAUDE.md keeps inline ship IRON RULES (version format, document-release, never-hand-roll); AGENTS.md keeps its boot order; llms indexes the new docs; KEY_FILES stays link-only (not inlined). Privacy: scrubbed the relocated 'wintermute/chat/' source-boost examples + the literal harvest-lint regex to generic placeholders (legitimate in allowlisted CLAUDE.md; genericized for the new public docs per the privacy rule). Reverts the 284c50a band-aid: re-inlines docs/what-schemas-unlock.md now that the restructure freed ~530KB of bundle headroom (llms-full.txt 740KB -> 225KB). verify 30/30 green (incl. new check:doc-history). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The highest-/ship-risk commit (isolated so it can revert alone). Moves the verbose release + contributor procedure out of CLAUDE.md, keeping every ship-critical IRON RULE inline so /ship + /document-release (which read CLAUDE.md) cannot regress. Moved to docs/RELEASING.md: pre-ship test requirements; the CHANGELOG-branch-scoped + CHANGELOG voice + release-summary template; the 'To take advantage of vX' block spec; version migrations + migration-is-canonical; schema state tracking; GitHub Actions SHA maintenance; PR-descriptions-cover-the-branch; community-PR-wave; checking-out-PRs-from-garrytan-agents. Kept INLINE in CLAUDE.md (ship-critical IRON RULES — do NOT move): - the Version-locations table (5-file sync) + the 3-line consistency audit - Conductor branch=workspace - Post-ship /document-release (MANDATORY) - Privacy + Responsible-disclosure rules (Privacy also anchors the check-privacy allowlist — the only place allowed to name the fork) - PR-title-version-first - never-hand-roll-ship (Skill routing) Plus a new ## Releasing pointer ('Before any ship, read docs/RELEASING.md in full') and a resolver row. CLAUDE.md 61KB -> 39KB (592KB -> 39KB overall, 93% cut; ~9k tokens auto-loaded vs ~147k). CLAUDE.md size-gate tightened 90KB -> 60KB. The content-contract tests pin that the inline IRON RULES (MAJOR.MINOR.PATCH.MICRO, document-release, hand-roll ship) did NOT move out. The moved ranges carry no banned fork name, so RELEASING.md needs no privacy allowlist entry. verify 30/30; bundle 225KB -> 204KB. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The CLAUDE.md thin-resolver restructure (592KB → 39KB) rides in this release; record it under the existing v0.42.9.0 For-contributors section. No version bump — v0.42.9.0 is unreleased and already allocated to this PR.

…grep The policy-doc re-admit (75992b7) put `\t` inline in the ALLOW patterns passed to `grep -E`. BSD grep (macOS local) treats `\t` as a tab so it worked locally; GNU grep (Ubuntu CI) treats it as literal `t`, so nothing re-admitted and docs/TESTING.md / docs/RELEASING.md stayed deny-listed — the two policy-doc tests failed on CI shard 6 (1097 pass / 2 fail). Build ALLOW_RE with `printf '\t(%s)'` so the tab is a real byte, identical in construction to DENY_RE (line 117), which the CI log shows matches correctly on GNU grep. End-to-end: editing docs/TESTING.md now flips the hash; a normal docs/*.md add still does not (deny stays scoped).

Surfaced by the SkillOpt real-LLM eval (Track B). The reflect step was shown only a pass/fail score and the agent transcript — never WHAT the benchmark judge rewards. On a skill judged by structure (e.g. "must include a Confidence: line") the optimizer proposed plausible-but-off edits ("close with a synthesis") that never satisfied the literal check; every candidate scored 0 on D_sel, the validation gate rejected them all, and the skill text never changed (optimized === baseline === 0). Fix: render each benchmark Judge (rule checks / llm rubric / qrels) into plain-English criteria via new exported describeJudge / describeJudges, and thread them into the reflect prompt (a SUCCESS CRITERIA block) for both the loop reflect calls and the one-shot-rewrite path. The orchestrator computes the distinct criteria across train+sel+test once. The optimizer system prompt now instructs it to satisfy the criteria through genuine content, never empty keywords — reward-hacking stays defended by the independent held-out gate (cat32 confirms the gate catches a keyword-stuffing hack). End-to-end this took a deficient skill from 0.00 to 1.00 on a held-out set it never trained on. Pinned by test/skillopt/reflect.test.ts (describeJudge per kind, describeJudges dedup, criteria present/absent in the prompt). Folds into the open v0.42.9.0 PR (#1759). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve version collision: master shipped v0.42.10.0 (wikilink global-basename) while the skillopt-eval-explainer work was in-branch as v0.42.9.0. Bump the wave to v0.42.11.0 (strictly greater than master) per the version-locations IRON RULE; keep both CHANGELOG entries. CLAUDE.md: keep the branch's restructured thin orientation; master's only change was a per-file-index annotation for the wikilink feature, which this branch deliberately moved out of CLAUDE.md. Carried that documentation into docs/architecture/KEY_FILES.md as a current-state link-extraction.ts entry so nothing is lost. Regenerated llms.txt / llms-full.txt via build:llms. Version trio audited (VERSION = package.json = CHANGELOG = 0.42.11.0); typecheck + current-state guard green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)

garrytan and others added 16 commits June 1, 2026 20:37

chore: bump version and changelog (v0.42.9.0)

fb5c493

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into garrytan/skillopt-e…

029178f

…val-explainer # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json

docs(changelog): note CLAUDE.md restructure in v0.42.9.0

5da5235

The CLAUDE.md thin-resolver restructure (592KB → 39KB) rides in this release; record it under the existing v0.42.9.0 For-contributors section. No version bump — v0.42.9.0 is unreleased and already allocated to this PR.

garrytan changed the title ~~v0.42.9.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts~~ v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts Jun 3, 2026

garrytan merged commit d4211f4 into master Jun 3, 2026
21 checks passed

garrytan mentioned this pull request Jun 3, 2026

v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) #1809

Merged

dcarolan1 mentioned this pull request Jun 5, 2026

multi-turn subagent loop drops tool-result user message → "Tool result is missing for tool call" with DeepSeek (and some other providers) #1886

Open

felix1904 mentioned this pull request Jun 10, 2026

gateway toolLoop: no resume reconciliation — crash mid-tool-round leaves a dangling tool-call tail that dead-letters every retry (ModelMessage[] schema / missing tool results) #2039

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts#1759

v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts#1759
garrytan merged 16 commits into
masterfrom
garrytan/skillopt-eval-explainer

garrytan commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Adversarial Review

Plan Completion

TODOS

Test plan

Documentation

Also in this PR: CLAUDE.md thin-resolver restructure (folded in)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Jun 2, 2026 •

edited

Loading