v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate#674
Merged
garrytan merged 9 commits intogarrytan:masterfrom May 7, 2026
Merged
Conversation
Updates skillify from v1.0.0 to v2.0.0 with the key innovation: cross-modal evaluation runs BEFORE tests (step 3) to establish quality, then tests lock in the proven-good behavior. Key changes: - 11-item checklist (was 10) - adds cross-modal eval as step 3 - Cross-modal eval uses 3 models to score output on 5 dimensions - Quality gate: all dimensions ≥ 7 average before proceeding to tests - Prevents locking in mediocrity through tests-first approach - References cross-modal-review skill for eval pipeline - Updated all gbrain-specific paths (bun test, scripts/*.ts) - Maintains compatibility with gbrain check-resolvable workflow The meta-skill for turning raw features into properly-skilled, tested, resolvable capabilities. Cross-modal eval ensures output quality before tests cement the behavior.
Applied top improvements from GPT-5.5 + Opus 4-7 + DeepSeek V4 Pro: - Named 3 frontier models explicitly with provider table - Inlined eval prompt template with CONTEXT param + scoring calibration - Defined aggregation math: mean >= 7 AND no single dim < 5 - Added eval receipt JSON schema - Structured 3-cycle fix loop with before/after delta tracking - Added worked example (summarize-pr, end-to-end) - Added cost guardrails (skip < 200 tokens, max 9 API calls) - Added representative input selection rule - Added SKILL.md frontmatter template (copy-paste ready) - Added Phase 0 decision gate (is this worth skillifying?) Also includes cross-modal-eval runner recipe with robust JSON parsing for LLMs that return malformed JSON (3-tier repair).
Superseded by `gbrain eval cross-modal` (next commit). The .mjs script was the original PR's hand-rolled provider stack; the replacement reuses src/core/ai/gateway.ts so config/auth/model-aliasing comes from the canonical recipe registry instead of a parallel stack. No code references the .mjs (it was invoked by skill prose only), so this delete is independently safe to bisect through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure-logic foundation for the new `gbrain eval cross-modal` command
(wired in the next commit). All five modules are self-contained — no
CLI surface, no I/O outside the receipt writer's mkdirSync. Imported
from src/core/ai/gateway.ts at runtime via gwChat (no config impact
at load time).
Modules:
- json-repair.ts: parseModelJSON 4-strategy fallback chain.
Adversarial nuclear-option throws rather than
fabricating scores (Q6 + Q3 in plan).
- aggregate.ts: verdict logic. PASS = (>=2 successes) AND
(every dim mean >= 7) AND (every dim min
across models >= 5). INCONCLUSIVE when <2/3
models returned parseable scores — closes the
v1 .mjs `Object.values({}).every(...) === true`
empty-array silent-PASS bug (Q2 + Q3).
- receipt-name.ts: receipt filename binds (slug, sha8 of SKILL.md)
so `gbrain skillify check` can detect stale
audits (T10 in plan).
- receipt-write.ts: thin wrapper over writeFileSync that auto-mkdirs
the parent directory. Standalone module because
gbrainPath() does NOT auto-mkdir (T5 plan
correction — Codex caught this).
- runner.ts: orchestrator. Promise.allSettled across 3 slots
per cycle; up to 3 cycles; stops early on PASS
or INCONCLUSIVE. Default slots: openai:gpt-4o /
anthropic:claude-opus-4-7 / google:gemini-1.5-pro.
estimateCost() exports a small per-model
pricing table (drifts; refresh alongside
model-family bumps).
Tests (32 cases total, all green):
- json-repair.test.ts: 10 cases (clean JSON, fences, trailing
commas, single quotes, embedded newlines,
mismatched braces, nuclear-option success
+ adversarial throws, empty input,
numeric-shorthand scores).
- aggregate.test.ts: 8 cases pinning Q2/Q3/dedup. The 0-of-3
INCONCLUSIVE case is the regression guard
for the v1 silent-PASS bug.
- cli.test.ts: 12 cases on receipt-name / receipt-write /
GBRAIN_HOME isolation. Uses withEnv()
helper for env mutation (R1 isolation rule).
Verifies bisect-clean: typecheck passes, all 32 unit cases green.
The runner.ts import of gateway.chat() is dead until commit 3 wires
the CLI surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-facing surface for the multi-model quality gate. Three different-
provider frontier models score the OUTPUT against the TASK on a 5-dim
rubric. Verdict drives exit code: 0 PASS, 1 FAIL, 2 INCONCLUSIVE
(<2/3 models returned parseable scores per Q3 in plan).
Wiring touches three files:
- src/commands/eval-cross-modal.ts (new, ~290 lines)
CLI handler. Self-configures the AI gateway from loadConfig() +
process.env so it works without `gbrain init` (the cli.ts no-DB
branch bypasses connectEngine()). Defaults: cycles=3 in TTY,
cycles=1 in non-TTY (T11 partial cost guardrail — limits scripted
bulk spend; full --budget-usd hard cap is a v0.27.x TODO). Prints
estimated max-cost-per-cycle to stderr before each run. Uses
gbrainPath('eval-receipts') for receipt directory.
- src/cli.ts (no-DB dispatch branch, 5-line addition)
Special-cases `eval cross-modal` BEFORE the existing
handleCliOnly path that requires connectEngine(). Mirrors the
`dream` no-DB pattern but doesn't even attempt the connect — the
command never touches the DB. New users can run the gate before
`gbrain init` (T3 in plan).
- src/commands/eval.ts (sub-subcommand dispatch)
Adds `cross-modal` alongside `export`/`prune`/`replay`. The
cli.ts branch takes precedence in the user-facing path; this
branch only fires when callers re-enter runEvalCommand with an
existing engine. Engine is intentionally unused — the handler
self-routes.
- test/e2e/cross-modal-eval.test.ts (new, 4 cases)
Mocked-fetch E2E. Lives at test/e2e/* (NOT *.serial.test.ts) per
plan T8: test/e2e/* is exempt from the test-isolation lint and
already runs serially via scripts/run-e2e.sh, so the
mock.module() call doesn't need a quarantine rename. Cases:
PASS / FAIL (mean<7) / FAIL (min<5 — Q2 floor) / INCONCLUSIVE
(2 mock 5xx — Q3 contract).
The runner from commit 2 now has live callers. typecheck passes;
the 4 E2E cases all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the skillify contract from 10 to 11 items. The 11th item
(cross-modal eval) is `required:false` per T7 in the plan — a
missing or stale receipt surfaces in the audit output but does not
fail the gate. Existing skills keep their current required-score;
the bump is additive, not breaking.
Changes:
- src/commands/skillify.ts
Header jsdoc updated 10-item -> 11-item. No code-flow changes.
- src/commands/skillify-check.ts (the per-skill audit; not
src/commands/skillpack-check.ts which is a different command —
plan T6 corrected the conflation in the original plan)
New informational item at position 11. Reuses
findReceiptForSkill() helper from
src/core/cross-modal-eval/receipt-name.ts to detect:
* found — receipt matches current SKILL.md sha-8
* stale — receipt exists for an older SKILL.md
* missing — no receipt yet
Audit output cases pass through to existing pretty/JSON formats.
- src/core/skillify/templates.ts
Scaffolded SKILL.md now includes a "Phase 3: Cross-modal eval
(informational)" section with copy-paste `gbrain eval cross-modal`
invocation, pass criteria, and receipt-naming convention. Helps
new skill authors discover the gate.
- test/skillify-scaffold.test.ts
New T9 case verifies the scaffold emits the Phase 3 section,
points at the correct command, documents the receipt path, and
appends exactly one resolver row. Replaces the original plan's
`gbrain skillify scaffold demo-eleven` shell verification (which
Codex caught as invalid + repo-mutating).
Verifies: typecheck passes; scaffold test 19/19 (was 18, +1 T9 case).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documentation catches up with the new behavior shipped in commits 1-4.
- skills/skillify/SKILL.md (1.0.0 -> 1.1.0)
Full rewrite. Frontmatter version is additive (T7 in plan); the
11th item is informational, not breaking. Phase 3 now points at
`gbrain eval cross-modal` with copy-paste invocation, default
slot table, pass criteria, receipt-naming convention, cycles +
cost guardrails (T11 partial cap), provider configuration via
the AI gateway, and the cycle-1/2/3 fix loop. Adds Output Format
section (skills-conformance.test.ts requires it). Drops the
original `(or lib/cross-modal-eval.ts)` parenthetical (Q5 plan
correction — that path never existed).
- skills/cross-modal-review/SKILL.md
Adds 4-line Relationship section pointing at `gbrain eval
cross-modal` (D3 plan reciprocal). Distinguishes the manual
second-opinion gate (this skill) from the automated multi-model
score-and-iterate gate (the new command).
- CLAUDE.md
Key Files entries for src/commands/eval-cross-modal.ts and the
five new src/core/cross-modal-eval/* modules. Commands list
gains the `gbrain eval cross-modal` entry under v0.27.x. Notes
the non-TTY default 1-cycle behavior + the gbrainPath('eval-
receipts') resolution.
- TODOS.md
Four v0.27.x follow-ups filed under a new "cross-modal-eval"
section: full --budget-usd cap (T11 follow-up), subagent
integration (recovers cross-process rate-leases T4 deferred),
skill adoption telemetry (revisit T7=C with data after 30 days),
docs/cross-modal-eval.md user guide.
- llms-full.txt
Regenerated via `bun run build:llms` to match the CLAUDE.md
edits — sync guard at test/build-llms.test.ts requires this.
Verifies: typecheck passes; skills-conformance 199/199 green;
build-llms 7/7 green; full unit fast loop 3861/3861 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-modal-eval # Conflicts: # TODOS.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 7, 2026
…x-wave Conflicts resolved: - VERSION: kept 0.28.5 (ahead of master's 0.28.4) - package.json: kept 0.28.5 - CHANGELOG.md: kept v0.28.5 entry above master's v0.28.4 entry Master added v0.28.4 (skillify cross-modal eval quality gate, #674) and a new src/commands/eval-cross-modal.ts. Orthogonal to this fix wave — no code-level conflicts. llms-full.txt and src/core/schema-embedded.ts regenerated post-merge. Typecheck clean.
garrytan
added a commit
that referenced
this pull request
May 7, 2026
….28.6 Master shipped three v0.28.x patch releases without the takes feature while v0.28-release was in flight: - v0.28.1: zombie process accumulation + health endpoint timeout (#637) - v0.28.3: restart-sweep — detect dropped Telegram messages (#675) - v0.28.4: skillify cross-modal eval quality gate (#674) Master's v0.28.0 slot was consumed without the takes layer ever landing, so this release ships the original takes feature as v0.28.6 (skipping v0.28.5 to leave space for any in-flight master patches). The migration orchestrator file (v0_28_0.ts) and migration skill doc (skills/migrations/v0.28.0.md) keep their original version keys — those identify the migration version, not the release version. Conflicts resolved: - VERSION → 0.28.6 (was 0.28.0; master had 0.28.4) - package.json → 0.28.6 (auto-merged ai-sdk deps from master's v0.27) - CHANGELOG.md → renamed top entry "## [0.28.0]" → "## [0.28.6]" with date 2026-05-06; rebuilt the "To take advantage of" block (was truncated by stale === markers from a prior merge); preserved master's v0.28.4/v0.28.3/v0.28.1 entries beneath - src/cli.ts auto-merged (CLI_ONLY has providers + takes/think both) Verified post-merge: - bun run verify: PASS (privacy + jsonb + progress + test-isolation + wasm + admin-build + typecheck) - 133 tests pass: migrate + apply-migrations + takes-engine + takes-fence - migrations v37 (takes) + v38 (access_tokens_permissions) apply cleanly on top of master's v35 (auto-RLS) + v36 (subagent persistence)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR went through plan-eng-review (14 decisions) + codex consult-mode (11 cross-model tensions) and was rewritten end-to-end on top of the original 2 commits. 25 decisions resolved across two review rounds. Plan file:
~/.claude/plans/radiant-napping-lerdorf.md.Commits on this branch (after rewrite):
cross-modal-eval.mjs(superseded — had 3 critical bugs)gbrain eval cross-modalCLI subcommand (handler + cli.ts no-DB branch + eval.ts dispatch + 4 mocked-fetch E2E cases)required:false, additive not breaking)The
gbrain eval cross-modalcommand: three different-provider frontier models score the OUTPUT against the TASK on a 5-dim rubric. Verdict drives exit code:0PASS,1FAIL,2INCONCLUSIVE (<2/3 model successes). Reusessrc/core/ai/gateway.ts:chat()so config/auth/aliasing comes from the canonical recipe registry. BypassesconnectEngine()via the cli.ts no-DB branch — first-run users can run the gate beforegbrain init. Receipts bind to a SHA-8 of the SKILL.md content sogbrain skillify checkcan detect stale audits.Test Coverage
Tests: 3867 → 3903 (+36 new). Full unit fast loop 3903/3903 pass, RC=0.
Pre-Landing Review
plan-eng-review (round 1): 13 issues found, 0 critical gaps remaining; all 14 decisions resolved per-finding via AskUserQuestion. The thesis was right; the original
.mjshad 3 critical correctness bugs (hardcoded/data/.env, all-models-fail returning silent PASS viaObject.values({}).every(...) === true, missing min-score floor).codex consult-mode (round 2): 11 cross-model tensions surfaced, 5 substantive plan errors caught that plan-eng-review missed:
src/core/ai/gateway.tsgbrain evaldispatch requiredconnectEngine()so first-run users couldn't run the gaterate-leaseshelper requires aminion_jobs.idthat a CLI eval doesn't havegbrainPathsemantics were wrong (gbrainPath does NOT auto-mkdir; resolves to<GBRAIN_HOME>/.gbrain/...)skillify-check.tswithskillpack-check.ts(different files)All 11 codex tensions resolved per-finding. The plan is materially better for it.
Plus: the conformance test required an
## Output Formatsection the SKILL.md rewrite dropped — caught and added.Eval Results
No prompt-related files changed in this PR — evals skipped.
Plan Completion
25 decisions resolved (14 plan-eng-review + 11 codex tensions). 8 actions implemented:
gbrain eval cross-modalcommandsrc/commands/eval-cross-modal.ts,src/core/cross-modal-eval/{5 modules},src/cli.ts,src/commands/eval.tsrecipes/cross-modal-eval/removedskills/cross-modal-review/SKILL.mdskills/skillify/SKILL.md(v1.1.0)Verification Results
bun run typecheck: clean (RC=0)bun run check:{privacy,jsonb,progress,wasm,test-isolation}: all cleanbun run test: 3903/3903 pass, RC=0 (post-merge)TODOS
4 v0.27.x+ follow-ups filed under a new
cross-modal-evalsection inTODOS.md:--budget-usdhard cap + per-call cost telemetry (P2). Full cost guardrail to complement the partial T11=B safety net (TTY-aware default cycles + cost-estimate print).gbrain eval cross-modalto be invokable as agbrain agent runchild job to recover the cross-process rate-leases that T4=A explicitly deferred.docs/cross-modal-eval.mduser guide (P3). Mirrordocs/eval-bench.mdprecedent.Documentation
CLAUDE.md— Key Files entries for the new command + 5 core modules; Commands list updated under v0.27.x.skills/skillify/SKILL.md— full rewrite to v1.1.0 (informational 11th item, Phase 3 cross-modal eval section, Output Format, Anti-Patterns including correlated-blind-spot warning).skills/cross-modal-review/SKILL.md— Relationship section pointing at the new command.TODOS.md— 4 follow-ups filed.llms-full.txt— regenerated viabun run build:llms.Test plan
OPENAI_API_KEY+ANTHROPIC_API_KEY+GOOGLE_GENERATIVE_AI_API_KEYin shell — run after merge)🤖 Generated with Claude Code