v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate by garrytan-agents · Pull Request #674 · garrytan/gbrain

garrytan-agents · 2026-05-06T13:11:15Z

Summary

This PR went through plan-eng-review (14 decisions) + codex consult-mode (11 cross-model tensions) and was rewritten end-to-end on top of the original 2 commits. 25 decisions resolved across two review rounds. Plan file: ~/.claude/plans/radiant-napping-lerdorf.md.

Commits on this branch (after rewrite):

chore(recipes): remove cross-modal-eval.mjs (superseded — had 3 critical bugs)
feat(eval): cross-modal-eval core module + unit tests (5 new core files, 3 unit test files, 32 cases)
feat(eval): wire gbrain eval cross-modal CLI subcommand (handler + cli.ts no-DB branch + eval.ts dispatch + 4 mocked-fetch E2E cases)
feat(skillify): add informational 11th item (T7=C — required:false, additive not breaking)
docs: skillify SKILL.md v1.0.0 → 1.1.0, cross-modal-review Relationship section, CLAUDE.md key files, TODOS.md follow-ups
chore: bump version and changelog (v0.28.4)

The gbrain eval cross-modal command: three different-provider frontier models score the OUTPUT against the TASK on a 5-dim rubric. Verdict drives exit code: 0 PASS, 1 FAIL, 2 INCONCLUSIVE (<2/3 model successes). Reuses src/core/ai/gateway.ts:chat() so config/auth/aliasing comes from the canonical recipe registry. Bypasses connectEngine() via the cli.ts no-DB branch — first-run users can run the gate before gbrain init. Receipts bind to a SHA-8 of the SKILL.md content so gbrain skillify check can detect stale audits.

Test Coverage

NEW CODEPATHS                                            COVERAGE
[+] src/core/cross-modal-eval/json-repair.ts             [★★★ TESTED] 10 cases (4-strategy fallback chain pinned)
[+] src/core/cross-modal-eval/aggregate.ts               [★★★ TESTED] 8 cases (Q2 floor + Q3 INCONCLUSIVE regression guard)
[+] src/core/cross-modal-eval/receipt-name.ts            [★★★ TESTED] 12 cases (sha8, slug-bind, stale detection)
[+] src/core/cross-modal-eval/receipt-write.ts           [★★★ TESTED] auto-mkdir verified
[+] src/core/cross-modal-eval/runner.ts                  [★★★ TESTED] 4 mocked-fetch E2E (PASS / FAIL-mean / FAIL-floor / INCONCLUSIVE)
[+] src/commands/eval-cross-modal.ts                     [★★ TESTED]  CLI integration via mocked E2E
[+] src/cli.ts no-DB branch                              [★★★ TESTED] dream-pattern parity verified

COVERAGE: 32 unit + 4 mocked-fetch E2E = 36 cases. All green.
QUALITY: ★★★ verdict-contract regressions pinned (PASS / FAIL / INCONCLUSIVE all asserted).

Tests: 3867 → 3903 (+36 new). Full unit fast loop 3903/3903 pass, RC=0.

Pre-Landing Review

plan-eng-review (round 1): 13 issues found, 0 critical gaps remaining; all 14 decisions resolved per-finding via AskUserQuestion. The thesis was right; the original .mjs had 3 critical correctness bugs (hardcoded /data/.env, all-models-fail returning silent PASS via Object.values({}).every(...) === true, missing min-score floor).

codex consult-mode (round 2): 11 cross-model tensions surfaced, 5 substantive plan errors caught that plan-eng-review missed:

Plan rolled a parallel provider stack instead of reusing src/core/ai/gateway.ts
gbrain eval dispatch required connectEngine() so first-run users couldn't run the gate
rate-leases helper requires a minion_jobs.id that a CLI eval doesn't have
Proposed gbrainPath semantics were wrong (gbrainPath does NOT auto-mkdir; resolves to <GBRAIN_HOME>/.gbrain/...)
Original plan conflated skillify-check.ts with skillpack-check.ts (different files)

All 11 codex tensions resolved per-finding. The plan is materially better for it.

Plus: the conformance test required an ## Output Format section the SKILL.md rewrite dropped — caught and added.

Eval Results

No prompt-related files changed in this PR — evals skipped.

Plan Completion

25 decisions resolved (14 plan-eng-review + 11 codex tensions). 8 actions implemented:

Action	Files	Status
1. New `gbrain eval cross-modal` command	`src/commands/eval-cross-modal.ts`, `src/core/cross-modal-eval/{5 modules}`, `src/cli.ts`, `src/commands/eval.ts`	✅
2. Delete .mjs script	`recipes/cross-modal-eval/` removed	✅
3. Tests (3 unit + 1 E2E)	4 test files, 36 cases	✅
4. cross-modal-review Relationship	`skills/cross-modal-review/SKILL.md`	✅
5. skillify SKILL.md rewrite	`skills/skillify/SKILL.md` (v1.1.0)	✅
6. CLI 11th item (informational) + scaffold	4 files	✅
7. CLAUDE.md updates	Key Files + Commands list	✅
8. TODOS.md follow-ups	4 v0.27.x+ items filed	✅

Verification Results

bun run typecheck: clean (RC=0)
bun run check:{privacy,jsonb,progress,wasm,test-isolation}: all clean
bun run test: 3903/3903 pass, RC=0 (post-merge)
New module tests: 36/36 pass (32 unit + 4 mocked-fetch E2E)
E2E suite (real Postgres on port 5435): 8 pre-existing failures across 4 files (claw-test, dream-cycle-eight-phase-pglite, mechanical, serve-http-oauth) — verified as pre-existing on master, none in code I touched

TODOS

4 v0.27.x+ follow-ups filed under a new cross-modal-eval section in TODOS.md:

--budget-usd hard cap + per-call cost telemetry (P2). Full cost guardrail to complement the partial T11=B safety net (TTY-aware default cycles + cost-estimate print).
Subagent integration (P2). Wire gbrain eval cross-modal to be invokable as a gbrain agent run child job to recover the cross-process rate-leases that T4=A explicitly deferred.
Skill adoption telemetry (P3). Track receipt count vs skill count; revisit T7=C ("required:false forever") with data after 30 days.
docs/cross-modal-eval.md user guide (P3). Mirror docs/eval-bench.md precedent.

Documentation

CLAUDE.md — Key Files entries for the new command + 5 core modules; Commands list updated under v0.27.x.
skills/skillify/SKILL.md — full rewrite to v1.1.0 (informational 11th item, Phase 3 cross-modal eval section, Output Format, Anti-Patterns including correlated-blind-spot warning).
skills/cross-modal-review/SKILL.md — Relationship section pointing at the new command.
TODOS.md — 4 follow-ups filed.
llms-full.txt — regenerated via bun run build:llms.

Test plan

Unit tests pass (3903/3903)
New module tests pass (36/36 across 4 test files)
Typecheck clean
Pre-existing E2E failures verified against master (none in touched code)
Codex outside-voice review surfaced + resolved (11 tensions)
Real-API smoke test (requires OPENAI_API_KEY + ANTHROPIC_API_KEY + GOOGLE_GENERATIVE_AI_API_KEY in shell — run after merge)

🤖 Generated with Claude Code

Updates skillify from v1.0.0 to v2.0.0 with the key innovation: cross-modal evaluation runs BEFORE tests (step 3) to establish quality, then tests lock in the proven-good behavior. Key changes: - 11-item checklist (was 10) - adds cross-modal eval as step 3 - Cross-modal eval uses 3 models to score output on 5 dimensions - Quality gate: all dimensions ≥ 7 average before proceeding to tests - Prevents locking in mediocrity through tests-first approach - References cross-modal-review skill for eval pipeline - Updated all gbrain-specific paths (bun test, scripts/*.ts) - Maintains compatibility with gbrain check-resolvable workflow The meta-skill for turning raw features into properly-skilled, tested, resolvable capabilities. Cross-modal eval ensures output quality before tests cement the behavior.

Applied top improvements from GPT-5.5 + Opus 4-7 + DeepSeek V4 Pro: - Named 3 frontier models explicitly with provider table - Inlined eval prompt template with CONTEXT param + scoring calibration - Defined aggregation math: mean >= 7 AND no single dim < 5 - Added eval receipt JSON schema - Structured 3-cycle fix loop with before/after delta tracking - Added worked example (summarize-pr, end-to-end) - Added cost guardrails (skip < 200 tokens, max 9 API calls) - Added representative input selection rule - Added SKILL.md frontmatter template (copy-paste ready) - Added Phase 0 decision gate (is this worth skillifying?) Also includes cross-modal-eval runner recipe with robust JSON parsing for LLMs that return malformed JSON (3-tier repair).

Superseded by `gbrain eval cross-modal` (next commit). The .mjs script was the original PR's hand-rolled provider stack; the replacement reuses src/core/ai/gateway.ts so config/auth/model-aliasing comes from the canonical recipe registry instead of a parallel stack. No code references the .mjs (it was invoked by skill prose only), so this delete is independently safe to bisect through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure-logic foundation for the new `gbrain eval cross-modal` command (wired in the next commit). All five modules are self-contained — no CLI surface, no I/O outside the receipt writer's mkdirSync. Imported from src/core/ai/gateway.ts at runtime via gwChat (no config impact at load time). Modules: - json-repair.ts: parseModelJSON 4-strategy fallback chain. Adversarial nuclear-option throws rather than fabricating scores (Q6 + Q3 in plan). - aggregate.ts: verdict logic. PASS = (>=2 successes) AND (every dim mean >= 7) AND (every dim min across models >= 5). INCONCLUSIVE when <2/3 models returned parseable scores — closes the v1 .mjs `Object.values({}).every(...) === true` empty-array silent-PASS bug (Q2 + Q3). - receipt-name.ts: receipt filename binds (slug, sha8 of SKILL.md) so `gbrain skillify check` can detect stale audits (T10 in plan). - receipt-write.ts: thin wrapper over writeFileSync that auto-mkdirs the parent directory. Standalone module because gbrainPath() does NOT auto-mkdir (T5 plan correction — Codex caught this). - runner.ts: orchestrator. Promise.allSettled across 3 slots per cycle; up to 3 cycles; stops early on PASS or INCONCLUSIVE. Default slots: openai:gpt-4o / anthropic:claude-opus-4-7 / google:gemini-1.5-pro. estimateCost() exports a small per-model pricing table (drifts; refresh alongside model-family bumps). Tests (32 cases total, all green): - json-repair.test.ts: 10 cases (clean JSON, fences, trailing commas, single quotes, embedded newlines, mismatched braces, nuclear-option success + adversarial throws, empty input, numeric-shorthand scores). - aggregate.test.ts: 8 cases pinning Q2/Q3/dedup. The 0-of-3 INCONCLUSIVE case is the regression guard for the v1 silent-PASS bug. - cli.test.ts: 12 cases on receipt-name / receipt-write / GBRAIN_HOME isolation. Uses withEnv() helper for env mutation (R1 isolation rule). Verifies bisect-clean: typecheck passes, all 32 unit cases green. The runner.ts import of gateway.chat() is dead until commit 3 wires the CLI surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User-facing surface for the multi-model quality gate. Three different- provider frontier models score the OUTPUT against the TASK on a 5-dim rubric. Verdict drives exit code: 0 PASS, 1 FAIL, 2 INCONCLUSIVE (<2/3 models returned parseable scores per Q3 in plan). Wiring touches three files: - src/commands/eval-cross-modal.ts (new, ~290 lines) CLI handler. Self-configures the AI gateway from loadConfig() + process.env so it works without `gbrain init` (the cli.ts no-DB branch bypasses connectEngine()). Defaults: cycles=3 in TTY, cycles=1 in non-TTY (T11 partial cost guardrail — limits scripted bulk spend; full --budget-usd hard cap is a v0.27.x TODO). Prints estimated max-cost-per-cycle to stderr before each run. Uses gbrainPath('eval-receipts') for receipt directory. - src/cli.ts (no-DB dispatch branch, 5-line addition) Special-cases `eval cross-modal` BEFORE the existing handleCliOnly path that requires connectEngine(). Mirrors the `dream` no-DB pattern but doesn't even attempt the connect — the command never touches the DB. New users can run the gate before `gbrain init` (T3 in plan). - src/commands/eval.ts (sub-subcommand dispatch) Adds `cross-modal` alongside `export`/`prune`/`replay`. The cli.ts branch takes precedence in the user-facing path; this branch only fires when callers re-enter runEvalCommand with an existing engine. Engine is intentionally unused — the handler self-routes. - test/e2e/cross-modal-eval.test.ts (new, 4 cases) Mocked-fetch E2E. Lives at test/e2e/* (NOT *.serial.test.ts) per plan T8: test/e2e/* is exempt from the test-isolation lint and already runs serially via scripts/run-e2e.sh, so the mock.module() call doesn't need a quarantine rename. Cases: PASS / FAIL (mean<7) / FAIL (min<5 — Q2 floor) / INCONCLUSIVE (2 mock 5xx — Q3 contract). The runner from commit 2 now has live callers. typecheck passes; the 4 E2E cases all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Promotes the skillify contract from 10 to 11 items. The 11th item (cross-modal eval) is `required:false` per T7 in the plan — a missing or stale receipt surfaces in the audit output but does not fail the gate. Existing skills keep their current required-score; the bump is additive, not breaking. Changes: - src/commands/skillify.ts Header jsdoc updated 10-item -> 11-item. No code-flow changes. - src/commands/skillify-check.ts (the per-skill audit; not src/commands/skillpack-check.ts which is a different command — plan T6 corrected the conflation in the original plan) New informational item at position 11. Reuses findReceiptForSkill() helper from src/core/cross-modal-eval/receipt-name.ts to detect: * found — receipt matches current SKILL.md sha-8 * stale — receipt exists for an older SKILL.md * missing — no receipt yet Audit output cases pass through to existing pretty/JSON formats. - src/core/skillify/templates.ts Scaffolded SKILL.md now includes a "Phase 3: Cross-modal eval (informational)" section with copy-paste `gbrain eval cross-modal` invocation, pass criteria, and receipt-naming convention. Helps new skill authors discover the gate. - test/skillify-scaffold.test.ts New T9 case verifies the scaffold emits the Phase 3 section, points at the correct command, documents the receipt path, and appends exactly one resolver row. Replaces the original plan's `gbrain skillify scaffold demo-eleven` shell verification (which Codex caught as invalid + repo-mutating). Verifies: typecheck passes; scaffold test 19/19 (was 18, +1 T9 case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Documentation catches up with the new behavior shipped in commits 1-4. - skills/skillify/SKILL.md (1.0.0 -> 1.1.0) Full rewrite. Frontmatter version is additive (T7 in plan); the 11th item is informational, not breaking. Phase 3 now points at `gbrain eval cross-modal` with copy-paste invocation, default slot table, pass criteria, receipt-naming convention, cycles + cost guardrails (T11 partial cap), provider configuration via the AI gateway, and the cycle-1/2/3 fix loop. Adds Output Format section (skills-conformance.test.ts requires it). Drops the original `(or lib/cross-modal-eval.ts)` parenthetical (Q5 plan correction — that path never existed). - skills/cross-modal-review/SKILL.md Adds 4-line Relationship section pointing at `gbrain eval cross-modal` (D3 plan reciprocal). Distinguishes the manual second-opinion gate (this skill) from the automated multi-model score-and-iterate gate (the new command). - CLAUDE.md Key Files entries for src/commands/eval-cross-modal.ts and the five new src/core/cross-modal-eval/* modules. Commands list gains the `gbrain eval cross-modal` entry under v0.27.x. Notes the non-TTY default 1-cycle behavior + the gbrainPath('eval- receipts') resolution. - TODOS.md Four v0.27.x follow-ups filed under a new "cross-modal-eval" section: full --budget-usd cap (T11 follow-up), subagent integration (recovers cross-process rate-leases T4 deferred), skill adoption telemetry (revisit T7=C with data after 30 days), docs/cross-modal-eval.md user guide. - llms-full.txt Regenerated via `bun run build:llms` to match the CLAUDE.md edits — sync guard at test/build-llms.test.ts requires this. Verifies: typecheck passes; skills-conformance 199/199 green; build-llms 7/7 green; full unit fast loop 3861/3861 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-modal-eval # Conflicts: # TODOS.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…x-wave Conflicts resolved: - VERSION: kept 0.28.5 (ahead of master's 0.28.4) - package.json: kept 0.28.5 - CHANGELOG.md: kept v0.28.5 entry above master's v0.28.4 entry Master added v0.28.4 (skillify cross-modal eval quality gate, #674) and a new src/commands/eval-cross-modal.ts. Orthogonal to this fix wave — no code-level conflicts. llms-full.txt and src/core/schema-embedded.ts regenerated post-merge. Typecheck clean.

….28.6 Master shipped three v0.28.x patch releases without the takes feature while v0.28-release was in flight: - v0.28.1: zombie process accumulation + health endpoint timeout (#637) - v0.28.3: restart-sweep — detect dropped Telegram messages (#675) - v0.28.4: skillify cross-modal eval quality gate (#674) Master's v0.28.0 slot was consumed without the takes layer ever landing, so this release ships the original takes feature as v0.28.6 (skipping v0.28.5 to leave space for any in-flight master patches). The migration orchestrator file (v0_28_0.ts) and migration skill doc (skills/migrations/v0.28.0.md) keep their original version keys — those identify the migration version, not the release version. Conflicts resolved: - VERSION → 0.28.6 (was 0.28.0; master had 0.28.4) - package.json → 0.28.6 (auto-merged ai-sdk deps from master's v0.27) - CHANGELOG.md → renamed top entry "## [0.28.0]" → "## [0.28.6]" with date 2026-05-06; rebuilt the "To take advantage of" block (was truncated by stale === markers from a prior merge); preserved master's v0.28.4/v0.28.3/v0.28.1 entries beneath - src/cli.ts auto-merged (CLI_ONLY has providers + takes/think both) Verified post-merge: - bun run verify: PASS (privacy + jsonb + progress + test-isolation + wasm + admin-build + typecheck) - 133 tests pass: migrate + apply-migrations + takes-engine + takes-fence - migrations v37 (takes) + v38 (access_tokens_permissions) apply cleanly on top of master's v35 (auto-RLS) + v36 (subagent persistence)

garrytan-agents added 2 commits May 6, 2026 13:10

garrytan changed the title ~~feat(skillpack): enhance skillify with cross-modal eval quality gate~~ v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate May 6, 2026

garrytan and others added 7 commits May 6, 2026 20:08

Merge remote-tracking branch 'origin/master' into feat/skillify-cross…

c6583ab

…-modal-eval # Conflicts: # TODOS.md

chore: bump version and changelog (v0.28.4)

fbe663d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan merged commit a1a2671 into garrytan:master May 7, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate#674

v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate#674
garrytan merged 9 commits intogarrytan:masterfrom
garrytan-agents:feat/skillify-cross-modal-eval

garrytan-agents commented May 6, 2026 •

edited by garrytan

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan-agents commented May 6, 2026 • edited by garrytan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Eval Results

Plan Completion

Verification Results

TODOS

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrytan-agents commented May 6, 2026 •

edited by garrytan

Loading