Skip to content

v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate#674

Merged
garrytan merged 9 commits intogarrytan:masterfrom
garrytan-agents:feat/skillify-cross-modal-eval
May 7, 2026
Merged

v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate#674
garrytan merged 9 commits intogarrytan:masterfrom
garrytan-agents:feat/skillify-cross-modal-eval

Conversation

@garrytan-agents
Copy link
Copy Markdown
Contributor

@garrytan-agents garrytan-agents commented May 6, 2026

Summary

This PR went through plan-eng-review (14 decisions) + codex consult-mode (11 cross-model tensions) and was rewritten end-to-end on top of the original 2 commits. 25 decisions resolved across two review rounds. Plan file: ~/.claude/plans/radiant-napping-lerdorf.md.

Commits on this branch (after rewrite):

  • chore(recipes): remove cross-modal-eval.mjs (superseded — had 3 critical bugs)
  • feat(eval): cross-modal-eval core module + unit tests (5 new core files, 3 unit test files, 32 cases)
  • feat(eval): wire gbrain eval cross-modal CLI subcommand (handler + cli.ts no-DB branch + eval.ts dispatch + 4 mocked-fetch E2E cases)
  • feat(skillify): add informational 11th item (T7=C — required:false, additive not breaking)
  • docs: skillify SKILL.md v1.0.0 → 1.1.0, cross-modal-review Relationship section, CLAUDE.md key files, TODOS.md follow-ups
  • chore: bump version and changelog (v0.28.4)

The gbrain eval cross-modal command: three different-provider frontier models score the OUTPUT against the TASK on a 5-dim rubric. Verdict drives exit code: 0 PASS, 1 FAIL, 2 INCONCLUSIVE (<2/3 model successes). Reuses src/core/ai/gateway.ts:chat() so config/auth/aliasing comes from the canonical recipe registry. Bypasses connectEngine() via the cli.ts no-DB branch — first-run users can run the gate before gbrain init. Receipts bind to a SHA-8 of the SKILL.md content so gbrain skillify check can detect stale audits.

Test Coverage

NEW CODEPATHS                                            COVERAGE
[+] src/core/cross-modal-eval/json-repair.ts             [★★★ TESTED] 10 cases (4-strategy fallback chain pinned)
[+] src/core/cross-modal-eval/aggregate.ts               [★★★ TESTED] 8 cases (Q2 floor + Q3 INCONCLUSIVE regression guard)
[+] src/core/cross-modal-eval/receipt-name.ts            [★★★ TESTED] 12 cases (sha8, slug-bind, stale detection)
[+] src/core/cross-modal-eval/receipt-write.ts           [★★★ TESTED] auto-mkdir verified
[+] src/core/cross-modal-eval/runner.ts                  [★★★ TESTED] 4 mocked-fetch E2E (PASS / FAIL-mean / FAIL-floor / INCONCLUSIVE)
[+] src/commands/eval-cross-modal.ts                     [★★ TESTED]  CLI integration via mocked E2E
[+] src/cli.ts no-DB branch                              [★★★ TESTED] dream-pattern parity verified

COVERAGE: 32 unit + 4 mocked-fetch E2E = 36 cases. All green.
QUALITY: ★★★ verdict-contract regressions pinned (PASS / FAIL / INCONCLUSIVE all asserted).

Tests: 3867 → 3903 (+36 new). Full unit fast loop 3903/3903 pass, RC=0.

Pre-Landing Review

plan-eng-review (round 1): 13 issues found, 0 critical gaps remaining; all 14 decisions resolved per-finding via AskUserQuestion. The thesis was right; the original .mjs had 3 critical correctness bugs (hardcoded /data/.env, all-models-fail returning silent PASS via Object.values({}).every(...) === true, missing min-score floor).

codex consult-mode (round 2): 11 cross-model tensions surfaced, 5 substantive plan errors caught that plan-eng-review missed:

  1. Plan rolled a parallel provider stack instead of reusing src/core/ai/gateway.ts
  2. gbrain eval dispatch required connectEngine() so first-run users couldn't run the gate
  3. rate-leases helper requires a minion_jobs.id that a CLI eval doesn't have
  4. Proposed gbrainPath semantics were wrong (gbrainPath does NOT auto-mkdir; resolves to <GBRAIN_HOME>/.gbrain/...)
  5. Original plan conflated skillify-check.ts with skillpack-check.ts (different files)

All 11 codex tensions resolved per-finding. The plan is materially better for it.

Plus: the conformance test required an ## Output Format section the SKILL.md rewrite dropped — caught and added.

Eval Results

No prompt-related files changed in this PR — evals skipped.

Plan Completion

25 decisions resolved (14 plan-eng-review + 11 codex tensions). 8 actions implemented:

Action Files Status
1. New gbrain eval cross-modal command src/commands/eval-cross-modal.ts, src/core/cross-modal-eval/{5 modules}, src/cli.ts, src/commands/eval.ts
2. Delete .mjs script recipes/cross-modal-eval/ removed
3. Tests (3 unit + 1 E2E) 4 test files, 36 cases
4. cross-modal-review Relationship skills/cross-modal-review/SKILL.md
5. skillify SKILL.md rewrite skills/skillify/SKILL.md (v1.1.0)
6. CLI 11th item (informational) + scaffold 4 files
7. CLAUDE.md updates Key Files + Commands list
8. TODOS.md follow-ups 4 v0.27.x+ items filed

Verification Results

  • bun run typecheck: clean (RC=0)
  • bun run check:{privacy,jsonb,progress,wasm,test-isolation}: all clean
  • bun run test: 3903/3903 pass, RC=0 (post-merge)
  • New module tests: 36/36 pass (32 unit + 4 mocked-fetch E2E)
  • E2E suite (real Postgres on port 5435): 8 pre-existing failures across 4 files (claw-test, dream-cycle-eight-phase-pglite, mechanical, serve-http-oauth) — verified as pre-existing on master, none in code I touched

TODOS

4 v0.27.x+ follow-ups filed under a new cross-modal-eval section in TODOS.md:

  1. --budget-usd hard cap + per-call cost telemetry (P2). Full cost guardrail to complement the partial T11=B safety net (TTY-aware default cycles + cost-estimate print).
  2. Subagent integration (P2). Wire gbrain eval cross-modal to be invokable as a gbrain agent run child job to recover the cross-process rate-leases that T4=A explicitly deferred.
  3. Skill adoption telemetry (P3). Track receipt count vs skill count; revisit T7=C ("required:false forever") with data after 30 days.
  4. docs/cross-modal-eval.md user guide (P3). Mirror docs/eval-bench.md precedent.

Documentation

  • CLAUDE.md — Key Files entries for the new command + 5 core modules; Commands list updated under v0.27.x.
  • skills/skillify/SKILL.md — full rewrite to v1.1.0 (informational 11th item, Phase 3 cross-modal eval section, Output Format, Anti-Patterns including correlated-blind-spot warning).
  • skills/cross-modal-review/SKILL.md — Relationship section pointing at the new command.
  • TODOS.md — 4 follow-ups filed.
  • llms-full.txt — regenerated via bun run build:llms.

Test plan

  • Unit tests pass (3903/3903)
  • New module tests pass (36/36 across 4 test files)
  • Typecheck clean
  • Pre-existing E2E failures verified against master (none in touched code)
  • Codex outside-voice review surfaced + resolved (11 tensions)
  • Real-API smoke test (requires OPENAI_API_KEY + ANTHROPIC_API_KEY + GOOGLE_GENERATIVE_AI_API_KEY in shell — run after merge)

🤖 Generated with Claude Code

Updates skillify from v1.0.0 to v2.0.0 with the key innovation:
cross-modal evaluation runs BEFORE tests (step 3) to establish
quality, then tests lock in the proven-good behavior.

Key changes:
- 11-item checklist (was 10) - adds cross-modal eval as step 3
- Cross-modal eval uses 3 models to score output on 5 dimensions
- Quality gate: all dimensions ≥ 7 average before proceeding to tests
- Prevents locking in mediocrity through tests-first approach
- References cross-modal-review skill for eval pipeline
- Updated all gbrain-specific paths (bun test, scripts/*.ts)
- Maintains compatibility with gbrain check-resolvable workflow

The meta-skill for turning raw features into properly-skilled,
tested, resolvable capabilities. Cross-modal eval ensures output
quality before tests cement the behavior.
Applied top improvements from GPT-5.5 + Opus 4-7 + DeepSeek V4 Pro:
- Named 3 frontier models explicitly with provider table
- Inlined eval prompt template with CONTEXT param + scoring calibration
- Defined aggregation math: mean >= 7 AND no single dim < 5
- Added eval receipt JSON schema
- Structured 3-cycle fix loop with before/after delta tracking
- Added worked example (summarize-pr, end-to-end)
- Added cost guardrails (skip < 200 tokens, max 9 API calls)
- Added representative input selection rule
- Added SKILL.md frontmatter template (copy-paste ready)
- Added Phase 0 decision gate (is this worth skillifying?)

Also includes cross-modal-eval runner recipe with robust JSON
parsing for LLMs that return malformed JSON (3-tier repair).
@garrytan garrytan changed the title feat(skillpack): enhance skillify with cross-modal eval quality gate v0.28.4 feat(skillpack): enhance skillify with cross-modal eval quality gate May 6, 2026
garrytan and others added 7 commits May 6, 2026 20:08
Superseded by `gbrain eval cross-modal` (next commit). The .mjs script
was the original PR's hand-rolled provider stack; the replacement reuses
src/core/ai/gateway.ts so config/auth/model-aliasing comes from the
canonical recipe registry instead of a parallel stack.

No code references the .mjs (it was invoked by skill prose only), so
this delete is independently safe to bisect through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure-logic foundation for the new `gbrain eval cross-modal` command
(wired in the next commit). All five modules are self-contained — no
CLI surface, no I/O outside the receipt writer's mkdirSync. Imported
from src/core/ai/gateway.ts at runtime via gwChat (no config impact
at load time).

Modules:
  - json-repair.ts:    parseModelJSON 4-strategy fallback chain.
                       Adversarial nuclear-option throws rather than
                       fabricating scores (Q6 + Q3 in plan).
  - aggregate.ts:      verdict logic. PASS = (>=2 successes) AND
                       (every dim mean >= 7) AND (every dim min
                       across models >= 5). INCONCLUSIVE when <2/3
                       models returned parseable scores — closes the
                       v1 .mjs `Object.values({}).every(...) === true`
                       empty-array silent-PASS bug (Q2 + Q3).
  - receipt-name.ts:   receipt filename binds (slug, sha8 of SKILL.md)
                       so `gbrain skillify check` can detect stale
                       audits (T10 in plan).
  - receipt-write.ts:  thin wrapper over writeFileSync that auto-mkdirs
                       the parent directory. Standalone module because
                       gbrainPath() does NOT auto-mkdir (T5 plan
                       correction — Codex caught this).
  - runner.ts:         orchestrator. Promise.allSettled across 3 slots
                       per cycle; up to 3 cycles; stops early on PASS
                       or INCONCLUSIVE. Default slots: openai:gpt-4o /
                       anthropic:claude-opus-4-7 / google:gemini-1.5-pro.
                       estimateCost() exports a small per-model
                       pricing table (drifts; refresh alongside
                       model-family bumps).

Tests (32 cases total, all green):
  - json-repair.test.ts:  10 cases (clean JSON, fences, trailing
                          commas, single quotes, embedded newlines,
                          mismatched braces, nuclear-option success
                          + adversarial throws, empty input,
                          numeric-shorthand scores).
  - aggregate.test.ts:    8 cases pinning Q2/Q3/dedup. The 0-of-3
                          INCONCLUSIVE case is the regression guard
                          for the v1 silent-PASS bug.
  - cli.test.ts:          12 cases on receipt-name / receipt-write /
                          GBRAIN_HOME isolation. Uses withEnv()
                          helper for env mutation (R1 isolation rule).

Verifies bisect-clean: typecheck passes, all 32 unit cases green.
The runner.ts import of gateway.chat() is dead until commit 3 wires
the CLI surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-facing surface for the multi-model quality gate. Three different-
provider frontier models score the OUTPUT against the TASK on a 5-dim
rubric. Verdict drives exit code: 0 PASS, 1 FAIL, 2 INCONCLUSIVE
(<2/3 models returned parseable scores per Q3 in plan).

Wiring touches three files:

  - src/commands/eval-cross-modal.ts (new, ~290 lines)
    CLI handler. Self-configures the AI gateway from loadConfig() +
    process.env so it works without `gbrain init` (the cli.ts no-DB
    branch bypasses connectEngine()). Defaults: cycles=3 in TTY,
    cycles=1 in non-TTY (T11 partial cost guardrail — limits scripted
    bulk spend; full --budget-usd hard cap is a v0.27.x TODO). Prints
    estimated max-cost-per-cycle to stderr before each run. Uses
    gbrainPath('eval-receipts') for receipt directory.

  - src/cli.ts (no-DB dispatch branch, 5-line addition)
    Special-cases `eval cross-modal` BEFORE the existing
    handleCliOnly path that requires connectEngine(). Mirrors the
    `dream` no-DB pattern but doesn't even attempt the connect — the
    command never touches the DB. New users can run the gate before
    `gbrain init` (T3 in plan).

  - src/commands/eval.ts (sub-subcommand dispatch)
    Adds `cross-modal` alongside `export`/`prune`/`replay`. The
    cli.ts branch takes precedence in the user-facing path; this
    branch only fires when callers re-enter runEvalCommand with an
    existing engine. Engine is intentionally unused — the handler
    self-routes.

  - test/e2e/cross-modal-eval.test.ts (new, 4 cases)
    Mocked-fetch E2E. Lives at test/e2e/* (NOT *.serial.test.ts) per
    plan T8: test/e2e/* is exempt from the test-isolation lint and
    already runs serially via scripts/run-e2e.sh, so the
    mock.module() call doesn't need a quarantine rename. Cases:
    PASS / FAIL (mean<7) / FAIL (min<5 — Q2 floor) / INCONCLUSIVE
    (2 mock 5xx — Q3 contract).

The runner from commit 2 now has live callers. typecheck passes;
the 4 E2E cases all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the skillify contract from 10 to 11 items. The 11th item
(cross-modal eval) is `required:false` per T7 in the plan — a
missing or stale receipt surfaces in the audit output but does not
fail the gate. Existing skills keep their current required-score;
the bump is additive, not breaking.

Changes:

  - src/commands/skillify.ts
    Header jsdoc updated 10-item -> 11-item. No code-flow changes.

  - src/commands/skillify-check.ts (the per-skill audit; not
    src/commands/skillpack-check.ts which is a different command —
    plan T6 corrected the conflation in the original plan)
    New informational item at position 11. Reuses
    findReceiptForSkill() helper from
    src/core/cross-modal-eval/receipt-name.ts to detect:
      * found  — receipt matches current SKILL.md sha-8
      * stale  — receipt exists for an older SKILL.md
      * missing — no receipt yet
    Audit output cases pass through to existing pretty/JSON formats.

  - src/core/skillify/templates.ts
    Scaffolded SKILL.md now includes a "Phase 3: Cross-modal eval
    (informational)" section with copy-paste `gbrain eval cross-modal`
    invocation, pass criteria, and receipt-naming convention. Helps
    new skill authors discover the gate.

  - test/skillify-scaffold.test.ts
    New T9 case verifies the scaffold emits the Phase 3 section,
    points at the correct command, documents the receipt path, and
    appends exactly one resolver row. Replaces the original plan's
    `gbrain skillify scaffold demo-eleven` shell verification (which
    Codex caught as invalid + repo-mutating).

Verifies: typecheck passes; scaffold test 19/19 (was 18, +1 T9 case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documentation catches up with the new behavior shipped in commits 1-4.

  - skills/skillify/SKILL.md (1.0.0 -> 1.1.0)
    Full rewrite. Frontmatter version is additive (T7 in plan); the
    11th item is informational, not breaking. Phase 3 now points at
    `gbrain eval cross-modal` with copy-paste invocation, default
    slot table, pass criteria, receipt-naming convention, cycles +
    cost guardrails (T11 partial cap), provider configuration via
    the AI gateway, and the cycle-1/2/3 fix loop. Adds Output Format
    section (skills-conformance.test.ts requires it). Drops the
    original `(or lib/cross-modal-eval.ts)` parenthetical (Q5 plan
    correction — that path never existed).

  - skills/cross-modal-review/SKILL.md
    Adds 4-line Relationship section pointing at `gbrain eval
    cross-modal` (D3 plan reciprocal). Distinguishes the manual
    second-opinion gate (this skill) from the automated multi-model
    score-and-iterate gate (the new command).

  - CLAUDE.md
    Key Files entries for src/commands/eval-cross-modal.ts and the
    five new src/core/cross-modal-eval/* modules. Commands list
    gains the `gbrain eval cross-modal` entry under v0.27.x. Notes
    the non-TTY default 1-cycle behavior + the gbrainPath('eval-
    receipts') resolution.

  - TODOS.md
    Four v0.27.x follow-ups filed under a new "cross-modal-eval"
    section: full --budget-usd cap (T11 follow-up), subagent
    integration (recovers cross-process rate-leases T4 deferred),
    skill adoption telemetry (revisit T7=C with data after 30 days),
    docs/cross-modal-eval.md user guide.

  - llms-full.txt
    Regenerated via `bun run build:llms` to match the CLAUDE.md
    edits — sync guard at test/build-llms.test.ts requires this.

Verifies: typecheck passes; skills-conformance 199/199 green;
build-llms 7/7 green; full unit fast loop 3861/3861 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit a1a2671 into garrytan:master May 7, 2026
7 checks passed
garrytan added a commit that referenced this pull request May 7, 2026
…x-wave

Conflicts resolved:
- VERSION: kept 0.28.5 (ahead of master's 0.28.4)
- package.json: kept 0.28.5
- CHANGELOG.md: kept v0.28.5 entry above master's v0.28.4 entry

Master added v0.28.4 (skillify cross-modal eval quality gate, #674) and a
new src/commands/eval-cross-modal.ts. Orthogonal to this fix wave — no
code-level conflicts.

llms-full.txt and src/core/schema-embedded.ts regenerated post-merge.
Typecheck clean.
garrytan added a commit that referenced this pull request May 7, 2026
….28.6

Master shipped three v0.28.x patch releases without the takes feature
while v0.28-release was in flight:
- v0.28.1: zombie process accumulation + health endpoint timeout (#637)
- v0.28.3: restart-sweep — detect dropped Telegram messages (#675)
- v0.28.4: skillify cross-modal eval quality gate (#674)

Master's v0.28.0 slot was consumed without the takes layer ever landing,
so this release ships the original takes feature as v0.28.6 (skipping
v0.28.5 to leave space for any in-flight master patches).

The migration orchestrator file (v0_28_0.ts) and migration skill doc
(skills/migrations/v0.28.0.md) keep their original version keys —
those identify the migration version, not the release version.

Conflicts resolved:
- VERSION → 0.28.6 (was 0.28.0; master had 0.28.4)
- package.json → 0.28.6 (auto-merged ai-sdk deps from master's v0.27)
- CHANGELOG.md → renamed top entry "## [0.28.0]" → "## [0.28.6]" with
  date 2026-05-06; rebuilt the "To take advantage of" block (was
  truncated by stale === markers from a prior merge); preserved master's
  v0.28.4/v0.28.3/v0.28.1 entries beneath
- src/cli.ts auto-merged (CLI_ONLY has providers + takes/think both)

Verified post-merge:
- bun run verify: PASS (privacy + jsonb + progress + test-isolation +
  wasm + admin-build + typecheck)
- 133 tests pass: migrate + apply-migrations + takes-engine + takes-fence
- migrations v37 (takes) + v38 (access_tokens_permissions) apply cleanly
  on top of master's v35 (auto-RLS) + v36 (subagent persistence)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants