j-rig-binary-eval

Software-grade release discipline for Claude Skills

Binary evaluation harness that treats SKILL.md artifacts as production software. Package integrity, trigger precision, functional quality, regression gating, baseline comparison, model-aware testing, and evidence-backed rollout decisions — all through binary yes/no criteria with external evaluators.

Links: Master Blueprint · Epic Index · Doc Index · Contributing · Security

One-Pager

The Problem

Claude Skills ship on instinct. A skill author writes a SKILL.md, eyeballs it, maybe runs it once, and pushes. There is no regression gate, no trigger precision measurement, no baseline comparison, no model-variance tracking, and no evidence trail for rollout decisions. When a skill breaks silently after a model update or a description tweak causes sibling confusion across a pack, nobody knows until users complain.

The Solution

J-Rig Binary Eval is a seven-layer evaluation harness that scores every skill change across seven product surfaces before it ships:

Package Integrity — Does it parse, validate, and reference real files?
Trigger Quality — Does it fire on the right prompts and stay silent on the wrong ones?
Functional Quality — Does it complete its task and produce correct artifacts?
Regression Protection — Did this change break anything that previously worked?
Baseline Value — Does the skill actually outperform the naked model?
Model Variance — Does it work across Haiku, Sonnet, and Opus?
Rollout Safety — Any prompt leakage, overreach, or unsafe automation?

Every criterion is binary (yes/no). The evaluator is always separate from the skill under test. Observed behavior outranks claimed behavior.

W5


Who	Claude Skill authors, skill pack maintainers, enterprise skill library operators
What	Evaluation harness + regression gate + optimization engine for Claude Skills
Where	Local CLI (author workflow), CI/CD (PR gate), team dashboard (reporting)
When	Every skill change: new skill, description edit, body rewrite, model update
Why	Skills are production software — they need release-quality discipline, not vibes

Stack

Layer	Technology
Runtime	TypeScript, Node.js 20+, pnpm
CLI/Parsing	commander, @clack/prompts, picocolors, yaml, unified/remark
Validation	zod
LLM Integration	@anthropic-ai/sdk
Persistence	better-sqlite3, drizzle-orm
Concurrency	p-limit, async-retry
Artifact Extraction	pdf-parse, mammoth
Dashboard (future)	Next.js, Tailwind, shadcn/ui

Key Differentiators

Binary criteria only — if a criterion can't be answered yes or no, it isn't ready. No fuzzy scores, no vibes.
External evaluators — the skill under test never judges itself. Deterministic checks first, LLM judges second.
Sacred regressions — a change that improves average score but breaks a sacred case is rejected. Period.
One change at a time — the optimizer proposes exactly one atomic change per experiment. No multi-variable confusion.
Baseline gating — if the base model already does the job without the skill, the skill gets flagged for obsolete review.
Model-aware — Haiku, Sonnet, and Opus are tested independently. Model variance is product reality, not noise.
Evidence-backed rollout — every ship/no-ship decision comes with a structured evidence trail.

Operator-Grade System Analysis

Architecture (Seven Layers)

┌─────────────────────────────────────────────────┐
│                   CLI / CI / API                 │  Layer 7: Surfaces
├─────────────────────────────────────────────────┤
│                 Evidence Layer                   │  Layer 6: Persistence
├─────────────────────────────────────────────────┤
│               Optimization Layer                 │  Layer 5: Experiments
├─────────────────────────────────────────────────┤
│                Judgment Layer                    │  Layer 4: Scoring
├─────────────────────────────────────────────────┤
│              Observation Layer                   │  Layer 3: Capture
├─────────────────────────────────────────────────┤
│               Execution Layer                    │  Layer 2: Harness
├─────────────────────────────────────────────────┤
│                  Spec Layer                      │  Layer 1: Contracts
└─────────────────────────────────────────────────┘

Layer	Responsibility	Key Entities
Spec	Human-authored YAML eval contracts, criteria, test cases	`eval_specs`, `criteria`, `test_cases`
Execution	Runs skills against trigger, functional, regression, adversarial, baseline cases	`runs`, `skill_versions`
Observation	Captures outputs, artifacts, cost, latency, timing, observed outcomes	`observed_outcomes`, `outputs`
Judgment	Deterministic checks first, external LLM judges second, calibration, disagreement handling	`criterion_results`
Optimization	Failure clustering, weakest-criterion targeting, single atomic changes, accept/reject/revert	`experiments`
Evidence	Stores runs, scores, artifacts, diffs, regressions, baselines, launch reports	`regressions`, `baselines`, `launch_reports`
CLI/CI/API	Local author workflows, PR gating, team reporting, dashboard	—

Epic Roadmap (10 Epics, Sequential)

#	Epic	Scope
01	Repo Foundation	Workspace skeleton, governance, CI
02	Spec Layer	YAML eval contracts, criteria schema, test case format
03	Package Integrity	Deterministic structure/metadata validation
04	Evidence Layer	SQLite persistence, run lifecycle, evidence serialization
05	Trigger Harness	Roster builder, trigger simulation, precision/recall
06	Functional Execution	Skill invocation, context injection, artifact capture
07	Judgment Layer	Binary judge engine, calibration, per-model matrix
08	Regression/CLI/CI	Regression comparison, baseline gating, score aggregation, CLI, PR gate
09	Optimizer	Failure clustering, one-change proposals, experiment runner
10	Team Product	Dashboard, eval packs, drift reevaluation, obsolete-review

Non-Negotiable Design Principles

Criteria must be binary — yes or no, no gradients
Evaluator is always separate — the skill never judges itself
Observed behavior outranks claimed behavior — grade what happened, not what the skill says it does
Regression tests are sacred — a regression on a sacred case blocks release regardless of average improvement
One change at a time — optimizer proposes exactly one atomic change per experiment
Blockers block release — a blocker failure cannot be averaged out
Baseline value matters — if the naked model matches the skill, flag for obsolete review
Model-aware testing is required — Haiku/Sonnet/Opus differences are product reality

Reference Library (32 files)

Self-contained library of templates, reference standards, agent patterns, and workflow diagrams under 000-docs/:

Directory	Contents
`templates/skill-templates/`	6 SKILL.md structural patterns
`templates/eval-schemas/`	Eval JSON schemas
`references/skill-standards/`	AgentSkills.io spec, source-of-truth, frontmatter, validation rules
`references/eval-patterns/`	Eval methodology, workflows, output patterns
`references/agents/`	Grader, comparator, analyzer agent patterns
`references/enterprise-standards/`	100-point rubric, production validator schema registry
`references/drift-and-consistency/`	Drift categories, source-of-truth hierarchy
`references/epic-workflows/`	10 ASCII workflow diagrams (one per epic)

Current Status

Phase: Epic 01 complete (repo foundation). Ready for Epic 02 (Spec Layer).

pnpm monorepo with four workspace packages (@j-rig/core, @j-rig/cli, @j-rig/db, @j-rig/dashboard), TypeScript baseline (tsup builds), quality guardrails (ESLint, Prettier, Vitest), and CI/CD workflows.

License

MIT — see LICENSE for details.

Author

Jeremy Longshore — jeremylongshore · Intent Solutions

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.beads		.beads
.github		.github
000-docs		000-docs
eval-packs		eval-packs
packages		packages
tests		tests
.editorconfig		.editorconfig
.gist-id		.gist-id
.gitattributes		.gitattributes
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
eslint.config.mjs		eslint.config.mjs
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
version.txt		version.txt
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

j-rig-binary-eval

One-Pager

The Problem

The Solution

W5

Stack

Key Differentiators

Operator-Grade System Analysis

Architecture (Seven Layers)

Epic Roadmap (10 Epics, Sequential)

Non-Negotiable Design Principles

Reference Library (32 files)

Current Status

License

Author

About

Uh oh!

Releases 25

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

j-rig-binary-eval

One-Pager

The Problem

The Solution

W5

Stack

Key Differentiators

Operator-Grade System Analysis

Architecture (Seven Layers)

Epic Roadmap (10 Epics, Sequential)

Non-Negotiable Design Principles

Reference Library (32 files)

Current Status

License

Author

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 25

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages