feat: add multi-agent consensus eval runner with persistent reputation by kaicianflone · Pull Request #123 · garrytan/gstack

kaicianflone · 2026-03-17T05:36:53Z

Summary

New script: bun run eval:consensus

Runs 5 specialized agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) against SKILL.md diffs on the current branch. Each agent votes YES/NO/REWRITE with risk scores. Reputation persists across runs — agents that align with consensus earn trust (+3), those that don't get slashed (-2).

Auto-detects changed SKILL.md files via git diff main
Cross-references diffs against ground truth (browse/SKILL.md for qa skills, main branch version for others)
Configurable: --runs N, --threshold N, --skill <name>, --reset-reputation
Results saved to .data/consensus-evals/ as JSON
Reputation saved to .data/reputation.json
Exit code 1 if any skill fails 60% pass rate — usable as CI gate

Usage

bun run eval:consensus                     # auto-detect + 5 runs
bun run eval:consensus --skill qa --runs 3 # specific skill
bun run eval:consensus --reset-reputation  # fresh start

Requires ANTHROPIC_API_KEY in .env.

Test plan

Script runs, detects changed skills, saves results and reputation
Rate limit retry on 429
157/157 static tests still pass

🤖 Generated with Claude Code

…lates Generator changes (scripts/gen-skill-docs.ts): - Add severity definitions table (Critical/High/Medium/Low with criteria and examples) to Health Score Rubric section - Add inline fallback for missing qa/templates/qa-report-template.md (agent can proceed without the template file) - Add inline fallback for missing qa/references/issue-taxonomy.md (agent uses category/severity enums directly) - Add issue format spec (ISSUE-NNN, title, severity, category, description, repro steps, screenshot paths) Regenerated qa/SKILL.md and qa-only/SKILL.md. Tier 1 static tests: 157/157 pass. CI freshness: 12/12 FRESH. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New script: `bun run eval:consensus` Runs 5 specialized agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) against SKILL.md diffs. Each agent votes YES/NO/REWRITE with risk scores. Reputation persists across runs in .data/reputation.json — agents that align with consensus earn trust (+3), those that don't get slashed (-2). Floor at 10, ceiling at 200. Features: - Auto-detects changed SKILL.md files on the branch - Cross-references diffs against ground truth (browse/SKILL.md for qa, main branch version for other skills) - Configurable: --runs N, --threshold N, --skill <name>, --reset-reputation - Results saved to .data/consensus-evals/ as JSON - Exit code 1 if any skill fails (60% pass rate threshold) - Rate limit retry (429 handling) Usage: bun run eval:consensus # auto-detect + 5 runs bun run eval:consensus --skill qa --runs 3 # specific skill, 3 runs bun run eval:consensus --reset-reputation # fresh start Requires: ANTHROPIC_API_KEY in .env Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2233admin · 2026-05-09T19:23:10Z

Message with backticks

2233admin · 2026-05-09T19:24:08Z

Message with backticks

2233admin · 2026-05-09T20:44:04Z

Message with backticks

2233admin · 2026-05-09T20:50:23Z

Message with backticks

2233admin · 2026-05-09T20:50:42Z

Message with backticks

2233admin · 2026-05-09T20:51:01Z

Message with backticks

2233admin · 2026-05-09T20:51:20Z

Message with backticks

2233admin · 2026-05-10T00:26:06Z

Message with backticks

2233admin · 2026-05-10T00:29:33Z

Message with backticks

2233admin · 2026-05-10T02:23:13Z

Message with backticks

Kai Cianflone and others added 2 commits March 16, 2026 20:53

garrytan mentioned this pull request Mar 17, 2026

feat: add multi-agent consensus eval runner with persistent reputation #124

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-agent consensus eval runner with persistent reputation#123

feat: add multi-agent consensus eval runner with persistent reputation#123
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:feat/consensus-eval-runner

kaicianflone commented Mar 17, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 10, 2026

Uh oh!

2233admin commented May 10, 2026

Uh oh!

2233admin commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaicianflone commented Mar 17, 2026

Summary

Usage

Test plan

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 9, 2026

Uh oh!

2233admin commented May 10, 2026

Uh oh!

2233admin commented May 10, 2026

Uh oh!

2233admin commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants