Skip to content

feat: add multi-agent consensus eval runner with persistent reputation#123

Closed
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:feat/consensus-eval-runner
Closed

feat: add multi-agent consensus eval runner with persistent reputation#123
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:feat/consensus-eval-runner

Conversation

@kaicianflone

Copy link
Copy Markdown

Summary

New script: bun run eval:consensus

Runs 5 specialized agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) against SKILL.md diffs on the current branch. Each agent votes YES/NO/REWRITE with risk scores. Reputation persists across runs — agents that align with consensus earn trust (+3), those that don't get slashed (-2).

  • Auto-detects changed SKILL.md files via git diff main
  • Cross-references diffs against ground truth (browse/SKILL.md for qa skills, main branch version for others)
  • Configurable: --runs N, --threshold N, --skill <name>, --reset-reputation
  • Results saved to .data/consensus-evals/ as JSON
  • Reputation saved to .data/reputation.json
  • Exit code 1 if any skill fails 60% pass rate — usable as CI gate

Usage

bun run eval:consensus                     # auto-detect + 5 runs
bun run eval:consensus --skill qa --runs 3 # specific skill
bun run eval:consensus --reset-reputation  # fresh start

Requires ANTHROPIC_API_KEY in .env.

Test plan

  • Script runs, detects changed skills, saves results and reputation
  • Rate limit retry on 429
  • 157/157 static tests still pass

🤖 Generated with Claude Code

Kai Cianflone and others added 2 commits March 16, 2026 20:53
…lates

Generator changes (scripts/gen-skill-docs.ts):
- Add severity definitions table (Critical/High/Medium/Low with
  criteria and examples) to Health Score Rubric section
- Add inline fallback for missing qa/templates/qa-report-template.md
  (agent can proceed without the template file)
- Add inline fallback for missing qa/references/issue-taxonomy.md
  (agent uses category/severity enums directly)
- Add issue format spec (ISSUE-NNN, title, severity, category,
  description, repro steps, screenshot paths)

Regenerated qa/SKILL.md and qa-only/SKILL.md.
Tier 1 static tests: 157/157 pass. CI freshness: 12/12 FRESH.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script: `bun run eval:consensus`

Runs 5 specialized agents (Doc Architect, API Accuracy, Agent Usability,
Completeness Auditor, Style Guardian) against SKILL.md diffs. Each agent
votes YES/NO/REWRITE with risk scores. Reputation persists across runs
in .data/reputation.json — agents that align with consensus earn trust
(+3), those that don't get slashed (-2). Floor at 10, ceiling at 200.

Features:
- Auto-detects changed SKILL.md files on the branch
- Cross-references diffs against ground truth (browse/SKILL.md for qa,
  main branch version for other skills)
- Configurable: --runs N, --threshold N, --skill <name>, --reset-reputation
- Results saved to .data/consensus-evals/ as JSON
- Exit code 1 if any skill fails (60% pass rate threshold)
- Rate limit retry (429 handling)

Usage:
  bun run eval:consensus                     # auto-detect + 5 runs
  bun run eval:consensus --skill qa --runs 3 # specific skill, 3 runs
  bun run eval:consensus --reset-reputation  # fresh start

Requires: ANTHROPIC_API_KEY in .env

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@2233admin

Copy link
Copy Markdown

Message with backticks

9 similar comments
@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

@2233admin

Copy link
Copy Markdown

Message with backticks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants