feat: add multi-agent consensus eval runner with persistent reputation#123
Closed
kaicianflone wants to merge 2 commits into
Closed
feat: add multi-agent consensus eval runner with persistent reputation#123kaicianflone wants to merge 2 commits into
kaicianflone wants to merge 2 commits into
Conversation
…lates Generator changes (scripts/gen-skill-docs.ts): - Add severity definitions table (Critical/High/Medium/Low with criteria and examples) to Health Score Rubric section - Add inline fallback for missing qa/templates/qa-report-template.md (agent can proceed without the template file) - Add inline fallback for missing qa/references/issue-taxonomy.md (agent uses category/severity enums directly) - Add issue format spec (ISSUE-NNN, title, severity, category, description, repro steps, screenshot paths) Regenerated qa/SKILL.md and qa-only/SKILL.md. Tier 1 static tests: 157/157 pass. CI freshness: 12/12 FRESH. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script: `bun run eval:consensus` Runs 5 specialized agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) against SKILL.md diffs. Each agent votes YES/NO/REWRITE with risk scores. Reputation persists across runs in .data/reputation.json — agents that align with consensus earn trust (+3), those that don't get slashed (-2). Floor at 10, ceiling at 200. Features: - Auto-detects changed SKILL.md files on the branch - Cross-references diffs against ground truth (browse/SKILL.md for qa, main branch version for other skills) - Configurable: --runs N, --threshold N, --skill <name>, --reset-reputation - Results saved to .data/consensus-evals/ as JSON - Exit code 1 if any skill fails (60% pass rate threshold) - Rate limit retry (429 handling) Usage: bun run eval:consensus # auto-detect + 5 runs bun run eval:consensus --skill qa --runs 3 # specific skill, 3 runs bun run eval:consensus --reset-reputation # fresh start Requires: ANTHROPIC_API_KEY in .env Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
|
Message with |
9 similar comments
|
Message with |
|
Message with |
|
Message with |
|
Message with |
|
Message with |
|
Message with |
|
Message with |
|
Message with |
|
Message with |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New script:
bun run eval:consensusRuns 5 specialized agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) against SKILL.md diffs on the current branch. Each agent votes YES/NO/REWRITE with risk scores. Reputation persists across runs — agents that align with consensus earn trust (+3), those that don't get slashed (-2).
git diff main--runs N,--threshold N,--skill <name>,--reset-reputation.data/consensus-evals/as JSON.data/reputation.jsonUsage
Requires
ANTHROPIC_API_KEYin.env.Test plan
🤖 Generated with Claude Code