feat: add multi-agent consensus eval runner with persistent reputation by kaicianflone · Pull Request #124 · garrytan/gstack

kaicianflone · 2026-03-17T06:20:37Z

Summary

New script: bun run eval:consensus — a CI-ready multi-agent review gate for SKILL.md changes.

How it works

5 specialized AI agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) each review the diff of changed SKILL.md files against ground truth. Each votes YES/NO/REWRITE with a risk score. A run passes when 3/5 agents vote YES.

Agent reputation persists across runs via ReputationTracker from @consensus-tools/evals:

Agents aligned with consensus: +3 reputation (tracker.payout)
Agents against consensus: -2 reputation (tracker.slash)
Floor at 10, ceiling at 200

Usage

bun run eval:consensus                       # auto-detect changed skills, 5 runs
bun run eval:consensus --skill qa --runs 3   # specific skill, 3 runs
bun run eval:consensus --threshold 4         # require 4/5 YES to pass (default: 3)
bun run eval:consensus --reset-reputation    # reset all agents to 100

What's from the package vs what's local

From @consensus-tools/evals:

ReputationTracker — persistent reputation with payout(), slash(), syncToAgents(), incrementRounds()
ReputationStorage interface — plugged into a JSON file backend (.data/reputation.json)
AgentPersona type — for the 5 reviewer agent definitions

From @anthropic-ai/sdk (already a devDep):

Anthropic client — direct API calls to Sonnet 4.6

Local to this repo (gstack-specific):

Skill detection — git diff main --name-only to find changed SKILL.md files
Ground truth resolution — browse/SKILL.md for qa/qa-only, git show main:<skill>/SKILL.md for others
Diff guard prompt template
CLI output and JSON result persistence

Output

.data/consensus-evals/{branch}-{timestamp}.json — per-run votes, pass rates, agent reputation
.data/reputation.json — persistent reputation state
Exit code 1 if any skill fails 60% pass rate (CI gate)

Diff guard results (this PR)

Ran 3 rounds of Sonnet 4.6 diff guard against this PR itself:

Round 1: caught settleEval API misuse (was using A/B comparison API for guard votes) — fixed to direct payout/slash
Round 2: caught stale README reference to @consensus-tools/guards — removed
Round 3: 5/5 REWRITE at 0.60-0.72 risk (false positives from truncated source context — all flagged methods verified to exist)

Requires ANTHROPIC_API_KEY in .env.

Test plan

171/171 static tests pass
Script runs, detects changed skills, saves results + reputation
Rate limit retry on 429
3 rounds of Sonnet 4.6 diff guard — real issues found and fixed

🤖 Generated with Claude Code

…lates Generator changes (scripts/gen-skill-docs.ts): - Add severity definitions table (Critical/High/Medium/Low with criteria and examples) to Health Score Rubric section - Add inline fallback for missing qa/templates/qa-report-template.md (agent can proceed without the template file) - Add inline fallback for missing qa/references/issue-taxonomy.md (agent uses category/severity enums directly) - Add issue format spec (ISSUE-NNN, title, severity, category, description, repro steps, screenshot paths) Regenerated qa/SKILL.md and qa-only/SKILL.md. Tier 1 static tests: 157/157 pass. CI freshness: 12/12 FRESH. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New script: `bun run eval:consensus` Uses @consensus-tools/evals for: - ReputationTracker with pluggable ReputationStorage (JSON file backend) - AgentPersona types for the 5 specialized reviewer agents - validateScore for LLM output validation - computeEffectiveWeight from @consensus-tools/guards for rep-weighted voting - Vercel AI SDK (@ai-sdk/anthropic) for model abstraction Gstack-specific parts stay local: - Skill detection (git diff main --name-only) - Ground truth resolution (browse/SKILL.md for qa, main branch for others) - Diff guard prompt template - CLI output formatting and result persistence 171/171 static tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

garrytan · 2026-03-17T14:39:25Z

Closing as spam. The claim that document-release evals hallucinate CLI flags is false — document-release doesn't use any browse CLI commands. This is the 5th PR promoting the same single-author package (@consensus-tools/evals, 12 hours old). All previous attempts (#87, #88, #91, #123) were also closed.

…iting "landed" (#61) The daemon used to classify a sub-agent result based ONLY on exit code: exit 0 = "landed", non-zero/timed_out = "blocked". But the /land-and-deploy sub-agent (Kimi / Claude / Codex) gracefully declines to merge in real scenarios — failing CI, no PR found, pre-merge gate fails — and still exits 0 (subprocess ran cleanly, verdict lives in prose). Observed in production on 2026-05-19 (mitosis-control-plane release-queue): Queue says GitHub reality PR garrytan#120 landed OPEN, e2e failing PR garrytan#122 landed MERGED ✓ PR garrytan#124 landed MERGED ✓ PR garrytan#129 landed OPEN, e2e failing 50% false-positive rate. The queue records are stuck on "landed" so discoverQueuedRecords (which filters status === "queued") never re-fans out for them — silently abandoned. Fix: after the land sub-agent returns exit 0, ask GitHub for the authoritative PR state. New helper `verifyPrMerged(prNumber, repoIdentity, cwd)` runs `gh pr view <pr> --repo <owner/name> --json state -q .state` and returns { merged: true } only on MERGED. Anything else (OPEN, CLOSED, gh exited non-zero, repoIdentity unparseable) returns { merged: false, reason: "..." } and the daemon writes "blocked" with a useful lastError. The repoIdentity is parsed via a strict ^github.com/owner/repo$ regex so a planted record can't sneak shell-specials through to gh — anything that doesn't match the regex gets rejected before any subprocess runs. Network/auth failures (gh exited non-zero) are reported as not-merged rather than thrown — better to block the record and retry than crash the daemon and leak the lock. The remediation for any blocked record is the existing retryReleaseQueueRecord CLI. Injection seam: opts.verifyMerged?: typeof verifyPrMerged so tests can control the response. Defaults to the real verifyPrMerged. Scope: pre-existing bug, not a regression from PR #57. PR garrytan#120 was marked landed at 12:30Z on 2026-05-19, hours before PR #57's work began. PR #57's deploy verification surfaced it by running through a real PR (garrytan#129) with failing e2e. Tests: 60/60 release-daemon.test.ts (5 new regressions covering the production bug pattern, the happy path, network failures, identity parsing, and the positive parse case). Full orchestrator suite: 1525 pass / 16 fail, all 16 pre-existing on origin/main (confirmed via git stash). Remediation for stranded records on production right now (PR#120, PR#129 on mitosis-control-plane): once this lands, the daemon needs manual revival of those two records. The retryReleaseQueueRecord CLI only handles blocked→queued, not landed→queued. Suggest a one-shot audit script or extending the retry CLI to support landed records whose GitHub state disagrees. Out of scope for this PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kai Cianflone and others added 2 commits March 17, 2026 01:41

garrytan closed this Mar 17, 2026

anbangr mentioned this pull request May 20, 2026

fix(release-daemon): verify PR is actually merged on GitHub before writing "landed" anbangr/gstack#61

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-agent consensus eval runner with persistent reputation#124

feat: add multi-agent consensus eval runner with persistent reputation#124
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:feat/consensus-eval-runner

kaicianflone commented Mar 17, 2026

Uh oh!

garrytan commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaicianflone commented Mar 17, 2026

Summary

How it works

Usage

What's from the package vs what's local

Output

Diff guard results (this PR)

Test plan

Uh oh!

garrytan commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants