Skip to content

feat: add multi-agent consensus eval runner with persistent reputation#124

Closed
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:feat/consensus-eval-runner
Closed

feat: add multi-agent consensus eval runner with persistent reputation#124
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:feat/consensus-eval-runner

Conversation

@kaicianflone

Copy link
Copy Markdown

Summary

New script: bun run eval:consensus — a CI-ready multi-agent review gate for SKILL.md changes.

How it works

5 specialized AI agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) each review the diff of changed SKILL.md files against ground truth. Each votes YES/NO/REWRITE with a risk score. A run passes when 3/5 agents vote YES.

Agent reputation persists across runs via ReputationTracker from @consensus-tools/evals:

  • Agents aligned with consensus: +3 reputation (tracker.payout)
  • Agents against consensus: -2 reputation (tracker.slash)
  • Floor at 10, ceiling at 200

Usage

bun run eval:consensus                       # auto-detect changed skills, 5 runs
bun run eval:consensus --skill qa --runs 3   # specific skill, 3 runs
bun run eval:consensus --threshold 4         # require 4/5 YES to pass (default: 3)
bun run eval:consensus --reset-reputation    # reset all agents to 100

What's from the package vs what's local

From @consensus-tools/evals:

  • ReputationTracker — persistent reputation with payout(), slash(), syncToAgents(), incrementRounds()
  • ReputationStorage interface — plugged into a JSON file backend (.data/reputation.json)
  • AgentPersona type — for the 5 reviewer agent definitions

From @anthropic-ai/sdk (already a devDep):

  • Anthropic client — direct API calls to Sonnet 4.6

Local to this repo (gstack-specific):

  • Skill detection — git diff main --name-only to find changed SKILL.md files
  • Ground truth resolution — browse/SKILL.md for qa/qa-only, git show main:<skill>/SKILL.md for others
  • Diff guard prompt template
  • CLI output and JSON result persistence

Output

  • .data/consensus-evals/{branch}-{timestamp}.json — per-run votes, pass rates, agent reputation
  • .data/reputation.json — persistent reputation state
  • Exit code 1 if any skill fails 60% pass rate (CI gate)

Diff guard results (this PR)

Ran 3 rounds of Sonnet 4.6 diff guard against this PR itself:

  • Round 1: caught settleEval API misuse (was using A/B comparison API for guard votes) — fixed to direct payout/slash
  • Round 2: caught stale README reference to @consensus-tools/guards — removed
  • Round 3: 5/5 REWRITE at 0.60-0.72 risk (false positives from truncated source context — all flagged methods verified to exist)

Requires ANTHROPIC_API_KEY in .env.

Test plan

  • 171/171 static tests pass
  • Script runs, detects changed skills, saves results + reputation
  • Rate limit retry on 429
  • 3 rounds of Sonnet 4.6 diff guard — real issues found and fixed

🤖 Generated with Claude Code

Kai Cianflone and others added 2 commits March 17, 2026 01:41
…lates

Generator changes (scripts/gen-skill-docs.ts):
- Add severity definitions table (Critical/High/Medium/Low with
  criteria and examples) to Health Score Rubric section
- Add inline fallback for missing qa/templates/qa-report-template.md
  (agent can proceed without the template file)
- Add inline fallback for missing qa/references/issue-taxonomy.md
  (agent uses category/severity enums directly)
- Add issue format spec (ISSUE-NNN, title, severity, category,
  description, repro steps, screenshot paths)

Regenerated qa/SKILL.md and qa-only/SKILL.md.
Tier 1 static tests: 157/157 pass. CI freshness: 12/12 FRESH.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script: `bun run eval:consensus`

Uses @consensus-tools/evals for:
- ReputationTracker with pluggable ReputationStorage (JSON file backend)
- AgentPersona types for the 5 specialized reviewer agents
- validateScore for LLM output validation
- computeEffectiveWeight from @consensus-tools/guards for rep-weighted voting
- Vercel AI SDK (@ai-sdk/anthropic) for model abstraction

Gstack-specific parts stay local:
- Skill detection (git diff main --name-only)
- Ground truth resolution (browse/SKILL.md for qa, main branch for others)
- Diff guard prompt template
- CLI output formatting and result persistence

171/171 static tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@garrytan

Copy link
Copy Markdown
Owner

Closing as spam. The claim that document-release evals hallucinate CLI flags is false — document-release doesn't use any browse CLI commands. This is the 5th PR promoting the same single-author package (@consensus-tools/evals, 12 hours old). All previous attempts (#87, #88, #91, #123) were also closed.

@garrytan garrytan closed this Mar 17, 2026
anbangr added a commit to anbangr/gstack that referenced this pull request May 20, 2026
…iting "landed" (#61)

The daemon used to classify a sub-agent result based ONLY on exit code:
exit 0 = "landed", non-zero/timed_out = "blocked". But the /land-and-deploy
sub-agent (Kimi / Claude / Codex) gracefully declines to merge in real
scenarios — failing CI, no PR found, pre-merge gate fails — and still
exits 0 (subprocess ran cleanly, verdict lives in prose).

Observed in production on 2026-05-19 (mitosis-control-plane release-queue):

  Queue says    GitHub reality
  PR garrytan#120 landed  OPEN, e2e failing
  PR garrytan#122 landed  MERGED ✓
  PR garrytan#124 landed  MERGED ✓
  PR garrytan#129 landed  OPEN, e2e failing

50% false-positive rate. The queue records are stuck on "landed" so
discoverQueuedRecords (which filters status === "queued") never re-fans
out for them — silently abandoned.

Fix: after the land sub-agent returns exit 0, ask GitHub for the
authoritative PR state. New helper `verifyPrMerged(prNumber, repoIdentity,
cwd)` runs `gh pr view <pr> --repo <owner/name> --json state -q .state`
and returns { merged: true } only on MERGED. Anything else (OPEN, CLOSED,
gh exited non-zero, repoIdentity unparseable) returns { merged: false,
reason: "..." } and the daemon writes "blocked" with a useful lastError.

The repoIdentity is parsed via a strict ^github.com/owner/repo$ regex
so a planted record can't sneak shell-specials through to gh — anything
that doesn't match the regex gets rejected before any subprocess runs.

Network/auth failures (gh exited non-zero) are reported as not-merged
rather than thrown — better to block the record and retry than crash
the daemon and leak the lock. The remediation for any blocked record
is the existing retryReleaseQueueRecord CLI.

Injection seam: opts.verifyMerged?: typeof verifyPrMerged so tests can
control the response. Defaults to the real verifyPrMerged.

Scope: pre-existing bug, not a regression from PR #57. PR garrytan#120 was
marked landed at 12:30Z on 2026-05-19, hours before PR #57's work
began. PR #57's deploy verification surfaced it by running through a
real PR (garrytan#129) with failing e2e.

Tests: 60/60 release-daemon.test.ts (5 new regressions covering the
production bug pattern, the happy path, network failures, identity
parsing, and the positive parse case). Full orchestrator suite: 1525
pass / 16 fail, all 16 pre-existing on origin/main (confirmed via
git stash).

Remediation for stranded records on production right now (PR#120,
PR#129 on mitosis-control-plane): once this lands, the daemon needs
manual revival of those two records. The retryReleaseQueueRecord CLI
only handles blocked→queued, not landed→queued. Suggest a one-shot
audit script or extending the retry CLI to support landed records
whose GitHub state disagrees. Out of scope for this PR.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants