feat: add multi-agent consensus eval runner with persistent reputation#124
Closed
kaicianflone wants to merge 2 commits into
Closed
feat: add multi-agent consensus eval runner with persistent reputation#124kaicianflone wants to merge 2 commits into
kaicianflone wants to merge 2 commits into
Conversation
…lates Generator changes (scripts/gen-skill-docs.ts): - Add severity definitions table (Critical/High/Medium/Low with criteria and examples) to Health Score Rubric section - Add inline fallback for missing qa/templates/qa-report-template.md (agent can proceed without the template file) - Add inline fallback for missing qa/references/issue-taxonomy.md (agent uses category/severity enums directly) - Add issue format spec (ISSUE-NNN, title, severity, category, description, repro steps, screenshot paths) Regenerated qa/SKILL.md and qa-only/SKILL.md. Tier 1 static tests: 157/157 pass. CI freshness: 12/12 FRESH. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script: `bun run eval:consensus` Uses @consensus-tools/evals for: - ReputationTracker with pluggable ReputationStorage (JSON file backend) - AgentPersona types for the 5 specialized reviewer agents - validateScore for LLM output validation - computeEffectiveWeight from @consensus-tools/guards for rep-weighted voting - Vercel AI SDK (@ai-sdk/anthropic) for model abstraction Gstack-specific parts stay local: - Skill detection (git diff main --name-only) - Ground truth resolution (browse/SKILL.md for qa, main branch for others) - Diff guard prompt template - CLI output formatting and result persistence 171/171 static tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
|
Closing as spam. The claim that document-release evals hallucinate CLI flags is false — document-release doesn't use any browse CLI commands. This is the 5th PR promoting the same single-author package (@consensus-tools/evals, 12 hours old). All previous attempts (#87, #88, #91, #123) were also closed. |
2 tasks
anbangr
added a commit
to anbangr/gstack
that referenced
this pull request
May 20, 2026
…iting "landed" (#61) The daemon used to classify a sub-agent result based ONLY on exit code: exit 0 = "landed", non-zero/timed_out = "blocked". But the /land-and-deploy sub-agent (Kimi / Claude / Codex) gracefully declines to merge in real scenarios — failing CI, no PR found, pre-merge gate fails — and still exits 0 (subprocess ran cleanly, verdict lives in prose). Observed in production on 2026-05-19 (mitosis-control-plane release-queue): Queue says GitHub reality PR garrytan#120 landed OPEN, e2e failing PR garrytan#122 landed MERGED ✓ PR garrytan#124 landed MERGED ✓ PR garrytan#129 landed OPEN, e2e failing 50% false-positive rate. The queue records are stuck on "landed" so discoverQueuedRecords (which filters status === "queued") never re-fans out for them — silently abandoned. Fix: after the land sub-agent returns exit 0, ask GitHub for the authoritative PR state. New helper `verifyPrMerged(prNumber, repoIdentity, cwd)` runs `gh pr view <pr> --repo <owner/name> --json state -q .state` and returns { merged: true } only on MERGED. Anything else (OPEN, CLOSED, gh exited non-zero, repoIdentity unparseable) returns { merged: false, reason: "..." } and the daemon writes "blocked" with a useful lastError. The repoIdentity is parsed via a strict ^github.com/owner/repo$ regex so a planted record can't sneak shell-specials through to gh — anything that doesn't match the regex gets rejected before any subprocess runs. Network/auth failures (gh exited non-zero) are reported as not-merged rather than thrown — better to block the record and retry than crash the daemon and leak the lock. The remediation for any blocked record is the existing retryReleaseQueueRecord CLI. Injection seam: opts.verifyMerged?: typeof verifyPrMerged so tests can control the response. Defaults to the real verifyPrMerged. Scope: pre-existing bug, not a regression from PR #57. PR garrytan#120 was marked landed at 12:30Z on 2026-05-19, hours before PR #57's work began. PR #57's deploy verification surfaced it by running through a real PR (garrytan#129) with failing e2e. Tests: 60/60 release-daemon.test.ts (5 new regressions covering the production bug pattern, the happy path, network failures, identity parsing, and the positive parse case). Full orchestrator suite: 1525 pass / 16 fail, all 16 pre-existing on origin/main (confirmed via git stash). Remediation for stranded records on production right now (PR#120, PR#129 on mitosis-control-plane): once this lands, the daemon needs manual revival of those two records. The retryReleaseQueueRecord CLI only handles blocked→queued, not landed→queued. Suggest a one-shot audit script or extending the retry CLI to support landed records whose GitHub state disagrees. Out of scope for this PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New script:
bun run eval:consensus— a CI-ready multi-agent review gate for SKILL.md changes.How it works
5 specialized AI agents (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) each review the diff of changed SKILL.md files against ground truth. Each votes YES/NO/REWRITE with a risk score. A run passes when 3/5 agents vote YES.
Agent reputation persists across runs via
ReputationTrackerfrom@consensus-tools/evals:tracker.payout)tracker.slash)Usage
What's from the package vs what's local
From
@consensus-tools/evals:ReputationTracker— persistent reputation withpayout(),slash(),syncToAgents(),incrementRounds()ReputationStorageinterface — plugged into a JSON file backend (.data/reputation.json)AgentPersonatype — for the 5 reviewer agent definitionsFrom
@anthropic-ai/sdk(already a devDep):Anthropicclient — direct API calls to Sonnet 4.6Local to this repo (gstack-specific):
git diff main --name-onlyto find changed SKILL.md filesbrowse/SKILL.mdfor qa/qa-only,git show main:<skill>/SKILL.mdfor othersOutput
.data/consensus-evals/{branch}-{timestamp}.json— per-run votes, pass rates, agent reputation.data/reputation.json— persistent reputation stateDiff guard results (this PR)
Ran 3 rounds of Sonnet 4.6 diff guard against this PR itself:
settleEvalAPI misuse (was using A/B comparison API for guard votes) — fixed to directpayout/slash@consensus-tools/guards— removedRequires
ANTHROPIC_API_KEYin.env.Test plan
🤖 Generated with Claude Code