feat(qa): add Browse CLI Reference section to qa/SKILL.md by kaicianflone · Pull Request #91 · garrytan/gstack

kaicianflone · 2026-03-16T06:53:00Z

Summary

The qa/SKILL.md v2.0.0 uses browse CLI commands ($B goto, $B snapshot -i, $B fill @e3, $B click @e5, $B console --errors, etc.) across all 11 phases but never formally documents their flags, arguments, or valid values. An AI agent must infer correct invocations from scattered usage examples — a gap the LLM-as-judge consistently identifies as the sole blocker between a 4/5 and 5/5 quality score.

This PR adds a Browse CLI Reference section with formal documentation for every $B command used in the QA workflow.

What changed

Browse CLI Reference section added after the intro, before Setup — 7 command subsections covering snapshot, goto, click, fill, screenshot, console, cookie-import, links, viewport, js
Each command includes a flag table with descriptions, positional argument ordering, and concrete bash examples
snapshot documents all 8 flags (-i, -c, -d, -a, -o, -D, -C, -s) with accurate descriptions cross-referenced against browse/SKILL.md
screenshot clarifies which args are flags (--viewport, --clip) vs positional ([selector|@ref], [path])
snapshot -D documents the "first call stores baseline, second shows changes" behavior
links output format documented as "text → href" (matches browse/SKILL.md)

Process

This change was generated and validated using consensus-tools, an open-source multi-agent consensus framework. The process:

1. Baseline measurement

Scored the default qa/SKILL.md v2.0.0 using an LLM-as-judge (claude-sonnet-4-6) on three dimensions — clarity, completeness, actionability — each 1-5. Ran 15 times across 3 temperatures (0, 0.3, 0.7) for statistical rigor.

Default v2.0.0: 4.0/5 (15/15 identical, zero variance)

The judge's consistent reasoning: "Browse CLI commands are used throughout but never formally documented with their flags and valid arguments — an agent must infer syntax from usage examples alone."

2. Consensus proposal generation

5 AI guard agents with distinct evaluation focuses proposed improvements:

Agent	Focus	Reputation
Doc Architect	Document structure, progressive disclosure	104
API Accuracy Checker	Command correctness, valid values	96
Agent Usability Tester	Zero-guess invocations	104
Completeness Auditor	Missing commands, edge cases	103
Style Guardian	Formatting consistency, accuracy	121

The proposer was given the exact correct flag definitions from browse/SKILL.md to prevent hallucination. Doc Architect generated the accepted proposal — a Browse CLI Reference section with flag tables and examples.

3. Diff guard review (3 rounds)

Each round, all 5 agents reviewed the diff against browse/SKILL.md ground truth for factual accuracy.

Round	Votes	Issues caught	Fixes applied
1	4 YES, 1 NO	`links` described as "URLs and link text" (should be `"text → href"` format); `cookie-import` overstated JSON schema fields	Fixed both
2	4 YES, 1 NO	`screenshot --viewport` falsely described as "default behavior"; `snapshot -D` missing "first call stores baseline" nuance	Fixed both
3	5/5 YES	None — all agents approved (max risk: 0.15)	Clean

Style Guardian caught all 4 issues across the first two rounds, earning the highest reputation (121) through accurate code review.

4. Reputation settlement

Agents earn or lose reputation based on whether their votes align with ground truth. Symmetric ±4 payoffs ensure no dominant strategy (always-YES and always-NO have equal expected value).

After 6 rounds, Style Guardian leads at 121 — it was penalized earlier for being too conservative on good proposals, but recovered by catching real inaccuracies in diff review that other agents missed.

5. Final verification

30 eval runs across 3 temperatures:

Version	Temp 0	Temp 0.3	Temp 0.7	Overall (15 runs)
Default v2.0.0	4/4/4	4/4/4	4/4/4	4.00/5
This PR	5/5/5	5/5/5	5/5/5	5.00/5
Delta	+1.00	+1.00	+1.00	+1.00 (+25%)

30 runs, zero variance in either direction, at any temperature.

gstack eval suite: 9/9 pass, zero regressions, cross-skill consistency 5/5.

Accuracy verification

Every flag in the Browse CLI Reference was cross-referenced against browse/SKILL.md:

Flag	Description in PR	Ground truth	Status
`snapshot -i`	Interactive elements only	`-i --interactive`	✅
`snapshot -c`	Compact, no empty nodes	`-c --compact`	✅
`snapshot -d N`	Limit tree depth	`-d --depth`	✅
`snapshot -a`	Annotated screenshot with red overlay	`-a --annotate`	✅
`snapshot -D`	First call stores baseline, second shows changes	`-D --diff`	✅
`snapshot -C`	Cursor-interactive @c refs	`-C --cursor-interactive`	✅
`snapshot -s <sel>`	Scope to CSS selector	`-s --selector`	✅
`snapshot -o <path>`	Output path for annotated screenshot	`-o --output`	✅
`console --errors`	Filter to error/warning	`--errors`	✅
`console --clear`	Reset buffer	`--clear`	✅
`links` output	`"text → href"` format	`"text → href"`	✅

Test plan

LLM judge: 30 eval runs, default 4.0 → PR 5.0 (+25%), zero variance
Diff guard: 3 rounds, 4 issues caught and fixed, final round 5/5 clean
gstack eval suite: 9/9 pass, zero regressions
Cross-skill consistency: 5/5
Every flag cross-referenced against browse/SKILL.md (11/11 accurate)
Manual review

🤖 Generated with Claude Code via consensus-tools

Diff-guard caught two inaccuracies: - links: "Returns URLs and link text" → "Returns each link as text → href" - cookie-import: removed assertion about required fields (browse/SKILL.md doesn't formally specify the JSON schema) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…note Diff-guard round 2 fixes: - screenshot: clarify --viewport and --clip are flags (not positional), remove false "default behavior" claim for --viewport - snapshot -D: add "first call stores baseline, second shows changes" nuance from browse/SKILL.md ground truth Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kaicianflone · 2026-03-16T07:02:08Z

Closing — need to apply changes to SKILL.md.tmpl (template) instead of SKILL.md directly per CONTRIBUTING.md. Will resubmit.

Kai Cianflone and others added 2 commits March 16, 2026 02:30

garrytan mentioned this pull request Mar 17, 2026

feat: add multi-agent consensus eval runner with persistent reputation #124

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qa): add Browse CLI Reference section to qa/SKILL.md#91

feat(qa): add Browse CLI Reference section to qa/SKILL.md#91
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:qa-browse-cli-reference-v2

kaicianflone commented Mar 16, 2026 •

edited

Loading

Uh oh!

kaicianflone commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaicianflone commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Process

1. Baseline measurement

2. Consensus proposal generation

3. Diff guard review (3 rounds)

4. Reputation settlement

5. Final verification

Accuracy verification

Test plan

Uh oh!

kaicianflone commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaicianflone commented Mar 16, 2026 •

edited

Loading