Skip to content

feat(qa): add Browse CLI Reference section to qa/SKILL.md#91

Closed
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:qa-browse-cli-reference-v2
Closed

feat(qa): add Browse CLI Reference section to qa/SKILL.md#91
kaicianflone wants to merge 2 commits into
garrytan:mainfrom
kaicianflone:qa-browse-cli-reference-v2

Conversation

@kaicianflone

@kaicianflone kaicianflone commented Mar 16, 2026

Copy link
Copy Markdown

Summary

The qa/SKILL.md v2.0.0 uses browse CLI commands ($B goto, $B snapshot -i, $B fill @e3, $B click @e5, $B console --errors, etc.) across all 11 phases but never formally documents their flags, arguments, or valid values. An AI agent must infer correct invocations from scattered usage examples — a gap the LLM-as-judge consistently identifies as the sole blocker between a 4/5 and 5/5 quality score.

This PR adds a Browse CLI Reference section with formal documentation for every $B command used in the QA workflow.

What changed

  • Browse CLI Reference section added after the intro, before Setup — 7 command subsections covering snapshot, goto, click, fill, screenshot, console, cookie-import, links, viewport, js
  • Each command includes a flag table with descriptions, positional argument ordering, and concrete bash examples
  • snapshot documents all 8 flags (-i, -c, -d, -a, -o, -D, -C, -s) with accurate descriptions cross-referenced against browse/SKILL.md
  • screenshot clarifies which args are flags (--viewport, --clip) vs positional ([selector|@ref], [path])
  • snapshot -D documents the "first call stores baseline, second shows changes" behavior
  • links output format documented as "text → href" (matches browse/SKILL.md)

Process

This change was generated and validated using consensus-tools, an open-source multi-agent consensus framework. The process:

1. Baseline measurement

Scored the default qa/SKILL.md v2.0.0 using an LLM-as-judge (claude-sonnet-4-6) on three dimensions — clarity, completeness, actionability — each 1-5. Ran 15 times across 3 temperatures (0, 0.3, 0.7) for statistical rigor.

Default v2.0.0: 4.0/5 (15/15 identical, zero variance)

The judge's consistent reasoning: "Browse CLI commands are used throughout but never formally documented with their flags and valid arguments — an agent must infer syntax from usage examples alone."

2. Consensus proposal generation

5 AI guard agents with distinct evaluation focuses proposed improvements:

Agent Focus Reputation
Doc Architect Document structure, progressive disclosure 104
API Accuracy Checker Command correctness, valid values 96
Agent Usability Tester Zero-guess invocations 104
Completeness Auditor Missing commands, edge cases 103
Style Guardian Formatting consistency, accuracy 121

The proposer was given the exact correct flag definitions from browse/SKILL.md to prevent hallucination. Doc Architect generated the accepted proposal — a Browse CLI Reference section with flag tables and examples.

3. Diff guard review (3 rounds)

Each round, all 5 agents reviewed the diff against browse/SKILL.md ground truth for factual accuracy.

Round Votes Issues caught Fixes applied
1 4 YES, 1 NO links described as "URLs and link text" (should be "text → href" format); cookie-import overstated JSON schema fields Fixed both
2 4 YES, 1 NO screenshot --viewport falsely described as "default behavior"; snapshot -D missing "first call stores baseline" nuance Fixed both
3 5/5 YES None — all agents approved (max risk: 0.15) Clean

Style Guardian caught all 4 issues across the first two rounds, earning the highest reputation (121) through accurate code review.

4. Reputation settlement

Agents earn or lose reputation based on whether their votes align with ground truth. Symmetric ±4 payoffs ensure no dominant strategy (always-YES and always-NO have equal expected value).

After 6 rounds, Style Guardian leads at 121 — it was penalized earlier for being too conservative on good proposals, but recovered by catching real inaccuracies in diff review that other agents missed.

5. Final verification

30 eval runs across 3 temperatures:

Version Temp 0 Temp 0.3 Temp 0.7 Overall (15 runs)
Default v2.0.0 4/4/4 4/4/4 4/4/4 4.00/5
This PR 5/5/5 5/5/5 5/5/5 5.00/5
Delta +1.00 +1.00 +1.00 +1.00 (+25%)

30 runs, zero variance in either direction, at any temperature.

gstack eval suite: 9/9 pass, zero regressions, cross-skill consistency 5/5.

Accuracy verification

Every flag in the Browse CLI Reference was cross-referenced against browse/SKILL.md:

Flag Description in PR Ground truth Status
snapshot -i Interactive elements only -i --interactive
snapshot -c Compact, no empty nodes -c --compact
snapshot -d N Limit tree depth -d --depth
snapshot -a Annotated screenshot with red overlay -a --annotate
snapshot -D First call stores baseline, second shows changes -D --diff
snapshot -C Cursor-interactive @c refs -C --cursor-interactive
snapshot -s <sel> Scope to CSS selector -s --selector
snapshot -o <path> Output path for annotated screenshot -o --output
console --errors Filter to error/warning --errors
console --clear Reset buffer --clear
links output "text → href" format "text → href"

Test plan

  • LLM judge: 30 eval runs, default 4.0 → PR 5.0 (+25%), zero variance
  • Diff guard: 3 rounds, 4 issues caught and fixed, final round 5/5 clean
  • gstack eval suite: 9/9 pass, zero regressions
  • Cross-skill consistency: 5/5
  • Every flag cross-referenced against browse/SKILL.md (11/11 accurate)
  • Manual review

🤖 Generated with Claude Code via consensus-tools

Kai Cianflone and others added 2 commits March 16, 2026 02:30
Diff-guard caught two inaccuracies:
- links: "Returns URLs and link text" → "Returns each link as text → href"
- cookie-import: removed assertion about required fields (browse/SKILL.md
  doesn't formally specify the JSON schema)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…note

Diff-guard round 2 fixes:
- screenshot: clarify --viewport and --clip are flags (not positional),
  remove false "default behavior" claim for --viewport
- snapshot -D: add "first call stores baseline, second shows changes"
  nuance from browse/SKILL.md ground truth

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kaicianflone

Copy link
Copy Markdown
Author

Closing — need to apply changes to SKILL.md.tmpl (template) instead of SKILL.md directly per CONTRIBUTING.md. Will resubmit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant