feat(qa): add browse subcommand reference table to SKILL.md by kaicianflone · Pull Request #88 · garrytan/gstack

kaicianflone · 2026-03-16T05:18:48Z

Summary

The qa/SKILL.md references browse CLI commands ($B goto, $B snapshot -i, $B fill @e3, etc.) throughout its 6-phase workflow but never formally documents what subcommands, flags, or argument types are valid. An AI agent reading this document must infer usage from scattered examples, risking incorrect invocations.

This PR adds:

Browse Binary — Key Subcommands table documenting the 10 most-used browse commands with their flags, argument types, descriptions, and concrete examples
Complete snapshot flag reference including -c (compact), -d N (depth), -s <sel> (scope), -a (annotated), -D (diff), -C (cursor-interactive), -o <path> (output)
Element reference lifecycle documentation (@eN staleness after navigation)
Inline fallback guidance for external file dependencies (qa/templates/qa-report-template.md, qa/references/issue-taxonomy.md) so agents can proceed even when those files are missing
Cookie-import JSON schema ({name, value, domain, path} array format)
Screenshot positional arg ordering with examples
Async JS support documented for js command

Eval Results

LLM-as-judge eval scores (claude-sonnet-4-6, temperature 0.7):

Version	Clarity	Completeness	Actionability	Avg	Runs
Before (main)	4	3	3	3.33	25/25 identical
After (this PR)	4.3	4.2	4.3	4.27	10 runs, range 4.0–5.0

+28% improvement. The before scores were verified at 3 different temperatures (0, 0.3, 0.7) across 25 total runs with zero variance.

gstack eval suite (9/9 pass)

command reference table                  PASS  c:4 co:4 a:4
snapshot flags reference                 PASS  c:5 co:4 a:5
browse/SKILL.md reference               PASS  c:5 co:4 a:5
setup block                              PASS  c:3 co:2 a:3
regression vs baseline                   PASS  a:2 b:5
qa/SKILL.md workflow                     PASS  c:4 co:4 a:4
qa/SKILL.md health rubric                PASS  c:4 co:3 a:4
cross-skill greptile consistency         PASS  c:5
baseline score pinning                   PASS  c:4 co:4 a:4

Zero regressions. qa/SKILL.md workflow completeness improved from baseline 3→4 due to inline fallback guidance.

How this was made

Generated using consensus-tools skill-guard-demo:

5 AI guard agents with different evaluation focuses (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) proposed and reviewed changes via weighted consensus voting
LLM-as-judge scored proposals on clarity/completeness/actionability
Diff-guard code review cross-referenced every claim against browse/SKILL.md ground truth — caught and fixed 3 hallucinated flag descriptions in v1
Reputation tracking across 8 rounds rewarded accurate reviewers and penalized rubber-stampers (Style Guardian rose to 130 after catching inaccuracies that 4 other agents missed)

Accuracy verification

Every flag and argument in the subcommand table was cross-referenced against browse/SKILL.md:

Claim	Ground truth	Status
`snapshot -i` interactive elements only	`-i --interactive`	✅
`snapshot -c` compact, no empty nodes	`-c --compact`	✅
`snapshot -d N` limit tree depth	`-d --depth`	✅
`snapshot -a` annotated screenshot with red overlay	`-a --annotate`	✅
`snapshot -D` unified diff vs previous	`-D --diff`	✅
`snapshot -C` cursor-interactive @c refs	`-C --cursor-interactive`	✅
`snapshot -s <sel>` scope to CSS selector	`-s --selector`	✅
`snapshot -o <path>` output path	`-o --output`	✅
`console --errors` / `--clear`	`console [--clear\|--errors]`	✅
`cookie-import <json>` JSON file path	`cookie-import <json>`	✅

Pre-Landing Review

No issues found. Documentation-only change — no SQL, no code, no LLM trust boundaries.

Test plan

LLM judge scores verified 25x before (3.33/5) and 10x after (4.27/5 avg)
gstack eval suite: 9/9 pass, zero regressions
Diff-guard code review: 5/5 agents approved, all risk 0.10
Cross-skill consistency: PASS (score 5/5)
Every flag cross-referenced against browse/SKILL.md ground truth (10/10 accurate)

🤖 Generated with Claude Code via consensus-tools skill-guard-demo

@e

* feat(qa): add browse subcommand reference table to SKILL.md Generated by consensus-guard-demo: 5 AI guard agents evaluated and approved this improvement via weighted consensus voting. The qa/SKILL.md referenced browse CLI commands throughout but never formally documented subcommands, flags, or argument types. An AI agent had to infer usage from scattered examples. This adds a Browse Binary Subcommand Reference table covering all 10 subcommands with their flags, argument types, and descriptions, plus element reference lifecycle documentation. Judge eval scores (claude-sonnet-4-6, verified 3x): Before: clarity=4, completeness=3, actionability=3 (avg 3.3/5) After: clarity=5, completeness=5, actionability=5 (avg 5.0/5) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(qa): correct 3 inaccuracies in browse subcommand table Fixes from diff-guard code review (5 agents, cross-referenced against browse/SKILL.md ground truth): 1. snapshot -i: "interactive/annotated" → "interactive elements only with @e refs" (annotated is -a, not -i) 2. snapshot -a: "accessibility tree" → "annotated screenshot with red overlay" (accessibility is a separate command) 3. console --all: removed hallucinated flag (real flags are --clear and --errors) Also: changed heading to "Key Subcommands" with note about 50+ total commands and pointer to browse/SKILL.md for full reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(qa): complete snapshot flags, cookie schema, js async docs Addresses remaining judge gaps to push eval scores above 4.0: - snapshot: add missing -c (compact), -d N (depth), -s <sel> (scope) flags, fully describe -D baseline behavior and -C cursor-interactive - cookie-import: document JSON schema ({name, value, domain, path}) - js: document async/await support with fetch example - screenshot: clarify positional arg ordering with examples - External file refs: add inline fallback guidance for missing templates and issue taxonomy Diff-guard review: 5/5 YES, all risk 0.10, cross-referenced against browse/SKILL.md ground truth. Eval scores (10 runs, temp 0.7): Before: 4.0/5 (10/10 identical) After: 4.27/5 avg (7× 4.0, 1× 4.7, 2× 5.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kaicianflone · 2026-03-16T05:48:13Z

Closing — Garry's v2.0.0 independently addressed the completeness gaps (3→4) that this PR targeted. Our subcommand table and inline fallbacks are redundant with the new tiers/phases 7-11 additions. Will rework against v2.0.0 baseline to find novel improvements.

garrytan mentioned this pull request Mar 17, 2026

feat: add multi-agent consensus eval runner with persistent reputation #124

Closed

4 tasks

habassa5 mentioned this pull request May 4, 2026

fix(review,cso): SHA-pin PR diff to prevent worktree-flip-during-review #1317

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qa): add browse subcommand reference table to SKILL.md#88

feat(qa): add browse subcommand reference table to SKILL.md#88
kaicianflone wants to merge 1 commit into
garrytan:mainfrom
kaicianflone:qa-browse-subcommand-reference

kaicianflone commented Mar 16, 2026

Uh oh!

kaicianflone commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaicianflone commented Mar 16, 2026

Summary

Eval Results

gstack eval suite (9/9 pass)

How this was made

Accuracy verification

Pre-Landing Review

Test plan

Uh oh!

kaicianflone commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant