Skip to content

feat(qa): add browse subcommand reference table to SKILL.md#88

Closed
kaicianflone wants to merge 1 commit into
garrytan:mainfrom
kaicianflone:qa-browse-subcommand-reference
Closed

feat(qa): add browse subcommand reference table to SKILL.md#88
kaicianflone wants to merge 1 commit into
garrytan:mainfrom
kaicianflone:qa-browse-subcommand-reference

Conversation

@kaicianflone

Copy link
Copy Markdown

Summary

The qa/SKILL.md references browse CLI commands ($B goto, $B snapshot -i, $B fill @e3, etc.) throughout its 6-phase workflow but never formally documents what subcommands, flags, or argument types are valid. An AI agent reading this document must infer usage from scattered examples, risking incorrect invocations.

This PR adds:

  • Browse Binary — Key Subcommands table documenting the 10 most-used browse commands with their flags, argument types, descriptions, and concrete examples
  • Complete snapshot flag reference including -c (compact), -d N (depth), -s <sel> (scope), -a (annotated), -D (diff), -C (cursor-interactive), -o <path> (output)
  • Element reference lifecycle documentation (@eN staleness after navigation)
  • Inline fallback guidance for external file dependencies (qa/templates/qa-report-template.md, qa/references/issue-taxonomy.md) so agents can proceed even when those files are missing
  • Cookie-import JSON schema ({name, value, domain, path} array format)
  • Screenshot positional arg ordering with examples
  • Async JS support documented for js command

Eval Results

LLM-as-judge eval scores (claude-sonnet-4-6, temperature 0.7):

Version Clarity Completeness Actionability Avg Runs
Before (main) 4 3 3 3.33 25/25 identical
After (this PR) 4.3 4.2 4.3 4.27 10 runs, range 4.0–5.0

+28% improvement. The before scores were verified at 3 different temperatures (0, 0.3, 0.7) across 25 total runs with zero variance.

gstack eval suite (9/9 pass)

command reference table                  PASS  c:4 co:4 a:4
snapshot flags reference                 PASS  c:5 co:4 a:5
browse/SKILL.md reference               PASS  c:5 co:4 a:5
setup block                              PASS  c:3 co:2 a:3
regression vs baseline                   PASS  a:2 b:5
qa/SKILL.md workflow                     PASS  c:4 co:4 a:4
qa/SKILL.md health rubric                PASS  c:4 co:3 a:4
cross-skill greptile consistency         PASS  c:5
baseline score pinning                   PASS  c:4 co:4 a:4

Zero regressions. qa/SKILL.md workflow completeness improved from baseline 3→4 due to inline fallback guidance.

How this was made

Generated using consensus-tools skill-guard-demo:

  1. 5 AI guard agents with different evaluation focuses (Doc Architect, API Accuracy, Agent Usability, Completeness Auditor, Style Guardian) proposed and reviewed changes via weighted consensus voting
  2. LLM-as-judge scored proposals on clarity/completeness/actionability
  3. Diff-guard code review cross-referenced every claim against browse/SKILL.md ground truth — caught and fixed 3 hallucinated flag descriptions in v1
  4. Reputation tracking across 8 rounds rewarded accurate reviewers and penalized rubber-stampers (Style Guardian rose to 130 after catching inaccuracies that 4 other agents missed)

Accuracy verification

Every flag and argument in the subcommand table was cross-referenced against browse/SKILL.md:

Claim Ground truth Status
snapshot -i interactive elements only -i --interactive
snapshot -c compact, no empty nodes -c --compact
snapshot -d N limit tree depth -d --depth
snapshot -a annotated screenshot with red overlay -a --annotate
snapshot -D unified diff vs previous -D --diff
snapshot -C cursor-interactive @c refs -C --cursor-interactive
snapshot -s <sel> scope to CSS selector -s --selector
snapshot -o <path> output path -o --output
console --errors / --clear console [--clear|--errors]
cookie-import <json> JSON file path cookie-import <json>

Pre-Landing Review

No issues found. Documentation-only change — no SQL, no code, no LLM trust boundaries.

Test plan

  • LLM judge scores verified 25x before (3.33/5) and 10x after (4.27/5 avg)
  • gstack eval suite: 9/9 pass, zero regressions
  • Diff-guard code review: 5/5 agents approved, all risk 0.10
  • Cross-skill consistency: PASS (score 5/5)
  • Every flag cross-referenced against browse/SKILL.md ground truth (10/10 accurate)

🤖 Generated with Claude Code via consensus-tools skill-guard-demo

* feat(qa): add browse subcommand reference table to SKILL.md

Generated by consensus-guard-demo: 5 AI guard agents evaluated and
approved this improvement via weighted consensus voting.

The qa/SKILL.md referenced browse CLI commands throughout but never
formally documented subcommands, flags, or argument types. An AI agent
had to infer usage from scattered examples. This adds a Browse Binary
Subcommand Reference table covering all 10 subcommands with their
flags, argument types, and descriptions, plus element reference
lifecycle documentation.

Judge eval scores (claude-sonnet-4-6, verified 3x):
  Before: clarity=4, completeness=3, actionability=3 (avg 3.3/5)
  After:  clarity=5, completeness=5, actionability=5 (avg 5.0/5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(qa): correct 3 inaccuracies in browse subcommand table

Fixes from diff-guard code review (5 agents, cross-referenced against
browse/SKILL.md ground truth):

1. snapshot -i: "interactive/annotated" → "interactive elements only
   with @e refs" (annotated is -a, not -i)
2. snapshot -a: "accessibility tree" → "annotated screenshot with red
   overlay" (accessibility is a separate command)
3. console --all: removed hallucinated flag (real flags are --clear
   and --errors)

Also: changed heading to "Key Subcommands" with note about 50+ total
commands and pointer to browse/SKILL.md for full reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(qa): complete snapshot flags, cookie schema, js async docs

Addresses remaining judge gaps to push eval scores above 4.0:

- snapshot: add missing -c (compact), -d N (depth), -s <sel> (scope)
  flags, fully describe -D baseline behavior and -C cursor-interactive
- cookie-import: document JSON schema ({name, value, domain, path})
- js: document async/await support with fetch example
- screenshot: clarify positional arg ordering with examples
- External file refs: add inline fallback guidance for missing
  templates and issue taxonomy

Diff-guard review: 5/5 YES, all risk 0.10, cross-referenced against
browse/SKILL.md ground truth.

Eval scores (10 runs, temp 0.7):
  Before: 4.0/5 (10/10 identical)
  After:  4.27/5 avg (7× 4.0, 1× 4.7, 2× 5.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kaicianflone

Copy link
Copy Markdown
Author

Closing — Garry's v2.0.0 independently addressed the completeness gaps (3→4) that this PR targeted. Our subcommand table and inline fallbacks are redundant with the new tiers/phases 7-11 additions. Will rework against v2.0.0 baseline to find novel improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant