feat(qa): add Browse CLI Reference section to qa/SKILL.md#91
Closed
kaicianflone wants to merge 2 commits into
Closed
feat(qa): add Browse CLI Reference section to qa/SKILL.md#91kaicianflone wants to merge 2 commits into
kaicianflone wants to merge 2 commits into
Conversation
Diff-guard caught two inaccuracies: - links: "Returns URLs and link text" → "Returns each link as text → href" - cookie-import: removed assertion about required fields (browse/SKILL.md doesn't formally specify the JSON schema) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…note Diff-guard round 2 fixes: - screenshot: clarify --viewport and --clip are flags (not positional), remove false "default behavior" claim for --viewport - snapshot -D: add "first call stores baseline, second shows changes" nuance from browse/SKILL.md ground truth Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Closing — need to apply changes to SKILL.md.tmpl (template) instead of SKILL.md directly per CONTRIBUTING.md. Will resubmit. |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
qa/SKILL.mdv2.0.0 uses browse CLI commands ($B goto,$B snapshot -i,$B fill @e3,$B click @e5,$B console --errors, etc.) across all 11 phases but never formally documents their flags, arguments, or valid values. An AI agent must infer correct invocations from scattered usage examples — a gap the LLM-as-judge consistently identifies as the sole blocker between a 4/5 and 5/5 quality score.This PR adds a Browse CLI Reference section with formal documentation for every
$Bcommand used in the QA workflow.What changed
snapshot,goto,click,fill,screenshot,console,cookie-import,links,viewport,jssnapshotdocuments all 8 flags (-i,-c,-d,-a,-o,-D,-C,-s) with accurate descriptions cross-referenced againstbrowse/SKILL.mdscreenshotclarifies which args are flags (--viewport,--clip) vs positional ([selector|@ref],[path])snapshot -Ddocuments the "first call stores baseline, second shows changes" behaviorlinksoutput format documented as"text → href"(matchesbrowse/SKILL.md)Process
This change was generated and validated using consensus-tools, an open-source multi-agent consensus framework. The process:
1. Baseline measurement
Scored the default
qa/SKILL.mdv2.0.0 using an LLM-as-judge (claude-sonnet-4-6) on three dimensions — clarity, completeness, actionability — each 1-5. Ran 15 times across 3 temperatures (0, 0.3, 0.7) for statistical rigor.Default v2.0.0: 4.0/5 (15/15 identical, zero variance)
The judge's consistent reasoning: "Browse CLI commands are used throughout but never formally documented with their flags and valid arguments — an agent must infer syntax from usage examples alone."
2. Consensus proposal generation
5 AI guard agents with distinct evaluation focuses proposed improvements:
The proposer was given the exact correct flag definitions from
browse/SKILL.mdto prevent hallucination. Doc Architect generated the accepted proposal — a Browse CLI Reference section with flag tables and examples.3. Diff guard review (3 rounds)
Each round, all 5 agents reviewed the diff against
browse/SKILL.mdground truth for factual accuracy.linksdescribed as "URLs and link text" (should be"text → href"format);cookie-importoverstated JSON schema fieldsscreenshot --viewportfalsely described as "default behavior";snapshot -Dmissing "first call stores baseline" nuanceStyle Guardian caught all 4 issues across the first two rounds, earning the highest reputation (121) through accurate code review.
4. Reputation settlement
Agents earn or lose reputation based on whether their votes align with ground truth. Symmetric ±4 payoffs ensure no dominant strategy (always-YES and always-NO have equal expected value).
After 6 rounds, Style Guardian leads at 121 — it was penalized earlier for being too conservative on good proposals, but recovered by catching real inaccuracies in diff review that other agents missed.
5. Final verification
30 eval runs across 3 temperatures:
30 runs, zero variance in either direction, at any temperature.
gstack eval suite: 9/9 pass, zero regressions, cross-skill consistency 5/5.
Accuracy verification
Every flag in the Browse CLI Reference was cross-referenced against
browse/SKILL.md:snapshot -i-i --interactivesnapshot -c-c --compactsnapshot -d N-d --depthsnapshot -a-a --annotatesnapshot -D-D --diffsnapshot -C-C --cursor-interactivesnapshot -s <sel>-s --selectorsnapshot -o <path>-o --outputconsole --errors--errorsconsole --clear--clearlinksoutput"text → href"format"text → href"Test plan
🤖 Generated with Claude Code via consensus-tools