Skip to content

Feature: Dogfood Skill — Agent-Driven Exploratory QA for Web Applications #315

@teknium1

Description

@teknium1

Overview

Inspiration: Vercel's dogfood skill in vercel-labs/agent-browser (PR #538, merged Feb 24, 2026). Also adopted by callstackincubator/agent-device for mobile QA.

Give Hermes Agent the ability to systematically explore and test web applications, find bugs, and produce structured QA reports with reproduction evidence (screenshots, video, console errors). The agent becomes its own QA tester — point it at a URL, it explores every page, clicks every button, tests forms with edge cases, checks console errors, captures visual evidence, and writes a professional bug report.


How Vercel's Dogfood Skill Works

Workflow (5 Phases)

  1. Initialize — Create output dirs (screenshots/, videos/), copy report template, open target URL in agent-browser
  2. Authenticate — If login required, perform auth flow and save browser state for session persistence
  3. Orient & Explore — Take annotated screenshots of each page, test interactive elements, check empty/error states, monitor browser console
  4. Document Issues — Capture evidence immediately as bugs are found:
    • Interactive/behavioral bugs: video recording + step-by-step screenshots
    • Static bugs (typos, alignment): single annotated screenshot
    • Console errors: captured from browser
  5. Wrap Up — Aim for 5-10 well-documented issues, update severity counts in report header, close session

Issue Taxonomy (7 Categories)

Category What to Look For
Visual/UI Layout breakage, rendering issues, missing assets, animation glitches
Functional Broken links (404s), buttons that do nothing, form validation errors, state not persisting, silent failures
UX Missing loading indicators, slow interactions (>300ms), confusing navigation, dead ends, missing confirmation for destructive actions
Content Typos, grammatical errors, placeholder/Lorem Ipsum text, truncated text without tooltips
Performance Slow page loads (>3s), janky scrolling, large layout shifts (CLS), excessive network requests, memory leaks
Console/Errors JS exceptions, unhandled promise rejections, failed network requests (4xx/5xx), CORS errors, deprecation warnings
Accessibility Missing alt text, unlabeled form inputs, poor keyboard navigation, focus traps, insufficient color contrast

Severity Levels

Severity Definition
Critical Blocks a core workflow, causes data loss, or crashes the app
High Major feature broken or unusable; no workaround exists
Medium Feature works but with noticeable problems; workaround exists
Low Minor cosmetic or polish issue

Exploration Checklist (Per Page)

  1. Visual scan — take annotated screenshot; check layout and alignment
  2. Interactive elements — click every button and link; verify feedback
  3. Forms — test empty submissions, invalid inputs, and edge cases
  4. Navigation — verify breadcrumbs, back buttons, and deep links
  5. States — explicitly check empty, loading, error, and overflow states
  6. Console — monitor for JS errors and failed network requests
  7. Responsiveness — test at various viewport sizes
  8. Auth boundaries — test behavior for logged-out users or different roles

Report Template Structure

# Dogfood Report

| Field | Value |
|-------|-------|
| Date | {DATE} |
| App URL | {URL} |
| Session | {SESSION_NAME} |
| Scope | {SCOPE} |

## Summary
| Severity | Count |
|----------|-------|
| Critical | 0 |
| High | 0 |
| Medium | 0 |
| Low | 0 |

## ISSUE-001: {Short title}
- **Severity:** critical / high / medium / low
- **Category:** visual / functional / ux / content / performance / console / accessibility
- **URL:** {page URL}
- **Repro Video:** {path or N/A}

**Description:** {what is wrong, expected vs actual}

**Repro Steps:**
1. Navigate to {URL}
   ![Step 1](screenshots/issue-001-step1.png)
2. {Action}
   ![Step 2](screenshots/issue-001-step2.png)
3. **Observe:** {Result}
   ![Result](screenshots/issue-001-result.png)

Key Design Decisions from Vercel's Implementation

  • "Repro is everything" — match evidence type to issue type (video for interaction bugs, screenshot for static bugs)
  • Human-pace recordingsleep 1s between actions so videos are watchable by humans
  • Use type not fill during recording for character-by-character input visibility
  • Incremental writing — append to report as you go for crash resilience
  • Test as a user, not a developer — do not read application source code or config files
  • Testable — includes structural tests (CI, no API key needed) + full E2E evals (against a fixture page with intentional bugs)

Implementation for Hermes Agent

Browser Automation Tool Options

Option Pros Cons
agent-browser (recommended) Purpose-built for AI agents; Rust CLI + Playwright backend; annotated screenshots with numbered refs (@e1, @e2); security features (domain allowlist, content boundaries, auth vault); snapshot command gives accessibility tree npm dependency; needs Chromium install
Playwright directly No wrapper dependency; full browser control More raw commands; no agent-optimized features like ref-based interaction
web_extract + vision_analyze Zero setup; already available No real interaction (can't click/fill); limited to content extraction

Recommendation: agent-browser — it's the tool Vercel's skill was built for. Key commands:

agent-browser open <url>                              # Navigate
agent-browser snapshot -i                             # Accessibility tree (interactive elements)
agent-browser screenshot --annotate output.png        # Numbered screenshot
agent-browser click @e2                               # Click by ref
agent-browser fill @e3 "text"                         # Fill input by ref
agent-browser console                                 # Get console output
agent-browser errors                                  # Get JS errors
agent-browser record start output.webm                # Start video recording
agent-browser record stop                             # Stop recording
agent-browser scroll down 300                         # Scroll
agent-browser close                                   # End session

Dogfood Skill Components

  1. SKILL.md — Main instructions teaching the agent the 5-phase workflow
  2. references/issue-taxonomy.md — Full issue category definitions and severity levels
  3. templates/dogfood-report-template.md — Report template to copy into output dir
  4. Optional: scripts/ — Helper scripts for setup/teardown

The skill should teach Hermes to:

  • Accept a target URL and optional scope/auth parameters
  • Systematically explore using agent-browser commands via terminal
  • Capture evidence (screenshots saved to disk, optional video)
  • Optionally use vision_analyze to verify screenshots show real issues
  • Write a structured report using the template
  • Categorize and severity-rate each issue

Report Output

Markdown report with embedded screenshot paths, suitable for:

  • Saving to disk for review
  • GitHub issue creation (one issue per finding, or summary issue)
  • Telegram/Discord summary delivery
  • Comparison against previous runs

Use Cases

  • Self-QA: Test our own landing page, docs, or web properties
  • PR review augmentation: When a PR changes UI, run dogfood against the preview deployment to catch visual regressions before merge
  • Scheduled QA: Cronjob that runs dogfood against production URLs periodically and reports issues via Telegram/Discord
  • User-facing: "Hey Hermes, QA test this website for me" — instant value for any web project the user is working on

Phased Rollout

Phase 1: Install agent-browser + basic skill

  • npm install -g agent-browser && agent-browser install
  • Create a dogfood skill adapted from Vercel's version
  • Support: open URL → explore pages → take screenshots → basic markdown report
  • Verify headless operation on our server (no display)

Phase 2: Full evidence pipeline

  • Video recording for interaction bugs
  • Console error capture and reporting
  • Issue taxonomy + severity classification
  • Structured report template with per-issue evidence
  • Form testing with edge cases (empty, invalid, overflow inputs)

Phase 3: Integration with Hermes workflows

  • Auto-create GitHub issues from findings
  • Send summary to Telegram/Discord
  • Compare against previous runs (regression detection)
  • Scheduled dogfood cronjobs for monitored URLs
  • Integration with PR review workflow (test preview deployments)

Open Questions

  • Should we vendor agent-browser or treat it as an optional dependency that gets installed on first use?
  • How to handle sites requiring authentication? (agent-browser has an auth vault — is that sufficient, or do we need our own credential storage?)
  • Should vision_analyze validate every screenshot, or trust agent-browser's accessibility tree for issue detection? (hybrid approach — use accessibility tree for functional issues, vision for visual issues?)
  • What's the right scope limit per session? Vercel targets 5-10 issues. Too many = noise, too few = incomplete.
  • Can we run this fully headless on our server (no display)? agent-browser uses headless Chromium so likely yes, but video recording may need xvfb.
  • Should the skill support custom checklists per-project (e.g., "always test the checkout flow with these 3 scenarios")?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions